Computer scientists recently examined the performance of dialog systems, such as personal assistants and chatbots designed to interact with humans. The team found that when these systems are confronted with dialog that includes idioms or similes, their performance drops to between 10 and 20 percent.
The research team also developed a partial remedy. They wrote a simple script that identifies figurative phrases and replaces those with their literal meaning. As a result, the performance of dialog systems improved by up to 15 percent.
The researchers are presenting their findings at the 2021 Conference on Empirical Methods in Natural Language Processing, which takes place Nov. 7 to 11, 2021.
Applications for this work include not only personal assistants, but also systems that are designed to summarize information, such as the box summarizing search results at the top of a Google page. Automated systems that need to answer questions, for example when a bill needs to be paid or an appointment to be made, would also benefit from this work.
“We want to enable more natural conversations between people and dialog systems,” said Harsh Jhamtani, the paper’s first author.
Jhamtani is a Ph.D. student at Carnegie Mellon University and is currently working as a visiting researcher with senior author Taylor Berg-Kirkpatrick, a faculty member in the UC San Diego Department of Computer Science and Engineering.
The study was inspired by Jhamtani’s own struggles with figurative language. He is a native Hindi speaker and also speaks English, India’s other official language. But he had to learn the many U.S. idioms and metaphors his colleagues use.
For example, he panicked when a colleague said they were starving because in Hindi that might indicate a medical emergency. His colleague then explained it just meant he was hungry. By then Jhamtani was wondering if artificial dialog systems would have the same issue he did.
In the study, researchers tested five different systems designed to talk with humans, including GPT-2, which is trained to predict the next word in 40GB of Internet text and was developed by research company OpenAI.
Researchers first ran the dialog systems through a dataset of 13.1K conversations on colloquial topics like tourism, health and so on. They then extracted the conversations that included figurative language from the dataset and ran the systems through those only. They observed a drop in performance ranging from 10 to 20 percent.
They then wrote a script that allowed the systems to quickly check dictionaries that translate figurative speech into literal speech. This is faster and more efficient than re-training systems to learn the complete content of these dictionaries. Researchers observed that performance improved by as much as 15 percent.
The researchers still had to partially rely on human observers to identify figurative language within the dataset, before the text could be converted. Further study is needed in this area.
It will take several iterations before the algorithms the researchers developed will be ready for implementation. For example, they found that in some rare cases, replacing the figurative language with literal language distorted the grammar of a sentence to the point where the dialog systems couldn’t no longer understand.
Investigating Robustness of Dialog Models to Popular Figurative Language Constructs
Harsh Jhamtani, Varun Gangal, Eduard Hovy, School of Computer Science, Carnegie Mellon University
Taylor Berg-Kirkpatrick, Department of Computer Science and Engineering, University of California San Diego