News Release

Awkward. Humans are still better than AI at reading the room

Johns Hopkins research shows AI models fall short in predicting social interactions

Reports and Proceedings

Johns Hopkins University

Humans, it turns out, are better than current AI models at describing and interpreting social interactions in a moving scene—a skill necessary for self-driving cars, assistive robots, and other technologies that rely on AI systems to navigate the real world.

The research, led by scientists at Johns Hopkins University, finds that artificial intelligence systems fail at understanding social dynamics and context necessary for interacting with people and suggests the problem may be rooted in the infrastructure of AI systems.

“AI for a self-driving car, for example, would need to recognize the intentions, goals, and actions of human drivers and pedestrians. You would want it to know which way a pedestrian is about to start walking, or whether two people are in conversation versus about to cross the street,” said lead author Leyla Isik, an assistant professor of cognitive science at Johns Hopkins University. “Any time you want an AI to interact with humans, you want it to be able to recognize what people are doing. I think this sheds light on the fact that these systems can’t right now.”

Kathy Garcia, a doctoral student working in Isik’s lab at the time of the research and co–first author, will present the research findings at the International Conference on Learning Representations on April 24. 

To determine how AI models measure up compared to human perception, the researchers asked human participants to watch three-second videoclips and rate features important for understanding social interactions on a scale of one to five. The clips included people either interacting with one another, performing side-by-side activities, or conducting independent activities on their own. 

The researchers then asked more than 350 AI language, video, and image models to predict how humans would judge the videos and how their brains would respond to watching. For large language models, the researchers had the AIs evaluate short, human-written captions. 

Participants, for the most part, agreed with each other on all the questions; the AI models, regardless of size or the data they were trained on, did not. Video models were unable to accurately describe what people were doing in the videos. Even image models that were given a series of still frames to analyze could not reliably predict whether people were communicating. Language models were better at predicting human behavior, while video models were better at predicting neural activity in the brain.

The results provide a sharp contrast to AI’s success in reading still images, the researchers said. 

“It’s not enough to just see an image and recognize objects and faces. That was the first step, which took us a long way in AI. But real life isn’t static. We need AI to understand the story that is unfolding in a scene. Understanding the relationships, context, and dynamics of social interactions is the next step, and this research suggests there might be a blind spot in AI model development,” Garcia said.

Researchers believe this is because AI neural networks were inspired by the infrastructure of the part of the brain that processes static images, which is different from the area of the brain that processes dynamic social scenes. 

“There’s a lot of nuances, but the big takeaway is none of the AI models can match human brain and behavior responses to scenes across the board, like they do for static scenes,” Isik said. “I think there’s something fundamental about the way humans are processing scenes that these models are missing.”


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.