AI still can’t beat humans at reading social cues
Despite rapid advancement, AI models still struggle at identifying how people interact with each other. The post AI still can’t beat humans at reading social cues appeared first on Popular Science.

AI models have progressed rapidly in recent years and can already outperform humans in various tasks, from generating basic code to dominating games like chess and Go. But despite massive computing power and billions of dollars in investor funding, these advanced models still can’t hold up to humans when it comes to truly understanding how real people interact with one another in the world. In other words, AI still fundamentally struggles at “reading the room.”
That’s the claim made in a new paper by researchers from Johns Hopkins University. In the study, researchers asked a group of human volunteers to watch three-second video clips and rate the various ways individuals in those videos were interacting with one another. They then tasked more than 350 AI models—including image, video, and language-based systems—with predicting how the humans had rated those interactions. While the humans completed the task with ease, the AI models, regardless of their training data, struggled to accurately interpret what was happening in the clips. The researchers say their findings suggest that AI models still have significant difficulty understanding human social cues in real-world environments. That insight could have major implications for the growing industry of AI-enabled driverless cars and robots, which inherently need to navigate the physical world alongside people.
“Anytime you want an AI system to interact with humans, you want to be able to know what those humans are doing and what groups of humans are doing with each other,” John Hopkins University assistant professor of cognitive science and paper lead author Leyla Isik told Popular Science. “This really highlights how a lot of these models fall short on those tasks.”
Isik will present the research findings today at the International Conference on Learning Representations.
Human observers had a consensus while AI models were all over the place
Though previous research has shown that AI models can accurately describe what’s happening in still images at a level comparable to humans, this study aimed to see whether that still holds true for video. To do that, Isik says she and her fellow researchers selected hundreds of videos from a computer vision dataset and clipped them down to three seconds each. They then narrowed the sample to include only videos featuring two humans interacting. Human volunteers viewed these clips, and answered a series of questions about what was happening, rated on a scale from 1 to 5. The questions ranged from objective prompts like “Do you think these bodies are facing each other?” to more subjective ones, such as whether the interaction appeared emotionally positive or negative.
In general, the human respondents tended to give similar answers, as reflected in their ratings—suggesting that people share a basic observational understanding of social interactions.
The researchers then posed similar types of questions to image, video, and language models. (The language models were given human-written captions to analyze instead of raw video.) Across the board, the AI models failed to demonstrate the same level of consensus as the human participants. The language models generally performed better than the image and video models, but Isik notes that may be partly due to the fact that they were analyzing captions that were already quite descriptive.
The researchers primarily evaluated open-access models, some of which were several years old. The study did not include the latest models recently released by leading AI companies like OpenAI and Anthropic. Still, the stark contrast between human and AI responses suggests there may be something fundamentally different about how models and humans process social and contextual information.
“It’s not enough to just see an image and recognize objects and faces,” John Hopkins University doctoral student and paper co-author Kathy Garcia said in a statement. “We need AI to understand the story that is unfolding in a scene. Understanding the relationships, context, and dynamics of social interactions is the next step, and this research suggests there might be a blind spot in AI model development.”
Understanding human social dynamics will be critical for “embodied AI”
The findings come as tech companies race to integrate AI into an increasing number of physical robots—a concept often referred to as “embodied AI.” Cities like Los Angeles, Phoenix, and Austin have become test beds of this new era thanks to the increasing presence of driverless Waymo robotaxis sharing the roads with human-driven vehicles. Limited understanding of complex environments has led some driverless cars to behave erratically or even get stuck in loops, driving in circles. While some recent studies suggest that driverless vehicles may currently be less prone to accidents than the average human driver, federal regulators have nonetheless opened up investigations into Waymo and Amazon-owned Zoox for driving behavior that allegedly violated safety laws.
Other companies—like Figure AI, Boston Dynamics, and Tesla —are taking things a step further by developing AI-enabled humanoid robots designed to work alongside humans in manufacturing environments. Figure has already signed a deal with BMW to deploy one of its bipedal models at a facility in South Carolina, though its exact purpose remains somewhat vague. In these settings, properly understanding human social cues and context is even more critical, as even small misjudgments in intention run the risk of injury. Going a step further, some experts have even suggested that advanced humanoid robots could one day assist with elder and child care. Isik suggested the results of the study mean there are still several steps that need to be taken before that vision becomes a reality.
“[The research] really highlights the importance of bringing neuroscience, cognitive science, and AI into these more dynamic real world settings.” Isik said.
The post AI still can’t beat humans at reading social cues appeared first on Popular Science.