The "First AI Software Engineer" Is Bungling the Vast Majority of Tasks It's Asked to Do
Researchers have found that AI tech company Cognition's Devin, which it claims to be the "first AI software engineer," is astonishingly bad at its job. In a recent analysis, a team of machine learning data scientists behind the independent AI research and development lab Answer.AI spent a month with the AI assistant, concluding that despite almost a year of hype, it "rarely worked." "Out of 20 tasks we attempted, we saw 14 failures, three inconclusive results, and just three successes," the researchers found. "More concerning was our inability to predict which tasks would succeed," they wrote. "Even tasks similar to […]
Researchers have found that AI tech company Cognition's Devin, which it claims to be the "first AI software engineer," is astonishingly bad at its job.
In a recent analysis, first spotted by The Register, a team of machine learning data scientists behind the independent AI research and development lab Answer.AI spent a month with the AI assistant, concluding that despite almost a year of hype, it "rarely worked."
"Out of 20 tasks we attempted, we saw 14 failures, three inconclusive results, and just three successes," the researchers found — a meager success rate of just 15 percent.
Super, we've all had coworkers like that. But for tech that's supposed to represent the future, it's not inspiring confidence.
"More concerning was our inability to predict which tasks would succeed," the team wrote. "Even tasks similar to our early wins would fail in complex, time-consuming ways. The autonomous nature that seemed promising became a liability — Devin would spend days pursuing impossible solutions rather than recognizing fundamental blockers."
For instance, Devin was asked to deploy multiple applications to a deployment platform called Railway, but instead of realizing it was "not actually possible to do this," Devin "marched forward and tried to do this and hallucinated some things about how to interact with Railway."
The results highlight that despite Cognition AI's boisterous marketing about Devin being able to "build and deploy apps end to end" when the tool was first introduced in March 2024, the tech is still struggling with some fundamental problems.
It's a pertinent topic, with Meta CEO Mark Zuckerberg recently announcing that he intends to replace "midlevel engineers" with AI as soon as this year. OpenAI is also rumored to "announce a next-level breakthrough that unleashes PhD-level super-agents to do complex human tasks," according to a recent column by Axios cofounder Mike Allen and CEO Jim VandeHei.
But whether the tech will actually live up to the hype and be ready to start replacing human workers in such a tight time frame — or even at all — remains an open question.
Devin is an amalgamation of several AI models that operates through the messaging platform Slack and has access to an entire computing environment, including a web browser, code editor, and terminal.
Devin was only made available to a select group of users when it was first announced, but saw a much wider release last month, starting at a steep $500 a month for "engineering teams."
As the Answer.AI team points out, early demos of the AI assistant were impressive. In a March video, Cognition claimed Devin could be used to "make money taking on messy tasks" on the freelancing platform Upwork.
It didn't take long for researchers to call foul, with a number of software developers analyzing Cognition's video and accusing the company of "lying" about its claims.
"All of this stuff makes it look like Devin did a bunch of work," said software engineer Carl Brown from the YouTube channel Internet of Bugs in an April video. "It makes it look like Devin accomplished a lot of stuff."
"So it is honestly, as far as I'm concerned, kind of impressive," he added. "But in the context of what an Upwork job should have been, and especially in the context of a bunch of people saying that Devin is 'taking jobs off of Upwork and doing them,' and especially in the context of the company saying that this video will let us watch Devin get paid for doing work, which is, again, just a lie."
Both Answer.AI and Brown found that Devin also took far longer than any human coder when completing tasks.
"Tasks that seemed straightforward often took days rather than hours," the Answer.AI researchers wrote, "with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions."
In short, Congition's Devin highlights the often wide gap between AI companies' claims and reality, which has plagued the industry for years now.
So whether an AI assistant will ever be able to competently replace a software engineer — without causing any major headaches for its human coworkers, at least — remains to be seen.
More on replacing workers with AI: CEO Who Bragged About Replacing Workers With AI Now Distressed That AI Will Replace His Job Too
The post The "First AI Software Engineer" Is Bungling the Vast Majority of Tasks It's Asked to Do appeared first on Futurism.
What's Your Reaction?