Can AI Really Replace Humans In Real World Jobs? This New Study Says Not Yet.

Shutterstock
You can hardly open a website these days without hearing about how artificial intelligence (AI) is advancing and will soon be taking everyone’s jobs. While it is undeniable that large language model AI systems are impressive, many people question whether they are actually able to perform like a human would in a real work environment.
With this question in mind, a group of researchers decided to conduct a study where they simulated a company that had all AI agents as employees to see how well they would complete common tasks. The whole company would be run by AI in a contained environment, so the agents would have to work together to complete many different types of projects. The company itself was set up as a software company, but the agents would also have to complete HR tasks and other related things associated with a full company.
In the paper, the authors explain:
“To measure the progress of these LLM [large language model] agents’ performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company.”
The team ran the simulation with multiple different large language models to see which ones performed the best, and which ones were lacking. The instructions that the AI agents would have to follow were provided in normal English just as if it were coming from a human. The performance of the AI would be measured at various checkpoints to see how they were progressing. They would also evaluate the company’s success financially to see just how it could compete with a human based company.

Shutterstock
Perhaps not surprisingly, the AI agents didn’t live up to the hype. Not only were they unable to complete many of the tasks, they actually procrastinated and straight up lied to try to convince the system that it was working. I suppose in that way, they are more like humans than one might expect.
The team writes in the study:
“Interestingly, we find that for some tasks, when the agent is not clear what the next steps should be, it sometimes try to be clever and create fake ‘shortcuts’ that omit the hard part of a task. For example, during the execution of one task, the agent cannot find the right person to ask questions on RocketChat. As a result, it then decides to create a shortcut solution by renaming another user to the name of the intended user.”
Of course, humans have had a very long time to get good at the types of things that are required in a job. Large language model AIs are still very new, and they are progressing extremely quickly. For this study, however, it is clear that all of the AI agents failed to get a passing grade. Even the best performing options would fall firmly in the “Does Not Meet” category in a traditional performance review.
“We can see that the Claude-3.5-Sonnet is the clear winner across all models. However, even with the strongest frontier model, it only manages to complete 24% of the total tasks and achieves a score of 34.4% taking into account partial completion credits. Note that this result comes at a cost: It requires an average of almost 30 steps and more than $6 to complete each task, making it the most expensive model to run both in time and in cost.”

Shutterstock
It will be interesting to see if this type of study can be performed again in a few years (or even sooner) to see how AI is able to advance.
For now, the study is available on the pre-print server arXiv and it has yet to be peer reviewed.
If you enjoyed that story, check out what happened when a guy gave ChatGPT $100 to make as money as possible, and it turned out exactly how you would expect.
Sign up to get our BEST stories of the week straight to your inbox.



