ChatGPT Beats Claude, Google’s Gemini, DeepSeek In Test Of AI Agents – Forbes

ByJohn Koetsier

ByJohn Koetsier,
Senior Contributor.
Rating AI agents including ChatGPT's o3, Claude from Anthropic, and Google's Gemini on web search … More tasks
ChatGPT’s recent o3 AI model beat Anthropic’s Claude, Google’s Gemini, and Hangzhou’s Deepseek in a test of AI agents for web research. But there’s still a considerable gap between human capabilities and the best AI agents.
Reseach firm FutureSearch put 11 major large language models through some messy, real-world research tasks, 89 in total, and evaluated each model on its ability to find original sources, seek out data, gather evidence, compile data, and validate claims.
The highest performance achieved was .51 on a scale where an estimated “perfect” agent would hit about .8. Which means that even the best AI agents available now are relatively easily outperformed by humans.
“We can conclude that frontier agents … substantially underperform smart generalist researchers who are given ample time,” the study says.
Here’s how they scored the various AI models:
Still, AI agents are rapidly improving. Based on the year-old ChatGPT -4-Turbo’s score of 0.27, researchers say that “about 45% of the gap between smart generalist researchers and frontier agents” was closed within a year of development.
Also, free or cheap agents such as DeepSeek are not that far behind paid and top-end AI agents from OpenAI. OpenAI’s o3 leads the pack, with Claude and Gemini close behind, and for now closed models are clearly superior for research-heavy tasks, but free and open-source models are increasingly capable.
All LLM-based AI agents still have major issues, however. They fall short of smart human researchers — especially on strategic planning, thoroughness, evaluating sources for quality, and “memory management:” they tend to forget earlier findings mid-task. A particular problem is that AI agents often engage in “satisficing," or accepting a lower level of quality instead of optimizing until they find the highest-quality level of response.
That’s a core reason why ChatGPT’s o3 model came in first. ChatGPT-o3 tended to validate its answers more thoroughly and stop short of better available answers less frequently.
Since a year has served to close almost half the gap between elite humans and the best AI agents, it may not be long until AI agents are outperforming even the best humans.
However, given ChatGPT’s recent challenges with its latest model being too agreeable, it’s clear that there’s not a straight-line path to improvement.
For now at least, it’ll remain essential to double-check any results from a generative AI application like AI agents to ensure accuracy.

source

Post navigation

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

If you like this post you might alo like these