CW Pakistan
  • Legacy
    • Legacy Editorial
    • Editor’s Note
  • Academy
  • Wired
  • Cellcos
  • PayTech
  • Business
  • Ignite
  • Digital Pakistan
  • PSEB
    • DFDI
    • Indus AI Week
  • PASHA
  • TechAdvisor
  • GamePro
  • Partnerships
  • PCWorld
  • Macworld
  • Infoworld
  • TechAdvisor
0
0
0
0
0
Subscribe
CW Pakistan
CW Pakistan CW Pakistan
  • Legacy
    • Legacy Editorial
    • Editor’s Note
  • Academy
  • Wired
  • Cellcos
  • PayTech
  • Business
  • Ignite
  • Digital Pakistan
  • PSEB
    • DFDI
    • Indus AI Week
  • PASHA
  • TechAdvisor
  • GamePro
  • Partnerships
  • TechAdvisor

Google FACTS Benchmark Reveals AI Chatbots Struggle With Factual Accuracy

  • December 23, 2025
Total
0
Shares
0
0
0
Share
Tweet
Share
Share
Share
Share

Google has published new findings that raise questions about the reliability of modern AI chatbots, showing that even the most advanced systems often struggle to deliver factually correct information. Using its newly developed FACTS Benchmark Suite, the company found that leading AI models fail to surpass a 70% accuracy rate, even when presenting answers with high confidence. Gemini 3 Pro led the pack with a score of 69%, while other widely used models from OpenAI, Anthropic, and xAI scored lower, demonstrating that factual errors remain a persistent challenge in current AI technology.

The FACTS Benchmark was created to address a critical gap in how AI performance is evaluated. Traditional assessments often focus on task completion or fluency rather than factual correctness. This distinction is particularly important for sectors such as healthcare, finance, and law, where inaccuracies can have serious consequences. An AI chatbot might produce responses that sound authoritative, yet contain errors that mislead users who assume the output is fully reliable. Google emphasized that the benchmark tests models in ways that go beyond surface-level performance to evaluate the accuracy and trustworthiness of their outputs.

The FACTS Benchmark Suite evaluates AI performance across four key categories. Parametric knowledge tests whether a model can correctly answer questions using information learned during training. Search performance examines the model’s ability to retrieve accurate information using web tools. Grounding evaluates whether the chatbot can stay faithful to a given document without adding false details. Finally, multimodal understanding measures a model’s ability to interpret charts, diagrams, and images accurately. Google’s tests revealed that multimodal tasks remain the weakest area across all models, with accuracy frequently dropping below 50%, highlighting the risk of confidently presenting incorrect numerical data or misinterpreted visuals.

The benchmark results illustrate notable performance differences between AI systems. Gemini 3 Pro achieved the highest overall score at 69%, followed closely by Gemini 2.5 Pro and OpenAI’s ChatGPT-5 at around 62%. Anthropic’s Claude 4.5 Opus scored roughly 51%, while xAI’s Grok 4 reached approximately 54%. Google noted that even the top-performing models make errors in roughly one out of every three responses. These findings underscore the importance of human oversight, especially in areas where factual accuracy is critical. Google clarified that these results do not diminish the value of AI chatbots, but they highlight the ongoing need for safeguards, verification processes, and careful validation to ensure reliable use in professional or high-stakes environments.

As AI systems continue to evolve, the FACTS Benchmark provides a standardized way to evaluate not only whether models can complete tasks but also whether they provide correct and trustworthy information. For now, users are advised to treat AI-generated content with caution, particularly in fields where incorrect data could lead to serious errors or financial and legal consequences. Google’s findings make it clear that while AI chatbots are improving rapidly, human judgment remains a critical part of ensuring the accuracy and reliability of AI-assisted workflows.

Follow the SPIN IDG WhatsApp Channel for updates across the Smart Pakistan Insights Network covering all of Pakistan’s technology ecosystem. 

Share
Tweet
Share
Share
Share
Related Topics
  • AI
  • AI Accuracy
  • AI Factual Errors
  • Artificial Intelligence
  • chatbots
  • ChatGPT-5
  • FACTS Benchmark
  • Google Gemini 3 Pro
  • multimodal AI
Previous Article
  • Wired

YouTube Cracks Down On AI-Generated Fictional Channels With New Policy

  • December 23, 2025
Read More
Next Article
  • GamePro

Top 10 Most Anticipated Video Games Of 2026 You Can’t Miss

  • December 24, 2025
Read More
You May Also Like
Read More
  • TechAdvisor

OpenAI Updates ChatGPT’s Default Model To GPT-5.5 Instant With Fewer Hallucinations And New Memory Sources Feature

  • Press Desk
  • May 7, 2026
Read More
  • TechAdvisor

Google Pixel 9 And Pixel 10 Users Report Persistent eSIM Connectivity Issues Requiring Frequent Phone Resets

  • Press Desk
  • May 7, 2026
Read More
  • TechAdvisor

Meta Introduces AI-Powered Age Detection And Visual Analysis To Protect Teenagers Across Its Platforms In Pakistan

  • Press Desk
  • May 7, 2026
Read More
  • TechAdvisor

StormFiber Launches Storm Social+ CDN Bundle Offering Unlimited Speeds On Netflix, YouTube, TikTok And More

  • Press Desk
  • May 6, 2026
Read More
  • TechAdvisor

Apple’s iOS 26.5 To Introduce End-To-End Encryption For RCS Messaging Between iPhone And Android Devices

  • Press Desk
  • May 6, 2026
Read More
  • TechAdvisor

Samsung Galaxy S27 Ultra Rumoured To Feature 200MP Camera With Variable Aperture

  • Press Desk
  • May 5, 2026
Read More
  • TechAdvisor

Ask.com Has Shut Down, Marking The Official Farewell To The Internet’s Favorite Butler

  • Press Desk
  • May 4, 2026
Read More
  • TechAdvisor

Motorola Launches Eight New Phones Including Razr 70 Ultra Foldable And Moto G87 With 200MP Camera

  • Press Desk
  • May 4, 2026
Trending Posts
  • OpenAI Updates ChatGPT’s Default Model To GPT-5.5 Instant With Fewer Hallucinations And New Memory Sources Feature
    • May 7, 2026
  • Nintendo Surprise Reveals Star Fox Remake For Switch 2 With James McCloud Prologue And 4v4 Star Wolf Battle Mode Launching June 25
    • May 7, 2026
  • Google Pixel 9 And Pixel 10 Users Report Persistent eSIM Connectivity Issues Requiring Frequent Phone Resets
    • May 7, 2026
  • Ministry Of IT Pakistan Advertises Director And Joint Director Positions For National AI Advancement Initiative With May 17 Deadline
    • May 7, 2026
  • BankIslami And Aik Partner With Paklaunch At UNConference 26 To Support Pakistan’s Startup And Fintech Ecosystem
    • May 7, 2026
about
CWPK Legacy
Launched in 1967 internationally, ComputerWorld is the oldest tech magazine/media property in the world. In Pakistan, ComputerWorld was launched in 1995. Initially providing news to IT executives only, once CIO Pakistan, its sister brand from the same family, was launched and took over the enterprise reporting domain in Pakistan, CWPK has emerged as a holistic technology media platform reporting everything tech in the country. It remains the oldest continuous IT publishing brand in the country and in 2025 is set to turn 30 years old, which will be its biggest benchmark and a legacy it hopes to continue for years to come. CWPK is part of the SPIN/IDG Wakhan media umbrella.
Read more
Explore Computerworld Sites Globally
  • computerworld.es
  • computerworld.com.pt
  • computerworld.com
  • cw.no
  • computerworldmexico.com.mx
  • computerwoche.de
  • computersweden.idg.se
  • computerworld.hu
Content from other IDG brands
  • PCWorld
  • Macworld
  • Infoworld
  • TechAdvisor
CW Pakistan CW Pakistan
  • CWPK
  • CXO
  • DEMO
  • WALLET

CW Media & all its sub-brands are copyrighted to SPIN-IDG Wakhan Media Inc., the publishing arm of NCC-RP Group. This site is designed by Crunch Collective. ©️1995-2026. Read Privacy Policy.

Input your search keywords and press Enter.