Results for "AI benchmark"
115 results found

AI Benchmark Prompt for GeoGuessr Fails After Model Update
A well-known prompt used to test AI geography skills no longer works on the O3 model, prompting debate about benchmark reliability and model drift.

DeepMind Veteran Warns AI Benchmarks Are Not Enough
A former DeepMind researcher warns that current benchmarks fail to ensure AI safety. The call for new evaluation methods comes as AI systems grow more powerful.

AI Coding Benchmarks Overlook Long-Term Code Health Risks
Current AI coding benchmarks measure one-shot performance but ignore quality erosion from repeated edits. This oversight could lead to unmaintainable codebases at scale.

AI therapy startup claims 95% safety score in mental health benchmark
The Path claims its AI model scored 95 on the Vera-MH safety benchmark, far above rivals like ChatGPT. The startup was co-founded by Tony Robbins and Calm veterans.

Antigravity 2.0 Dominates First OpenSCAD 3D LLM Benchmark
Antigravity 2.0 tops the OpenSCAD Architectural 3D LLM Benchmark, demonstrating superior ability to generate valid 3D models from natural language prompts.

Google's Gemini 3.5 Flash Reshapes Enterprise AI Cost Equation
Google claims its new Gemini 3.5 Flash model can save enterprises over $1 billion annually by delivering near-frontier performance at triple the speed and half the cost.

AI IQ site ignites debate by scoring large language models on the bell curve
A startup called AI IQ is assigning IQ scores to over 50 AI models. The project draws praise for clarity and criticism for oversimplifying machine intelligence.

AI Bots Fool Nearly Half of Participants in New Online Test
Surfshark's experiment reveals 47% of people can't tell AI bots from humans online. The test challenges users to identify bots in simulated social interactions.

AI coding boom creates production chaos, Resolve AI launches multi-agent fix
Resolve AI expands its platform with multi-agent investigation to tackle production failures caused by rapid AI code generation. The system uses coordinated agents that verify each other's findings.

SpaceX Acquires xAI, Declares AI Its Core Business Ahead of IPO
SpaceX's IPO filing reveals AI as its primary market, projecting $26.5 trillion opportunity. The company positioned Grok against OpenAI and Anthropic.

Enterprises stuck in AI's 'chat phase' as gap between insight and action widens
Many enterprises use AI only for chat and queries, failing to translate insights into business outcomes. A shift toward integrated execution is critical.

Cerebras wafer-scale chip runs trillion-parameter model 7x faster than GPU clouds
Cerebras claims its wafer-scale chip runs a trillion-parameter AI model nearly seven times faster than GPU-based clouds, challenging Nvidia's dominance in inference.

Open-source coding model NousCoder-14B matches big rivals in just 4 days
An open-source AI coding model trained in four days matches proprietary systems, highlighting the rapid progress of open-source alternatives in AI-assisted software development.

OpenClaw AI Agent Steps Into the Physical World With a Robot Body
An AI coding agent named OpenClaw has been given a physical robot body, demonstrating how AI models can simplify robot building and deployment.

Anthropic Surpasses OpenAI in Corporate AI Adoption for First Time
Anthropic's Claude overtakes OpenAI's ChatGPT in business AI adoption. But escalating costs and competition threaten its lead.

Why Autonomous AI Fails Without a Body-Like Feedback System
AI systems that rely on pure autonomy often fail. A new framework compares AI to the human body, arguing that feedback loops build trust.

Salesforce Turns Slackbot Into a Full AI Agent for the Enterprise
Salesforce rebuilt Slackbot from a simple notification tool into an AI agent that searches data, drafts documents and takes actions, intensifying workplace AI competition.

AI demand forces a fundamental shift in enterprise data center strategy
Rising AI workloads are pushing companies to rethink infrastructure, moving from general-purpose servers to specialized GPU clusters and liquid-cooled data centers.

AI Outpaces Human Patching, Making Vulnerability Windows Obsolete
AI-powered bug detection finds vulnerabilities faster than humans can patch. The industry shifts from reactive patching to building resilient software from the start.

Anthropic Nears First Profit as AI Race Intensifies
Anthropic is set to report its first profitable quarter since founding in 2021, marking a milestone in the competitive AI landscape.