Results for "AI evaluation"
406 results found

Why AI Evaluation Startups Struggle to Survive
Eval startups face commoditization, open-source pressure and weak business models. A look at why so many fail to scale.

AI Benchmark Prompt for GeoGuessr Fails After Model Update
A well-known prompt used to test AI geography skills no longer works on the O3 model, prompting debate about benchmark reliability and model drift.

DeepMind Veteran Warns AI Benchmarks Are Not Enough
A former DeepMind researcher warns that current benchmarks fail to ensure AI safety. The call for new evaluation methods comes as AI systems grow more powerful.

AI Coding Benchmarks Overlook Long-Term Code Health Risks
Current AI coding benchmarks measure one-shot performance but ignore quality erosion from repeated edits. This oversight could lead to unmaintainable codebases at scale.

AI therapy startup claims 95% safety score in mental health benchmark
The Path claims its AI model scored 95 on the Vera-MH safety benchmark, far above rivals like ChatGPT. The startup was co-founded by Tony Robbins and Calm veterans.

Large Context Windows in AI Models Face Reliability Challenges
New research reveals that large context windows in AI models suffer from significant performance degradation, raising questions about their practical utility for complex tasks.

AI coding boom creates production chaos, Resolve AI launches multi-agent fix
Resolve AI expands its platform with multi-agent investigation to tackle production failures caused by rapid AI code generation. The system uses coordinated agents that verify each other's findings.

Open-source coding model NousCoder-14B matches big rivals in just 4 days
An open-source AI coding model trained in four days matches proprietary systems, highlighting the rapid progress of open-source alternatives in AI-assisted software development.

The New Complexity of Large Language Models
Large language models are growing more complex with new architectures and techniques. This shift has implications for performance, interpretability, and the future of AI research.

Antigravity 2.0 Dominates First OpenSCAD 3D LLM Benchmark
Antigravity 2.0 tops the OpenSCAD Architectural 3D LLM Benchmark, demonstrating superior ability to generate valid 3D models from natural language prompts.

Claude AI's free tier tightens as Anthropic shifts focus to paid subscribers
Anthropic has quietly reduced free access to Claude AI, capping daily messages and reserving faster models for paying users.

Enterprise AI Matures as Autonomous Agents Draw Record Investment
A specialist AI agent company raised $950M at a $15B valuation, signaling a shift from workflow automation to autonomous enterprises. The investment reflects confidence in AI agents.

Enterprise AI Investment Reaches Measurable Returns, Google Cloud Reports
Google Cloud VP says companies are seeing ROI from AI, signaling a shift from pilots to production. This marks a potential tipping point for enterprise AI adoption.

Military Smart Glasses Let Soldiers Order Drone Strikes With Eye Tracking
Anduril and Meta are developing AR headsets that use eye-tracking and AI to order drone strikes. The systems face technical and attention hurdles before a potential 2028 production.

Why Some Users Want Search Engines to Stop Thinking for Them
AI summaries in search results frustrate users who prefer traditional link lists. Critics argue search engines should retrieve information, not interpret it.

PostgresBench Brings Reproducible Testing to Cloud Database Choices
A new open-source benchmark, PostgresBench, aims to standardize performance testing for PostgreSQL services. It offers reproducible results across self-managed and cloud providers, helping developers make informed infrastructure decisions.

Anthropic Files for IPO, Signaling Escalation in AI Public Market Race
Anthropic has confidentially filed for an IPO, following a $65 billion funding round that valued it at $965 billion. The move intensifies the race among top AI companies to go public.

Secretive AI Startup Hark Raises $700M at $6 Billion Valuation
Hark, Brett Adcock's stealth AI startup, raised a massive $700M Series A, valuing the 'universal' interface company at $6 billion.

China Robotics Investment Hits Record as Embodied AI Startups Attract Billions
China-based robotics startups raised $5.6 billion through mid-2026, matching the 2021 peak. Embodied AI companies drive the surge, with several startups reaching billion-dollar valuations.

VCs Warn AI Frenzy Fuels Dangerous Groupthink Among Startups
Top venture capitalists see an AI funding bubble with young founders raising millions easily. They warn of groupthink and inflated valuations in the startup ecosystem.