Results for "AI evaluation"

406 results found

Startups / Funding

Why AI Evaluation Startups Struggle to Survive

Eval startups face commoditization, open-source pressure and weak business models. A look at why so many fail to scale.

Jun 24, 20263 min read

AI / Machine Learning

AI Benchmark Prompt for GeoGuessr Fails After Model Update

A well-known prompt used to test AI geography skills no longer works on the O3 model, prompting debate about benchmark reliability and model drift.

May 21, 20262 min read

AI / Machine Learning

DeepMind Veteran Warns AI Benchmarks Are Not Enough

A former DeepMind researcher warns that current benchmarks fail to ensure AI safety. The call for new evaluation methods comes as AI systems grow more powerful.

May 22, 20263 min read

AI / Machine Learning

AI Coding Benchmarks Overlook Long-Term Code Health Risks

Current AI coding benchmarks measure one-shot performance but ignore quality erosion from repeated edits. This oversight could lead to unmaintainable codebases at scale.

May 21, 20263 min read

AI / Machine Learning

AI therapy startup claims 95% safety score in mental health benchmark

The Path claims its AI model scored 95 on the Vera-MH safety benchmark, far above rivals like ChatGPT. The startup was co-founded by Tony Robbins and Calm veterans.

May 21, 20263 min read

AI / Machine Learning

Large Context Windows in AI Models Face Reliability Challenges

New research reveals that large context windows in AI models suffer from significant performance degradation, raising questions about their practical utility for complex tasks.

Jun 14, 20262 min read

AI / Machine Learning

AI coding boom creates production chaos, Resolve AI launches multi-agent fix

Resolve AI expands its platform with multi-agent investigation to tackle production failures caused by rapid AI code generation. The system uses coordinated agents that verify each other's findings.

May 21, 20263 min read

AI / Machine Learning

Open-source coding model NousCoder-14B matches big rivals in just 4 days

An open-source AI coding model trained in four days matches proprietary systems, highlighting the rapid progress of open-source alternatives in AI-assisted software development.

May 19, 20262 min read

AI / Machine Learning

The New Complexity of Large Language Models

Large language models are growing more complex with new architectures and techniques. This shift has implications for performance, interpretability, and the future of AI research.

Jun 20, 20263 min read

AI / Machine Learning

Antigravity 2.0 Dominates First OpenSCAD 3D LLM Benchmark

Antigravity 2.0 tops the OpenSCAD Architectural 3D LLM Benchmark, demonstrating superior ability to generate valid 3D models from natural language prompts.

May 22, 20263 min read

AI / Machine Learning

Claude AI's free tier tightens as Anthropic shifts focus to paid subscribers

Anthropic has quietly reduced free access to Claude AI, capping daily messages and reserving faster models for paying users.

Jun 3, 20263 min read

AI / Machine Learning

Enterprise AI Matures as Autonomous Agents Draw Record Investment

A specialist AI agent company raised $950M at a $15B valuation, signaling a shift from workflow automation to autonomous enterprises. The investment reflects confidence in AI agents.

Jun 12, 20262 min read

Big Tech

Enterprise AI Investment Reaches Measurable Returns, Google Cloud Reports

Google Cloud VP says companies are seeing ROI from AI, signaling a shift from pilots to production. This marks a potential tipping point for enterprise AI adoption.

Jun 20, 20263 min read

AI / Machine Learning

Military Smart Glasses Let Soldiers Order Drone Strikes With Eye Tracking

Anduril and Meta are developing AR headsets that use eye-tracking and AI to order drone strikes. The systems face technical and attention hurdles before a potential 2028 production.

May 25, 20263 min read

Big Tech

Why Some Users Want Search Engines to Stop Thinking for Them

AI summaries in search results frustrate users who prefer traditional link lists. Critics argue search engines should retrieve information, not interpret it.

Jun 3, 20263 min read

Software Development

PostgresBench Brings Reproducible Testing to Cloud Database Choices

A new open-source benchmark, PostgresBench, aims to standardize performance testing for PostgreSQL services. It offers reproducible results across self-managed and cloud providers, helping developers make informed infrastructure decisions.

Jun 21, 20262 min read

Startups / Funding

Anthropic Files for IPO, Signaling Escalation in AI Public Market Race

Anthropic has confidentially filed for an IPO, following a $65 billion funding round that valued it at $965 billion. The move intensifies the race among top AI companies to go public.

Jun 2, 20263 min read

Startups / Funding

Secretive AI Startup Hark Raises $700M at $6 Billion Valuation

Hark, Brett Adcock's stealth AI startup, raised a massive $700M Series A, valuing the 'universal' interface company at $6 billion.

May 21, 20262 min read

Startups / Funding

China Robotics Investment Hits Record as Embodied AI Startups Attract Billions

China-based robotics startups raised $5.6 billion through mid-2026, matching the 2021 peak. Embodied AI companies drive the surge, with several startups reaching billion-dollar valuations.

May 20, 20263 min read

Startups / Funding

VCs Warn AI Frenzy Fuels Dangerous Groupthink Among Startups

Top venture capitalists see an AI funding bubble with young founders raising millions easily. They warn of groupthink and inflated valuations in the startup ecosystem.

May 30, 20262 min read