Results for "evaluation methods"
19 results found

DeepMind Veteran Warns AI Benchmarks Are Not Enough
A former DeepMind researcher warns that current benchmarks fail to ensure AI safety. The call for new evaluation methods comes as AI systems grow more powerful.

AI therapy startup claims 95% safety score in mental health benchmark
The Path claims its AI model scored 95 on the Vera-MH safety benchmark, far above rivals like ChatGPT. The startup was co-founded by Tony Robbins and Calm veterans.

How Pull Requests Are Replacing Whiteboards in Tech Hiring
A growing number of tech companies are replacing traditional whiteboard interviews with real-world coding tasks using pull requests. This shift aims to evaluate candidates more fairly and accurately.

Open-source coding model NousCoder-14B matches big rivals in just 4 days
An open-source AI coding model trained in four days matches proprietary systems, highlighting the rapid progress of open-source alternatives in AI-assisted software development.

AI Coding Benchmarks Overlook Long-Term Code Health Risks
Current AI coding benchmarks measure one-shot performance but ignore quality erosion from repeated edits. This oversight could lead to unmaintainable codebases at scale.

Antigravity 2.0 Dominates First OpenSCAD 3D LLM Benchmark
Antigravity 2.0 tops the OpenSCAD Architectural 3D LLM Benchmark, demonstrating superior ability to generate valid 3D models from natural language prompts.

AI Benchmark Prompt for GeoGuessr Fails After Model Update
A well-known prompt used to test AI geography skills no longer works on the O3 model, prompting debate about benchmark reliability and model drift.

France Leads EU's Charge Away From US Tech Giants
France is replacing Zoom and Microsoft Teams with homegrown tools, and other EU countries are following. The Trump-era push for digital sovereignty is reshaping Europe's tech landscape.

AI coding boom creates production chaos, Resolve AI launches multi-agent fix
Resolve AI expands its platform with multi-agent investigation to tackle production failures caused by rapid AI code generation. The system uses coordinated agents that verify each other's findings.

Mercury Hits $5.2B Valuation as Fintech Startup Pursues Own Banking License
Mercury raised $200M at a $5.2B valuation and secured regulatory approval to establish its own bank. The digital banking startup serves over 300,000 companies and reported $650M in annualized revenue.

Fresha Reaches $1 Billion Valuation With KKR Investment
Beauty booking platform Fresha secured $80M from KKR, pushing its valuation to $1 billion. The funding underscores growth in service marketplace tech.

Secretive AI Startup Hark Raises $700M at $6 Billion Valuation
Hark, Brett Adcock's stealth AI startup, raised a massive $700M Series A, valuing the 'universal' interface company at $6 billion.

General Catalyst bets $63M on India's travel payments startup Scapia
General Catalyst leads $63 million funding round in Scapia, an Indian travel booking and payments startup. The investment doubles the company's valuation.

Typewise Hires AI Growth Engineer as Startup Expands Reach
Typewise, the YC-backed keyboard startup, is hiring an AI Growth Engineer for Zurich or remote. The move signals a push to integrate AI into growth and product development.

Quantum Physics, AI Join Forces to Supercharge Enzyme Engineering
Imperagen raises £5 million to blend quantum physics simulations with AI for faster, more precise enzyme design, aiming to green industrial processes.

China Robotics Investment Hits Record as Embodied AI Startups Attract Billions
China-based robotics startups raised $5.6 billion through mid-2026, matching the 2021 peak. Embodied AI companies drive the surge, with several startups reaching billion-dollar valuations.

SoftBank CEO's Recent Bets Raise Alarm Among Executives
Insiders at SoftBank worry that Masayoshi Son's recent investment decisions signal a losing streak. The CEO known for bold bets may be overpaying for deals.

Why Autonomous AI Fails Without a Body-Like Feedback System
AI systems that rely on pure autonomy often fail. A new framework compares AI to the human body, arguing that feedback loops build trust.

Google’s Gemini Voice Push Redefines How We Talk to AI
Google is leaning into voice interaction with Gemini, encouraging users to speak naturally. The shift capitalizes on voice dictation’s popularity and aims to make AI conversations feel human.