AI IQ site scores large language models on the bell curve, dividing tech

A startup called AI IQ is assigning IQ scores to over 50 AI models. The project draws praise for clarity and criticism for oversimplifying machine intelligence.

A new startup project called AI IQ is applying an old concept to artificial intelligence. The website scores dozens of frontier language models on the standard human IQ scale. It then plots them on a bell curve. The result has sparked intense debate across social media.

Some technology experts say the charts make a complex market easy to understand. Others argue the approach is misleading and dangerous. The core question: Can a single number capture a highly uneven set of AI capabilities?

How the scoring system works

AI IQ was created by Ryan Shea. Shea is an engineer and entrepreneur who co-founded the blockchain platform Stacks. He also invested in early unicorns such as OpenSea and Mercury.

The methodology combines 12 benchmarks into four reasoning dimensions: abstract, mathematical, programmatic and academic. The composite IQ is a simple average of those four scores. Each benchmark is mapped to an implied IQ using hand-calibrated difficulty curves. The system prevents easier benchmarks from inflating results. Missing data is handled conservatively. Models must show scores on at least two dimensions to get a derived IQ.

The abstract dimension includes ARC-AGI-1 and ARC-AGI-2, the pattern-recognition tests for fluid intelligence. Mathematical reasoning uses FrontierMath, AIME and ProofBench. Programmatic reasoning relies on Terminal-Bench 2.0, SWE-Bench Verified and SciCode. Academic reasoning uses Humanity's Last Exam, CritPt and GPQA Diamond.

Top models converge tightly

As of mid-May 2026, OpenAI holds the top spot. GPT-5.5 scores near 136 IQ. It is closely followed by Anthropic's Opus 4.7 at roughly 132 and Google's Gemini 3.1 Pro near 131. The cluster of top models has never been tighter.

Other independent rankings show the same compression. Visual Capitalist recently noted that the leaderboard is extremely crowded at the top. Below that cluster, models from Chinese labs such as Kimi and DeepSeek populate a broad midfield. The charts reveal a widening gap between frontier models and the tiers below.

Why this matters

AI IQ directly affects enterprise buyers, researchers and policymakers who rely on benchmarks to compare models. A single IQ score can oversimplify a model's real strengths and weaknesses. Businesses may make procurement decisions based on a misleading number. Researchers warn that compressing jagged AI abilities into one metric creates a false sense of precision. The debate highlights a broader challenge: how to measure progress in a field where no single standard exists.

Critics say the map is not the territory. Supporters say clear visualizations help people make sense of rapid advances. Both sides agree that the way we measure AI will shape how it is built and deployed.

AI IQ site ignites debate by scoring large language models on the bell curve

How the scoring system works

Top models converge tightly

Why this matters

Related Articles

A simple prompt tweak can dramatically improve AI image quality

No-Code AI: Training LLaMA 2 Chatbots Becomes Accessible to Everyone

Intel Unveils Massive Memory AI Chip for Data Centers