A new benchmark for large language models in architectural 3D design has a clear winner. Antigravity 2.0 outperformed all competitors in generating valid OpenSCAD code for complex structures. The results mark a significant step forward for generative AI in computer-aided design.
Benchmark Details
The OpenSCAD Architectural 3D LLM Benchmark tests how well language models can produce OpenSCAD scripts from natural language descriptions. OpenSCAD is a script-based 3D modeling language widely used in parametric design. The benchmark evaluates correctness, completeness and efficiency of generated code across dozens of architectural tasks such as walls, roofs and complex facades.
Antigravity 2.0 achieved the highest overall score, beating models from major AI labs and open-source projects. It excelled at handling multi-part assemblies and avoiding syntax errors. The benchmark authors noted that Antigravity 2.0’s output required the fewest manual corrections.
Why This Matters
Architects and designers increasingly use AI to speed up modeling workflows. A model that reliably translates plain language into production-ready OpenSCAD code could save hours per project. This benchmark shows that specialized LLMs are now capable of handling real-world design tasks, not just text or image generation.
The results also push the field toward better evaluation standards. Many existing benchmarks focus on general coding or simple geometry. This one targets architectural domain knowledge, forcing models to understand spatial relationships and construction logic.
What Sets Antigravity Apart
Antigravity 2.0 uses a custom training pipeline focused on structured code generation. The model was fine-tuned on a large corpus of validated OpenSCAD files and architectural blueprints. Its architecture includes a dedicated spatial reasoning module that helps maintain coordinate consistency across multiple components.
Early user reports indicate that Antigravity 2.0 can generate load-bearing wall layouts and staircases with accurate dimensions. These are tasks that tripped up earlier models. The developers behind Antigravity 2.0 have not yet released the model publicly, but they plan to share benchmark evaluation scripts and a limited API for testing.
Other models in the top tier include GPT-4o and a specialized variant of CodeLlama, but neither matched Antigravity 2.0’s reliability on complex scenes. The gap was widest on tasks requiring nested loops and conditional geometry.
The benchmark’s authors stress that no current model is production-ready for unsupervised use. Human review remains essential. But Antigravity 2.0 reduces error rates enough to be a practical assistant for experienced designers.



