Google unveils Gemini 3 claiming the lead in math, science, multimodal and agentic AI benchmarks -

After more than a month of rumors and feverish speculation — including Polymarket wagering on the release date — Google today unveiled Gemini 3, its newest proprietary frontier model family and the company’s most comprehensive AI release since the Gemini line debuted in 2023.

The models are proprietary (closed-source), available exclusively through Google products, developer platforms, and paid APIs, including Google AI Studio, Vertex AI, the Gemini command line interface (CLI) for developers, and third-party integrations across the broader integrated developer environment (IDE) ecosystem.

Gemini 3 arrives as a full portfolio, including:

Gemini 3 Pro: the flagship frontier model
Gemini 3 Deep Think: an enhanced reasoning mode
Generative interface models powering Visual Layout and Dynamic View
Gemini Agent for multi-step task execution
Gemini 3 engine embedded in Google Antigravity, the company’s new agent-first development environment.

“This is the best model in the world, by a crazy wide margin!” wrote Google DeepMind Research Scientist Yi Tay on X.

Indeed, already, independent AI benchmarking and analysis organization Artificial Analysis has crowned Gemini 3 Pro the “new leader in AI” globally, achieving the top score of 73 on the organization’s index, leaping Google from its former placement of 9th overall with the preceding Gemini 2.5 Pro model, which scored 60 behind OpenAI, Moonshot AI, xAI, Anthropic and MiniMax models. As Artificial Analysis wrote on X: “For the first time, Google has the most intelligent model.”

Another independent leaderboard site, LMArena reported that Gemini 3 Pro ranked first in the world across all of its major evaluation tracks, including text reasoning, vision, coding, and web development.

In a public post, the @arena account on X said the model surpassed even the newly released (hours old) Grok-4.1, as well as Claude 4.5, and GPT-5-class systems in categories such as math, long-form queries, creative writing, and several occupational benchmarks.

The post also highlighted the scale of gains over Gemini 2.5 Pro, including a 50-point jump in text Elo, a 70-point increase in vision, and a 280-point rise in web-development tasks.

While these results reflect live community voting and remain preliminary, they signal unusually broad performance improvements across domains where previous Gemini models trailed competitors.

What It Means For Google In the Hotly Competitive AI Race

The launch represents one of Google’s largest, most tightly coordinated model releases.

Gemini 3 is shipping simultaneously across Google Search, the Gemini app, Google AI Studio, Vertex AI, and a range of developer tools.

Executives emphasized that this integration reflects Google’s control of tensor processing unit (TPU — its homegrown Nvidia GPU rival chips) hardware, data center infrastructure, and consumer products.

According to the company, the Gemini app now has more than 650 million monthly active users, more than 13 million developers build with Google’s AI tools, and more than 2 billion monthly users engage with Gemini-powered AI Overviews in Search.

At the center of the release is a shift toward agentic AI — systems that plan, act, navigate interfaces, and coordinate tools, rather than just generating text.

Gemini 3 is designed to translate high-level instructions into multi-step workflows across devices and applications, with the ability to generate functional interfaces, run tools, and manage complex tasks.

Major Performance Gains Over Gemini 2.5 Pro

Gemini 3 Pro introduces large gains over Gemini 2.5 Pro across reasoning, mathematics, multimodality, tool use, coding, and long-horizon planning. Google’s benchmark disclosures show substantial improvements in many categories.

Gemini 3 Pro debuted at the top of the LMArena text-reasoning leaderboard, posting a preliminary Elo score of 1501 based on pre-release community voting — the first LLM to ever cross the 1500 threshold.

That places it above xAI’s newly announced Grok-4.1-thinking model (1484) and Grok-4.1 (1465), both of which were unveiled just hours earlier, as well as above Gemini 2.5 Pro (1451) and recent Claude Sonnet and Opus releases.

While LMArena covers only text-reasoning performance and the results are labeled preliminary, this ranking positions Gemini 3 Pro as the strongest publicly evaluated model on that benchmark as of its launch day — though not necessarily the top performer in the world across all modalities, tasks, or evaluation suites.

In mathematical and scientific reasoning, Gemini 3 Pro scored 95 percent on AIME 2025 without tools and 100 percent with code execution, compared to 88 percent for its predecessor.

On GPQA Diamond, it reached 91.9 percent, up from 86.4 percent. The model also recorded a major jump on MathArena Apex, reaching 23.4 percent versus 0.5 percent for Gemini 2.5 Pro, and delivered 31.1 percent on ARC-AGI-2 compared to 4.9 percent previously.

ARC-AGI-2 is the second-generation version of the Abstraction and Reasoning Corpus (ARC), a benchmark introduced by AI researcher François Chollet to measure generalization, not memorization.

Unlike typical multiple-choice or dataset-based evaluations, ARC-AGI-2 presents models with tiny grid-based puzzles that require discovering and applying abstract rules.

Each task provides a few input–output examples, and the model must infer the underlying transformation and apply it to a new test case. The problems span visual pattern recognition, symbolic manipulation, object transformations, spatial reasoning, and rule induction — all designed to test reasoning capabilities that do not depend on training-set familiarity.

The new ARC-AGI-2 variant is deliberately constructed to be out-of-distribution and resistant to memorization, making it one of the most difficult benchmarks for large language models. Its tasks are engineered to stress-test whether a model can infer a previously unseen rule purely from examples, a proxy for early forms of generalized problem-solving.

Astonishingly, the “Deep Think” version of Gemini 3, designed to take longer to solve problems and use more reasoning, scored 45.1%, representing a substantial jump over prior frontier models, which typically score in the mid-teens to low-twenties. It also far exceeds Gemini 3 Pro’s 31.1% and is an order-of-magnitude improvement over older Gemini releases.

These results suggest that Deep Think’s architecture is particularly effective at multi-step hypothesis generation, checking, and revision — the specific capabilities ARC-AGI-2 is designed to measure.

Multimodal performance increased across the board. Gemini 3 Pro scored 81 percent on MMMU-Pro, up from 68 percent, and 87.6 percent on Video-MMMU, compared to 83.6 percent. Its result on ScreenSpot-Pro, a key benchmark for agentic computer use, rose from 11.4 percent to 72.7 percent. Document understanding and chart reasoning also improved.

Coding and tool-use performance showed equally significant gains. The model’s LiveCodeBench Pro score reached 2,439, up from 1,775. On Terminal-Bench 2.0 it achieved 54.2 percent versus 32.6 percent previously. SWE-Bench Verified, which measures agentic coding through structured fixes, increased from 59.6 percent to 76.2 percent. The model also posted 85.4 percent on t2-bench, up from 54.9 percent.

Long-context and planning benchmarks indicate more stable multi-step behavior. Gemini 3 achieved 77 percent on MRCR v2 at 128k context (versus 58 percent) and 26.3 percent at 1 million tokens (versus 16.4 percent). Its Vending-Bench 2 score reached $5,478.16, compared to $573.64 for Gemini 2.5 Pro, reflecting stronger consistency during long-running decision processes.

Language understanding scores improved on SimpleQA Verified (72.1 percent versus 54.5 percent), MMLU (91.8 percent versus 89.5 percent), and the FACTS Benchmark Suite (70.5 percent versus 63.4 percent), supporting more reliable fact-based work in regulated sectors.

Generative Interfaces Move Gemini Beyond Text

Gemini 3 introduces a new class of generative interface capabilities in the consumer-facing Google Search AI Mode and for developers through Google AI Studio.

Visual Layout produces structured, magazine-style pages with images, diagrams, and modules tailored to the query.

Dynamic View generates functional interface components such as calculators, simulations, galleries, and interactive graphs.

These experiences will be available starting today globally in Google Search’s AI Mode, enabling models to surface information in visual, interactive formats beyond static text.

Developers can reproduce similar UI elements through Google AI Studio and the Gemini API, but the full consumer-facing interface types are not available as direct API outputs; instead, developers receive the underlying code or schema to render these components themselves. The branded Visual Layout and Dynamic View formats are therefore specific to Search and not exposed as standalone API features.

Google says the model analyzes user intent to construct the layout best suited to a task. In practice, this includes everything from automatically building diagrams for scientific concepts to generating custom UI components that respond to user input.

Google held a press call the day before the Gemini 3 announcement to brief reporters on the model family, its intended use cases, and how it differed from earlier Gemini releases. The call was led by multiple Google and DeepMind executives who walked through the model’s capabilities and framed Gemini 3 as a step toward more reliable, multi-step agentic systems that can operate across Google’s ecosystem.

During the briefing, speakers emphasized that Gemini 3 was engineered to support more consistent long-horizon reasoning, better tool use, and smoother planning loops than Gemini 2.5 Pro.

One presenter said the model benefits from an architecture that allows it to generate and evaluate multiple hypotheses in parallel, improving reliability on mathematically hard questions and complex procedural tasks.

Another speaker explained that Gemini 3’s improved spatial reasoning enables more robust interaction with interface elements, which supports agentic workflows across screens and applications.

Presenters highlighted growing enterprise adoption, noting strong demand for multimodal analysis, structured document reasoning, and agentic coding tools. They said Gemini 3’s performance on multimodal and scientific benchmarks reflected Google’s focus on grounded, verifiable reasoning. And they discussed Gemini 3’s safety processes and improvements, including reduced sycophancy, stronger prompt-injection resistance, and a more structured evaluation pipeline guided by Google’s Frontier Safety Framework introduced back in 2024.

A portion of the call was dedicated to developer experience. Google described updates to its AI Studio and API that allow developers to control thinking depth, adjust model “resolution,” and combine new grounding tools with URL context and Search.

Demoes showed Gemini 3 generating application interfaces, managing tool sequences, and debugging code in Antigravity, illustrating the model’s shift toward agentic operation rather than single-step generation.

The call positioned Gemini 3 as an upgrade across reasoning, planning, multimodal understanding, and developer workflows, with Google framing these advances as the foundation for its next generation of agent-driven products and enterprise services.

Gemini Agent Introduces Multi-Step Workflow Automation

Gemini Agent marks Google’s effort to move beyond conversational assistance toward operational AI. The system coordinates multi-step tasks across tools like Gmail, Calendar, Canvas, and live browsing. It reviews inboxes, drafts replies, prepares plans, triages information, and reasons through complex workflows, while requiring user approval before performing sensitive actions.

On a press call with journalists ahead of the release yesterday, Google said the agent is designed to handle multi-turn planning and tool-use sequences with consistency that was not feasible in earlier generations.

It is rolling out first to Google AI Ultra subscribers in the Gemini app.

Google Antigravity and Developer Toolchain Integration

Antigravity is Google’s new agent-first development environment designed around Gemini 3. Developers collaborate with agents across an editor, terminal, and browser. The system orchestrates full-stack tasks, including code generation, UI prototyping, debugging, live execution, and report generation.

Across the broader developer ecosystem, Google AI Studio now includes a Build mode that automatically wires the right models and APIs to speed up AI-native app creation. Annotations support allows developers to attach prompts to UI elements for faster iteration. Spatial reasoning improvements enable agents to interpret mouse movements, screen annotations, and multi-window layouts to operate computer interfaces more effectively.

Developers also gain new reasoning controls through “thinking level” and “model resolution” parameters in the Gemini API, along with stricter validation of thought signatures for multi-turn consistency. A hosted server-side bash tool supports secure, multi-language code generation and prototyping. Grounding with Google Search and URL context can now be combined to extract structured information for downstream tasks.

Enterprise Impact and Adoption

Enterprise teams gain multimodal understanding, agentic coding, and long-horizon planning needed for production use cases. The new model unifies analysis of documents, audio, video, workflows, and logs. Improvements in spatial and visual reasoning support robotics, autonomous systems, and scenarios requiring navigation of screens and applications. High-frame-rate video understanding helps developers detect events in fast-moving environments.

Gemini 3’s structured document understanding capabilities support legal review, complex form processing, and regulated workflows. Its ability to generate functional interfaces and prototypes with minimal prompting reduces engineering cycles. In addition, the gains in system reliability, tool-calling stability, and context retention make multi-step planning viable for operations like financial forecasting, customer support automation, supply chain modeling, and predictive maintenance.

Developer and API Pricing

Google has disclosed initial API pricing for Gemini 3 Pro.

In preview, the model is priced at $2 per million input tokens and $12 per million output tokens for prompts up to 200,000 tokens in Google AI Studio and Vertex AI. For prompts that require more than 200,000 tokens, the input pricing doubles to $2 per 1M tok, while the output rises to $18 per 1M tok.

When compared to the API pricing for other frontier AI models from rival labs, Gemini 3 is priced in the mid-high range, which may impact adoption as cheaper and open-source (permissively licensed) Chinese models have increasingly come to be adopted by U.S. startups. Here’s how it stacks up:

Model	Input (/1M tokens)	Output (/1M tokens)	Total Cost	Source
ERNIE 4.5 Turbo	$0.11	$0.45	$0.56	Qianfan
ERNIE 5.0	$0.85	$3.40	$4.25	Qianfan
Qwen3 (Coder ex.)	$0.85	$3.40	$4.25	Qianfan
GPT-5.1	$1.25	$10.00	$11.25	OpenAI
Gemini 2.5 Pro (≤200K)	$1.25	$10.00	$11.25	Google
Gemini 3 Pro (≤200K)	$2.00	$12.00	$14.00	Google
Gemini 2.5 Pro (>200K)	$2.50	$15.00	$17.50	Google
Gemini 3 Pro (>200K)	$4.00	$18.00	$22.00	Google
Grok 4 (0709)	$3.00	$15.00	$18.00	xAI API
Claude Opus 4.1	$15.00	$75.00	$90.00	Anthropic

Gemini 3 Pro is also available at no charge with rate limits in Google AI Studio for experimentation.

The company has not yet announced pricing for Gemini 3 Deep Think, extended context windows, generative interfaces, or tool invocation.

Enterprises planning deployment at scale will require these details to estimate operational costs.

Multimodal, Visual, and Spatial Reasoning Enhancements

Gemini 3’s improvements in embodied and spatial reasoning support pointing and trajectory prediction, task progression, and complex screen parsing. These capabilities extend to desktop and mobile environments, enabling agents to interpret screen elements, respond to on-screen context, and unlock new forms of computer-use automation.

The model also delivers improved video reasoning with high-frame-rate understanding for analyzing fast-moving scenes, along with long-context video recall for synthesizing narratives across hours of footage. Google’s examples show the model generating full interactive demo apps directly from prompts, illustrating the depth of multimodal and agentic integration.

Vibe Coding and Agentic Code Generation

Gemini 3 advances Google’s concept of “vibe coding,” where natural language acts as the primary syntax. The model can translate high-level ideas into full applications with a single prompt, handling multi-step planning, code generation, and visual design. Enterprise partners like Figma, JetBrains, Cursor, Replit, and Cline report stronger instruction following, more stable agentic operation, and better long-context code manipulation compared to prior models.

Rumors and Rumblings

In the weeks leading up to the announcement, X became a hub of speculation about Gemini 3.

Well-known accounts such as @slow_developer suggested internal builds were significantly ahead of Gemini 2.5 Pro and likely exceeded competitor performance in reasoning and tool use. Others, including @synthwavedd and @VraserX, noted mixed behavior in early checkpoints but acknowledged Google’s advantage in TPU hardware and training data.

Viral clips from users like @lepadphone and @StijnSmits showed the model generating websites, animations, and UI layouts from single prompts, adding to the momentum.

Prediction markets on Polymarket amplified the speculation. Whale accounts drove the odds of a mid-November release sharply upward, prompting widespread debate about insider activity. A temporary dip during a global Cloudflare outage became a moment of humor and conspiracy before odds surged again.

The key moment came when users including @cheatyyyy shared what appeared to be an internal model-card benchmark table for Gemini 3 Pro.

The image circulated rapidly, with commentary from figures like @deedydas and @kimmonismus arguing the numbers suggested a significant lead.

When Google published the official benchmarks, they matched the leaked table exactly, confirming the document’s authenticity.

By launch day, enthusiasm reached a peak. A brief “Geminiii” post from Sundar Pichai triggered widespread attention, and early testers quickly shared real examples of Gemini 3 generating interfaces, full apps, and complex visual designs.

While some concerns about pricing and efficiency appeared, the dominant sentiment framed the launch as a turning point for Google and a display of its full-stack AI capabilities.

Safety and Evaluation

Google says Gemini 3 is its most secure model yet, with reduced sycophancy, stronger prompt-injection resistance, and better protection against misuse. The company partnered with external groups, including Apollo and Vaultis, and conducted evaluations using its Frontier Safety Framework.

Deployment Across Google Products

Gemini 3 is available across Google Search AI Mode, the Gemini app, Google AI Studio, Vertex AI, the Gemini CLI, and Google’s new agentic development platform, Antigravity. Google says additional Gemini 3 variants will arrive later.

Conclusion

Gemini 3 represents Google’s largest step forward in reasoning, multimodality, enterprise reliability, and agentic capabilities. The model’s performance gains over Gemini 2.5 Pro are substantial across mathematical reasoning, vision, coding, and planning. Generative interfaces, Gemini Agent, and Antigravity demonstrate a shift toward systems that not only respond to prompts but plan tasks, construct interfaces, and coordinate tools. Combined with an unusually intense hype and leak cycle, the launch marks a significant moment in the AI landscape as Google moves aggressively to expand its presence across both consumer-facing and enterprise-facing AI workflows.