Chinese AI and tech firms continue to impress with their development of cutting-edge, state-of-the-art AI language models.
Today, the one drawing eyeballs is Alibaba Cloud’s Qwen Team of AI researchers and its unveiling of a new proprietary language reasoning model, Qwen3-Max-Thinking.
You may recall, as VentureBeat covered last year, that Qwen has made a name for itself in the fast-moving global AI marketplace by shipping a variety of powerful, open source models in various modalities, from text to image to spoken audio. The company even earned an endorsement from U.S. tech lodgings giant Airbnb, whose CEO and co-founder Brian Chesky said the company was relying on Qwen’s free, open source models as a more affordable alternative to U.S. offerings like those of OpenAI.
Now, with the proprietary Qwen3-Max-Thinking, the Qwen Team is aiming to match and, in some cases, outpace the reasoning capabilities of GPT-5.2 and Gemini 3 Pro through architectural efficiency and agentic autonomy.
The release comes at a critical juncture. Western labs have largely defined the “reasoning” category (often dubbed “System 2” logic), but Qwen’s latest benchmarks suggest the gap has closed.
In addition, the company’s relatively affordable API pricing strategy aggressively targets enterprise adoption. However, as it is a Chinese model, some U.S. firms with strict national security requirements and considerations may be wary of adopting it.
The Architecture: “Test-Time Scaling” Redefined
The core innovation driving Qwen3-Max-Thinking is a departure from standard inference methods. While most models generate tokens linearly, Qwen3 utilizes a “heavy mode” driven by a technique known as “Test-time scaling.”
In simple terms, this technique allows the model to trade compute for intelligence. But unlike naive “best-of-N” sampling—where a model might generate 100 answers and pick the best one — Qwen3-Max-Thinking employs an experience-cumulative, multi-round strategy.
This approach mimics human problem-solving. When the model encounters a complex query, it doesn’t just guess; it engages in iterative self-reflection. It uses a proprietary “take-experience” mechanism to distill insights from previous reasoning steps. This allows the model to:
-
Identify Dead Ends: Recognize when a line of reasoning is failing without needing to fully traverse it.
-
Focus Compute: Redirect processing power toward “unresolved uncertainties” rather than re-deriving known conclusions.
The efficiency gains are tangible. By avoiding redundant reasoning, the model integrates richer historical context into the same window. The Qwen team reports that this method drove massive performance jumps without exploding token costs:
-
GPQA (PhD-level science): Scores improved from 90.3 to 92.8.
-
LiveCodeBench v6: Performance jumped from 88.0 to 91.4.
Beyond Pure Thought: Adaptive Tooling
While “thinking” models are powerful, they have historically been siloed — great at math, but poor at browsing the web or running code. Qwen3-Max-Thinking bridges this gap by effectively integrating “thinking and non-thinking modes”.
The model features adaptive tool-use capabilities, meaning it autonomously selects the right tool for the job without manual user prompting. It can seamlessly toggle between:
-
Web Search & Extraction: For real-time factual queries.
-
Memory: To store and recall user-specific context.
-
Code Interpreter: To write and execute Python snippets for computational tasks.
In “Thinking Mode,” the model supports these tools simultaneously. This capability is critical for enterprise applications where a model might need to verify a fact (Search), calculate a projection (Code Interpreter), and then reason about the strategic implication (Thinking) all in one turn.
Empirically, the team notes that this combination “effectively mitigates hallucinations,” as the model can ground its reasoning in verifiable external data rather than relying solely on its training weights.
Benchmark Analysis: The Data Story
Qwen is not shy about direct comparisons.
On HMMT Feb 25, a rigorous reasoning benchmark, Qwen3-Max-Thinking scored 98.0, edging out Gemini 3 Pro (97.5) and significantly leading DeepSeek V3.2 (92.5).
However, the most significant signal for developers is arguably Agentic Search. On “Humanity’s Last Exam” (HLE) — the benchmark that measures performance on 3,000 “Google-proof” graduate-level questions across math, science, computer science, humanities and engineering — Qwen3-Max-Thinking, equipped with web search tools, scored 49.8, beating both Gemini 3 Pro (45.8) and GPT-5.2-Thinking (45.5) .
This suggests that Qwen3-Max-Thinking’s architecture is uniquely suited for complex, multi-step agentic workflows where external data retrieval is necessary.
In coding tasks, the model also shines. On Arena-Hard v2, it posted a score of 90.2, leaving competitors like Claude-Opus-4.5 (76.7) far behind.
The Economics of Reasoning: Pricing Breakdown
For the first time, we have a clear look at the economics of Qwen’s top-tier reasoning model. Alibaba Cloud has positioned qwen3-max-2026-01-23 as a premium but accessible offering on its API.
-
Input: $1.20 per 1 million tokens (for standard contexts <= 32k).
-
Output: $6.00 per 1 million tokens.
On a base level, here’s how Qwen3-Max-Thinking stacks up:
|
Model |
Input (/1M) |
Output (/1M) |
Total Cost |
Source |
|
Qwen 3 Turbo |
$0.05 |
$0.20 |
$0.25 |
|
|
Grok 4.1 Fast (reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
Grok 4.1 Fast (non-reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
deepseek-chat (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
deepseek-reasoner (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
Qwen 3 Plus |
$0.40 |
$1.20 |
$1.60 |
|
|
ERNIE 5.0 |
$0.85 |
$3.40 |
$4.25 |
|
|
Gemini 3 Flash Preview |
$0.50 |
$3.00 |
$3.50 |
|
|
Claude Haiku 4.5 |
$1.00 |
$5.00 |
$6.00 |
|
|
Qwen3-Max Thinking (2026-01-23) |
$1.20 |
$6.00 |
$7.20 |
|
|
Gemini 3 Pro (≤200K) |
$2.00 |
$12.00 |
$14.00 |
|
|
GPT-5.2 |
$1.75 |
$14.00 |
$15.75 |
|
|
Claude Sonnet 4.5 |
$3.00 |
$15.00 |
$18.00 |
|
|
Gemini 3 Pro (>200K) |
$4.00 |
$18.00 |
$22.00 |
|
|
Claude Opus 4.5 |
$5.00 |
$25.00 |
$30.00 |
|
|
GPT-5.2 Pro |
$21.00 |
$168.00 |
$189.00 |
This pricing structure is aggressive, undercutting many legacy flagship models while offering state-of-the-art performance.
However, developers should note the granular pricing for the new agentic capabilities, as Qwen separates the cost of “thinking” (tokens) from the cost of “doing” (tool use).
-
Agent Search Strategy: Both standard
search_strategy:agentand the more advancedsearch_strategy:agent_maxare priced at $10 per 1,000 calls.-
Note: The
agent_maxstrategy is currently marked as a “Limited Time Offer,” suggesting its price may rise later.
-
-
Web Search: Priced at $10 per 1,000 calls via the Responses API.
Promotional Free Tier:To encourage adoption of its most advanced features, Alibaba Cloud is currently offering two key tools for free for a limited time:
-
Web Extractor: Free (Limited Time).
-
Code Interpreter: Free (Limited Time).
This pricing model (low token cost + à la carte tool pricing) allows developers to build complex agents that are cost-effective for text processing, while paying a premium only when external actions—like a live web search—are explicitly triggered.
Developer Ecosystem
Recognizing that performance is useless without integration, Alibaba Cloud has ensured Qwen3-Max-Thinking is drop-in ready.
-
OpenAI Compatibility: The API supports the standard OpenAI format, allowing teams to switch models by simply changing the
base_urlandmodelname. -
Anthropic Compatibility: In a savvy move to capture the coding market, the API also supports the Anthropic protocol. This makes Qwen3-Max-Thinking compatible with Claude Code, a popular agentic coding environment.
The Verdict
Qwen3-Max-Thinking represents a maturation of the AI market in 2026. It moves the conversation beyond “who has the smartest chatbot” to “who has the most capable agent.”
By combining high-efficiency reasoning with adaptive, autonomous tool use—and pricing it to move—Qwen has firmly established itself as a top-tier contender for the enterprise AI throne.
For developers and enterprises, the “Limited Time Free” windows on Code Interpreter and Web Extractor suggest now is the time to experiment. The reasoning wars are far from over, but Qwen has just deployed a very heavy hitter.
