Chinese frontier models compared: GLM-5, MiniMax M2.5, Kimi K2.5, and Qwen 3.5
A benchmark-driven comparison of the new wave of Chinese frontier models against Claude Opus 4.6 and GPT-5.2 — with pricing, architecture, and practical guidance for production teams.
February 2026 was a landmark month for Chinese AI labs. Within weeks of each other, four models shipped that match or exceed Western frontier systems on major benchmarks — at a fraction of the API cost. For anyone running AI in production, the model landscape just got a lot more interesting.
This article compares GLM-5 (Zhipu AI), MiniMax M2.5 (MiniMax), Kimi K2.5 (Moonshot AI), and Qwen 3.5-397B (Alibaba) head-to-head against Claude Opus 4.6 and GPT-5.2 across coding, math, reasoning, agentic tasks, and multimodal benchmarks. We include full pricing breakdowns, architecture deep dives, and a practical decision framework.
Quick links
- Benchmark tables — Jump to Head-to-head benchmarks
- Pricing — Jump to Pricing comparison
- Decision guide — Jump to When to use which model
- Inference infrastructure — See Inference stacks compared
The models at a glance
| Model | Lab | Total params | Active params | Architecture | Context | License | Release |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | Undisclosed | Undisclosed | Dense (assumed) | 200K (1M beta) | Proprietary | Feb 5, 2026 |
| GPT-5.2 | OpenAI | Undisclosed | Undisclosed | Dense (assumed) | 400K | Proprietary | Dec 10, 2025 |
| GLM-5 | Zhipu AI | 744B | 40B | MoE | 200K | MIT | Feb 11, 2026 |
| MiniMax M2.5 | MiniMax | 230B | 10B | MoE (256E / 8A) | 204K | Open weights | Feb 12, 2026 |
| Kimi K2.5 | Moonshot AI | 1T | 32B | MoE (384E / 8A) | 256K | Modified MIT | Jan 27, 2026 |
| Qwen 3.5-397B | Alibaba | 397B | 17B | MoE (hybrid attn) | 1M | Apache 2.0 | Feb 16, 2026 |
The architecture story is striking. Every Chinese model uses a Mixture-of-Experts (MoE) design, activating only a fraction of total parameters per token. Kimi K2.5 packs 1 trillion total parameters but activates just 32B per forward pass — roughly the footprint of a mid-range open-source dense model. MiniMax M2.5 takes efficiency even further: 10B active parameters achieving frontier-class SWE-Bench scores. This is the core mechanism behind the dramatic cost reduction.
For context, Anthropic and OpenAI have not disclosed whether Opus 4.6 or GPT-5.2 use MoE architectures. But the pricing gap between proprietary dense models and open-weight sparse models tells the story.
Head-to-head benchmarks
We grouped benchmarks into four categories: coding, math & reasoning, agentic & tool use, and vision & multimodal. Bold scores indicate the best result in each row. Dashes indicate scores that were not publicly reported or not applicable.
Coding
| Benchmark | Claude Opus 4.6 | GPT-5.2 | GLM-5 | MiniMax M2.5 | Kimi K2.5 | Qwen 3.5 |
|---|---|---|---|---|---|---|
| SWE-Bench Verified | 80.8% | 80.0% | 77.8% | 80.2% | 76.8% | 76.4% |
| SWE-Bench Pro | — | 55.6% | 38.6% | 51.3% | — | — |
| LiveCodeBench v6 | 68.1% | 66.8% | 55.4% | — | 85.0% | 83.6% |
| Terminal-Bench 2.0 | 65.4% | 54.1% | 42.1% | — | 50.8% | — |
SWE-Bench Verified tests whether a model can resolve real GitHub issues end-to-end. MiniMax M2.5 essentially matches Claude Opus 4.6 here (80.2% vs 80.8%) while activating 8x fewer parameters. GLM-5 holds the top position among MIT-licensed models at 77.8%, and was the first Chinese open-source model to top VendingBench 2 — a long-horizon benchmark testing operational capability across extended agentic coding sessions.
On LiveCodeBench (competitive programming problems released after training cutoffs), the Chinese models diverge sharply. Kimi K2.5 leads at 85.0%, followed closely by Qwen 3.5 at 83.6% — both substantially ahead of Opus 4.6 (68.1%) and GPT-5.2 (66.8%). This suggests the Chinese models have a particular strength in algorithmic reasoning under novel conditions.
Terminal-Bench 2.0, which tests command-line tool use and shell scripting, remains a stronghold for Claude Opus 4.6 at 65.4%.
Math & reasoning
| Benchmark | Claude Opus 4.6 | GPT-5.2 | GLM-5 | MiniMax M2.5 | Kimi K2.5 | Qwen 3.5 |
|---|---|---|---|---|---|---|
| AIME 2025 | 94.0% | 100% | 92.7%† | — | 96.1% | 91.3% |
| HMMT Feb 2025 | — | 99.4% | — | — | 95.4% | 94.8% |
| GPQA Diamond | 87.0% | 92.4% | — | — | 87.6% | 88.4% |
| MMLU-Pro | 88.2% | 86.3% | 70.4% | — | 87.1% | 87.8% |
| ARC-AGI 2 | 68.8% | 86.2% | — | — | — | — |
†GLM-5's reported AIME score is on the 2026 problem set, not 2025.
GPT-5.2 still dominates pure mathematical reasoning — a perfect 100% on AIME 2025 and 99.4% on HMMT. But the margins are thinning. Kimi K2.5 reaches 96.1% on AIME, and even the smallest active-parameter model in the group (Qwen 3.5 at 17B active) hits 91.3%.
On MMLU-Pro (graduate-level knowledge across 14 disciplines), Claude Opus 4.6 leads at 88.2%, with Qwen 3.5 (87.8%) and Kimi K2.5 (87.1%) within striking distance. GLM-5's lower MMLU-Pro score (70.4%) suggests its training data may have been more narrowly focused on coding and STEM benchmarks.
GPQA Diamond (expert-level science questions written by PhDs) shows GPT-5.2 pulling ahead at 92.4%, with Qwen 3.5 (88.4%) and Kimi K2.5 (87.6%) both outperforming Claude Opus 4.6 (87.0%).
Agentic & tool use
| Benchmark | Claude Opus 4.6 | GPT-5.2 | GLM-5 | MiniMax M2.5 | Kimi K2.5 | Qwen 3.5 |
|---|---|---|---|---|---|---|
| BrowseComp | 84.0% | 65.8% | 75.9% | 76.3% | 78.4% | 78.6% |
| HLE (w/ tools) | 53.1% | 50.0% | — | — | 50.2% | — |
| TAU2-Bench (telecom) | — | 98.7% | — | — | — | 86.7% |
| τ2-Bench (retail) | 91.9% | — | — | — | — | — |
BrowseComp tests a model's ability to locate hard-to-find information across the web. Claude Opus 4.6 leads convincingly at 84.0%, but all four Chinese models cluster between 75.9% and 78.6% — a meaningful jump from GPT-5.2's 65.8%. This matters for document research, competitive intelligence, and any workflow that requires synthesizing information from scattered online sources.
Kimi K2.5's Agent Swarm architecture is the standout here. It orchestrates up to 100 parallel sub-agents coordinating up to 1,500 tool calls, reducing multi-step task execution time by 3–4.5x compared to sequential approaches. This is trained via Parallel-Agent Reinforcement Learning (PARL) — a novel training paradigm that rewards effective task decomposition and parallel execution.
Humanity's Last Exam (HLE), a benchmark of expert-level questions crowd-sourced from researchers, shows Claude Opus 4.6 at the top (53.1% with tools), with GPT-5.2 (50.0%) and Kimi K2.5 (50.2%) effectively tied.
Vision & multimodal
| Benchmark | Claude Opus 4.6 | GPT-5.2 | GLM-5 | MiniMax M2.5 | Kimi K2.5 | Qwen 3.5 |
|---|---|---|---|---|---|---|
| MMMU Pro | 77.3% | 79.5% | — | — | 78.5% | — |
| MathVista (mini) | — | — | — | — | 90.1% | 88.6% |
| OCRBench | — | — | — | — | 92.3% | 90.8% |
| MRCR v2 (1M ctx) | 76.0% | — | — | — | — | — |
Kimi K2.5 dominates the vision benchmarks that matter most for document processing: 92.3% on OCRBench and 90.1% on MathVista. These scores are critical for anyone building pipelines that extract structured data from PDFs, invoices, or scanned documents. Qwen 3.5 follows closely at 90.8% and 88.6% respectively.
GPT-5.2 leads on MMMU Pro (university-level multimodal reasoning) at 79.5%. Claude Opus 4.6 stands alone on MRCR v2 — a needle-in-a-haystack retrieval test at 1M token context — where its extended context window gives it a structural advantage.
Pricing comparison
This is where the gap becomes dramatic. We list standard API pricing; batch and cached-prompt discounts can reduce costs further.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Effective cost index |
|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | 16.7x |
| GPT-5.2 | $1.75 | $14.00 | 5.8x |
| GLM-5 | $1.00 | $3.20 | 3.3x |
| Qwen 3.5-397B | $0.60 | $3.60 | 2x |
| Kimi K2.5 | $0.60 | $2.00 | 2x |
| MiniMax M2.5 | $0.30 | $1.10 | 1x |
Cost index calculated as weighted average (1:1 input:output) relative to MiniMax M2.5, the cheapest model in the comparison.
MiniMax M2.5 matches Opus 4.6 on SWE-Bench while costing 16.7x less on input tokens and 22.7x less on output tokens. Even the most expensive Chinese model (GLM-5 at $1.00 input) is 5x cheaper than Opus 4.6 on input.
What this means at scale
For teams running high-throughput background agents processing millions of calls per day, the annual cost difference is staggering.
| Scenario: 1M calls/day (1K tok/call) | Claude Opus 4.6 | GPT-5.2 | GLM-5 | MiniMax M2.5 |
|---|---|---|---|---|
| Daily cost | $5,000 | $1,750 | $1,000 | $300 |
| Monthly cost | $150,000 | $52,500 | $30,000 | $9,000 |
| Annual cost | $1.83M | $639K | $365K | $110K |
| Annual savings vs Opus 4.6 | — | $1.19M | $1.46M | $1.72M |
At 10M calls/day, these numbers multiply by 10x. A team spending $18.3M/year on Claude Opus 4.6 could achieve comparable coding performance with MiniMax M2.5 for $1.1M — freeing $17.2M annually.
Architecture deep dives
GLM-5 — Zhipu AI (Z.AI)
GLM-5 is notable for being trained entirely on Huawei Ascend 910B chips using the MindSpore framework — achieving complete independence from NVIDIA hardware. This alone makes it strategically important for Chinese AI deployment. At 744B total parameters with 40B active, it has the highest active parameter count of the four Chinese models.
The model uses DeepSeek Sparse Attention for long-context handling across its 200K-token window and was trained on 28.5 trillion tokens. GLM-5 was the first Chinese model to top VendingBench 2, a long-horizon agentic coding benchmark requiring sustained operational capability across extended interactions. It is fully MIT-licensed.
Strengths: Agentic coding, long-horizon tasks, BrowseComp, fully open-source. Weaknesses: MMLU-Pro (70.4%) lags behind peers; highest-cost Chinese option.
MiniMax M2.5
The efficiency outlier. MiniMax M2.5 activates just 10 billion parameters — the fewest in this comparison — yet matches Opus 4.6 on SWE-Bench Verified (80.2% vs 80.8%). The architecture uses 256 experts with 8 active per token across 230B total parameters.
In practice, M2.5 completes SWE-Bench tasks in roughly half the time of GLM-5 (21 minutes vs 44 minutes). The "Lightning" variant delivers 100 tokens/second at $1/hour for continuous operation, making it the cheapest frontier-class model by a significant margin.
M2.5 was trained with reinforcement learning across hundreds of thousands of complex real-world environments, with a focus on coding and productivity tasks rather than pure reasoning.
Strengths: Best cost-per-quality ratio, fastest inference, frontier SWE-Bench. Weaknesses: Limited public scores on math/reasoning; narrower benchmark coverage.
Kimi K2.5 — Moonshot AI
The largest model in the comparison at 1 trillion total parameters (32B active, 384 experts with 8 active per token). Kimi K2.5's key differentiator is native multimodality — jointly pre-trained on approximately 15 trillion mixed text and vision tokens — and the Agent Swarm feature.
Agent Swarm splits complex tasks across up to 100 parallel sub-agents, trained with Parallel-Agent Reinforcement Learning (PARL). This gives it a 3–4.5x latency reduction on multi-step agentic tasks compared to sequential execution, while coordinating up to 1,500 tool calls in a single workflow. The result: top scores on tool-augmented reasoning benchmarks (HLE, TerminalBench) and vision tasks (OCR, MathVista).
Kimi K2.5 also leads on LiveCodeBench v6 (85.0%) and AIME 2025 (96.1%), making it the strongest all-around Chinese contender.
Strengths: Agentic tasks, math, vision, LiveCodeBench, Agent Swarm parallelism. Weaknesses: Higher latency for single-agent tasks; SWE-Bench trails leaders.
Qwen 3.5-397B — Alibaba
Qwen 3.5 is the context window champion at 1 million tokens — 2.5x GPT-5.2's 400K and 4x what Kimi K2.5 offers. At 397B total parameters with 17B active, it uses a novel hybrid attention mechanism that enables 8.6x–19x faster inference than Qwen3-Max while maintaining comparable quality.
The model delivers strong all-around performance and reportedly outperforms GPT-5.2 on approximately 80% of evaluated benchmark categories. It is particularly strong for document-heavy pipelines, long-context tasks, and multilingual workloads — supporting 201 languages and dialects natively. On BrowseComp (78.6%), it edges out both Kimi K2.5 and MiniMax M2.5.
Qwen 3.5 also shines on OS-level agent tasks: 62.2% on OSWorld-Verified and 66.8% on AndroidWorld, both strong results for desktop and mobile UI automation.
Strengths: 1M context, multilingual, desktop/mobile agents, balanced performance. Weaknesses: Math lags GPT-5.2 and Kimi K2.5; SWE-Bench trails by 4+ points.
When to use which model
Coding (SWE-Bench) → Claude Opus 4.6 or MiniMax M2.5. Opus 4.6 has a marginal lead (80.8% vs 80.2%), but M2.5 gets there at 1/17th the input cost and roughly 2x the speed. For high-throughput coding agents, M2.5 is the pragmatic choice.
Competitive programming → Kimi K2.5. 85.0% on LiveCodeBench v6 is the highest score in this comparison by a significant margin — 17 percentage points above Claude Opus 4.6. Qwen 3.5 (83.6%) is the runner-up.
Math → GPT-5.2. Still unmatched at 100% AIME 2025 and 99.4% HMMT. If you need the absolute best on mathematical reasoning and can afford $1.75/1M input, GPT-5.2 is the answer. Kimi K2.5 (96.1% AIME) is the closest open-weight challenger.
Agentic workflows → Kimi K2.5. The Agent Swarm architecture and top scores on HLE, TerminalBench, and BrowseComp make it the natural choice for complex, tool-heavy pipelines. The 3–4.5x latency reduction from parallel sub-agents is a structural advantage no other model offers.
Massive context windows → Qwen 3.5. 1M tokens is unmatched among the Chinese models (Opus 4.6 has 1M in beta, but at 16x the price). Strong all-around performance and particularly valuable for document processing, RAG pipelines, and multilingual tasks.
Document OCR & vision → Kimi K2.5 or Qwen 3.5. Kimi leads on OCRBench (92.3%) and MathVista (90.1%). Qwen follows at 90.8% and 88.6%. Both are strong choices for extracting structured data from scanned documents, invoices, and PDFs.
Pure cost optimization → MiniMax M2.5. At $0.30/1M input tokens with frontier-class SWE-Bench performance, it is the clear winner for high-throughput workloads where cost is the primary constraint. At scale, this translates to $1.7M+/year in savings compared to Opus 4.6.
General-purpose frontier quality → Claude Opus 4.6 or GPT-5.2. They still lead on several key benchmarks (BrowseComp, AIME, τ2-Bench), but the gap is closing fast.
The bigger picture
The February 2026 model releases mark a turning point. For the first time, multiple open-weight Chinese models match or exceed Western proprietary systems on major benchmarks — at 5–17x lower cost. The MoE architecture has proven that you don't need to activate 100%+ billion parameters on every token to achieve frontier quality.
For teams running high-throughput background agents, this changes the calculus entirely. The question is no longer "can we afford frontier-quality AI on every input?" but "which model gives us the best quality-per-dollar for our specific task?"
That is exactly the question Maniac answers. We automatically evaluate these models on your production data, fine-tune task-specialized variants, and route to the optimal model — so you get frontier quality at a fraction of the cost, regardless of which lab ships the next breakthrough.
Benchmark data sourced from official model releases, Benchable.ai, LLM Stats, Awesome Agents, OpenRouter, and Artificial Analysis. Scores may vary by evaluation harness and configuration. Pricing as of February 2026.