> This is the markdown version of https://www.maniac.ai/blog/qwen-3-5-vs-gemma-4-benchmarks-by-size. Visit the full page for interactive content.


# Qwen 3.5 vs Gemma 4: the benchmark-by-size comparison | Maniac | Maniac

[Blog](/blog)

April 2, 2026

\[ Next step \]

## Work with our research team to build your specialized agents

Maniac helps teams compare, fine-tune, and route open models on real traffic so you can choose the right model for your workload, not just the one with the best benchmark row.

[Book a demo](/book-demo)[Join the waitlist](/waitlist)

Google DeepMind just shipped [Gemma 4](https://ai.google.dev/gemma/docs/core), its newest open-weight family in `E2B`, `E4B`, `26B A4B`, and `31B` sizes. Alibaba's [Qwen 3.5](https://github.com/QwenLM/Qwen3.5) is already one of the strongest open families on the market, spanning `2B`, `4B`, `9B`, `27B`, `35B-A3B`, `122B-A10B`, and `397B-A17B`.

If you are choosing an open model for local agents, laptop inference, or self-hosted production, the useful question is not "which family has the biggest headline model?" It is **which family wins at the size you can actually run**.

There is not yet a single **Vals-style third-party table** covering _every_ `Gemma 4` and `Qwen 3.5` size. So the most reliable comparison today has to separate **two different kinds of evidence**:

1.  **Third-party chat-preference evidence**, where Google cites [Arena AI's open-source text leaderboard](https://arena.ai/leaderboard/text?license=open-source).
2.  **Official model-card overlap**, where we match models by **deployment class** and compare only the benchmark rows both families actually publish.

**Short version:** the evidence is **mixed, not contradictory**. On the currently published **official benchmark overlap**, `Qwen 3.5` wins more rows in the `2B`, `4B`, and `mid-size MoE` classes, while `Gemma 4` is most competitive in the **dense ~30B** class and has a better story for **audio at the edge**, **multilingual**, and some **multimodal** workloads. On **Arena AI**, though, `Gemma 4 31B` and `Gemma 4 26B A4B` both rank above the comparable open `Qwen 3.5` large models in chat preference.

## Quick verdict

Size class

Gemma 4 counterpart

Qwen 3.5 counterpart

Current best reading

Edge / mobile

Gemma 4 E2B

Qwen3.5-2B

No good third-party leaderboard yet; official overlap favors **Qwen**

4B class

Gemma 4 E4B

Qwen3.5-4B

No good third-party leaderboard yet; official overlap favors **Qwen**

Large dense models

Gemma 4 31B

Qwen3.5-27B

Official overlap is **split**; Arena AI chat preference favors **Gemma 4 31B**

Efficient MoE

Gemma 4 26B A4B

Qwen3.5-35B-A3B

Official overlap favors **Qwen**; Arena AI chat preference favors **Gemma 4 26B A4B**

Qwen also has an **upper tier that Gemma 4 does not match directly**: `9B`, `122B-A10B`, and `397B-A17B`.

* * *

## Why Google's claim can still be true

If you only read the model-card tables below, it is easy to conclude that `Qwen 3.5` broadly beats `Gemma 4`. That is **too strong**. Google's launch post is pointing at a **different kind of evidence**.

On the [Arena AI open-source text leaderboard](https://arena.ai/leaderboard/text?license=open-source) page dated **March 31, 2026**:

-   `Gemma 4 31B` is **#3 open model** at **1452 ± 9**
-   `Qwen3.5-397B-A17B` is **#4** at **1449 ± 6**
-   `Gemma 4 26B A4B` is **#6** at **1441 ± 9**
-   `Qwen3.5-122B-A10B` is **1416 ± 6**
-   `Qwen3.5-27B` is **1404 ± 6**
-   `Qwen3.5-35B-A3B` is **1400 ± 6**

That is real **third-party** evidence in Gemma's favor, especially for Google's `byte for byte` and `intelligence-per-parameter` framing. It does **not** mean Gemma 4 beats Qwen 3.5 on every benchmark. It means that on a large-scale **chat-preference leaderboard**, the two bigger Gemma 4 models are currently placed above the main open Qwen 3.5 models.

The honest reading is:

-   **Arena AI** currently leans **Gemma 4** for large-model assistant quality.
-   **Official model-card overlap** leans **Qwen 3.5** on many static reasoning, coding, and agent rows.
-   **Small-model third-party evidence is still thin**, so the `2B` and `4B` conclusions remain more provisional than the large-model ones.

* * *

## The bigger trend: open weights are improving almost as fast as closed

One useful way to step back from the `Gemma 4` vs `Qwen 3.5` question is to chart the **best open-weight model available** and the **best closed model available** at each point over the last year on a single third-party metric.

For that, `Arena AI` is imperfect but useful. It measures **human-preference chat quality**, not static benchmark accuracy. That means it should not replace the model-card tables below. But it _does_ put open and closed models on the same scale, which makes it a good proxy for **frontier pace**.

![Step chart of open-weights frontier versus closed frontier over the last year on Arena AI overall text score](/images/blog/open-vs-closed-frontier-pace-arena.svg)

On this metric, the **open-weights frontier** moved from **1398** to **1456** over the last year (**+58**), while the **closed frontier** moved from **1448** to **1504** (**+56**). So open weights improved **almost as fast** as closed weights on this leaderboard, but the absolute gap barely changed: **50 points** at the start of the window versus **48 points** at the end.

### Frontier milestones behind the chart

Track

Became frontier

Model

Arena AI score

Open weights

Window start (Apr 2, 2025)

DeepSeek-R1

1398 ± 5

Open weights

Aug 21, 2025

DeepSeek-V3.1

1418 ± 6

Open weights

Dec 22, 2025

GLM-4.7

1443 ± 6

Open weights

Jan 27, 2026

Kimi K2.5 Thinking

1453 ± 5

Open weights

Feb 11, 2026

GLM-5

1456 ± 6

Closed weights

Window start (Apr 2, 2025)

Gemini 2.5 Pro Experimental

1448 ± 3

Closed weights

Aug 5, 2025

Claude Opus 4.1 Thinking

1449 ± 3

Closed weights

Nov 12, 2025

GPT-5.1 High

1455 ± 4

Closed weights

Nov 24, 2025

Claude Opus 4.5 Thinking

1474 ± 4

Closed weights

Feb 5, 2026

Claude Opus 4.6 Thinking

1504 ± 6

Three things this chart makes clear:

-   **Open weights are no longer moving at a dramatically slower pace.** On this third-party chat metric, the slope is now very similar.
-   **The gap is still real.** Open models gained almost as many points, but they did **not** materially erase the closed-model lead over the last year.
-   **Most of the open-model catch-up came from a few big releases, not a smooth monthly grind.** The same is true on the closed side, especially around `Claude Opus 4.5` and `Claude Opus 4.6`.

_Method note: the chart uses the **March 31, 2026** `Arena AI` overall text leaderboard as a single common score scale, then plots the frontier envelope by model release date. This is useful for comparing **pace**, but it is still a **chat-preference** metric, not a lab benchmark. Small score changes within the stated uncertainty bands should be treated cautiously. `GLM-5` uses its first public availability on Hugging Face (**Feb 11, 2026**) as the date anchor because Z.ai's public launch details were not accessible in a machine-readable post when this article was updated._

### Interactive across benchmarks

The static chart above is the cleanest single-view story because it keeps open and closed on the same `Arena AI` chat-preference scale. But if you want to stress-test the takeaway across **task benchmarks**, the explorer below recomputes the `open-weight` and `closed-weight` frontier envelope over the same one-year window using current sourced rows from [Vals AI](https://www.vals.ai/benchmarks), whose benchmark pages also keep older model rows visible.

Interactive benchmark explorerVals AI sourced rows

### Open vs closed frontier, benchmark by benchmark

The static chart above is the clean one-metric view. This explorer recomputes the frontier envelope over the same one-year window on a Vals AI benchmark set that includes older and newer models directly on each benchmark page.

Benchmark

KnowledgeVals AI academic reasoning benchmark.

Open gain

+7.1 pts

Change from the first to current open frontier point.

Closed gain

+11.4 pts

Change from the first to current closed frontier point.

End gap

Closed +8.1 pts

Closed minus open at the current frontier edge.

99.5%93.7%87.9%82.1%76.3%Apr 2025Jul 2025Oct 2025Jan 2026Apr 2026Open weights: DeepSeek V3.2 (Thinking) on Dec 1, 2025 at 80.3%Open weights: Kimi K2.5 on Jan 27, 2026 at 84.1%Open weights: Qwen 3.5 Plus on Feb 16, 2026 at 87.4%Open weightsClosed weights: o3 on Apr 16, 2025 at 84.1%Closed weights: GPT 5 on Aug 7, 2025 at 85.6%Closed weights: Claude Opus 4.5 (Thinking) on Nov 1, 2025 at 85.9%Closed weights: GPT 5.1 on Nov 13, 2025 at 86.6%Closed weights: Gemini 3 Pro on Nov 18, 2025 at 91.7%Closed weights: Gemini 3.1 Pro Preview on Feb 19, 2026 at 95.5%Closed weights

Open frontier milestones

DeepSeek V3.2 (Thinking)

Dec 1, 2025

80.3%

Kimi K2.5

Jan 27, 2026

84.1%

Qwen 3.5 Plus

Feb 16, 2026

87.4%

Closed frontier milestones

o3

Apr 16, 2025

84.1%

GPT 5

Aug 7, 2025

85.6%

Claude Opus 4.5 (Thinking)

Nov 1, 2025

85.9%

GPT 5.1

Nov 13, 2025

86.6%

Gemini 3 Pro

Nov 18, 2025

91.7%

Gemini 3.1 Pro Preview

Feb 19, 2026

95.5%

Uses current sourced rows from [Vals AI](https://www.vals.ai/benchmarks) benchmark pages and plots them by frontier date within the last-year window. Missing model rows are simply omitted for that benchmark, so this is best read as the shape of progress, not a release-day historical record. Historical frontier points can still include superseded models when they really were best on that benchmark at the time.

_Explorer note: this uses current `Vals AI` benchmark pages rather than release-day model-card snapshots. That makes it useful for comparing the **shape** of progress by benchmark, but it should not be read as a release-day historical record. If a model is missing a sourced row on a given benchmark, that milestone simply does not appear for that metric. Historical frontier points can still include superseded models when they really were best on that benchmark at the time._

* * *

## How I matched the models

Two caveats matter before looking at the tables:

1.  **Gemma's small models use effective parameters.** `Gemma 4 E2B` is **2.3B effective / 5.1B loaded with embeddings**, and `Gemma 4 E4B` is **4.5B effective / 8B loaded**. These are best understood as **deployment-class matches** to Qwen's `2B` and `4B`, not exact raw-weight matches.
2.  **Qwen publishes different benchmark modes by size.** `Qwen3.5-4B`, `Qwen3.5-27B`, and `Qwen3.5-35B-A3B` run in **thinking mode by default** on their model cards. `Qwen3.5-2B` publishes separate **thinking** and **non-thinking** scores; this post uses the **thinking** number whenever the card shows `thinking / non-thinking`.

I also exclude benchmarks that are too methodology-sensitive to treat as clean head-to-head rows here, such as:

-   `AIME 2026 no tools`, because Gemma publishes it but the matched Qwen sizes do not.
-   `SWE-bench Verified`, because Qwen publishes it for larger models but Gemma 4 does not.
-   `CodeForces`, because Qwen footnotes that its `CodeForces` result is measured on **its own query set**, making it a poor direct comparison to Google's published `Codeforces ELO`.

That means the tables below should be read as **reliable but narrow**: they are good for comparing the published overlap, but they are **not** the full story of overall assistant quality.

Across the tables below, read the rows as:

-   `MMLU-Pro`: general knowledge and reasoning
-   `GPQA Diamond`: expert science reasoning
-   `LiveCodeBench v6`: coding
-   `Tau2 / TAU2-Bench`: agentic/tool-use behavior
-   `MMMLU`: multilingual reasoning
-   `MMMU-Pro`: multimodal reasoning

### Size matching

Deployment class

Gemma 4

Qwen 3.5

Why this is the right pairing

Edge / mobile

E2B

2B

smallest local models

4B class

E4B

4B

small-laptop / edge-plus tier

Large dense models

31B dense

27B dense

largest dense model in each family

Efficient MoE

26B A4B

35B-A3B

closest mid-size MoE class, with ~4B vs ~3B active parameters

* * *

## Edge / mobile class

This is the closest comparison for **phone, browser, Raspberry Pi, and lightweight local assistant** deployments.

Benchmark

Gemma 4 E2B

Qwen3.5-2B

Winner

MMLU-Pro

60.0

**66.5**

Qwen

Tau2 / TAU2-Bench\*

24.5

**48.8**

Qwen

MMMLU

**67.4**

63.1

Gemma

MMMU-Pro

44.2

**50.3**

Qwen

`Qwen3.5-2B` wins **3 of the 4 overlap rows**, and the `Tau2` gap is large enough to matter if you care about **tool-using assistants** or other structured workflows. On the **currently published overlap**, Qwen is the stronger small text-and-agent model.

`Gemma 4 E2B` still has a real edge case, though. It leads on `MMMLU`, supports **native audio** on the small-model tier, and is designed around Google's mobile and Android ecosystem. If your edge workload is **voice + vision + lightweight reasoning**, Gemma is not just a consolation prize.

The deployment tradeoff is also different: `Gemma 4 E2B` gives you **128K** context and native audio, while `Qwen3.5-2B` gives you **262K native** context. If your workload involves long docs, repo snippets, or multi-turn tool traces, Qwen's context advantage may matter more than a few benchmark points.

* * *

## 4B class

This is the class most teams actually evaluate for **single-user local copilots**, **low-cost API inference**, and **small-GPU agents**.

Benchmark

Gemma 4 E4B

Qwen3.5-4B

Winner

MMLU-Pro

69.4

**79.1**

Qwen

GPQA Diamond

58.6

**76.2**

Qwen

LiveCodeBench v6

52.0

**55.8**

Qwen

Tau2 / TAU2-Bench\*

42.2

**79.9**

Qwen

MMMLU

**76.6**

76.1

Gemma

MMMU-Pro

52.6

**66.3**

Qwen

This is the **clearest result in the official overlap tables**. `Qwen3.5-4B` is ahead on nearly every row that matters for **reasoning**, **science**, **coding**, **agents**, and **multimodal reasoning**. Gemma only nudges ahead on `MMMLU`, and even there the gap is just **0.5 points**.

If you want the **strongest small open model** on the published benchmark tables, `Qwen3.5-4B` is the more impressive release. The `Tau2` margin is especially notable: it suggests that Qwen's reinforcement-learning and agent training stack is showing up in behavior, not just in static knowledge benchmarks.

`Gemma 4 E4B` still keeps the same two structural advantages as `E2B`: **native audio** and tighter **Google edge tooling**. But on pure benchmark output, the `4B` class is currently **Qwen's strongest win**.

* * *

## Large dense models

This is the most interesting matchup in the post because it is the one where the **official overlap** and the **third-party chat leaderboard** pull in slightly different directions.

Benchmark

Gemma 4 31B

Qwen3.5-27B

Winner

MMLU-Pro

85.2

**86.1**

Qwen

GPQA Diamond

84.3

**85.5**

Qwen

LiveCodeBench v6

80.0

**80.7**

Qwen

Tau2 / TAU2-Bench\*

76.9

**79.0**

Qwen

MMMLU

**88.4**

85.9

Gemma

MMMU-Pro

**76.9**

75.0

Gemma

By row count, `Qwen3.5-27B` wins the text-heavy side of the table: `MMLU-Pro`, `GPQA Diamond`, `LiveCodeBench`, and `TAU2`. But the margins on the first three are **small**, while `Gemma 4 31B` puts up the better numbers on `MMMLU` and `MMMU-Pro`.

That makes this the most **balanced** size class on the official overlap:

-   If you care most about **text reasoning**, **general problem solving**, and **agentic behavior**, `Qwen3.5-27B` gets the nod.
-   If you want a large dense model with stronger **multilingual** and **multimodal** behavior, `Gemma 4 31B` has the better argument than a quick winner-count suggests.

This is also the size class where Google's launch framing has the strongest independent support. On Arena AI, `Gemma 4 31B` at **1452 ± 9** sits above even `Qwen3.5-397B-A17B` at **1449 ± 6**, which is exactly the kind of result Google is pointing to with `byte for byte`.

So the reliable conclusion here is not "Qwen wins" or "Gemma wins." It is: **static benchmark overlap slightly favors Qwen on text-heavy rows, while third-party chat preference favors Gemma 4 31B overall.**

* * *

## MoE ~4B-active class

This is the right comparison for teams that want **mid-size MoE efficiency** without jumping all the way to Qwen's upper-tier `122B-A10B` and `397B-A17B` models.

Benchmark

Gemma 4 26B A4B

Qwen3.5-35B-A3B

Winner

MMLU-Pro

82.6

**85.3**

Qwen

GPQA Diamond

82.3

**84.2**

Qwen

LiveCodeBench v6

**77.1**

74.6

Gemma

Tau2 / TAU2-Bench\*

68.2

**81.2**

Qwen

MMMLU

**86.3**

85.2

Gemma

MMMU-Pro

73.8

**75.1**

Qwen

`Qwen3.5-35B-A3B` is the stronger **all-around MoE on the published overlap**: better text reasoning, better expert-science reasoning, a large `Tau2` lead, and a small edge on `MMMU-Pro`.

`Gemma 4 26B A4B` is not just a speed-oriented compromise, though. It still wins `LiveCodeBench v6`, and it also beats Qwen on `MMMLU`. So if your MoE workload is mostly **coding** plus **multilingual** inference, Gemma remains worth a real A/B test.

For mixed workloads, the official overlap still favors `Qwen3.5-35B-A3B`. But again, the external chat-preference signal points the other way: on Arena AI, `Gemma 4 26B A4B` scores **1441 ± 9** versus **1400 ± 6** for `Qwen3.5-35B-A3B`.

That suggests Gemma's assistant-style tuning may be stronger than the static overlap rows alone would imply.

* * *

## What the benchmark pattern says

-   **On official model-card overlap, Qwen 3.5 wins more rows.** That is most obvious in the `4B` class, and still directionally true in `2B` and the MoE matchup.
-   **On Arena AI, Gemma 4's big models currently look stronger.** `Gemma 4 31B` and `26B A4B` outrank the comparable open `Qwen 3.5` models on third-party chat preference.
-   **The large dense-model class is the real battleground.** `Gemma 4 31B` is where the two narratives meet: static overlap is split, but third-party assistant preference currently leans Gemma.
-   **Qwen's lineup is broader.** There is no direct Gemma 4 answer to `Qwen3.5-9B`, `Qwen3.5-122B-A10B`, or `Qwen3.5-397B-A17B`.
-   **Qwen has the long-context advantage.** Qwen's model cards advertise **262K native context** across the family, with support for extending toward **~1.01M** via scaling. Gemma 4 gives you **128K** on `E2B/E4B` and **256K** on `26B/31B`.
-   **The small-model verdicts are less settled than the big-model ones.** There is not yet a strong third-party leaderboard covering `Gemma 4 E2B/E4B` versus `Qwen3.5-2B/4B`, so those sections rely mostly on vendor-published benchmark overlap.

* * *

## Which one would I pick?

**For `2B` edge deployments:** `Qwen3.5-2B` if you care about text quality, tool use, and long context; `Gemma 4 E2B` if **audio** and **Google-edge integration** matter more.

**For `4B`:** `Qwen3.5-4B`. This is the easiest call in the post.

**For large dense models:** `Qwen3.5-27B` for text-first assistants and agents, `Gemma 4 31B` for more balanced multilingual and multimodal local workloads.

**For mid-size MoE:** `Qwen3.5-35B-A3B` unless your core KPI is coding-heavy multilingual work and you specifically want to test `Gemma 4 26B A4B`.

The important part is that the family-level prior is now clear enough to guide a shortlist. But the prior is **not one-dimensional**: if you care about **static task benchmarks**, the published overlap often points to `Qwen 3.5`; if you care about **assistant-style chat quality**, today's strongest third-party signal points to `Gemma 4` at the top end.

The posterior still comes from **your prompts**, **your context length**, **your tool calls**, and **your latency budget**. That is exactly why real deployment teams should treat public benchmarks as a filter, not as the final answer.

* * *

_Source note: `Gemma 4` rows come from Google's [Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4) and [launch post](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/). `Qwen 3.5` rows come from the official model cards for [Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B), [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B), [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B), and [Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B), accessed April 2, 2026. Arena AI numbers come from the [open-source text leaderboard](https://arena.ai/leaderboard/text?license=open-source), page dated **March 31, 2026**, accessed April 2, 2026. The interactive benchmark explorer uses current sourced rows from [Vals AI benchmark pages](https://www.vals.ai/benchmarks), updated April 2, 2026. `Qwen3.5-2B` publishes some vision rows as `thinking / non-thinking`; this post uses the thinking score. `Tau2 / TAU2-Bench` is included as a directional agent row, but Google reports `Tau2 (average over 3)` while Qwen reports `TAU2-Bench` with the airline-domain fix noted in its model card._

\[ Next step \]

## Work with our research team to build your specialized agents

Maniac helps teams compare, fine-tune, and route open models on real traffic so you can choose the right model for your workload, not just the one with the best benchmark row.

[Book a demo](/book-demo)[Join the waitlist](/waitlist)

---

*Maniac, High throughput background agents. Opus-quality outputs at 1/50 of the cost. Learn more at [maniac.ai](https://www.maniac.ai).*