> This is the markdown version of https://www.maniac.ai/blog/minimax-m2-7-vs-glm-5-1-vals-benchmarks. Visit the full page for interactive content.


# MiniMax M2.7 vs GLM 5.1 for long-horizon agents | Arendil | Arendil

[Blog](/blog)

April 12, 2026

If you care about **long-horizon agent work**, the important question is not who looks best on a giant mixed leaderboard. It is who looks stronger on tasks that resemble what agents actually do: **multi-step coding, tool use, repository repair, and workflows that can go wrong halfway through**.

The most relevant rows are:

-   **Finance Agent (v1.1)**
-   **Terminal-Bench 2.0**
-   **SWE-bench**
-   **LiveCodeBench**
-   **Vibe Code Bench**

I also use **latency**, **token pricing**, **context window**, and **max output tokens** because those operational constraints compound quickly once an agent starts taking multiple turns.

* * *

## TL;DR

-   **GLM 5.1** wins all five of the long-horizon rows I would care about first: **Finance Agent**, **Terminal-Bench 2.0**, **SWE-bench**, **LiveCodeBench**, and **Vibe Code Bench**.
-   The biggest gaps are **Finance Agent (57.66% vs 48.40%)** and **Terminal-Bench 2.0 (53.93% vs 47.19%)**.
-   A simple average across those five rows gives **GLM 5.1 at 60.17%** versus **MiniMax M2.7 at 55.27%**. That average is my own rollup from the cited public numbers, not an official benchmark metric.
-   **MiniMax M2.7** is still much cheaper: **$0.30 / $1.20** input/output versus **$1.00 / $3.20** for GLM 5.1.
-   For benchmarked long-horizon performance, **GLM looks stronger today**. For cost-constrained long runs, **MiniMax is the budget play**.

* * *

## The long-horizon rows

Benchmark

GLM 5.1

MiniMax M2.7

Edge

**Finance Agent (v1.1)**

**57.66%**

**48.40%**

**GLM +9.26**

**Terminal-Bench 2.0**

**53.93%**

**47.19%**

**GLM +6.74**

**SWE-bench**

**76.40%**

**73.80%**

**GLM +2.60**

**LiveCodeBench**

**81.38%**

**79.93%**

**GLM +1.45**

**Vibe Code Bench**

**31.46%**

**27.04%**

**GLM +4.42**

**Simple average of rows above**

**60.17%**

**55.27%**

**GLM +4.89**

This is the cleanest summary I can get if the question is "which model looks better for agentic work that unfolds over multiple steps?" On every row that looks most relevant to that question, **GLM is ahead**.

* * *

## Why these rows matter more than the overall index

The overall **Index** still favors **GLM 5.1**, **63.17%** to **59.58%**, and the overall **coding** category average does too, **60.79%** to **56.99%**. That is directionally useful.

But if your use case is specifically **long-horizon work**, the overall index is not the main thing to watch because it mixes in categories that are much less agent-like. A strong **CaseLaw** or **GPQA** row can matter for some buyers, but it does not tell you nearly as much about a model that has to:

-   plan across multiple turns
-   recover after a bad step
-   use tools or act through a runtime
-   modify code or reason over a repository

That is why the **Finance Agent**, **Terminal-Bench**, **SWE-bench**, and broader coding rows are the real signal here.

* * *

## Where GLM creates the clearest separation

The most important thing here is not that GLM wins by a little. It is that the rows most associated with multi-step agent loops all point in the same direction.

Two rows stand out most:

-   **Finance Agent (v1.1): 57.66% vs 48.40%**
-   **Terminal-Bench 2.0: 53.93% vs 47.19%**

Those are the biggest gaps among the long-horizon rows, and they are also the rows that most obviously suggest **agents acting through tools or extended workflows** rather than pure one-shot completion.

The coding rows reinforce the same story:

-   **SWE-bench:** 76.40% vs 73.80%
-   **LiveCodeBench:** 81.38% vs 79.93%
-   **Vibe Code Bench:** 31.46% vs 27.04%

None of those individual coding gaps are gigantic on their own. The point is that **they all stack in the same direction**. If you care about whether an agent can keep a repo task alive across multiple steps, the public benchmark picture is more favorable to **GLM 5.1**.

* * *

## What MiniMax buys you anyway

MiniMax's case is not benchmark leadership on these rows. Its case is **economics** and a couple of operational constraints that can matter once long runs get expensive.

Operational metric

GLM 5.1

MiniMax M2.7

Read

**Input cost**

**$1.00 / 1M**

**$0.30 / 1M**

MiniMax is **70% cheaper**

**Output cost**

**$3.20 / 1M**

**$1.20 / 1M**

MiniMax is **62.5% cheaper**

**Latency**

**7.63m**

**10.33m**

GLM is lower on Vals' latency row

**Context window**

**200K**

**197K**

Essentially tied

**Max output tokens**

**131K**

**197K**

MiniMax allows longer output

On a workload with **1M input tokens and 1M output tokens**, the cited public prices imply:

-   **GLM 5.1:** **$4.20**
-   **MiniMax M2.7:** **$1.50**

That makes MiniMax about **64% cheaper** on a balanced input/output workload. If your agent loops are long, branchy, or retry-heavy, that cost gap can buy you a lot of extra attempts.

So even though **GLM looks better on the long-horizon benchmark rows**, **MiniMax is still economically interesting** if you are willing to trade some benchmark headroom for much cheaper inference.

* * *

## Do not over-read the close coding rows

The cited benchmark rows also include `+/-` ranges beside the point estimates. Some of the coding rows are close enough that I would not call them decisive from this page alone:

-   **LiveCodeBench:** 81.38% +/- 1.06 for GLM vs 79.93% +/- 1.10 for MiniMax
-   **SWE-bench:** 76.40% +/- 1.90 for GLM vs 73.80% +/- 1.97 for MiniMax

That is why I would not summarize this as "GLM crushes MiniMax." The more honest read is:

-   **GLM is ahead on every long-horizon row selected here**
-   **The strongest practical gaps are on Finance Agent and Terminal-Bench**
-   **MiniMax is much cheaper, which may matter more than a few benchmark points in production**
-   **Some coding deltas are real but not huge, so your own workload still matters**

* * *

## What to choose

If you optimize for the **best public benchmark picture on long-horizon tasks**, **GLM 5.1** is the cleaner pick today. It leads on the rows that look most like tool use, multi-step coding, and agentic workflows, and it also has the lower latency row.

If you optimize for **cost per token** and expect long runs to dominate your bill, **MiniMax M2.7** is still very much alive in the conversation. A model that is roughly **64% cheaper** on balanced token spend, with a slightly longer max output allowance, can still be the right economic choice even if the benchmark rows are weaker.

That is the real takeaway: **GLM buys you more benchmark headroom on long-horizon work, while MiniMax buys you more room to spend on long-horizon inference**.

* * *

## Source

-   Benchmark rows, latency, context, max output, and token pricing cited from [Vals' GLM 5.1 vs MiniMax M2.7 comparison](https://www.vals.ai/comparison?modelA=zai%2Fglm-5.1-thinking&modelB=minimax%2FMiniMax-M2.7)

---

*Arendil, High throughput background agents. Opus-quality outputs at 1/50 of the cost. Learn more at [maniac.ai](https://www.maniac.ai).*