> This is the markdown version of https://www.maniac.ai/blog/composer-2-vs-kimi-k2-5-coding-benchmarks. Visit the full page for interactive content.


# Composer 2 vs Kimi K2.5 on coding benchmarks: what post-training is buying | Maniac | Maniac

[Blog](/blog)

March 20, 2026

In March 2026, Cursor shipped **Composer 2** as an in-product coding model, describing it as the product of **continued pretraining** plus **reinforcement learning on long-horizon coding tasks** ([Cursor's Composer 2 announcement](https://cursor.com/blog/composer-2)). Separately, developers observed tokenizer overlap and internal identifiers consistent with **Kimi K2.5** (Moonshot's open-weights MoE flagship), fueling discussion that Composer 2 is effectively a **post-trained derivative** rather than a from-scratch architecture.

This post stays on **public numbers**: what Moonshot reports for Kimi K2.5, what Cursor reports for Composer 2, how **Cursor bills** each model, and what that combination implies about **where post-training shows up** (quality, latency tiers, and go-to-market) versus **where the base model already did the heavy lifting**.

* * *

## Lineage in one paragraph

**Kimi K2.5** is documented as continual pretraining on the order of **~15 trillion** mixed vision-and-text tokens atop **Kimi-K2-Base**, with native multimodality and strong coding rows on Moonshot's public evaluation table ([Hugging Face model card](https://huggingface.co/moonshotai/Kimi-K2.5)).

**Composer 2**, per Cursor, improves over prior Composer generations through **continued pretraining** and **RL** aimed at agentic coding, including self-summarization-style training for long tasks ([Composer 2 post](https://cursor.com/blog/composer-2)).

If Composer 2 does share Kimi K2.5–class foundations, as community forensics and Moonshot commentary have suggested, the honest framing is: **a strong open-weights coding base plus a second-stage training stack tuned for a specific product and infrastructure**. The interesting question is not "is it the same transformer?" but **which gaps post-training closes at what marginal cost**.

* * *

## Coding benchmarks: overlap and caveats

Benchmarks are only comparable when the **task definition, tools, harness, and model mode** match. They often don't. Treat the table below as **side-by-side public reporting**, not a rigorous head-to-head.

### Scores you can line up (with footnotes)

Benchmark

Kimi K2.5 (Moonshot table)

Composer 2 (Cursor table)

**SWE-Bench Multilingual**

**73.0**

**73.7**

**Terminal-Bench 2.0**

**50.8**

**61.7**

**SWE-Bench Verified**

**76.8**

_(not in Cursor's Composer 2 table)_

Sources: Moonshot's evaluation table on the [Kimi K2.5 model card](https://huggingface.co/moonshotai/Kimi-K2.5) (coding section; SWE-Bench and Terminal-Bench methodology described in the card footnotes); Cursor's Composer 2 announcement ([cursor.com/blog/composer-2](https://cursor.com/blog/composer-2)).

**Why this is messy:**

-   Moonshot documents **Terminal-Bench 2.0** for Kimi K2.5 with the **Terminus-2** agent framework in **non-thinking** mode, because their thinking-mode context management is incompatible with that harness ([model card footnotes](https://huggingface.co/moonshotai/Kimi-K2.5)).
-   Cursor reports Terminal-Bench 2.0 for Composer 2 using the **Harbor** framework, the benchmark authors' designated harness for 2.0, and averages over five runs per model–agent pair ([Composer 2 post footnote](https://cursor.com/blog/composer-2)).

So the **~11 point** gap on Terminal-Bench in the table mixes **model change** with **evaluation stack change**. The **0.7 point** lift on **SWE-Bench Multilingual** is closer to a like-for-like task family, but still not guaranteed identical tooling and prompts across vendors.

### CursorBench (Composer-only)

Cursor also publishes **CursorBench** scores for Composer generations (**61.3** for Composer 2 vs **44.2** for Composer 1.5 and **38.0** for Composer 1) ([Composer 2 post](https://cursor.com/blog/composer-2)). That benchmark is **Cursor-internal**, so it tells you about **product-aligned iteration**, not about universal leaderboard rank.

### Where Kimi looks stronger on public coding rows

On Moonshot's table, Kimi K2.5 retains a **higher SWE-Bench Verified** score (**76.8**) than its **SWE-Bench Multilingual** score (**73.0**). Cursor's Composer 2 post emphasizes **SWE-bench Multilingual** and **Terminal-Bench 2.0**, not Verified, so readers comparing "coding IQ" should decide **which benchmark matches their real workflow** (English-only repo repair vs multilingual issues vs terminal agents).

**LiveCodeBench v6** is reported for Kimi K2.5 at **85.0** on the same model card. Cursor does not quote LiveCodeBench in the Composer 2 announcement, so there is no clean public pairing.

* * *

## Cost: Kimi K2.5 varies by host (then Cursor)

For many teams, the decisive comparison is not a leaderboard but **invoice math** under a specific router. Cursor publishes per-million-token **API pool** rates for both models ([Models & pricing](https://cursor.com/docs/models-and-pricing)):

Model

Input ($/1M)

Cache read ($/1M)

Output ($/1M)

**Composer 2**

$0.50

$0.20

$2.50

**Kimi K2.5**

$0.60

$0.10

$3.00

A few implications:

-   **Composer 2 is cheaper on plain input and output** in Cursor's published table, while **Kimi K2.5 is cheaper on cache reads**. Workloads with heavy prompt caching may tilt the comparison.
-   **Composer 2 also sits in Cursor's "Auto + Composer" usage pool**, which is how Cursor positions default agentic coding ([usage pools](https://cursor.com/docs/models-and-pricing#usage-pools)). Kimi K2.5 is billed as a **third-party API pool** model. Pricing is not just tokens, it's **which budget bucket** a request draws down.

So even if two models were **bit-for-bit identical**, **routing, pooling, and speed tiers** would still change perceived cost. Cursor additionally offers a **faster Composer 2 variant** at higher per-token rates ([Composer 2 post](https://cursor.com/blog/composer-2)).

* * *

## What this says about the value add of post-training

### 1\. Narrow gains on the same benchmark family can still be economically rational

If we take **SWE-Bench Multilingual** at face value, the delta between **73.0** and **73.7** is modest. In production, though, **sub-point improvements** on tasks with long tool loops can dominate **hours saved** and **failed-agent retries**. Post-training is sometimes less about moving a public leaderboard and more about **shaping failure modes** for a harness users actually run.

### 2\. Large swings may reflect the harness as much as the weights

The **Terminal-Bench** column is a reminder that **agent benchmarks are part model, part runtime**. Post-training that teaches **better tool use under Cursor's stack** can look like a leap on one leaderboard while another lab's harness tells a different story. For buyers, the actionable test is **your** repository, **your** CI, **your** linter stack, not a single number.

### 3\. Post-training buys distribution and product fit, not just perplexity

Even when base models are open or API-available, **a packaged model** can win on:

-   **Integrated latency and capacity** (including "fast" SKUs)
-   **Billing pools** that align incentives for everyday agent use
-   **Safety, compliance, and support** expectations in enterprise procurement

That is a different "value add" from raw **pretraining FLOPs**, and it explains why vendors invest in **continued pretraining + RL** on top of strong bases.

### 4\. For teams building their own stack: the lesson generalizes

The Composer 2 / Kimi K2.5 story is an extreme version of a pattern we see across the industry: **frontier-ish capability from a base model**, then **smaller, targeted investment** to align behavior with a product surface (tools, context management, refusal style, formatting, long-horizon stability). At [Maniac](https://www.maniac.ai/product), we care about this because the same structure shows up in customer workloads: **most ROI comes from aligning a model to repeated traffic**, not from retraining the world.

* * *

## Takeaways

-   **Public coding benchmarks** put **Kimi K2.5** and **Composer 2** in a similar band on **SWE-Bench Multilingual**, with Cursor reporting a **small edge** for Composer 2 on that row.
-   **Terminal-Bench** numbers are **not directly comparable** across Moonshot's and Cursor's published methodologies; treat big gaps as **hypotheses to validate in your environment**.
-   On **Cursor's posted API rates**, **Composer 2 undercuts Kimi K2.5 on input/output**, while **Kimi wins on cache read**, and **pooling** may matter more than list price.
-   **Post-training** here is best understood as **(a)** marginal capability shaping on overlapping tasks, plus **(b)** **product and economics**: speed tiers, routing, and a bundled usage story, not only a raw score delta.

If you are deciding between them in Cursor, run the same real task on both with **token accounting** turned on. The leaderboard is a prior; **your** trace is the posterior.

---

*Maniac, High throughput background agents. Opus-quality outputs at 1/50 of the cost. Learn more at [maniac.ai](https://www.maniac.ai).*