Maniac Docs

Continuous Batching

Single-flight default serialization, opt-in ContinuousBatchScheduler, hybrid spec-dec policy, SSE fan-out, and tunables.

By default, mach-serve processes one generation at a time. Continuous batching is an opt-in scheduler that coalesces concurrent /v1/chat/completions requests into shared decode steps.

Single-flight default

All generations serialize behind a process-wide _GENERATION_LOCK because the engine shares mutable state:

  • Expert bank LRU and eviction
  • KV cache
  • Sampling context
  • Grammar / JSON-schema enforcer

Concurrent callers queue; only one decode loop runs at a time. This is the safe production default for DFlash v7 + streaming residency.

Implications:

  • /v1/cache/stats may return stale_generation_in_flight during a long request
  • Stats sampled between requests give reliable snapshots

See Troubleshooting.

Opt-in continuous batching

Enable:

mach-serve /path/to/checkpoint --streaming --continuous-batching --port 8080

Or:

LME_CONTINUOUS_BATCHING=1 mach-serve ...

ContinuousBatchScheduler batches concurrent chat completions into one decode step, up to LME_CONTINUOUS_BATCHING_MAX_B (default 4).

Hybrid speculative decoding policy

Active requestsDecode path
1 (lone request)Full DFlash v7 speculative decoding
≥ 2Target-only BatchedDecoder with per-slot sampling and grammar (specdec/batch_sampling.py)

Spec-dec and multi-request batching do not share the same code path — batching trades draft acceptance for throughput.

Response header: X-Decode-Mode: batched when the batched decoder is active.

Architecture guard

Continuous batching needs a KV cache that to_batch_cache can clone per row. The server resolves this from the arch's kv_cache_kind (the same continuous_batching_capability predicate that backs mach check): standard / hybrid_gdn are batch-compatible, compressed (DeepSeek-V4 CompressedKVCache) is blocked, and rotating (sliding-window gpt-oss / Gemma 4) is unknown (only batchable at keep == 0). When --continuous-batching is requested for a blocked/unknown arch, the server disables continuous batching and serves the serial path instead of crashing in to_batch_cache.

Keeping learned spec-dec engaged at B≥2 is narrower still — only qwen3_5_moe (Qwen3.6: DFlash specdec_v7 draft + hybrid_gdn cache) qualifies. mach check --all-archs surfaces this as the batched_specdec row, and /v1/capabilities reports it under speculation.batched_specdec. All other archs keep batched target-only decode.

SSE fan-out

Each streaming client receives events on its own asyncio.Queue. The scheduler multiplexes token deltas from the shared batched step onto per-client queues.

Compatible with OpenAI-style SSE on POST /v1/chat/completions (stream: true).

Admission and scheduling

Env / flagDefaultPurpose
LME_CONTINUOUS_BATCHINGoffMaster enable
LME_CONTINUOUS_BATCHING_MAX_B4Max batch size
LME_CONTINUOUS_BATCHING_WINDOW_MSCoalescing window for admitting requests into a step
LME_CONTINUOUS_BATCHING_KV_BUDGET_GB10KV-budget admission — reject or defer when exceeded

Response header X-Active-Loops reports how many generation loops are active in the scheduler.

Mutual exclusions

FeatureCompatible with continuous batching?
DFlash v7 (B=1 only)Partial — full spec-dec only for lone request
TurboQuant KVNo — cannot coexist
Prefix cache serialize/deserialize under TurboQuantN/A — TurboQuant off when batching

Enable one or the other, not both. See Caching.

When to enable

ScenarioRecommendation
Single agent / OpenCode sessionDefault single-flight + DFlash
Multiple concurrent API clients, throughput over spec-dec--continuous-batching
Memory-tight machinesWatch LME_CONTINUOUS_BATCHING_KV_BUDGET_GB

Tunables summary

KnobDefaultNotes
--continuous-batchingoffCLI enable
LME_CONTINUOUS_BATCHING0Env enable
LME_CONTINUOUS_BATCHING_MAX_B4Upper bound on batch width
LME_CONTINUOUS_BATCHING_WINDOW_MSscheduler defaultAdmission coalescing
LME_CONTINUOUS_BATCHING_KV_BUDGET_GB10KV admission cap

On this page