Maniac Docs

Caching

Prefix cache L1/L2 for TTFT, warm_prefix endpoints, GDN snapshots, and optional TurboQuant KV compression.

Caching in mach targets time-to-first-token (TTFT) for repeated system and tool prefixes, plus optional KV memory reduction via TurboQuant. Expert bank caching is separate — see Expert residency.

Prefix cache overview

Production DFlash serving enables prefix caching by default (L2 on disk). Disable with --no-prefix-cache.

flowchart LR
  prompt["Rendered prompt"] --> hash["SHA1 key"]
  hash --> l1["L1 in-memory snapshots"]
  hash --> l2["L2 DiskKVCache on disk"]
  l2 --> restore["Restore KV on cache hit"]
  restore --> ttft["Skip full prefill → lower TTFT"]

L1 — in-memory

Hot prefix snapshots kept in process memory for immediate reuse within a session.

L2 — on-disk DiskKVCache

  • Keys: SHA1 of rendered prompt text
  • Eviction: value-scored LRU (eviction_policy="value")
  • Score: (effective_hits + 1) * tokens / file_size with hit decay and optional anchor multiplier
  • Triggers: cold miss save, continued generation, displacement, shutdown (--disk-kv-dir)
FlagPurpose
--prefix-cache / --no-prefix-cacheEnable or disable
--disk-kv-dirL2 directory (production default sets location)
--disk-kv-budget-gbCap disk usage
--prefix-cache-max-entriesEntry count limit
--prefix-cache-max-gbMemory budget for L1
--prefix-cache-block-sizeBlock granularity

Env: LME_DISK_KV_EVICTION_POLICY=value for value-scored eviction.

GDN hybrid snapshots

Qwen3.5/3.6 hybrid GDN + attention models use PrefixKVSnapshotStore for recurrent GDN state that is not trimmable like standard KV. Prefix restore must rehydrate both attention KV and GDN caches.

DFlash prefix snapshot management integrates with dflash-mlx; cold snapshot save is deferred until after first-token emission.

Warming prefixes

Agent workloads repeat stable system and tool prefixes. Warm them before user traffic:

POST /v1/cache/warm_prefix

Prefill and cache a stable system/tool prefix without generating a full completion.

curl -s -X POST http://127.0.0.1:8080/v1/cache/warm_prefix \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"system","content":"You are a coding assistant."}]}'

GET /v1/cache/warm_prefix

Inspect warm-prefix status and cache entries.

Warming pairs with production opencode-sampled-v1 spec-dec for tool-calling sessions. See Serving.

Observability

GET /v1/cache/stats reports:

  • prefix_cache_* counters
  • DFlash-specific prefix stats
  • streaming_summary for expert bank (see Expert residency)
  • stale_generation_in_flight when single-flight lock is held

GET /v1/stats also exposes adaptive_block_policy alongside cache-related fields.

TurboQuant KV compression

Optional approximate compression for full-attention KV only. GDN recurrent caches stay fp32.

mach-serve /path/to/checkpoint --streaming --turboquant-kv --turboquant-bits 4 ...
FlagOptions / notes
--turboquant-kvMaster enable
--turboquant-bitsQuantization width
--turboquant-group-sizeGroup size for quant kernels
Modesv2_lean, v2_rotated, v3_* variants

Expected benefit: up to ~5.5× KV memory reduction for long contexts.

Constraints

RuleReason
No prefix-cache (de)serialization while TurboQuant activeExact snapshot format incompatible with compressed KV
Cannot coexist with continuous batchingShared KV mutation paths conflict

Choose TurboQuant or continuous batching, not both. See Continuous batching.

Production defaults

CapabilityProduction default
Prefix cacheEnabled (L2 on disk)
TurboQuant KVOff
Disk KV evictionValue-scored when DiskKVCache uses eviction_policy="value"

On this page