Speculative Decoding
DFlash v7 draft-verify-accept cycle, adaptive block policies, acceptance sampling, draft-free request presets, target-only fallback, and DeepSeek-V4 rho-gate.
The default served decode path is DFlash v7 speculative decoding: a small draft model proposes token blocks; the target model verifies them in parallel; accepted prefixes commit in one step.
Startup logs: decode_path=specdec-v7.
Draft → verify → accept → commit
sequenceDiagram
participant Draft as DFlash draft (8 layers)
participant Target as Target MoE model
participant Policy as Adaptive block policy
Draft->>Policy: Propose block (up to 16 tokens)
Policy->>Target: Verify draft tokens
Target->>Target: Longest-prefix accept (greedy or sampled)
Target->>Target: Commit accepted run; roll forward- Draft — block-diffusion head proposes up to 16 tokens per forward (8-layer draft).
- Verify — target runs on the drafted continuation.
- Accept — longest matching prefix between draft and target distributions.
- Commit — append accepted tokens; repeat until stop or max tokens.
Runtime counters on GET /v1/stats: specdec_drafted, specdec_accepted, recent_alpha, observed_cycles, low_cycles.
DFlash v7 engine
Mode: BlockSpecDecEngineMode.NATIVE_V7.
The draft checkpoint is a block-diffusion head trained for the target family:
- Local:
experiments/pipeline_v1/results/eagle3_training/dflash_draft - Hugging Face:
z-lab/Qwen3.6-35B-A3B-DFlash
Load programmatically:
from mach import BlockSpecDecEngine, BlockSpecDecEngineConfig, BlockSpecDecEngineMode
engine = BlockSpecDecEngine.from_checkpoint(
target_checkpoint="/path/to/target",
draft_checkpoint="/path/to/dflash_draft",
config=BlockSpecDecEngineConfig(mode=BlockSpecDecEngineMode.NATIVE_V7),
)Requires the [dflash] extra (dflash-mlx). See Installation.
v7 optimizations (on for DFlash server): lazy commit + trim hidden — reduces per-cycle target feature projection overhead.
Adaptive block policies
--serving-adaptive-block-policy controls how many draft tokens to propose per cycle. Tool-calling workloads default to opencode-sampled-v1.
| Policy | Behavior |
|---|---|
opencode-sampled-v1 | Production default: window=4, alpha_threshold=0.40, low_block=4, default_block=16 |
balanced-v1 | Deterministic ablations / greedy comparisons |
off | Static block size 16 |
Per-knob overrides:
--adaptive-block-alpha-window--adaptive-block-alpha-threshold--adaptive-block-low-block--adaptive-block-default-block--block-size(static override when policy isoff)
Confirm active policy via /v1/stats → adaptive_block_policy (name, knobs, observed_cycles, recent_alpha).
Acceptance
| Temperature | Strategy |
|---|---|
0 (greedy) | Longest-prefix match between draft and target argmax |
> 0 | Sampled acceptance (Leviathan rejection sampling) |
Fallbacks:
- Penalty fallback → target-only autoregressive step
- Zero-acceptance fallback → target-only AR for that cycle
Target-only mode
Disable DFlash entirely:
mach-serve /path/to/checkpoint --streaming --target-only --port 8080Auto-fallback also occurs when no usable --draft-dir is provided (no [dflash] install or missing draft weights).
The fallback is decided once by the arch gate (_resolve_effective_decode_path): a stray --draft-dir on a non-Qwen architecture, or a missing/unusable draft, resolves cleanly to target-only instead of mis-building the DFlash stack. The resolved path is surfaced as effective_decode_path (+ reason) in GET /v1/capabilities.
Draft-free request presets
Independently of the trained DFlash draft, callers can request a draft-free speculation preset per request via an additive speculation field on the chat-completions body. These presets need no draft head — they propose tokens from the running context and verify them against the target in a single block.
| Preset | Strategy | Best for |
|---|---|---|
none (default) | Engine defaults — CopySpec for the Qwen DFlash path, plain autoregressive for target-only. | Anything; identical to the no-speculation path. |
summarization | Prompt-lookup decoding (PLD) — propose the continuation of the most recent n-gram match of the running suffix. | Outputs that echo the input (summaries, edits, refactors). |
classification | Fixed-token — when the target's just-emitted token starts a candidate label, propose the rest of that label so it resolves in ~one verified block. | Short, fixed-vocabulary answers (labels, routing). |
All three presets are exactness-preserving: greedy decode with any preset is token-identical to greedy without it, because every committed token is the target's own argmax — the draft only ever short-circuits forwards the target would have produced anyway. When the speculation field is absent (or preset: "none"), decoding is byte-for-byte identical to today's path.
Request shape (injected into the POST body):
{
"speculation": {
"preset": "summarization",
"min_ngram": 3,
"max_ngram": 5,
"max_draft": 4
}
}{
"speculation": {
"preset": "classification",
"candidates": ["positive", "negative", "neutral"]
}
}candidates label strings are tokenized server-side into candidate token-id sequences (the engine never needs a tokenizer of its own). min_ngram / max_ngram (defaults 3 / 5) tune the PLD lookup window for summarization; max_draft caps the per-cycle draft length (the engine's block size caps it otherwise).
The Maniac harness can auto-select summarization for memory/summarizer-style sub-agent roles (where the output echoes the input); classification stays an explicit API capability because it requires a caller-supplied candidate label set.
How presets ride the decode paths
A preset engages on one of two paths depending on architecture:
- Qwen DFlash path (
qwen3_5_moe) — the preset layers onto the existing DFlash/CopySpec loop. - Target-only overlay (
gpt_oss,gemma4,qwen3_moe) — a shared draft-free overlay (specdec/draft_free.py) runs the propose → verify → longest-prefix-accept → trim cycle.
The target-only overlay requires every KV-cache layer to support O(1) trim() — i.e. all-attention models. Hybrid linear-attention caches (Qwen3-Next / Qwen3.6 GDN) are not trimmable, so the dispatcher gates on a trimmability probe and falls back to plain decoding for those architectures (the preset becomes a no-op rather than an error).
DeepSeek-V4: rho-gate instead of DFlash
deepseek_v4 uses MTP / EAGLE-3.1 drafts with a rho-gate acceptance path, not DFlash v7.
| Env / knob | Role |
|---|---|
LME_V4_DRAFT | Draft source selection |
LME_RHO_GATE* | Rho-gate thresholds and behavior |
See Architecture for the load_v4_flash entry point.
Research and experimental paths
Additional draft strategies exist behind experimental gating (LME_ALLOW_EXPERIMENTAL):
- EAGLE-3 tree drafts
- Zerocost drafts
These are not the production mach-serve default. Use for ablations and development only.
The prompt-lookup (PLD) and CopySpec strategies that previously lived here as experimental paths are now the supported, per-request draft-free presets above (summarization / classification).
Interaction with continuous batching
Default single-flight serving always uses DFlash when available. Opt-in continuous batching demotes to target-only BatchedDecoder when batch size ≥ 2. See Continuous batching.