Architecture
How load_moe detects architecture, swaps expert blocks, and wires decode, caching, and serving together.
The engine centers on a shared load → swap → dispatch → decode pipeline. Every registered MoE family (REGISTRY / MoeArchAdapter) follows the same shape; architecture-specific code plugs in at detection and model construction.
This page is the conceptual hub linking Expert residency, Speculative decoding, Continuous batching, and Caching.
Load pipeline
Entry points: load(repo_or_path) and load_moe(checkpoint, arch=...). The flow in io/load.py:
- Read
config.jsonfrom the checkpoint directory. detect_arch()— matchmodel_typeand layout hints againstREGISTRY.- Construct base model via the adapter for that architecture.
_swap_switch_mlp()— replace routed-expert MLP blocks with engine-specific switch modules.- Load non-expert weights (attention, norms, embeddings, shared experts).
- Optional runtime hooks — prewarm,
mx.compile, fast-RMSNorm fusion, streaming embedding.
Return value: (model, tokenizer, store, cache) where store and cache back expert residency and KV state.
from mach import load_moe
model, tokenizer, store, cache = load_moe(
"/path/to/checkpoint",
arch="qwen3_5_moe", # optional; auto-detected when omitted
)See Library for programmatic APIs.
REGISTRY and MoeArchAdapter
REGISTRY maps architecture keys (qwen3_5_moe, qwen3_moe, gpt_oss, gemma4, deepseek_v4, …) to MoeArchAdapter implementations. Each adapter knows how to:
- Build the MLX model class for that family (
model_module) - Name on-disk per-expert / stacked tensors and classify checkpoint layout (sliced per-expert vs axis-0 stacked)
- Apply the SwiGLU/GeGLU activation variant (
swiglu_variant) and per-expert bias (has_expert_bias) - Choose compatible expert store and cache backends
Each adapter also carries two capability fields on the same seam:
| Field | Values | Drives |
|---|---|---|
kv_cache_kind | standard, hybrid_gdn, compressed, rotating | The static continuous-batching / prefix-reuse predictions. standard and hybrid_gdn batch cleanly; compressed (DeepSeek-V4) cannot; rotating (sliding-window) batches only when keep == 0, so it is reported honestly as unknown until verified at load. |
specdec_kind | specdec_v7, mtp_eagle3, null | Whether the arch has a speculative-decode draft family (and which) or serves target-only. |
Both flow into arch_summary for free, so they appear in mach-archs --json and the GET /v1/capabilities arch block, and back the mach check path × architecture preflight matrix (continuous-batching capability is derived from kv_cache_kind).
detect_arch() is the single front door so CLI and server code do not hard-code family checks. The static, pre-load support evaluator behind mach check reuses this same registry (plus the loader's layout / routed-quant inspection and the server's serving gates) to predict what a checkpoint can run without loading weights.
DiskSwitchGLU vs GatherSwitchGLU
Routed experts are not executed as a single dense MLP. After swap:
| Module | Role | Production default |
|---|---|---|
DiskSwitchGLU | Legacy per-slot disk loop; loads experts into fixed bank slots one route at a time | Off when gather dispatch is enabled |
GatherSwitchGLU | Batches routed expert matmuls through gather kernels (LME_USE_GATHER_DISPATCH=1) | On — production path for agent workloads |
Gather dispatch avoids Python per-route loops on the hot path. Stats surface via GatherSwitchGLU.stats() (grouping, fuse gate/up experiments, hit-only fast path).
LME_USE_GATHER_DISPATCH defaults to on. Disable only for diagnostics comparing legacy disk loops.
Checkpoint layouts
Two physical layouts drive store selection:
Sliced (per-expert) layout
- Experts stored as separate slices or packed sidecar records (
expert_sidecar/layer_XX.bin). - Default for engine-format sliced artifacts (e.g. full-expert
k256checkpoints). - Pairs with
streamingresidency: boundedBankCache/LayerExpertBank, misses pread from SSD. - Production quant path uses native sidecar I/O + direct pread.
Stacked (axis-0) layout
- All experts concatenated along axis 0 in MLX safetensors.
stackedresidency (--no-streaming):ResidentStackedExpertCachekeeps all compatible experts resident — requires enough RAM to hold the full expert set.bf16_streaming: streams full-precision stacked slices into bounded BF16 banks for bakeoffs.
load_moe prints an explicit guard when regular streaming is pointed at a stacked layout: it uses lazy per-expert slices, not resident stacked banks. Use --no-streaming only when full resident quant banks are intentional.
Where subsystems plug in
flowchart TB
load["load_moe + _swap_switch_mlp"] --> store["Expert store / bank"]
store --> dispatch["GatherSwitchGLU dispatch"]
dispatch --> prefill["Prefill (transient or resident)"]
dispatch --> decode["Decode loop"]
prefill --> pcache["Prefix cache / DiskKVCache"]
decode --> spec["DFlash v7 (default)"]
decode --> batch["ContinuousBatchScheduler (opt-in)"]
decode --> kv["KV cache + optional TurboQuant"]
spec --> http["mach-serve HTTP"]
batch --> http- Residency —
storeimplementation (BankCache,ResidentStackedExpertCache, BF16 streaming) selected by--streaming/--no-streaming/expert_residency. See Expert residency. - Decode —
BlockSpecDecEngine(default) or target-only / batched decoders. See Speculative decoding and Continuous batching. - Caches — in-memory L1 + disk L2 prefix snapshots; GDN hybrid caches for Qwen3.5/3.6. See Caching.
- Serving — FastAPI app wraps the loaded model and decode engine. See Serving.
DeepSeek-V4 branch
deepseek_v4 uses a parallel load path (load_v4_flash) with hybrid routing and MTP/EAGLE-3.1 draft integration instead of DFlash v7. Rho-gate env knobs (LME_V4_DRAFT, LME_RHO_GATE*) control draft acceptance. See Speculative decoding.