Architecture

How load_moe detects architecture, swaps expert blocks, and wires decode, caching, and serving together.

The engine centers on a shared load → swap → dispatch → decode pipeline. Every registered MoE family (REGISTRY / MoeArchAdapter) follows the same shape; architecture-specific code plugs in at detection and model construction.

This page is the conceptual hub linking Expert residency, Speculative decoding, Continuous batching, and Caching.

Load pipeline

Entry points: load(repo_or_path) and load_moe(checkpoint, arch=...). The flow in io/load.py:

Read config.json from the checkpoint directory.
detect_arch() — match model_type and layout hints against REGISTRY.
Construct base model via the adapter for that architecture.
_swap_switch_mlp() — replace routed-expert MLP blocks with engine-specific switch modules.
Load non-expert weights (attention, norms, embeddings, shared experts).
Optional runtime hooks — prewarm, mx.compile, fast-RMSNorm fusion, streaming embedding.

Return value: (model, tokenizer, store, cache) where store and cache back expert residency and KV state.

from mach import load_moe

model, tokenizer, store, cache = load_moe(
    "/path/to/checkpoint",
    arch="qwen3_5_moe",  # optional; auto-detected when omitted
)

See Library for programmatic APIs.

REGISTRY and MoeArchAdapter

REGISTRY maps architecture keys (qwen3_5_moe, qwen3_moe, gpt_oss, gemma4, deepseek_v4, …) to MoeArchAdapter implementations. Each adapter knows how to:

Build the MLX model class for that family (model_module)
Name on-disk per-expert / stacked tensors and classify checkpoint layout (sliced per-expert vs axis-0 stacked)
Apply the SwiGLU/GeGLU activation variant (swiglu_variant) and per-expert bias (has_expert_bias)
Choose compatible expert store and cache backends

Each adapter also carries two capability fields on the same seam:

Field	Values	Drives
`kv_cache_kind`	`standard`, `hybrid_gdn`, `compressed`, `rotating`	The static continuous-batching / prefix-reuse predictions. `standard` and `hybrid_gdn` batch cleanly; `compressed` (DeepSeek-V4) cannot; `rotating` (sliding-window) batches only when `keep == 0`, so it is reported honestly as `unknown` until verified at load.
`specdec_kind`	`specdec_v7`, `mtp_eagle3`, `null`	Whether the arch has a speculative-decode draft family (and which) or serves target-only.

Both flow into arch_summary for free, so they appear in mach-archs --json and the GET /v1/capabilities arch block, and back the mach check path × architecture preflight matrix (continuous-batching capability is derived from kv_cache_kind).

detect_arch() is the single front door so CLI and server code do not hard-code family checks. The static, pre-load support evaluator behind mach check reuses this same registry (plus the loader's layout / routed-quant inspection and the server's serving gates) to predict what a checkpoint can run without loading weights.

DiskSwitchGLU vs GatherSwitchGLU

Routed experts are not executed as a single dense MLP. After swap:

Module	Role	Production default
`DiskSwitchGLU`	Legacy per-slot disk loop; loads experts into fixed bank slots one route at a time	Off when gather dispatch is enabled
`GatherSwitchGLU`	Batches routed expert matmuls through gather kernels (`LME_USE_GATHER_DISPATCH=1`)	On — production path for agent workloads

Gather dispatch avoids Python per-route loops on the hot path. Stats surface via GatherSwitchGLU.stats() (grouping, fuse gate/up experiments, hit-only fast path).

LME_USE_GATHER_DISPATCH defaults to on. Disable only for diagnostics comparing legacy disk loops.

Checkpoint layouts

Two physical layouts drive store selection:

Sliced (per-expert) layout

Experts stored as separate slices or packed sidecar records (expert_sidecar/layer_XX.bin).
Default for engine-format sliced artifacts (e.g. full-expert k256 checkpoints).
Pairs with streaming residency: bounded BankCache / LayerExpertBank, misses pread from SSD.
Production quant path uses native sidecar I/O + direct pread.

Stacked (axis-0) layout

All experts concatenated along axis 0 in MLX safetensors.
stacked residency (--no-streaming): ResidentStackedExpertCache keeps all compatible experts resident — requires enough RAM to hold the full expert set.
bf16_streaming: streams full-precision stacked slices into bounded BF16 banks for bakeoffs.

load_moe prints an explicit guard when regular streaming is pointed at a stacked layout: it uses lazy per-expert slices, not resident stacked banks. Use --no-streaming only when full resident quant banks are intentional.

Where subsystems plug in

flowchart TB
  load["load_moe + _swap_switch_mlp"] --> store["Expert store / bank"]
  store --> dispatch["GatherSwitchGLU dispatch"]
  dispatch --> prefill["Prefill (transient or resident)"]
  dispatch --> decode["Decode loop"]
  prefill --> pcache["Prefix cache / DiskKVCache"]
  decode --> spec["DFlash v7 (default)"]
  decode --> batch["ContinuousBatchScheduler (opt-in)"]
  decode --> kv["KV cache + optional TurboQuant"]
  spec --> http["mach-serve HTTP"]
  batch --> http

Residency — store implementation (BankCache, ResidentStackedExpertCache, BF16 streaming) selected by --streaming / --no-streaming / expert_residency. See Expert residency.
Decode — BlockSpecDecEngine (default) or target-only / batched decoders. See Speculative decoding and Continuous batching.
Caches — in-memory L1 + disk L2 prefix snapshots; GDN hybrid caches for Qwen3.5/3.6. See Caching.
Serving — FastAPI app wraps the loaded model and decode engine. See Serving.

DeepSeek-V4 branch

deepseek_v4 uses a parallel load path (load_v4_flash) with hybrid routing and MTP/EAGLE-3.1 draft integration instead of DFlash v7. Rho-gate env knobs (LME_V4_DRAFT, LME_RHO_GATE*) control draft acceptance. See Speculative decoding.

Architecture

On this page