Maniac Docs

Architecture

How load_moe detects architecture, swaps expert blocks, and wires decode, caching, and serving together.

The engine centers on a shared load → swap → dispatch → decode pipeline. Every registered MoE family (REGISTRY / MoeArchAdapter) follows the same shape; architecture-specific code plugs in at detection and model construction.

This page is the conceptual hub linking Expert residency, Speculative decoding, Continuous batching, and Caching.

Load pipeline

Entry points: load(repo_or_path) and load_moe(checkpoint, arch=...). The flow in io/load.py:

  1. Read config.json from the checkpoint directory.
  2. detect_arch() — match model_type and layout hints against REGISTRY.
  3. Construct base model via the adapter for that architecture.
  4. _swap_switch_mlp() — replace routed-expert MLP blocks with engine-specific switch modules.
  5. Load non-expert weights (attention, norms, embeddings, shared experts).
  6. Optional runtime hooks — prewarm, mx.compile, fast-RMSNorm fusion, streaming embedding.

Return value: (model, tokenizer, store, cache) where store and cache back expert residency and KV state.

from mach import load_moe

model, tokenizer, store, cache = load_moe(
    "/path/to/checkpoint",
    arch="qwen3_5_moe",  # optional; auto-detected when omitted
)

See Library for programmatic APIs.

REGISTRY and MoeArchAdapter

REGISTRY maps architecture keys (qwen3_5_moe, qwen3_moe, gpt_oss, gemma4, deepseek_v4, …) to MoeArchAdapter implementations. Each adapter knows how to:

  • Build the MLX model class for that family (model_module)
  • Name on-disk per-expert / stacked tensors and classify checkpoint layout (sliced per-expert vs axis-0 stacked)
  • Apply the SwiGLU/GeGLU activation variant (swiglu_variant) and per-expert bias (has_expert_bias)
  • Choose compatible expert store and cache backends

Each adapter also carries two capability fields on the same seam:

FieldValuesDrives
kv_cache_kindstandard, hybrid_gdn, compressed, rotatingThe static continuous-batching / prefix-reuse predictions. standard and hybrid_gdn batch cleanly; compressed (DeepSeek-V4) cannot; rotating (sliding-window) batches only when keep == 0, so it is reported honestly as unknown until verified at load.
specdec_kindspecdec_v7, mtp_eagle3, nullWhether the arch has a speculative-decode draft family (and which) or serves target-only.

Both flow into arch_summary for free, so they appear in mach-archs --json and the GET /v1/capabilities arch block, and back the mach check path × architecture preflight matrix (continuous-batching capability is derived from kv_cache_kind).

detect_arch() is the single front door so CLI and server code do not hard-code family checks. The static, pre-load support evaluator behind mach check reuses this same registry (plus the loader's layout / routed-quant inspection and the server's serving gates) to predict what a checkpoint can run without loading weights.

DiskSwitchGLU vs GatherSwitchGLU

Routed experts are not executed as a single dense MLP. After swap:

ModuleRoleProduction default
DiskSwitchGLULegacy per-slot disk loop; loads experts into fixed bank slots one route at a timeOff when gather dispatch is enabled
GatherSwitchGLUBatches routed expert matmuls through gather kernels (LME_USE_GATHER_DISPATCH=1)On — production path for agent workloads

Gather dispatch avoids Python per-route loops on the hot path. Stats surface via GatherSwitchGLU.stats() (grouping, fuse gate/up experiments, hit-only fast path).

LME_USE_GATHER_DISPATCH defaults to on. Disable only for diagnostics comparing legacy disk loops.

Checkpoint layouts

Two physical layouts drive store selection:

Sliced (per-expert) layout

  • Experts stored as separate slices or packed sidecar records (expert_sidecar/layer_XX.bin).
  • Default for engine-format sliced artifacts (e.g. full-expert k256 checkpoints).
  • Pairs with streaming residency: bounded BankCache / LayerExpertBank, misses pread from SSD.
  • Production quant path uses native sidecar I/O + direct pread.

Stacked (axis-0) layout

  • All experts concatenated along axis 0 in MLX safetensors.
  • stacked residency (--no-streaming): ResidentStackedExpertCache keeps all compatible experts resident — requires enough RAM to hold the full expert set.
  • bf16_streaming: streams full-precision stacked slices into bounded BF16 banks for bakeoffs.

load_moe prints an explicit guard when regular streaming is pointed at a stacked layout: it uses lazy per-expert slices, not resident stacked banks. Use --no-streaming only when full resident quant banks are intentional.

Where subsystems plug in

flowchart TB
  load["load_moe + _swap_switch_mlp"] --> store["Expert store / bank"]
  store --> dispatch["GatherSwitchGLU dispatch"]
  dispatch --> prefill["Prefill (transient or resident)"]
  dispatch --> decode["Decode loop"]
  prefill --> pcache["Prefix cache / DiskKVCache"]
  decode --> spec["DFlash v7 (default)"]
  decode --> batch["ContinuousBatchScheduler (opt-in)"]
  decode --> kv["KV cache + optional TurboQuant"]
  spec --> http["mach-serve HTTP"]
  batch --> http
  • Residencystore implementation (BankCache, ResidentStackedExpertCache, BF16 streaming) selected by --streaming / --no-streaming / expert_residency. See Expert residency.
  • DecodeBlockSpecDecEngine (default) or target-only / batched decoders. See Speculative decoding and Continuous batching.
  • Caches — in-memory L1 + disk L2 prefix snapshots; GDN hybrid caches for Qwen3.5/3.6. See Caching.
  • Serving — FastAPI app wraps the loaded model and decode engine. See Serving.

DeepSeek-V4 branch

deepseek_v4 uses a parallel load path (load_v4_flash) with hybrid routing and MTP/EAGLE-3.1 draft integration instead of DFlash v7. Rho-gate env knobs (LME_V4_DRAFT, LME_RHO_GATE*) control draft acceptance. See Speculative decoding.

On this page