Expert Residency
Streaming, stacked, and bf16_streaming modes — sidecar format, direct-pread, bank sizing, and transient prefill with resident decode.
Expert residency is the core memory story for MoE on Apple Silicon. Routed experts far exceed GPU RAM; the engine keeps a bounded bank of hot experts and streams misses from disk (or holds everything resident when the checkpoint and machine allow).
Three residency modes
| Mode | Selector | Experts live | When to use |
|---|---|---|---|
streaming | --streaming | Disk (sliced safetensors + packed sidecar); bounded GPU bank | Memory-constrained machines; experts streamed from SSD into a bounded bank |
stacked | --no-streaming | All compatible experts resident in GPU (ResidentStackedExpertCache) | Machines with enough RAM to hold every expert resident, using a stacked MLX export |
bf16_streaming | expert_residency=bf16_streaming / LME_BF16_STACKED_STREAMING=1 | Full-precision stacked HF/SwiftLM slices streamed into BF16 banks | Bakeoffs / full-precision comparison, not the production sidecar path |
Use mach check to predict which mode fits a given checkpoint on your machine before loading weights.
mach-serve requires an explicit --streaming or --no-streaming choice. --streaming enables transient prefill + resident decode for streaming sidecar serving without manually setting LME_TRANSIENT_PREFILL=1.
Streaming (--streaming)
- Experts on disk;
BankCache/LayerExpertBankholds a working subset per layer. - Misses pread from SSD via native sidecar I/O or safetensors slices.
- Transient prefill: prefill uses a shared transient scratch arena so large prompt windows do not pin the full expert set.
- Resident decode: decode keeps a smaller resident bank; optional decode-arena reclaim and hot-expert pinning via
LME_DECODE_RESIDENCY.
This is the default streaming path for full-expert sliced checkpoints; run mach check to confirm it fits your machine.
Stacked (--no-streaming)
- Requires axis-0 stacked MLX checkpoint compatible with
ResidentStackedExpertCache. - No admission/eviction — all routed experts resident.
- Fails fast if the checkpoint layout cannot support resident stacked banks.
Use only when you intentionally load every compatible quantized expert into memory.
BF16 streaming
- Opt-in full-precision streaming for plain HF/SwiftLM stacked checkpoints.
- Native safetensor pread into bounded BF16 banks (fallback: staged MLX load).
- Counters:
bf16_pread_*,staged_load_*. - Not the production quant sidecar path.
Transient prefill + resident decode
For quant sidecar streaming:
mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080Startup should report:
production_streaming_mode=streamingexpert_residency=streamingtransient_prefill=1native_transient_prefill=1
Prefill experts are loaded transiently for the current prompt window; decode retains a resident expert bank sized by --expert-cache-gb and related knobs.
Expert sidecar format
Production direct-pread expects:
checkpoint/
├── config.json
├── *.safetensors
└── expert_sidecar/
├── layout.json
├── layer_00.bin
├── layer_01.bin
└── ...layout.json— metadata: format (lme-expert-sidecar-v1, GGUF v1/v2), layer count, expert count, record layout.layer_XX.bin— packed per-layer expert records for native I/O.
Export with:
python scripts/export_expert_sidecar.py \
--checkpoint /path/to/<your-engine-checkpoint> \
--output /path/to/<your-engine-checkpoint>/expert_sidecar \
--num-experts <routed-experts-per-layer> \
--bits 4--num-experts must match the model's routed expert count per layer (e.g. 256 for Qwen3.6-35B-A3B).
Missing or invalid sidecar → production fail-fast (sidecar_valid=0) unless diagnostic fallback is explicitly enabled.
Direct-pread fast path
lme_mlx_pread_ext preads sidecar bytes directly into persistent bank slots.
| Signal | Meaning |
|---|---|
direct_pread=1 | Fast path active |
direct_pread_bytes / direct_pread_syscalls | Runtime counters |
native_extension=ready | Extension import succeeded |
fallback_policy=fail-fast | Production default |
Related production optimizations (on by default with sidecar):
LME_NATIVE_PREFILL_FUSED=1— plan admission + read misses in one native passLME_NATIVE_BANK_HANDLE=1— reuse native sidecar/bank handles across commitsLME_NATIVE_ROUTE_MAP=1— update expert-id → slot map in native commit pathLME_DIRECT_PREAD_EVAL_MODE=minimal— eval only changed slots after bank mutation
Fail-fast vs diagnostic Python fallback
Production fails fast when native extension or sidecar is incomplete. For intentional debugging only:
LME_ALLOW_DIAGNOSTIC_PYTHON_FALLBACK=1 mach-serve ... --streamingThis path is slow (~6 tok/s prefill) and not a serving mode. See Troubleshooting.
Gather dispatch
LME_USE_GATHER_DISPATCH=1 (default on) routes production traffic through GatherSwitchGLU instead of legacy DiskSwitchGLU per-slot loops. Works across streaming, stacked, and bf16_streaming when the bank backend supports gather.
Memory and bank sizing
| Knob | Default | Purpose |
|---|---|---|
--wired-gb | 9 | Metal wired memory limit |
--expert-cache-gb | (profile-dependent) | Resident expert bank size for decode |
--bank-capacity-per-layer | — | Cap slots per layer |
LME_BANK_EVICTION_POLICY | lookahead (production) | Retain experts visible in future dispatch windows |
Eviction policy lookahead — under tight slot budgets, keeps experts needed soon rather than pure LRU.
Hit-only fast path — when all requested experts are already resident, skips miss planning, scatter, and cleanup.
Observability
GET /v1/cache/stats includes streaming_summary:
- hits / misses / evictions
direct_pread_*,native_miss_*,bf16_pread_*hit_only_fastpath_*,inline_prev_prefetch_*
High evictions → raise --wired-gb and/or --expert-cache-gb, or confirm --streaming so prefill stays transient. See Troubleshooting.
Optional experiments (off in production)
| Flag | Notes |
|---|---|
LME_ASYNC_PREFETCH=1 | Heuristic staged prefetch; opt-in |
LME_INLINE_PREV_PREFETCH=1 | Previous-route prefetch hints |
LME_GATHER_DISPATCH_GROUPING=expert | Sort prefill routes by expert |
LME_FUSE_GATE_UP=1 | Fuse gate/up gather (streaming concat cache via LME_FUSE_GATE_UP_STREAMING=1) |
Related pages
- Architecture —
GatherSwitchGLUand layout detection - Installation — build
lme_mlx_pread_ext - Serving —
--streamingstartup contract