Expert Residency

Streaming, stacked, and bf16_streaming modes — sidecar format, direct-pread, bank sizing, and transient prefill with resident decode.

Expert residency is the core memory story for MoE on Apple Silicon. Routed experts far exceed GPU RAM; the engine keeps a bounded bank of hot experts and streams misses from disk (or holds everything resident when the checkpoint and machine allow).

Three residency modes

Mode	Selector	Experts live	When to use
`streaming`	`--streaming`	Disk (sliced safetensors + packed sidecar); bounded GPU bank	Memory-constrained machines; experts streamed from SSD into a bounded bank
`stacked`	`--no-streaming`	All compatible experts resident in GPU (`ResidentStackedExpertCache`)	Machines with enough RAM to hold every expert resident, using a stacked MLX export
`bf16_streaming`	`expert_residency=bf16_streaming` / `LME_BF16_STACKED_STREAMING=1`	Full-precision stacked HF/SwiftLM slices streamed into BF16 banks	Bakeoffs / full-precision comparison, not the production sidecar path

Use mach check to predict which mode fits a given checkpoint on your machine before loading weights.

mach-serve requires an explicit --streaming or --no-streaming choice. --streaming enables transient prefill + resident decode for streaming sidecar serving without manually setting LME_TRANSIENT_PREFILL=1.

Streaming (`--streaming`)

Experts on disk; BankCache / LayerExpertBank holds a working subset per layer.
Misses pread from SSD via native sidecar I/O or safetensors slices.
Transient prefill: prefill uses a shared transient scratch arena so large prompt windows do not pin the full expert set.
Resident decode: decode keeps a smaller resident bank; optional decode-arena reclaim and hot-expert pinning via LME_DECODE_RESIDENCY.

This is the default streaming path for full-expert sliced checkpoints; run mach check to confirm it fits your machine.

Stacked (`--no-streaming`)

Requires axis-0 stacked MLX checkpoint compatible with ResidentStackedExpertCache.
No admission/eviction — all routed experts resident.
Fails fast if the checkpoint layout cannot support resident stacked banks.

Use only when you intentionally load every compatible quantized expert into memory.

BF16 streaming

Opt-in full-precision streaming for plain HF/SwiftLM stacked checkpoints.
Native safetensor pread into bounded BF16 banks (fallback: staged MLX load).
Counters: bf16_pread_*, staged_load_*.
Not the production quant sidecar path.

Transient prefill + resident decode

For quant sidecar streaming:

mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080

Startup should report:

production_streaming_mode=streaming
expert_residency=streaming
transient_prefill=1
native_transient_prefill=1

Prefill experts are loaded transiently for the current prompt window; decode retains a resident expert bank sized by --expert-cache-gb and related knobs.

Expert sidecar format

Production direct-pread expects:

checkpoint/
├── config.json
├── *.safetensors
└── expert_sidecar/
    ├── layout.json
    ├── layer_00.bin
    ├── layer_01.bin
    └── ...

layout.json — metadata: format (lme-expert-sidecar-v1, GGUF v1/v2), layer count, expert count, record layout.
layer_XX.bin — packed per-layer expert records for native I/O.

Export with:

python scripts/export_expert_sidecar.py \
  --checkpoint /path/to/<your-engine-checkpoint> \
  --output /path/to/<your-engine-checkpoint>/expert_sidecar \
  --num-experts <routed-experts-per-layer> \
  --bits 4

--num-experts must match the model's routed expert count per layer (e.g. 256 for Qwen3.6-35B-A3B).

Missing or invalid sidecar → production fail-fast (sidecar_valid=0) unless diagnostic fallback is explicitly enabled.

Direct-pread fast path

lme_mlx_pread_ext preads sidecar bytes directly into persistent bank slots.

Signal	Meaning
`direct_pread=1`	Fast path active
`direct_pread_bytes` / `direct_pread_syscalls`	Runtime counters
`native_extension=ready`	Extension import succeeded
`fallback_policy=fail-fast`	Production default

Related production optimizations (on by default with sidecar):

LME_NATIVE_PREFILL_FUSED=1 — plan admission + read misses in one native pass
LME_NATIVE_BANK_HANDLE=1 — reuse native sidecar/bank handles across commits
LME_NATIVE_ROUTE_MAP=1 — update expert-id → slot map in native commit path
LME_DIRECT_PREAD_EVAL_MODE=minimal — eval only changed slots after bank mutation

Fail-fast vs diagnostic Python fallback

Production fails fast when native extension or sidecar is incomplete. For intentional debugging only:

LME_ALLOW_DIAGNOSTIC_PYTHON_FALLBACK=1 mach-serve ... --streaming

This path is slow (~6 tok/s prefill) and not a serving mode. See Troubleshooting.

Gather dispatch

LME_USE_GATHER_DISPATCH=1 (default on) routes production traffic through GatherSwitchGLU instead of legacy DiskSwitchGLU per-slot loops. Works across streaming, stacked, and bf16_streaming when the bank backend supports gather.

Memory and bank sizing

Knob	Default	Purpose
`--wired-gb`	9	Metal wired memory limit
`--expert-cache-gb`	(profile-dependent)	Resident expert bank size for decode
`--bank-capacity-per-layer`	—	Cap slots per layer
`LME_BANK_EVICTION_POLICY`	`lookahead` (production)	Retain experts visible in future dispatch windows

Eviction policy lookahead — under tight slot budgets, keeps experts needed soon rather than pure LRU.

Hit-only fast path — when all requested experts are already resident, skips miss planning, scatter, and cleanup.

Observability

GET /v1/cache/stats includes streaming_summary:

hits / misses / evictions
direct_pread_*, native_miss_*, bf16_pread_*
hit_only_fastpath_*, inline_prev_prefetch_*

High evictions → raise --wired-gb and/or --expert-cache-gb, or confirm --streaming so prefill stays transient. See Troubleshooting.

Optional experiments (off in production)

Flag	Notes
`LME_ASYNC_PREFETCH=1`	Heuristic staged prefetch; opt-in
`LME_INLINE_PREV_PREFETCH=1`	Previous-route prefetch hints
`LME_GATHER_DISPATCH_GROUPING=expert`	Sort prefill routes by expert
`LME_FUSE_GATE_UP=1`	Fuse gate/up gather (streaming concat cache via `LME_FUSE_GATE_UP_STREAMING=1`)

Architecture — GatherSwitchGLU and layout detection
Installation — build lme_mlx_pread_ext
Serving — --streaming startup contract

Expert Residency

On this page