Mach
MoE-specialized inference for Apple Silicon — disk-streamed experts, DFlash speculative decoding, and an OpenAI-compatible serving API.
mach is a Mixture-of-Experts (MoE) inference engine built for Apple Silicon. It runs on MLX and mlx-lm, streams routed experts from disk into bounded GPU banks, and serves an OpenAI-compatible HTTP API through mach-serve.
The engine is architecture-generic across the supported MoE families. Its production fast path — quantized sidecar artifacts, native direct-pread I/O, transient prefill with resident decode, DFlash v7 speculative decoding, and prefix caching for agent workloads — applies to any supported full-expert checkpoint. Qwen3.6-35B-A3B is the reference checkpoint used throughout these examples.
Platform requirements
| Requirement | Details |
|---|---|
| Hardware | Apple Silicon Mac (Metal) |
| OS | macOS |
| Python | 3.12+ |
| Primary stack | MLX, mlx-lm, optional dflash-mlx for spec-dec |
For the full production fast path you also need the native extensions (lme_mlx_pread_ext, liblme_expert_io) and a checkpoint with a valid expert_sidecar/ layout. See Installation.
Two operating modes
mach-serve exposes one top-level selector: --streaming or --no-streaming. Backend and profile defaults are implied; you do not need separate --backend production flags for normal use.
| Mode | Flag | When to use |
|---|---|---|
| Streaming | --streaming | Experts streamed from SSD into a bounded GPU bank; full-expert sliced sidecar checkpoints; large prefill windows. Transient prefill + resident decode. |
| Resident stacked | --no-streaming | Machines with enough RAM to hold all experts resident, using a stacked-compatible MLX checkpoint. Fails fast if layout is incompatible. |
Use mach check to predict whether a given checkpoint fits your machine (fits / tight / wont_fit) before loading weights.
A third residency mode, bf16_streaming, is opt-in for full-precision bakeoffs against plain HF/SwiftLM stacked checkpoints. See Expert residency.
Generic, architecture-agnostic serving without the production optimizations uses --backend openai (mlx-lm path). See Serving.
Mental model
flowchart LR
ckpt["Checkpoint (config.json + safetensors + expert_sidecar/)"] --> load["load_moe(): detect arch, swap expert blocks"]
load --> residency["Expert residency: streaming | stacked | bf16_streaming"]
residency --> dispatch["GatherSwitchGLU dispatch via bounded GPU bank"]
dispatch --> decode["Decode loop"]
decode -->|"default"| specdec["DFlash v7 spec-dec (draft → verify → accept)"]
decode -->|"opt-in"| batch["Continuous batching scheduler (B up to 4)"]
specdec --> serve["mach-serve: OpenAI /v1 API + SSE"]
batch --> serve
caches["Prefix cache (L2 disk + GDN snapshots) + optional TurboQuant KV"] --- decodeSupported architectures
Registered MoE families share a common load, swap, and dispatch pipeline. detect_arch() matches config.json against REGISTRY (MoeArchAdapter).
model_type | Reference target | Notes |
|---|---|---|
qwen3_5_moe | Qwen3.6-35B-A3B | Hybrid GDN + full attention |
qwen3_moe | Qwen3-Coder-30B-A3B-Instruct | Pure attention |
gpt_oss | gpt-oss-20b | Pure attention + sliding window/sinks |
gemma4 | Gemma 4 26B MoE | Pure attention with multimodal wrap |
deepseek_v4 | DeepSeek-V4-Flash | Hybrid routing; engine layout required; MTP/EAGLE drafts with rho-gate |
This table is the human-readable view of the in-code registry. Run mach-archs (or mach-archs --json) to print the same set with each architecture's static facts straight from the engine; preflight a specific checkpoint — its support, runnable serving paths, required artifacts, and memory fit — with the static mach check before paying a load; and ask a running server about the loaded model with GET /v1/capabilities.
Details on the load pipeline and block swapping are in Architecture.
Subsystems
| Topic | What it covers |
|---|---|
| Installation | pip extras, native extensions, tests |
| Architecture | load_moe, REGISTRY, DiskSwitchGLU vs GatherSwitchGLU |
| Expert residency | Streaming, stacked, sidecar format, direct-pread, memory knobs |
| Speculative decoding | DFlash v7, adaptive blocks, acceptance, DeepSeek-V4 rho-gate |
| Continuous batching | Single-flight default, opt-in scheduler, hybrid spec-dec |
| Caching | Prefix cache L1/L2, warm_prefix, TurboQuant KV |
| Serving | mach-serve, endpoints, startup signals |
| CLI | mach-archs, mach-models, mach check, mach-generate, mach convert, mach-prune, mach-reap, artifacts |
| Conversion | mach convert — master → 2-bit IQ2 GGUF, ConversionConfig seam, gates |
| Library | load(), load_moe(), BlockSpecDecEngine API |
| Maniac integration | Vendored snapshot, desktop spawn path, env overrides |
| Troubleshooting | Slow-path diagnostics and fixes |
Quick start (production)
pip install -e ".[dev,dflash,native]"
python scripts/build_mlx_pread_ext.py
python -c "import lme_mlx_pread_ext"
mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080Point OpenAI-compatible clients at http://127.0.0.1:8080/v1 with any API key string. Confirm the fast path via startup logs (decode_path=specdec-v7, direct_pread=1, native_extension=ready) or GET /v1/stats.