Mach

MoE-specialized inference for Apple Silicon — disk-streamed experts, DFlash speculative decoding, and an OpenAI-compatible serving API.

mach is a Mixture-of-Experts (MoE) inference engine built for Apple Silicon. It runs on MLX and mlx-lm, streams routed experts from disk into bounded GPU banks, and serves an OpenAI-compatible HTTP API through mach-serve.

The engine is architecture-generic across the supported MoE families. Its production fast path — quantized sidecar artifacts, native direct-pread I/O, transient prefill with resident decode, DFlash v7 speculative decoding, and prefix caching for agent workloads — applies to any supported full-expert checkpoint. Qwen3.6-35B-A3B is the reference checkpoint used throughout these examples.

Platform requirements

Requirement	Details
Hardware	Apple Silicon Mac (Metal)
OS	macOS
Python	3.12+
Primary stack	MLX, mlx-lm, optional `dflash-mlx` for spec-dec

For the full production fast path you also need the native extensions (lme_mlx_pread_ext, liblme_expert_io) and a checkpoint with a valid expert_sidecar/ layout. See Installation.

Two operating modes

mach-serve exposes one top-level selector: --streaming or --no-streaming. Backend and profile defaults are implied; you do not need separate --backend production flags for normal use.

Mode	Flag	When to use
Streaming	`--streaming`	Experts streamed from SSD into a bounded GPU bank; full-expert sliced sidecar checkpoints; large prefill windows. Transient prefill + resident decode.
Resident stacked	`--no-streaming`	Machines with enough RAM to hold all experts resident, using a stacked-compatible MLX checkpoint. Fails fast if layout is incompatible.

Use mach check to predict whether a given checkpoint fits your machine (fits / tight / wont_fit) before loading weights.

A third residency mode, bf16_streaming, is opt-in for full-precision bakeoffs against plain HF/SwiftLM stacked checkpoints. See Expert residency.

Generic, architecture-agnostic serving without the production optimizations uses --backend openai (mlx-lm path). See Serving.

Mental model

flowchart LR
  ckpt["Checkpoint (config.json + safetensors + expert_sidecar/)"] --> load["load_moe(): detect arch, swap expert blocks"]
  load --> residency["Expert residency: streaming | stacked | bf16_streaming"]
  residency --> dispatch["GatherSwitchGLU dispatch via bounded GPU bank"]
  dispatch --> decode["Decode loop"]
  decode -->|"default"| specdec["DFlash v7 spec-dec (draft → verify → accept)"]
  decode -->|"opt-in"| batch["Continuous batching scheduler (B up to 4)"]
  specdec --> serve["mach-serve: OpenAI /v1 API + SSE"]
  batch --> serve
  caches["Prefix cache (L2 disk + GDN snapshots) + optional TurboQuant KV"] --- decode

Supported architectures

Registered MoE families share a common load, swap, and dispatch pipeline. detect_arch() matches config.json against REGISTRY (MoeArchAdapter).

`model_type`	Reference target	Notes
`qwen3_5_moe`	Qwen3.6-35B-A3B	Hybrid GDN + full attention
`qwen3_moe`	Qwen3-Coder-30B-A3B-Instruct	Pure attention
`gpt_oss`	gpt-oss-20b	Pure attention + sliding window/sinks
`gemma4`	Gemma 4 26B MoE	Pure attention with multimodal wrap
`deepseek_v4`	DeepSeek-V4-Flash	Hybrid routing; engine layout required; MTP/EAGLE drafts with rho-gate

This table is the human-readable view of the in-code registry. Run mach-archs (or mach-archs --json) to print the same set with each architecture's static facts straight from the engine; preflight a specific checkpoint — its support, runnable serving paths, required artifacts, and memory fit — with the static mach check before paying a load; and ask a running server about the loaded model with GET /v1/capabilities.

Details on the load pipeline and block swapping are in Architecture.

Subsystems

Topic	What it covers
Installation	pip extras, native extensions, tests
Architecture	`load_moe`, `REGISTRY`, `DiskSwitchGLU` vs `GatherSwitchGLU`
Expert residency	Streaming, stacked, sidecar format, direct-pread, memory knobs
Speculative decoding	DFlash v7, adaptive blocks, acceptance, DeepSeek-V4 rho-gate
Continuous batching	Single-flight default, opt-in scheduler, hybrid spec-dec
Caching	Prefix cache L1/L2, `warm_prefix`, TurboQuant KV
Serving	`mach-serve`, endpoints, startup signals
CLI	`mach-archs`, `mach-models`, `mach check`, `mach-generate`, `mach convert`, `mach-prune`, `mach-reap`, artifacts
Conversion	`mach convert` — master → 2-bit IQ2 GGUF, `ConversionConfig` seam, gates
Library	`load()`, `load_moe()`, `BlockSpecDecEngine` API
Maniac integration	Vendored snapshot, desktop spawn path, env overrides
Troubleshooting	Slow-path diagnostics and fixes

Quick start (production)

pip install -e ".[dev,dflash,native]"
python scripts/build_mlx_pread_ext.py
python -c "import lme_mlx_pread_ext"

mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080

Point OpenAI-compatible clients at http://127.0.0.1:8080/v1 with any API key string. Confirm the fast path via startup logs (decode_path=specdec-v7, direct_pread=1, native_extension=ready) or GET /v1/stats.