Maniac Docs

Mach

MoE-specialized inference for Apple Silicon — disk-streamed experts, DFlash speculative decoding, and an OpenAI-compatible serving API.

mach is a Mixture-of-Experts (MoE) inference engine built for Apple Silicon. It runs on MLX and mlx-lm, streams routed experts from disk into bounded GPU banks, and serves an OpenAI-compatible HTTP API through mach-serve.

The engine is architecture-generic across the supported MoE families. Its production fast path — quantized sidecar artifacts, native direct-pread I/O, transient prefill with resident decode, DFlash v7 speculative decoding, and prefix caching for agent workloads — applies to any supported full-expert checkpoint. Qwen3.6-35B-A3B is the reference checkpoint used throughout these examples.

Platform requirements

RequirementDetails
HardwareApple Silicon Mac (Metal)
OSmacOS
Python3.12+
Primary stackMLX, mlx-lm, optional dflash-mlx for spec-dec

For the full production fast path you also need the native extensions (lme_mlx_pread_ext, liblme_expert_io) and a checkpoint with a valid expert_sidecar/ layout. See Installation.

Two operating modes

mach-serve exposes one top-level selector: --streaming or --no-streaming. Backend and profile defaults are implied; you do not need separate --backend production flags for normal use.

ModeFlagWhen to use
Streaming--streamingExperts streamed from SSD into a bounded GPU bank; full-expert sliced sidecar checkpoints; large prefill windows. Transient prefill + resident decode.
Resident stacked--no-streamingMachines with enough RAM to hold all experts resident, using a stacked-compatible MLX checkpoint. Fails fast if layout is incompatible.

Use mach check to predict whether a given checkpoint fits your machine (fits / tight / wont_fit) before loading weights.

A third residency mode, bf16_streaming, is opt-in for full-precision bakeoffs against plain HF/SwiftLM stacked checkpoints. See Expert residency.

Generic, architecture-agnostic serving without the production optimizations uses --backend openai (mlx-lm path). See Serving.

Mental model

flowchart LR
  ckpt["Checkpoint (config.json + safetensors + expert_sidecar/)"] --> load["load_moe(): detect arch, swap expert blocks"]
  load --> residency["Expert residency: streaming | stacked | bf16_streaming"]
  residency --> dispatch["GatherSwitchGLU dispatch via bounded GPU bank"]
  dispatch --> decode["Decode loop"]
  decode -->|"default"| specdec["DFlash v7 spec-dec (draft → verify → accept)"]
  decode -->|"opt-in"| batch["Continuous batching scheduler (B up to 4)"]
  specdec --> serve["mach-serve: OpenAI /v1 API + SSE"]
  batch --> serve
  caches["Prefix cache (L2 disk + GDN snapshots) + optional TurboQuant KV"] --- decode

Supported architectures

Registered MoE families share a common load, swap, and dispatch pipeline. detect_arch() matches config.json against REGISTRY (MoeArchAdapter).

model_typeReference targetNotes
qwen3_5_moeQwen3.6-35B-A3BHybrid GDN + full attention
qwen3_moeQwen3-Coder-30B-A3B-InstructPure attention
gpt_ossgpt-oss-20bPure attention + sliding window/sinks
gemma4Gemma 4 26B MoEPure attention with multimodal wrap
deepseek_v4DeepSeek-V4-FlashHybrid routing; engine layout required; MTP/EAGLE drafts with rho-gate

This table is the human-readable view of the in-code registry. Run mach-archs (or mach-archs --json) to print the same set with each architecture's static facts straight from the engine; preflight a specific checkpoint — its support, runnable serving paths, required artifacts, and memory fit — with the static mach check before paying a load; and ask a running server about the loaded model with GET /v1/capabilities.

Details on the load pipeline and block swapping are in Architecture.

Subsystems

TopicWhat it covers
Installationpip extras, native extensions, tests
Architectureload_moe, REGISTRY, DiskSwitchGLU vs GatherSwitchGLU
Expert residencyStreaming, stacked, sidecar format, direct-pread, memory knobs
Speculative decodingDFlash v7, adaptive blocks, acceptance, DeepSeek-V4 rho-gate
Continuous batchingSingle-flight default, opt-in scheduler, hybrid spec-dec
CachingPrefix cache L1/L2, warm_prefix, TurboQuant KV
Servingmach-serve, endpoints, startup signals
CLImach-archs, mach-models, mach check, mach-generate, mach convert, mach-prune, mach-reap, artifacts
Conversionmach convert — master → 2-bit IQ2 GGUF, ConversionConfig seam, gates
Libraryload(), load_moe(), BlockSpecDecEngine API
Maniac integrationVendored snapshot, desktop spawn path, env overrides
TroubleshootingSlow-path diagnostics and fixes

Quick start (production)

pip install -e ".[dev,dflash,native]"
python scripts/build_mlx_pread_ext.py
python -c "import lme_mlx_pread_ext"

mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080

Point OpenAI-compatible clients at http://127.0.0.1:8080/v1 with any API key string. Confirm the fast path via startup logs (decode_path=specdec-v7, direct_pread=1, native_extension=ready) or GET /v1/stats.

On this page