Run models bigger than your memory
Mach is a Mixture-of-Experts inference engine built for Apple Silicon. It streams experts off your SSD as they're routed, drafts tokens ahead of the model, and serves an OpenAI-compatible API — entirely on your Mac.
$ mach-serve ./qwen3.6-35b-a3b --streaming --port 8080
detected qwen3_5_moe · sliced layout + expert sidecar
expert_residency=streaming transient_prefill=1
native_extension=ready direct_pread=1
decode_path=specdec-v7 prefix_cache=on
serving on http://127.0.0.1:8080/v1 2-bit
quantized experts (IQ2 GGUF)
16
tokens drafted per cycle
5.5×
max KV compression (opt-in TurboQuant)
5
MoE families, one pipeline
Experts live on your SSD, not in your RAM
A 35B mixture-of-experts model only activates a few billion parameters per token. Mach exploits that: routed experts stay on disk in a packed sidecar format while a bounded bank of hot experts lives on the GPU, and misses are read straight into place through a native direct-pread path — no Python on the hot loop. Prefill runs through a transient scratch arena so a long prompt never pins the full expert set, and `mach check` predicts whether a checkpoint fits your machine before you download a single weight.
How expert residency worksVerified in blocks, not token by token
The default decode path is DFlash speculative decoding: a small draft head proposes up to 16 tokens, the target model verifies the whole block in one parallel pass, and the longest accepted prefix commits at once. Acceptance is exactness-preserving — greedy output is token-identical to running the target alone. Per-request presets add draft-free speculation for summarization and classification workloads.
Inside DFlash speculative decodingAgents repeat themselves. Mach notices.
Every agent turn re-sends the same system prompt and tool definitions. Mach keeps a two-level prefix cache — hot snapshots in memory, a keyed KV store on disk — and restores it instead of re-prefilling, so repeated prefixes skip straight to new tokens. A warm_prefix endpoint lets Maniac warm a session before your first message, and the adaptive draft policy is tuned for tool-calling output.
Prefix caching in depthFrom a bf16 master to a Mac-sized checkpoint in one command
mach convert turns a higher-precision MoE master into a servable 2-bit IQ2 GGUF checkpoint: slice, calibrate with an importance matrix, quantize, pack, and certify. Validation gates compare the result against the bf16 teacher, so 2-bit is a measured trade — not a leap of faith. Every routed expert is kept; nothing is pruned.
The conversion pipelineOne pipeline, five families — and an open API
Every supported architecture flows through the same load, swap, dispatch, and decode pipeline. The server speaks the OpenAI API, so any client, SDK, or coding agent can point at localhost — and ask /v1/capabilities what the loaded model can actually do.
Serving & API referenceQwen3.6-35B-A3B
hybrid GDN + attention
Qwen3-Coder-30B-A3B
pure attention
gpt-oss-20b
sliding window + sinks
Gemma 4 26B
multimodal wrap
DeepSeek-V4-Flash
MTP / EAGLE drafts
GET /v1/capabilities · ask the server, don't guess
Zero terminals required
Maniac manages the engine for you: it installs the runtime, picks the right serving mode for each checkpoint, spawns the server, and shows live throughput, time-to-first-token, and memory in the Server panel. Everything on this page is also a pip install away — same engine, your terminal.
How Maniac integrates Mach
Local inference, without the leftovers
Download Maniac and the engine comes with it — or read the docs and run mach-serve yourself.