Mach · the engine inside Maniac

Run models bigger than your memory

Mach is a Mixture-of-Experts inference engine built for Apple Silicon. It streams experts off your SSD as they're routed, drafts tokens ahead of the model, and serves an OpenAI-compatible API — entirely on your Mac.

Download for Mac Read the engine docs

mach-serve

$ mach-serve ./qwen3.6-35b-a3b --streaming --port 8080
detected qwen3_5_moe · sliced layout + expert sidecar
expert_residency=streaming transient_prefill=1
native_extension=ready direct_pread=1
decode_path=specdec-v7 prefix_cache=on
serving on http://127.0.0.1:8080/v1

2-bit

quantized experts (IQ2 GGUF)

tokens drafted per cycle

5.5×

max KV compression (opt-in TurboQuant)

MoE families, one pipeline

Experts live on your SSD, not in your RAM

A 35B mixture-of-experts model only activates a few billion parameters per token. Mach exploits that: routed experts stay on disk in a packed sidecar format while a bounded bank of hot experts lives on the GPU, and misses are read straight into place through a native direct-pread path — no Python on the hot loop. Prefill runs through a transient scratch arena so a long prompt never pins the full expert set, and `mach check` predicts whether a checkpoint fits your machine before you download a single weight.

How expert residency works

expert streamingall routes resident

expert_sidecar/ · ssdlayer_00.binlayer_01.binlayer_02.binlayer_03.bin…

pread

bounded gpu bank · 24 of 256 experts resident

e011e048e085e122e159e196e233e014e051e088e125e162e199e236e017e054e091e128e165e202e239e020e057e094

hits 1,204misses 86evictions 61direct_pread=1

Verified in blocks, not token by token

The default decode path is DFlash speculative decoding: a small draft head proposes up to 16 tokens, the target model verifies the whole block in one parallel pass, and the longest accepted prefix commits at once. Acceptance is exactness-preserving — greedy output is token-identical to running the target alone. Per-request presets add draft-free speculation for summarization and classification workloads.

Inside DFlash speculative decoding

speculative decodingdraft → verify

Here'sthesummaryyouaskedfor.Revenuerose18%monthquarterovermonth,drivenbythenewenterprisetierlaunch.

draft · 8 layers proposetarget · 35b verifies19 tokens accepted

2.4× decode speedup

Agents repeat themselves. Mach notices.

Every agent turn re-sends the same system prompt and tool definitions. Mach keeps a two-level prefix cache — hot snapshots in memory, a keyed KV store on disk — and restores it instead of re-prefilling, so repeated prefixes skip straight to new tokens. A warm_prefix endpoint lets Maniac warm a session before your first message, and the adaptive draft policy is tuned for tool-calling output.

Prefix caching in depth

time to first tokensame prefix, two turns

turn 1 · coldfull prefill of system + tools

first token

turn 2 · prefix cache hitKV restored from disk — prefill skipped

first token

L1 · in-memory snapshotsL2 · disk KV cachePOST /v1/cache/warm_prefix

From a bf16 master to a Mac-sized checkpoint in one command

mach convert turns a higher-precision MoE master into a servable 2-bit IQ2 GGUF checkpoint: slice, calibrate with an importance matrix, quantize, pack, and certify. Validation gates compare the result against the bf16 teacher, so 2-bit is a measured trade — not a leap of faith. Every routed expert is kept; nothing is pruned.

The conversion pipeline

mach convertresumable phases

slice

calibrate

quantize

pack

gates

bf16 expert weights16-bit

2-bit IQ2 GGUF8× smaller

imatrix calibrationevery expert keptgates · certified vs bf16 teacher

One pipeline, five families — and an open API

Every supported architecture flows through the same load, swap, dispatch, and decode pipeline. The server speaks the OpenAI API, so any client, SDK, or coding agent can point at localhost — and ask /v1/capabilities what the loaded model can actually do.

Serving & API reference

Qwen3.6-35B-A3B

hybrid GDN + attention

Qwen3-Coder-30B-A3B

pure attention

gpt-oss-20b

sliding window + sinks

Gemma 4 26B

multimodal wrap

DeepSeek-V4-Flash

MTP / EAGLE drafts

GET /v1/capabilities · ask the server, don't guess

Zero terminals required

Maniac manages the engine for you: it installs the runtime, picks the right serving mode for each checkpoint, spawns the server, and shows live throughput, time-to-first-token, and memory in the Server panel. Everything on this page is also a pip install away — same engine, your terminal.

How Maniac integrates Mach

The Maniac desktop app running a local model, with a chat using Gmail, Slack, Calendar and Notes alongside a live local inference server panel.

Local inference, without the leftovers

Download Maniac and the engine comes with it — or read the docs and run mach-serve yourself.

Download for Mac Read the Mach docs