Maniac Docs

Serving

Run mach-serve on the production fast path or generic OpenAI backend, confirm startup signals, and use HTTP endpoints.

mach-serve is the primary entry point for OpenAI-compatible HTTP serving on Apple Silicon.

Production fast path

mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080

Pick exactly one residency mode:

FlagResidency
--streamingTransient prefill + resident decode (experts streamed from SSD sidecar)
--no-streamingFull resident stacked (compatible checkpoint that fits in RAM)

Production stack (implicit):

  • DFlash v7 + opencode-sampled-v1 adaptive blocks
  • Native expert sidecar I/O + direct pread
  • Gather dispatch + vLLM-parity tool handling
  • Prefix cache (L2 on disk)

Compatibility aliases (not primary docs): --backend production, --backend dflash, scripts/serve_production.py.

Generic OpenAI backend

Architecture-agnostic serving without DFlash production optimizations:

mach-serve mlx-community/Qwen3.6-35B-A3B-MLX-4bit --backend openai --port 8000

Use for quick mlx-lm compatibility checks, not production benches on engine-format sidecar artifacts.

Startup signals

Confirm the fast path in stdout logs:

startup mode profile=production production_streaming_mode=streaming ... transient_prefill=1 ... expert_residency=streaming ...
production optimizations direct_pread=1 ... native_prefill_fused=1 ... async_prefetch=0 ... resident_metal=0
startup direct-pread expected=1 sidecar_present=1 sidecar_valid=1 ... native_extension=ready ... fallback_policy=fail-fast
startup decode_path=specdec-v7 adaptive_block_policy=opencode-sampled-v1
SignalExpected (production streaming)
production_streaming_modestreaming
transient_prefill1
expert_residencystreaming
direct_pread1
native_extensionready
decode_pathspecdec-v7
adaptive_block_policyopencode-sampled-v1

Red flags: sidecar_valid=0, native_extension=missing, fallback_policy=diagnostic-python. See Troubleshooting.

OpenAI client configuration

Base URL: http://127.0.0.1:PORT/v1

API key: any non-empty string (server does not validate keys locally).

Example with curl smoke script:

./examples/curl_smoke.sh http://127.0.0.1:8080

OpenCode example:

export XDG_CONFIG_HOME=$(pwd)/.opencode-config
opencode run "Write a binary search in Python"

HTTP endpoints

MethodPathPurpose
GET/v1/modelsModel list (readiness probe for Maniac Desktop)
GET/v1/capabilitiesLoaded model's architecture + active serving path (both backends)
POST/v1/chat/completionsChat completions — streaming SSE, tools, JSON schema
GET/v1/statsDecode stats, adaptive block policy, DFlash counters
GET/v1/cache/statsPrefix cache + streaming_summary expert bank rollup
GET/v1/cache/warm_prefixWarm-prefix status
POST/v1/cache/warm_prefixPrefill and cache stable system/tool prefix

Capabilities

GET /v1/capabilities reports, for the currently loaded model, which architecture it is and what serving path is active. It is served by both the production fast path (--streaming / --no-streaming) and the generic --backend openai path.

curl -s http://127.0.0.1:8080/v1/capabilities | jq .
{
  "model": "your-engine-checkpoint",
  "backend": "production",
  "supported": true,
  "arch": {
    "model_type": "qwen3_5_moe",
    "swiglu_variant": "plain_swiglu",
    "stacked_supported": true,
    "has_expert_bias": false,
    "text_config_wrapped": true,
    "model_module": "mach.models.qwen3_5_moe",
    "default_bits": 4,
    "default_group_size": 64,
    "kv_cache_kind": "hybrid_gdn",
    "specdec_kind": "specdec_v7",
    "continuous_batching": "ok"
  },
  "serving": {
    "expert_residency": "streaming",
    "serving_mode": "streaming",
    "decode_path": "specdec-v7",
    "effective_decode_path": "specdec_v7",
    "effective_decode_path_reason": "specdec_v7 draft is usable",
    "continuous_batching": false,
    "turboquant_kv": false,
    "decode_policy": { "mode": "greedy-specdec", "engine_mode": "native_v7" },
    "speculation": {
      "supported_presets": ["none", "summarization", "classification"],
      "additive": true,
      "exactness_preserving": true
    },
    "rho_gate": null
  },
  "grammar": { "xgrammar": true, "lm_format_enforcer": false }
}
FieldMeaning
modelLoaded model id
backendproduction or openai
supportedWhether the loaded model_type is in the registry (arch is null when not)
archStatic architecture facts — the same arch_summary that mach-archs --json prints
servingLive config of this process — expert residency, serving mode, decode path, the arch-gated effective_decode_path (+ reason), continuous batching, the decode policy, and the speculation capability block
grammarWhich constrained-decoding backends are importable (xgrammar, lm_format_enforcer)

decode_path vs effective_decode_path

decode_path is the configured path label; effective_decode_path is what the arch gate (_resolve_effective_decode_path) actually resolved at load, one of specdec_v7, target_only, v4_mtp, or v4_eagle3, with a one-line effective_decode_path_reason. The gate reuses the same specdec predicate as mach check, so a stray --draft-dir on a non-Qwen architecture (or a missing/unusable draft) cleanly falls back to target_only instead of mis-constructing the DFlash stack. See Target-only mode.

speculation

The speculation block advertises the draft-free request presets the loaded engine accepts: supported_presets (["none", "summarization", "classification"] on the DFlash and target-only paths; ["none"] on engines without a speculation seam, e.g. DeepSeek-V4 MTP/EAGLE), plus additive: true and exactness_preserving: true.

The generic openai backend returns backend: "openai" with a backend-specific serving block (serving_mode: "openai-mlx-stream-generate", a decode_path of spec-dec or target-only, and a disk_kv flag).

This endpoint reports only what the running server is actually doing. The full path × architecture support matrix (which paths each arch could run) is intentionally out of scope here — that is the job of the static, pre-load mach check preflight, which predicts a checkpoint's runnable paths, required artifacts, and memory fit without loading weights.

Chat completions

Supports:

  • stream: true — SSE token deltas
  • Tool calling — OpenAI/vLLM-compatible parsing
  • JSON schema response format
  • Optional --continuous-batching — see Continuous batching

Headers of interest when batching is enabled: X-Decode-Mode, X-Active-Loops.

Runtime observability

curl -s http://127.0.0.1:8080/v1/stats | jq .
curl -s http://127.0.0.1:8080/v1/cache/stats | jq .

Confirm adaptive_block_policy.name=opencode-sampled-v1, alpha_window=4, alpha_threshold=0.40, low_block=4, default_block=16.

Memory tuning examples

Default streaming (sidecar experts streamed, bounded decode bank):

mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080

Larger resident decode bank when you have spare RAM:

mach-serve /path/to/<your-engine-checkpoint> --streaming --expert-cache-gb 8 --port 8080

Resident stacked (compatible checkpoint that fits in RAM):

mach-serve /path/to/stacked-mlx-checkpoint --no-streaming --port 8080

To size these knobs for a specific machine, run mach check <checkpoint> --expert-cache-gb N, which reports a fits / tight / wont_fit verdict without loading weights.

On this page