Conversion
mach convert — one command to turn a higher-precision MoE master into a servable 2-bit IQ2 GGUF checkpoint, with the per-arch ConversionConfig seam and validation gates.
mach convert turns a higher-precision MoE master (bf16 / fp8 / mxfp4) into a servable 2-bit IQ2 GGUF checkpoint in one command. It replaces the seven-step manual recipe that previously lived in recipes/2bit-moe/scripts/* — those scripts are now thin wrappers over the in-package mach.conversion.* modules that this command drives.
The output is a drop-in checkpoint with a GGUF v2 expert sidecar (lme-gguf-expert-sidecar-v2) and a config.json that declares quantization.expert_format="gguf". The engine self-selects the fused GGUF decode path from that config.json, so the result is served exactly like any other checkpoint:
mach convert mlx-community/Qwen3.6-35B-A3B-bf16 --out ./qwen36-a3b-iq2 --gates full
mach-serve ./qwen36-a3b-iq2 --streaming --port 8080The pipeline
mach convert threads six resumable phases, writing each phase's artifact under --out plus a conversion_report.json manifest:
resolve master (bf16/fp8/mxfp4)
│
├─ slice per-expert split (keep-all — every routed expert retained)
├─ calibrate importance matrix (imatrix); + per-layer Hessian for --gptq
├─ quantize experts → 2-bit IQ2 GGUF blocks in the sidecar layout
├─ pack servable checkpoint (config.json + non-experts + expert_sidecar/)
└─ gates validation gates certify the result| Phase | What it does |
|---|---|
slice | Splits the master into per-expert tensors (keep-all; not a quality-reducing prune). |
calibrate | Collects the per-projection importance matrix (imatrix); with --gptq, also captures the per-layer input Hessian for block-GPTQ error feedback. |
quantize | Quantizes every expert to 2-bit IQ2 GGUF blocks using the per-arch codec schedule (+ imatrix, + optional GPTQ). |
pack | Assembles the servable checkpoint — config.json (with quantization.expert_format="gguf"), non-expert tensors, and the expert_sidecar/ (lme-gguf-expert-sidecar-v2). |
gates | Runs the validation gates to certify quality (see Gates). |
Each phase is resumable: --resume skips a phase whose artifact already exists, and --phase runs a single phase in isolation.
Usage
mach convert <hf_repo_or_path> --out DIR
[--arch X] # auto-detected from config.json via detect_arch
[--variant V] [--layers all] # variant defaults to the arch's default_variant
[--imatrix|--no-imatrix] [--gptq]
[--gates {off,fast,full}] # default fast; "full" certifies vs a bf16 teacher
[--phase {slice,calibrate,quantize,pack,gates}]
[--master-dtype {bf16,mxfp4,fp8}] # default per-arch ConversionConfig
[--backend {local,modal}] # local default
[--teacher-dir DIR] [--resume]| Flag | Purpose |
|---|---|
<hf_repo_or_path> | Higher-precision master — a Hugging Face repo id or a local checkpoint directory. |
--out DIR | Output directory for phase artifacts, the packed checkpoint, and conversion_report.json. |
--arch | Architecture key; auto-detected from the master's config.json (detect_arch) when omitted. |
--variant | Codec-schedule variant from the arch's ConversionConfig (defaults to that config's default_variant; DeepSeek-V4 ships the validated variant a). |
--layers | Layer subset to convert (default all). |
--imatrix / --no-imatrix | Enable/disable importance-matrix calibration. |
--gptq | Add block-GPTQ error feedback (captures a per-layer Hessian in the calibrate phase). |
--gates {off,fast,full} | Gate level (default fast); full certifies against a bf16 teacher. |
--phase {slice,calibrate,quantize,pack,gates} | Run a single phase in isolation. |
--master-dtype {bf16,mxfp4,fp8} | Read dtype of the master (default from the arch's ConversionConfig). |
--backend {local,modal} | Execution backend; local is the default. modal is not yet implemented as an in-process path (see Backends). |
--teacher-dir DIR | Teacher checkpoint for the full gates. |
--resume | Skip phases whose artifacts already exist under --out. |
Exit codes
| Code | Meaning |
|---|---|
0 | Conversion succeeded (gates passed, or gates off). |
1 | A required validation gate failed. |
2 | Usage or runtime error — including a missing native libiqk codec (see Native dependency). |
The ConversionConfig seam
The constants the old recipe hardcoded for DeepSeek-V4 are now policy on a per-model_type ConversionConfig (mach.conversion.config), while shapes are derived from config.json at runtime so one config serves any SKU of an architecture:
| Was hardcoded | New home |
|---|---|
HID, INTER, NE, NL dims | resolve_shape(config_json, adapter) (reuses the arch registry + slicer expert-count key path) |
depth-graded down codec schedule | ConversionConfig.schedules[variant] |
| master is mxfp4 | ConversionConfig.master_dtype |
num_hash_layers, top-k | ConversionConfig.num_hash_layers / moe_top_k |
| corpus paths, teacher/refppl protocol | ConversionConfig.calibration / gate |
CONVERSION_REGISTRY is kept in lockstep with the arch registry (io.arch_registry.REGISTRY) — the module raises at import if they drift — so a new architecture cannot get a serving adapter without a conversion policy, or vice versa.
Validated vs. untuned architectures
| Architecture | Conversion status |
|---|---|
deepseek_v4 | Validated — ships the certified codec schedule (variant a). |
qwen3_5_moe (Qwen3.6-35B-A3B) | Validated |
qwen3_moe | Registered, untuned placeholder schedule |
gpt_oss | Registered, untuned placeholder schedule |
gemma4 | Registered, untuned placeholder schedule |
Architectures whose depth schedule is still a conservative placeholder carry schedule_tuned=False. The full gates surface a non-fatal schedule_untuned warning for them, so a placeholder can't masquerade as certified. Re-derive the per-layer bit schedule from the model's own sensitivity before treating an untuned arch as production-ready.
Validation gates
| Level | What runs |
|---|---|
off | No gates — fastest, no certification. |
fast (default) | Cheap structural checks (codec round-trip, shape/geometry validation). |
full | Adds output-space certification against a bf16 teacher (KL / top-1 vs the teacher; pass --teacher-dir), plus the schedule_untuned warning for untuned archs. |
A failed required gate exits 1. See recipes/2bit-moe/README.md for the full gate rationale (block recon, Metal parity, end-to-end decode parity, per-domain KL/top-1, on-box fit).
Backends
--backend local (the default) runs the full pipeline in-process on Apple Silicon. --backend modal is recognized but not yet implemented as an in-process path; large-scale Modal-based quantization still goes through the recipe's standalone Modal workers (recipes/2bit-moe/scripts/modal_quant.py), because the per-shard quant runs MLX-free on Linux.
Native dependency
The IQ2 codec (IQ2_K / IQ2_KS / IQ2_KL quantize + dequant) is provided by the native libiqk library. Build it once before converting:
python vendor/local_moe_engine/scripts/build_libiqk.pyA missing libiqk makes mach convert exit 2. The codec also ships as the conversion optional-dependency extra.
Relationship to the recipe
The reproducible 2-bit MoE technique writeup — the measured quality/speed results, the codec-and-bit-allocation rationale, and the per-model calibration guidance — lives in recipes/2bit-moe/README.md. That recipe is now driven by mach convert; its scripts/* remain only as thin back-compat wrappers over mach.conversion.* for running a single phase by hand.
Related pages
- CLI — discovery, generation, and artifact commands
- Serving — serve the converted checkpoint with
mach-serve --streaming - Expert residency — the expert sidecar format the pack phase emits
- Installation — extras and native builds