Conversion

mach convert — one command to turn a higher-precision MoE master into a servable 2-bit IQ2 GGUF checkpoint, with the per-arch ConversionConfig seam and validation gates.

mach convert turns a higher-precision MoE master (bf16 / fp8 / mxfp4) into a servable 2-bit IQ2 GGUF checkpoint in one command. It replaces the seven-step manual recipe that previously lived in recipes/2bit-moe/scripts/* — those scripts are now thin wrappers over the in-package mach.conversion.* modules that this command drives.

The output is a drop-in checkpoint with a GGUF v2 expert sidecar (lme-gguf-expert-sidecar-v2) and a config.json that declares quantization.expert_format="gguf". The engine self-selects the fused GGUF decode path from that config.json, so the result is served exactly like any other checkpoint:

mach convert mlx-community/Qwen3.6-35B-A3B-bf16 --out ./qwen36-a3b-iq2 --gates full
mach-serve ./qwen36-a3b-iq2 --streaming --port 8080

The pipeline

mach convert threads six resumable phases, writing each phase's artifact under --out plus a conversion_report.json manifest:

resolve master (bf16/fp8/mxfp4)
   │
   ├─ slice        per-expert split (keep-all — every routed expert retained)
   ├─ calibrate    importance matrix (imatrix); + per-layer Hessian for --gptq
   ├─ quantize     experts → 2-bit IQ2 GGUF blocks in the sidecar layout
   ├─ pack         servable checkpoint (config.json + non-experts + expert_sidecar/)
   └─ gates        validation gates certify the result

Phase	What it does
`slice`	Splits the master into per-expert tensors (keep-all; not a quality-reducing prune).
`calibrate`	Collects the per-projection importance matrix (imatrix); with `--gptq`, also captures the per-layer input Hessian for block-GPTQ error feedback.
`quantize`	Quantizes every expert to 2-bit IQ2 GGUF blocks using the per-arch codec schedule (+ imatrix, + optional GPTQ).
`pack`	Assembles the servable checkpoint — `config.json` (with `quantization.expert_format="gguf"`), non-expert tensors, and the `expert_sidecar/` (`lme-gguf-expert-sidecar-v2`).
`gates`	Runs the validation gates to certify quality (see Gates).

Each phase is resumable: --resume skips a phase whose artifact already exists, and --phase runs a single phase in isolation.

Usage

mach convert <hf_repo_or_path> --out DIR
  [--arch X]                         # auto-detected from config.json via detect_arch
  [--variant V] [--layers all]      # variant defaults to the arch's default_variant
  [--imatrix|--no-imatrix] [--gptq]
  [--gates {off,fast,full}]          # default fast; "full" certifies vs a bf16 teacher
  [--phase {slice,calibrate,quantize,pack,gates}]
  [--master-dtype {bf16,mxfp4,fp8}]  # default per-arch ConversionConfig
  [--backend {local,modal}]          # local default
  [--teacher-dir DIR] [--resume]

Flag	Purpose
`<hf_repo_or_path>`	Higher-precision master — a Hugging Face repo id or a local checkpoint directory.
`--out DIR`	Output directory for phase artifacts, the packed checkpoint, and `conversion_report.json`.
`--arch`	Architecture key; auto-detected from the master's `config.json` (`detect_arch`) when omitted.
`--variant`	Codec-schedule variant from the arch's `ConversionConfig` (defaults to that config's `default_variant`; DeepSeek-V4 ships the validated variant `a`).
`--layers`	Layer subset to convert (default `all`).
`--imatrix` / `--no-imatrix`	Enable/disable importance-matrix calibration.
`--gptq`	Add block-GPTQ error feedback (captures a per-layer Hessian in the calibrate phase).
`--gates {off,fast,full}`	Gate level (default `fast`); `full` certifies against a bf16 teacher.
`--phase {slice,calibrate,quantize,pack,gates}`	Run a single phase in isolation.
`--master-dtype {bf16,mxfp4,fp8}`	Read dtype of the master (default from the arch's `ConversionConfig`).
`--backend {local,modal}`	Execution backend; `local` is the default. `modal` is not yet implemented as an in-process path (see Backends).
`--teacher-dir DIR`	Teacher checkpoint for the `full` gates.
`--resume`	Skip phases whose artifacts already exist under `--out`.

Exit codes

Code	Meaning
`0`	Conversion succeeded (gates passed, or gates `off`).
`1`	A required validation gate failed.
`2`	Usage or runtime error — including a missing native `libiqk` codec (see Native dependency).

The `ConversionConfig` seam

The constants the old recipe hardcoded for DeepSeek-V4 are now policy on a per-model_type ConversionConfig (mach.conversion.config), while shapes are derived from config.json at runtime so one config serves any SKU of an architecture:

Was hardcoded	New home
`HID, INTER, NE, NL` dims	`resolve_shape(config_json, adapter)` (reuses the arch registry + slicer expert-count key path)
depth-graded `down` codec schedule	`ConversionConfig.schedules[variant]`
master is mxfp4	`ConversionConfig.master_dtype`
`num_hash_layers`, top-k	`ConversionConfig.num_hash_layers` / `moe_top_k`
corpus paths, teacher/refppl protocol	`ConversionConfig.calibration` / `gate`

CONVERSION_REGISTRY is kept in lockstep with the arch registry (io.arch_registry.REGISTRY) — the module raises at import if they drift — so a new architecture cannot get a serving adapter without a conversion policy, or vice versa.

Validated vs. untuned architectures

Architecture	Conversion status
`deepseek_v4`	Validated — ships the certified codec schedule (variant `a`).
`qwen3_5_moe` (Qwen3.6-35B-A3B)	Validated
`qwen3_moe`	Registered, untuned placeholder schedule
`gpt_oss`	Registered, untuned placeholder schedule
`gemma4`	Registered, untuned placeholder schedule

Architectures whose depth schedule is still a conservative placeholder carry schedule_tuned=False. The full gates surface a non-fatal schedule_untuned warning for them, so a placeholder can't masquerade as certified. Re-derive the per-layer bit schedule from the model's own sensitivity before treating an untuned arch as production-ready.

Validation gates

Level	What runs
`off`	No gates — fastest, no certification.
`fast` (default)	Cheap structural checks (codec round-trip, shape/geometry validation).
`full`	Adds output-space certification against a bf16 teacher (KL / top-1 vs the teacher; pass `--teacher-dir`), plus the `schedule_untuned` warning for untuned archs.

A failed required gate exits 1. See recipes/2bit-moe/README.md for the full gate rationale (block recon, Metal parity, end-to-end decode parity, per-domain KL/top-1, on-box fit).

Backends

--backend local (the default) runs the full pipeline in-process on Apple Silicon. --backend modal is recognized but not yet implemented as an in-process path; large-scale Modal-based quantization still goes through the recipe's standalone Modal workers (recipes/2bit-moe/scripts/modal_quant.py), because the per-shard quant runs MLX-free on Linux.

Native dependency

The IQ2 codec (IQ2_K / IQ2_KS / IQ2_KL quantize + dequant) is provided by the native libiqk library. Build it once before converting:

python vendor/local_moe_engine/scripts/build_libiqk.py

A missing libiqk makes mach convert exit 2. The codec also ships as the conversion optional-dependency extra.

Relationship to the recipe

The reproducible 2-bit MoE technique writeup — the measured quality/speed results, the codec-and-bit-allocation rationale, and the per-model calibration guidance — lives in recipes/2bit-moe/README.md. That recipe is now driven by mach convert; its scripts/* remain only as thin back-compat wrappers over mach.conversion.* for running a single phase by hand.

CLI — discovery, generation, and artifact commands
Serving — serve the converted checkpoint with mach-serve --streaming
Expert residency — the expert sidecar format the pack phase emits
Installation — extras and native builds

Conversion

On this page