Maniac Docs

Troubleshooting

Diagnose production slow paths — direct pread, cache churn, slow prefill, stats timeouts, and diagnostic fallback.

Production mach-serve is designed to fail fast when the native fast path is unavailable. Use this page when startup logs or /v1/cache/stats indicate you are on a diagnostic or degraded path.

See Serving for expected startup signals and Installation for building native extensions.

direct_pread.active=0 or direct_pread.mode=none

Symptom: /v1/cache/statsstreaming_summary shows direct_pread.active=0 or mode=none.

Cause:

  • lme_mlx_pread_ext not built or not importable
  • expert_sidecar/ missing or invalid
  • Direct pread explicitly disabled

Fix:

pip install -e ".[dev,dflash,native]"
python scripts/build_mlx_pread_ext.py
python -c "import lme_mlx_pread_ext"

Restart mach-serve and confirm startup logs:

startup direct-pread ... native_extension=ready ... fallback_policy=fail-fast
production optimizations direct_pread=1 ...

Re-export sidecar if sidecar_valid=0:

python scripts/export_expert_sidecar.py \
  --checkpoint /path/to/checkpoint \
  --output /path/to/checkpoint/expert_sidecar \
  --num-experts 256 \
  --bits 4

See Expert residency and CLI.

High evictions / cache churn

Symptom: streaming_summary shows high evictions, low hit rate, or rising misses under steady decode.

Cause: Resident decode bank too small for the workload's expert working set.

Fix:

  • Increase --wired-gb (Metal wired limit, default 9)
  • Increase --expert-cache-gb (resident bank size)
  • On constrained machines, use --streaming so prefill stays transient while decode remains resident

Example:

mach-serve /path/to/<your-engine-checkpoint> --streaming --wired-gb 10 --expert-cache-gb 8 --port 8080

Review LME_BANK_EVICTION_POLICY=lookahead behavior in Expert residency.

Very slow prefill (~6 tok/s)

Symptom: Prefill throughput orders of magnitude below expected; logs show diagnostic fallback.

Cause:

  • Python diagnostic fallback path (fallback_policy=diagnostic-python)
  • Native pread disabled or missing
  • Non-production backend (--backend openai)

Fix:

  • Confirm effective_backend=production profile=production
  • Confirm startup decode_path=specdec-v7
  • Confirm native_extension=ready and direct_pread=1
  • Avoid generic openai backend for production benchmarks

If you intentionally enabled diagnostic fallback for debugging, expect slow prefill — this is not a serving mode.

/v1/cache/stats timeout or stale_generation_in_flight

Symptom: Cache stats request hangs, returns timeout, or stale_generation_in_flight.

Cause: Single-flight _GENERATION_LOCK is held during an in-progress generation. Stats sampling contends with the active decode loop.

Fix:

  • Retry after the current generation completes
  • Sample stats between requests for reliable snapshots
  • For concurrent clients, consider opt-in Continuous batching (trades DFlash at B≥2)

Diagnostic fallback override (explicitly slow)

Python fallback is diagnostic only, not production.

LME_ALLOW_DIAGNOSTIC_PYTHON_FALLBACK=1 mach-serve /path/to/checkpoint --streaming --port 8080

Use only when debugging startup blockers. Maniac Desktop may keep --streaming with this policy when expert_sidecar/ is missing from a catalog path — prefer fixing the checkpoint or setting MANIAC_LOCAL_MOE_CHECKPOINT_PATH.

Startup red flags checklist

Log valueMeaningAction
sidecar_valid=0Sidecar missing or corruptExport or fetch the engine-format sidecar artifact
native_extension=missinglme_mlx_pread_ext import failedRebuild extension
fallback_policy=diagnostic-pythonOn Python slow pathFix native + sidecar
decode_pathspecdec-v7Spec-dec disabled or wrong backendInstall [dflash], pass --draft-dir, avoid --target-only
effective_backend=openaiGeneric mlx-lm pathUse production mach-serve without --backend openai

Maniac Desktop-specific

IssueCheck
Engine not startingMANIAC_LOCAL_MOE_ENABLED=1, venv install logs
Wrong checkpointMANIAC_LOCAL_MOE_CHECKPOINT_PATH, catalog sidecar presence
Stale vendored codepnpm run vendor:local-moe:check
Readiness timeoutFirst expert load can take minutes — 600s startup budget

See Maniac integration.

Observability commands

curl -s http://127.0.0.1:8080/v1/stats | jq '.adaptive_block_policy, .specdec_drafted, .specdec_accepted'
curl -s http://127.0.0.1:8080/v1/cache/stats | jq '.streaming_summary, .prefix_cache'

Focused residency probe (upstream script):

python scripts/run_expert_residency_probe.py --url http://127.0.0.1:8080

On this page