Maniac LogoManiac

Evaluations

Measure how your task-specialized models compare against frontier baselines.

Evaluations

Evaluations are how Maniac proves your optimized models outperform frontier baselines on your niche domain tasks. Every optimization run is gated by evals — a candidate model only gets promoted to production if it beats the current model on your defined criteria.

Maniac supports two types of evaluations:

  • Code Evals — Deterministic scoring with custom Python functions. Use these for schema validation, exact match, unit tests, or any business logic you can express in code.
  • Judge Evals — LLM-based comparison using a judge prompt. Use these when success criteria are best described in natural language (e.g., "more accurate", "follows the format better").

Evaluations run automatically during optimization. You can also trigger standalone eval runs to compare any model against your production baseline.

How Evals Fit the Pipeline

  1. Your container collects production telemetry from your high-throughput agent traffic.
  2. Maniac trains candidate models optimized for your specific task.
  3. Evals score each candidate against the frontier baseline on your data.
  4. Only candidates that pass are promoted — ensuring your agents always run the best model.

For implementation details, see Code Evals and Judge Evals.