Judge evals
LLM-based comparison to verify your task-specialized models outperform frontier baselines.
Judge evals use a judge prompt to compare two candidate outputs (A vs B) for the same input — typically your optimized task-specialized model vs. the frontier baseline.
Use judge evals when you want fast iteration and your success criteria can be described in natural language (e.g. "more accurate on this domain", "follows the format better", "handles edge cases correctly").
For implementation details and examples, see the hosted docs at
https://docs.maniac.ai/evaluations/creating-evaluations.