LLM Evaluation: How to Measure the Quality of AI Features
Prologue: "It Seemed Fine When I Tested It"
When I first shipped an AI feature to production, my quality standard was:
"I tried it a few times and it looked good."
In retrospect that's terrifying. It was the equivalent of deploying code with no unit tests. I'd only tested the happy path, and had no coverage for the weird queries and edge cases that real users bring.
Sure enough, strange responses started appearing in production. Confident hallucinations on certain question patterns. Sudden language switching. Complete disregard for user context.
That's when it clicked: AI features need tests too. Just not the same kind as regular software tests.
That's where LLM Evaluation (Eval) comes in.
1. What Is Eval?
LLM Eval is a methodology for systematically measuring the output quality of AI models or AI-powered features.
The fundamental difference from regular testing:
| Unit Tests | LLM Eval |
|---|
| Input | Fixed values | Diverse natural language |
| Output verification | Exact value match (===) | Quality/relevance/accuracy judgment |
| Determinism | Same input = same output | Same input ≠ same output |
| Failure criteria | Clear pass/fail | Spectrum (0.0 - 1.0) |
The key shift: instead of "right or wrong," you're measuring "how good."
2. Three Types of Eval
Automatic Evaluation
Code-automated checks. Fastest and easiest to integrate into CI.
// Exact match — for classification, structured extraction
const exactMatch = (output: string, expected: string) =>
output.trim() === expected.trim() ? 1.0 : 0.0;
// Contains check — for required information presence
const containsCheck = (output: string, required: string[]) => {
const matched = required.filter((p) =>
output.toLowerCase().includes(p.toLowerCase())
);
return matched.length / required.length;
};
// Semantic similarity — for meaning-level comparison
async function semanticSimilarity(a: string, b: string): Promise<number> {
const [embedA, embedB] = await Promise.all([
embed({ model: openai.embedding("text-embedding-3-small"), value: a }),
embed({ model: openai.embedding("text-embedding-3-small"), value: b }),
]);
const dot = embedA.embedding.reduce((s, v, i) => s + v * embedB.embedding[i], 0);
const magA = Math.sqrt(embedA.embedding.reduce((s, v) => s + v * v, 0));
const magB = Math.sqrt(embedB.embedding.reduce((s, v) => s + v * v, 0));
return dot / (magA * magB);
}
Human Evaluation
Most accurate, but slow and expensive. Use it to set quality baselines or evaluate subjective criteria (creativity, cultural appropriateness, domain-specific accuracy).
Practical tip: use automatic eval to filter first, then apply human eval only to ambiguous or high-stakes cases.
LLM-as-Judge
LLM evaluates LLM output. Automated, yet closer to human-quality judgment than simple heuristics.
async function llmJudge(
input: string,
output: string,
criteria: { name: string; description: string }[]
): Promise<Record<string, number>> {
const response = await client.messages.create({
model: "claude-opus-4-5",
max_tokens: 512,
messages: [{
role: "user",
content: `Evaluate this AI response on the given criteria.
Input: ${input}
Output: ${output}
Criteria: ${criteria.map(c => `- ${c.name}: ${c.description} (1-5 scale)`).join("\n")}
Respond with JSON: { "scores": { "criterionName": score } }`
}],
});
const text = response.content[0].type === "text" ? response.content[0].text : "{}";
const match = text.match(/\{[\s\S]*\}/);
return JSON.parse(match?.[0] ?? "{}").scores ?? {};
}
Watch out for judge model bias: models tend to favor their own stylistic patterns. Use multiple judge models and average the scores when accuracy matters.
3. Key Metrics
For RAG systems:
- Faithfulness: Does the response stay within the provided context? (No hallucinations)
- Answer Relevance: Does the response actually answer the question?
- Context Precision: What fraction of retrieved docs were actually used?
For general generation:
- BLEU/ROUGE for translation and summarization
- LLM-as-judge quality scores for open-ended responses
- Task-specific heuristics (format checks, required field presence)
4. Building an Eval Dataset
A good eval starts with a good dataset. Dataset construction is half the work.
interface EvalCase {
id: string;
input: string;
expected?: string;
criteria: string[];
tags: string[];
difficulty: "easy" | "medium" | "hard" | "adversarial";
}
// Recommended composition
const composition = {
happyPath: 0.4, // Normal use cases
edgeCases: 0.3, // Empty input, very long input, etc.
adversarial: 0.2, // Deliberately tricky inputs
regression: 0.1, // Past failure cases
};
Source 1: Production logs — Real user queries are the most representative. Prioritize low-rated interactions.
Source 2: Synthetic generation — Use an LLM to generate diverse test cases across difficulty levels. Fast way to get broad coverage.
5. Eval Tools
promptfoo (open source CLI): Quick setup, easy CI integration, great for comparing models and prompts side-by-side.
# promptfooconfig.yaml
providers:
- anthropic:claude-opus-4-5
- openai:gpt-4o
tests:
- vars:
question: "What's your refund policy?"
assert:
- type: contains
value: "refund"
- type: llm-rubric
value: "Is the response helpful and specific?"
- type: cost
threshold: 0.01
Braintrust (SaaS): Dataset management, experiment tracking, team collaboration. Worth it if you're running evals regularly across a team.
RAGAS (Python): Purpose-built for RAG evaluation. Rich set of RAG-specific metrics out of the box.
| Tool | Type | Strengths | Weaknesses |
|---|
| promptfoo | Open source CLI | Fast setup, great for CI | Basic dataset management |
| Braintrust | SaaS | Team collab, experiment tracking | Paid |
| RAGAS | Open source lib | RAG-specific metrics | Python only |
6. CI Integration
An eval that doesn't run automatically is an eval you'll forget to run.
# .github/workflows/eval.yml
name: LLM Eval
on:
pull_request:
paths:
- "src/lib/prompts/**"
- "src/app/api/chat/**"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm run eval
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Fail on threshold breach
run: node scripts/check-eval-thresholds.js
Set score thresholds in your eval script and call process.exit(1) if any metric drops below its threshold. This makes eval failures block the PR merge, just like test failures do.
7. Production Monitoring
CI eval is the pre-deploy quality gate. But quality can drift after deployment too — new user patterns emerge, models get updated, prompt injections get attempted.
The practical approach: log all AI interactions. Sample 5-10% for heavier LLM-as-judge evaluation. Track user ratings (thumbs up/down) as a real signal. Alert when rolling average scores drop below baseline.
Epilogue: Eval Is the Test Suite for AI Features
Running an AI feature without eval is the same as deploying code without tests. Fast to ship initially, but eventually degrades into a quality and trust problem.
You don't need a perfect eval system from day one. Start simple:
- 10-20 of your most critical cases
- Simple automatic checks first (contains, regex)
- Gradually add LLM-as-judge and semantic similarity
- Wire it into CI so it runs on every PR
The goal: replace "this seems fine" with "the data shows this is fine." That's what eval gives you.