M·07AI ENGINEERING2026.03.164 MIN READ

LLM Evaluation: How to Measure the Quality of AI Features

LLM 평가(Eval): AI 기능의 품질을 어떻게 측정할까

Shipping an AI feature and calling it good because it 'seems to work' is not a quality strategy. This post covers types of LLM evals, key metrics, building evaluation datasets, eval tools like promptfoo and Braintrust, and integrating evals into CI.

codemapo

INTERDISCIPLINARY DEV · SEOUL

LLM Evaluation: How to Measure the Quality of AI Features

Prologue: "It Seemed Fine When I Tested It"

When I first shipped an AI feature to production, my quality standard was:

"I tried it a few times and it looked good."

In retrospect that's terrifying. It was the equivalent of deploying code with no unit tests. I'd only tested the happy path, and had no coverage for the weird queries and edge cases that real users bring.

Sure enough, strange responses started appearing in production. Confident hallucinations on certain question patterns. Sudden language switching. Complete disregard for user context.

That's when it clicked: AI features need tests too. Just not the same kind as regular software tests.

That's where LLM Evaluation (Eval) comes in.

1. What Is Eval?

LLM Eval is a methodology for systematically measuring the output quality of AI models or AI-powered features.

The fundamental difference from regular testing:

	Unit Tests	LLM Eval
Input	Fixed values	Diverse natural language
Output verification	Exact value match (`===`)	Quality/relevance/accuracy judgment
Determinism	Same input = same output	Same input ≠ same output
Failure criteria	Clear pass/fail	Spectrum (0.0 - 1.0)

The key shift: instead of "right or wrong," you're measuring "how good."

2. Three Types of Eval

Automatic Evaluation

Code-automated checks. Fastest and easiest to integrate into CI.

// Exact match — for classification, structured extraction
const exactMatch = (output: string, expected: string) =>
  output.trim() === expected.trim() ? 1.0 : 0.0;

// Contains check — for required information presence
const containsCheck = (output: string, required: string[]) => {
  const matched = required.filter((p) =>
    output.toLowerCase().includes(p.toLowerCase())
  );
  return matched.length / required.length;
};

// Semantic similarity — for meaning-level comparison
async function semanticSimilarity(a: string, b: string): Promise<number> {
  const [embedA, embedB] = await Promise.all([
    embed({ model: openai.embedding("text-embedding-3-small"), value: a }),
    embed({ model: openai.embedding("text-embedding-3-small"), value: b }),
  ]);
  const dot = embedA.embedding.reduce((s, v, i) => s + v * embedB.embedding[i], 0);
  const magA = Math.sqrt(embedA.embedding.reduce((s, v) => s + v * v, 0));
  const magB = Math.sqrt(embedB.embedding.reduce((s, v) => s + v * v, 0));
  return dot / (magA * magB);
}

Human Evaluation

Most accurate, but slow and expensive. Use it to set quality baselines or evaluate subjective criteria (creativity, cultural appropriateness, domain-specific accuracy).

Practical tip: use automatic eval to filter first, then apply human eval only to ambiguous or high-stakes cases.

LLM-as-Judge

LLM evaluates LLM output. Automated, yet closer to human-quality judgment than simple heuristics.

async function llmJudge(
  input: string,
  output: string,
  criteria: { name: string; description: string }[]
): Promise<Record<string, number>> {
  const response = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 512,
    messages: [{
      role: "user",
      content: `Evaluate this AI response on the given criteria.

Input: ${input}
Output: ${output}

Criteria: ${criteria.map(c => `- ${c.name}: ${c.description} (1-5 scale)`).join("\n")}

Respond with JSON: { "scores": { "criterionName": score } }`
    }],
  });

  const text = response.content[0].type === "text" ? response.content[0].text : "{}";
  const match = text.match(/\{[\s\S]*\}/);
  return JSON.parse(match?.[0] ?? "{}").scores ?? {};
}

Watch out for judge model bias: models tend to favor their own stylistic patterns. Use multiple judge models and average the scores when accuracy matters.

3. Key Metrics

For RAG systems:

Faithfulness: Does the response stay within the provided context? (No hallucinations)
Answer Relevance: Does the response actually answer the question?
Context Precision: What fraction of retrieved docs were actually used?

For general generation:

BLEU/ROUGE for translation and summarization
LLM-as-judge quality scores for open-ended responses
Task-specific heuristics (format checks, required field presence)

4. Building an Eval Dataset

A good eval starts with a good dataset. Dataset construction is half the work.

interface EvalCase {
  id: string;
  input: string;
  expected?: string;
  criteria: string[];
  tags: string[];
  difficulty: "easy" | "medium" | "hard" | "adversarial";
}

// Recommended composition
const composition = {
  happyPath: 0.4,   // Normal use cases
  edgeCases: 0.3,   // Empty input, very long input, etc.
  adversarial: 0.2, // Deliberately tricky inputs
  regression: 0.1,  // Past failure cases
};

Source 1: Production logs — Real user queries are the most representative. Prioritize low-rated interactions.

Source 2: Synthetic generation — Use an LLM to generate diverse test cases across difficulty levels. Fast way to get broad coverage.

5. Eval Tools

promptfoo (open source CLI): Quick setup, easy CI integration, great for comparing models and prompts side-by-side.

# promptfooconfig.yaml
providers:
  - anthropic:claude-opus-4-5
  - openai:gpt-4o

tests:
  - vars:
      question: "What's your refund policy?"
    assert:
      - type: contains
        value: "refund"
      - type: llm-rubric
        value: "Is the response helpful and specific?"
      - type: cost
        threshold: 0.01

Braintrust (SaaS): Dataset management, experiment tracking, team collaboration. Worth it if you're running evals regularly across a team.

RAGAS (Python): Purpose-built for RAG evaluation. Rich set of RAG-specific metrics out of the box.

Tool	Type	Strengths	Weaknesses
promptfoo	Open source CLI	Fast setup, great for CI	Basic dataset management
Braintrust	SaaS	Team collab, experiment tracking	Paid
RAGAS	Open source lib	RAG-specific metrics	Python only

6. CI Integration

An eval that doesn't run automatically is an eval you'll forget to run.

# .github/workflows/eval.yml
name: LLM Eval
on:
  pull_request:
    paths:
      - "src/lib/prompts/**"
      - "src/app/api/chat/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Fail on threshold breach
        run: node scripts/check-eval-thresholds.js

Set score thresholds in your eval script and call process.exit(1) if any metric drops below its threshold. This makes eval failures block the PR merge, just like test failures do.

7. Production Monitoring

CI eval is the pre-deploy quality gate. But quality can drift after deployment too — new user patterns emerge, models get updated, prompt injections get attempted.

The practical approach: log all AI interactions. Sample 5-10% for heavier LLM-as-judge evaluation. Track user ratings (thumbs up/down) as a real signal. Alert when rolling average scores drop below baseline.

Epilogue: Eval Is the Test Suite for AI Features

Running an AI feature without eval is the same as deploying code without tests. Fast to ship initially, but eventually degrades into a quality and trust problem.

You don't need a perfect eval system from day one. Start simple:

10-20 of your most critical cases
Simple automatic checks first (contains, regex)
Gradually add LLM-as-judge and semantic similarity
Wire it into CI so it runs on every PR

The goal: replace "this seems fine" with "the data shows this is fine." That's what eval gives you.

#LLM Evaluation #AI Quality #Evals #AI Engineering

← Back to List

M·07AI ENGINEERING2026.03.164 MIN READ

LLM Evaluation: How to Measure the Quality of AI Features

LLM 평가(Eval): AI 기능의 품질을 어떻게 측정할까

codemapo

INTERDISCIPLINARY DEV · SEOUL

LLM Evaluation: How to Measure the Quality of AI Features

Prologue: "It Seemed Fine When I Tested It"

When I first shipped an AI feature to production, my quality standard was:

"I tried it a few times and it looked good."

Sure enough, strange responses started appearing in production. Confident hallucinations on certain question patterns. Sudden language switching. Complete disregard for user context.

That's when it clicked: AI features need tests too. Just not the same kind as regular software tests.

That's where LLM Evaluation (Eval) comes in.

1. What Is Eval?

LLM Eval is a methodology for systematically measuring the output quality of AI models or AI-powered features.

The fundamental difference from regular testing:

	Unit Tests	LLM Eval
Input	Fixed values	Diverse natural language
Output verification	Exact value match (`===`)	Quality/relevance/accuracy judgment
Determinism	Same input = same output	Same input ≠ same output
Failure criteria	Clear pass/fail	Spectrum (0.0 - 1.0)

The key shift: instead of "right or wrong," you're measuring "how good."

2. Three Types of Eval

Automatic Evaluation

Code-automated checks. Fastest and easiest to integrate into CI.

// Exact match — for classification, structured extraction
const exactMatch = (output: string, expected: string) =>
  output.trim() === expected.trim() ? 1.0 : 0.0;

// Contains check — for required information presence
const containsCheck = (output: string, required: string[]) => {
  const matched = required.filter((p) =>
    output.toLowerCase().includes(p.toLowerCase())
  );
  return matched.length / required.length;
};

// Semantic similarity — for meaning-level comparison
async function semanticSimilarity(a: string, b: string): Promise<number> {
  const [embedA, embedB] = await Promise.all([
    embed({ model: openai.embedding("text-embedding-3-small"), value: a }),
    embed({ model: openai.embedding("text-embedding-3-small"), value: b }),
  ]);
  const dot = embedA.embedding.reduce((s, v, i) => s + v * embedB.embedding[i], 0);
  const magA = Math.sqrt(embedA.embedding.reduce((s, v) => s + v * v, 0));
  const magB = Math.sqrt(embedB.embedding.reduce((s, v) => s + v * v, 0));
  return dot / (magA * magB);
}

Human Evaluation

Most accurate, but slow and expensive. Use it to set quality baselines or evaluate subjective criteria (creativity, cultural appropriateness, domain-specific accuracy).

Practical tip: use automatic eval to filter first, then apply human eval only to ambiguous or high-stakes cases.

LLM-as-Judge

LLM evaluates LLM output. Automated, yet closer to human-quality judgment than simple heuristics.

async function llmJudge(
  input: string,
  output: string,
  criteria: { name: string; description: string }[]
): Promise<Record<string, number>> {
  const response = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 512,
    messages: [{
      role: "user",
      content: `Evaluate this AI response on the given criteria.

Input: ${input}
Output: ${output}

Criteria: ${criteria.map(c => `- ${c.name}: ${c.description} (1-5 scale)`).join("\n")}

Respond with JSON: { "scores": { "criterionName": score } }`
    }],
  });

  const text = response.content[0].type === "text" ? response.content[0].text : "{}";
  const match = text.match(/\{[\s\S]*\}/);
  return JSON.parse(match?.[0] ?? "{}").scores ?? {};
}

Watch out for judge model bias: models tend to favor their own stylistic patterns. Use multiple judge models and average the scores when accuracy matters.

3. Key Metrics

For RAG systems:

Faithfulness: Does the response stay within the provided context? (No hallucinations)
Answer Relevance: Does the response actually answer the question?
Context Precision: What fraction of retrieved docs were actually used?

For general generation:

BLEU/ROUGE for translation and summarization
LLM-as-judge quality scores for open-ended responses
Task-specific heuristics (format checks, required field presence)

4. Building an Eval Dataset

A good eval starts with a good dataset. Dataset construction is half the work.

interface EvalCase {
  id: string;
  input: string;
  expected?: string;
  criteria: string[];
  tags: string[];
  difficulty: "easy" | "medium" | "hard" | "adversarial";
}

// Recommended composition
const composition = {
  happyPath: 0.4,   // Normal use cases
  edgeCases: 0.3,   // Empty input, very long input, etc.
  adversarial: 0.2, // Deliberately tricky inputs
  regression: 0.1,  // Past failure cases
};

Source 1: Production logs — Real user queries are the most representative. Prioritize low-rated interactions.

Source 2: Synthetic generation — Use an LLM to generate diverse test cases across difficulty levels. Fast way to get broad coverage.

5. Eval Tools

promptfoo (open source CLI): Quick setup, easy CI integration, great for comparing models and prompts side-by-side.

# promptfooconfig.yaml
providers:
  - anthropic:claude-opus-4-5
  - openai:gpt-4o

tests:
  - vars:
      question: "What's your refund policy?"
    assert:
      - type: contains
        value: "refund"
      - type: llm-rubric
        value: "Is the response helpful and specific?"
      - type: cost
        threshold: 0.01

Braintrust (SaaS): Dataset management, experiment tracking, team collaboration. Worth it if you're running evals regularly across a team.

RAGAS (Python): Purpose-built for RAG evaluation. Rich set of RAG-specific metrics out of the box.

Tool	Type	Strengths	Weaknesses
promptfoo	Open source CLI	Fast setup, great for CI	Basic dataset management
Braintrust	SaaS	Team collab, experiment tracking	Paid
RAGAS	Open source lib	RAG-specific metrics	Python only

6. CI Integration

An eval that doesn't run automatically is an eval you'll forget to run.

# .github/workflows/eval.yml
name: LLM Eval
on:
  pull_request:
    paths:
      - "src/lib/prompts/**"
      - "src/app/api/chat/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Fail on threshold breach
        run: node scripts/check-eval-thresholds.js

Set score thresholds in your eval script and call process.exit(1) if any metric drops below its threshold. This makes eval failures block the PR merge, just like test failures do.

7. Production Monitoring

CI eval is the pre-deploy quality gate. But quality can drift after deployment too — new user patterns emerge, models get updated, prompt injections get attempted.

Epilogue: Eval Is the Test Suite for AI Features

Running an AI feature without eval is the same as deploying code without tests. Fast to ship initially, but eventually degrades into a quality and trust problem.

You don't need a perfect eval system from day one. Start simple:

10-20 of your most critical cases
Simple automatic checks first (contains, regex)
Gradually add LLM-as-judge and semantic similarity
Wire it into CI so it runs on every PR

The goal: replace "this seems fine" with "the data shows this is fine." That's what eval gives you.

#LLM Evaluation #AI Quality #Evals #AI Engineering

← Back to List