M·07AI ENGINEERING2026.03.164 MIN READ

LLM 평가(Eval): AI 기능의 품질을 어떻게 측정할까

LLM Evaluation: How to Measure the Quality of AI Features

AI 기능을 프로덕션에 올렸는데 '잘 되는 것 같은데...'로 끝나면 곤란하다. LLM Eval의 종류와 메트릭, 실제 평가 데이터셋 구축 방법, CI에 Eval을 붙이는 방법까지 정리했다.

codemapo

INTERDISCIPLINARY DEV · SEOUL

Prologue: "그냥 테스트해보니까 잘 되던데요"

AI 기능을 처음 프로덕션에 올렸을 때, 품질 기준이 이랬다.

"내가 몇 번 써봤는데 잘 나오더라고요."

지금 생각하면 끔찍하다. 유닛 테스트 하나 없이 코드를 배포하는 것과 같은 상황이었다. 내가 테스트한 케이스는 Happy Path뿐이었고, 실제 유저들이 입력하는 엉뚱한 쿼리나 엣지 케이스는 전혀 다뤄지지 않았다.

결국 프로덕션에서 이상한 응답들이 나오기 시작했다. 특정 질문 패턴에서 AI가 잘못된 정보를 확신에 차서 말하거나, 다른 언어로 갑자기 답하거나, 사용자 맥락을 완전히 무시하거나.

그때 깨달았다. AI 기능에도 테스트가 필요하다. 다만 일반적인 소프트웨어 테스트와는 다른 방식으로.

그게 **LLM Evaluation(Eval)**의 시작이었다.

1. Eval이란 무엇인가

LLM Eval은 AI 모델이나 AI 기능의 출력 품질을 체계적으로 측정하는 방법론이다.

일반 소프트웨어 테스트와 근본적인 차이가 있다.

	일반 테스트	LLM Eval
입력	고정된 값	다양한 자연어 입력
출력 검증	정확한 값 비교 (`===`)	품질/관련성/정확성 판단
결정론성	동일 입력 = 동일 출력	동일 입력 ≠ 동일 출력
실패 기준	명확 (pass/fail)	스펙트럼 (0.0 ~ 1.0)

핵심은 "맞다/틀리다"가 아니라 **"얼마나 좋은가"**를 측정한다는 것이다.

Eval이 필요한 순간들

새 AI 기능 출시 전 품질 검증
프롬프트 변경이 성능을 올리는지/내리는지 확인
모델 버전 업그레이드 전후 비교 (gpt-4o → gpt-4o-mini 비용 절감 시)
프로덕션에서 AI 응답 품질 모니터링
회귀 테스트 — 수정이 다른 케이스를 망치지 않았는지 확인

2. Eval의 세 가지 유형

2-1. 자동 평가 (Automatic Evaluation)

코드로 자동화된 평가. 가장 빠르고 CI에 붙이기 쉽다.

정확 일치 (Exact Match)

function exactMatch(output: string, expected: string): number {
  return output.trim() === expected.trim() ? 1.0 : 0.0;
}

// 사용 케이스: 분류, 코드 생성, 구조화된 데이터 추출
const result = exactMatch(
  llmOutput, // "positive"
  "positive"  // expected
);

포함 여부 (Contains)

function containsCheck(output: string, requiredPhrases: string[]): number {
  const matched = requiredPhrases.filter((phrase) =>
    output.toLowerCase().includes(phrase.toLowerCase())
  );
  return matched.length / requiredPhrases.length;
}

// 사용 케이스: 특정 정보가 응답에 포함되는지 확인
const score = containsCheck(
  "서울의 현재 기온은 22도이며 맑습니다.",
  ["기온", "22도", "맑"]
);
// → 1.0 (모두 포함)

정규식 패턴 매칭

function regexMatch(output: string, pattern: RegExp): number {
  return pattern.test(output) ? 1.0 : 0.0;
}

// 사용 케이스: 응답 포맷 검증 (JSON, 날짜, 이메일 등)
const isValidJson = regexMatch(output, /^\{.*\}$/s);
const hasDate = regexMatch(output, /\d{4}-\d{2}-\d{2}/);

임베딩 유사도 (Semantic Similarity)

import { openai } from "@ai-sdk/openai";
import { embed } from "ai";

async function semanticSimilarity(
  output: string,
  reference: string
): Promise<number> {
  const [outputEmbed, referenceEmbed] = await Promise.all([
    embed({ model: openai.embedding("text-embedding-3-small"), value: output }),
    embed({
      model: openai.embedding("text-embedding-3-small"),
      value: reference,
    }),
  ]);

  // 코사인 유사도 계산
  const dot = outputEmbed.embedding.reduce(
    (sum, val, i) => sum + val * referenceEmbed.embedding[i],
    0
  );
  const magA = Math.sqrt(
    outputEmbed.embedding.reduce((sum, val) => sum + val * val, 0)
  );
  const magB = Math.sqrt(
    referenceEmbed.embedding.reduce((sum, val) => sum + val * val, 0)
  );

  return dot / (magA * magB);
}

// 사용 케이스: 의미적으로 유사한 답변인지 확인
const similarity = await semanticSimilarity(
  "강아지는 사람의 친구다",
  "개는 인간과 친한 동물이다"
);
// → 0.89 (의미적으로 유사)

2-2. 사람 평가 (Human Evaluation)

가장 정확하지만 비용과 시간이 많이 든다. 실제 사용자나 전문가가 AI 응답을 평가한다.

interface HumanEvalTask {
  id: string;
  input: string;
  output: string;
  criteria: string[];
  evaluatorId: string;
}

interface HumanEvalResult {
  taskId: string;
  scores: Record<string, number>; // criterion → score (1-5)
  comments: string;
  evaluatorId: string;
  timestamp: Date;
}

// 사용 케이스:
// - 창의성, 유머, 문화적 적절성 같은 주관적 품질
// - 도메인 전문 지식이 필요한 정확성 (의료, 법률, 금융)
// - 새 기능의 초기 품질 기준 설정

실무 팁: 사람 평가는 전체 데이터셋에 다 적용하기 어렵다. 자동 평가로 1차 필터링하고, 모호한 케이스나 중요한 케이스에만 사람 평가를 적용하는 게 효율적이다.

2-3. LLM-as-Judge

LLM이 LLM의 응답을 평가하는 방식. 자동화되면서도 사람 평가에 가까운 품질을 얻을 수 있다.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface EvalCriteria {
  name: string;
  description: string;
  scale: string; // "1-5" or "0-1"
}

async function llmJudge(
  input: string,
  output: string,
  criteria: EvalCriteria[]
): Promise<Record<string, number>> {
  const criteriaText = criteria
    .map((c) => `- ${c.name}: ${c.description} (${c.scale} 척도)`)
    .join("\n");

  const prompt = `당신은 AI 응답의 품질을 평가하는 전문가입니다.

다음 입력과 AI 응답을 평가해주세요.

[입력]
${input}

[AI 응답]
${output}

[평가 기준]
${criteriaText}

각 기준에 대해 점수와 간단한 이유를 JSON 형식으로 답해주세요:
{
  "scores": {
    "기준명": 점수,
    ...
  },
  "reasoning": {
    "기준명": "이유",
    ...
  }
}`;

  const response = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });

  const text =
    response.content[0].type === "text" ? response.content[0].text : "{}";

  try {
    const jsonMatch = text.match(/\{[\s\S]*\}/);
    const parsed = JSON.parse(jsonMatch?.[0] ?? "{}");
    return parsed.scores ?? {};
  } catch {
    return {};
  }
}

// 사용 예시
const scores = await llmJudge(
  "파이썬에서 리스트를 정렬하는 방법을 알려줘",
  "Python에서 리스트를 정렬하려면 `sorted()` 함수나 `.sort()` 메서드를 사용할 수 있습니다...",
  [
    {
      name: "accuracy",
      description: "기술적으로 정확한가",
      scale: "1-5",
    },
    {
      name: "completeness",
      description: "질문에 충분히 답했는가",
      scale: "1-5",
    },
    {
      name: "clarity",
      description: "이해하기 쉽게 설명됐는가",
      scale: "1-5",
    },
  ]
);

LLM-as-Judge의 주의점:

판사 모델의 편향이 결과에 영향을 미친다
피평가 모델과 같은 회사의 모델을 쓰면 편향이 생길 수 있다
"자신의 스타일을 선호하는" 경향이 있다 (GPT가 GPT 스타일의 응답에 높은 점수를 주는 등)
여러 모델로 평가하고 평균 내는 게 좋다

3. 핵심 평가 메트릭

RAG (Retrieval-Augmented Generation) 시스템의 메트릭

RAG 시스템을 평가할 때 특히 중요한 세 가지:

Faithfulness (충실도): AI의 응답이 제공된 컨텍스트(검색된 문서)에 충실한가? 컨텍스트에 없는 내용을 만들어내지 않았는가?

컨텍스트: "서울의 면적은 605.2 km²이다."
질문: "서울의 면적은?"
좋은 응답: "서울의 면적은 605.2 km²입니다."  → Faithfulness: 1.0
나쁜 응답: "서울의 면적은 약 600 km²이며, 인구는 1000만 명입니다."  
  → Faithfulness: 0.5 (인구는 컨텍스트에 없음)

Answer Relevance (답변 관련성): 응답이 질문에 실제로 답하는가?

질문: "파이썬 리스트와 튜플의 차이는?"
나쁜 응답: "파이썬은 가이도 반 로섬이 만든 언어입니다. 1991년..."
  → Relevance: 0.1 (질문에 안 답함)
좋은 응답: "리스트는 mutable(변경 가능), 튜플은 immutable(변경 불가)..."
  → Relevance: 0.95

Context Precision (컨텍스트 정밀도): 검색된 문서 중 실제로 답변에 사용된 비율. 불필요한 컨텍스트가 많으면 노이즈가 된다.

일반 텍스트 생성 메트릭

// BLEU Score - 번역, 요약 등에서 사용
// n-gram 겹침으로 유사도 측정
function bleuScore(hypothesis: string, references: string[]): number {
  // 실제 구현은 라이브러리 사용 권장
  // npm install natural
  return 0; // placeholder
}

// ROUGE Score - 요약 품질 측정
// Recall 기반 (참조 텍스트가 얼마나 포함됐는가)
interface RougeScore {
  rouge1: number; // 단어 수준 겹침
  rouge2: number; // 바이그램 겹침
  rougeL: number; // 최장 공통 부분 시퀀스
}

실무에서는 BLEU/ROUGE보다 LLM-as-Judge + 자동화 휴리스틱의 조합이 더 실용적인 경우가 많다.

4. Eval 데이터셋 구축하기

좋은 Eval은 좋은 데이터셋에서 시작한다. 데이터셋 구축이 전체 작업의 절반이다.

데이터셋 구성 원칙

interface EvalCase {
  id: string;
  // 입력
  input: string | Record<string, unknown>;
  // 기대 출력 (정확한 값 또는 참조)
  expected?: string;
  // 평가 기준
  criteria: string[];
  // 메타데이터
  tags: string[];
  difficulty: "easy" | "medium" | "hard";
  category: string;
}

// 좋은 데이터셋의 구성 비율 예시
const datasetComposition = {
  happyPath: 0.4,     // 정상적인 사용 케이스
  edgeCases: 0.3,     // 경계 케이스 (빈 입력, 매우 긴 입력 등)
  adversarial: 0.2,   // 의도적으로 까다로운 케이스
  regression: 0.1,    // 과거에 문제가 있었던 케이스
};

데이터셋 수집 방법

1. 프로덕션 로그에서 추출

실제 유저 쿼리가 가장 현실적이다.

// 프로덕션 로그에서 평가 케이스 추출
async function extractEvalCasesFromLogs(
  supabase: ReturnType<typeof createClient>,
  limit = 100
) {
  // 낮은 피드백 점수를 받은 응답 우선
  const { data } = await supabase
    .from("ai_interactions")
    .select("input, output, user_rating, user_feedback")
    .not("user_rating", "is", null)
    .order("user_rating", { ascending: true })
    .limit(limit);

  return data?.map((row) => ({
    id: crypto.randomUUID(),
    input: row.input,
    expected: row.output, // 현재 응답을 베이스라인으로
    criteria: ["accuracy", "helpfulness"],
    tags: ["production", row.user_rating <= 2 ? "negative" : "positive"],
    difficulty: "medium" as const,
    category: "production",
  }));
}

2. 합성 데이터 생성

LLM을 써서 다양한 테스트 케이스를 자동 생성한다.

async function generateSyntheticCases(
  topic: string,
  count = 20
): Promise<EvalCase[]> {
  const client = new Anthropic();

  const response = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 4096,
    messages: [
      {
        role: "user",
        content: `"${topic}" 주제와 관련해서 AI 시스템을 테스트할 ${count}개의 다양한 질문을 생성해주세요.

요구사항:
- 쉬운 질문 8개, 보통 5개, 어려운 5개, 까다로운 질문 2개
- 각 질문마다 이상적인 답변의 핵심 요소도 포함
- JSON 배열 형식으로 출력

형식:
[
  {
    "question": "질문 내용",
    "key_points": ["핵심 요소 1", "핵심 요소 2"],
    "difficulty": "easy|medium|hard|adversarial"
  }
]`,
      },
    ],
  });

  const text =
    response.content[0].type === "text" ? response.content[0].text : "[]";

  try {
    const jsonMatch = text.match(/\[[\s\S]*\]/);
    const cases = JSON.parse(jsonMatch?.[0] ?? "[]");

    return cases.map(
      (c: { question: string; key_points: string[]; difficulty: string }) => ({
        id: crypto.randomUUID(),
        input: c.question,
        criteria: c.key_points,
        tags: ["synthetic", topic],
        difficulty: c.difficulty as EvalCase["difficulty"],
        category: topic,
      })
    );
  } catch {
    return [];
  }
}

5. Eval 도구들

promptfoo

오픈소스 LLM 평가 프레임워크. CLI로 쉽게 실행할 수 있다.

npm install -g promptfoo

# promptfooconfig.yaml
description: "Customer support chatbot eval"

prompts:
  - "당신은 {{company}}의 고객 지원 담당자입니다.\n\n{{question}}"

providers:
  - id: anthropic:claude-opus-4-5
    config:
      temperature: 0.3
  - id: openai:gpt-4o
    config:
      temperature: 0.3

tests:
  - vars:
      company: "테크스타트업"
      question: "환불 정책이 어떻게 되나요?"
    assert:
      - type: contains
        value: "환불"
      - type: llm-rubric
        value: "응답이 친절하고 구체적인 정보를 제공하는가?"
      - type: cost
        threshold: 0.01  # 최대 $0.01

  - vars:
      company: "테크스타트업"
      question: "니 엄마는 뭐해?"  # adversarial
    assert:
      - type: llm-rubric
        value: "응답이 부적절한 질문을 정중히 거절하는가?"

  - vars:
      company: "테크스타트업"
      question: ""  # 빈 입력
    assert:
      - type: llm-rubric
        value: "빈 입력에 대해 안내 메시지를 제공하는가?"

# 실행
promptfoo eval

# 결과 확인 (웹 UI)
promptfoo view

Braintrust

더 강력한 Eval 플랫폼. 데이터셋 관리, 실험 추적, 팀 협업까지 지원한다.

import { Eval, Score } from "braintrust";

// Braintrust eval 정의
const result = await Eval("customer-support-bot", {
  data: () => [
    {
      input: { question: "환불 정책이 어떻게 되나요?", company: "테크스타트업" },
      expected: "환불은 구매 후 14일 이내에 가능합니다",
      metadata: { category: "refund", difficulty: "easy" },
    },
    // ... 더 많은 테스트 케이스
  ],

  task: async (input) => {
    // 실제 AI 호출
    const response = await callYourAI(input.question, input.company);
    return response;
  },

  scores: [
    // 내장 스코어러
    (args) =>
      ({
        name: "Contains refund info",
        score: args.output.toLowerCase().includes("환불") ? 1 : 0,
      }) as Score,

    // LLM-as-judge
    async (args) => {
      const judgeScore = await callLLMJudge(args.input, args.output);
      return { name: "Quality", score: judgeScore } as Score;
    },
  ],
});

console.log(result.summary);

RAGAS (RAG 특화)

RAG 시스템 전용 평가 라이브러리.

# Python 예시 (RAGAS는 주로 Python 생태계)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

data = {
    "question": ["서울의 면적은?", "파이썬이란?"],
    "answer": ["605.2 km²입니다.", "파이썬은 프로그래밍 언어입니다."],
    "contexts": [
        ["서울의 면적은 605.2 km²이다. 인구는 약 950만명이다."],
        ["파이썬(Python)은 1991년 귀도 반 로섬이 만든 인터프리터 언어이다."],
    ],
    "ground_truths": [["605.2 km²"], ["Python은 프로그래밍 언어"]],
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.89, 'context_precision': 0.91}

도구 비교

도구	타입	강점	약점
promptfoo	오픈소스 CLI	빠른 셋업, CI 통합 쉬움	데이터셋 관리 기본적
Braintrust	SaaS	팀 협업, 실험 추적, UI	유료
RAGAS	오픈소스 라이브러리	RAG 특화, 메트릭 풍부	Python만 지원
LangSmith	SaaS	LangChain 통합	LangChain 의존
Weave (W&B)	SaaS	ML 실험 추적과 통합	설정 복잡

6. CI에 Eval 통합하기

Eval이 개발 파이프라인에 통합되지 않으면 "한 번 해보고 잊어버리는 것"으로 끝난다. CI에 붙여서 매 PR마다 자동으로 실행되게 해야 한다.

GitHub Actions 통합

# .github/workflows/eval.yml
name: LLM Eval

on:
  pull_request:
    paths:
      - "src/lib/prompts/**"
      - "src/app/api/chat/**"
  push:
    branches: [main]
  schedule:
    - cron: "0 9 * * 1"  # 매주 월요일 9시 (모델 드리프트 감지)

jobs:
  eval:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "20"
          
      - name: Install dependencies
        run: npm ci
        
      - name: Run eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: npm run eval
        
      - name: Check eval thresholds
        run: |
          node scripts/check-eval-thresholds.js
          
      - name: Comment results on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval-results.json', 'utf8'));
            
            const comment = `## Eval Results
            
            | Metric | Score | Threshold | Status |
            |--------|-------|-----------|--------|
            ${results.metrics.map(m => 
              `| ${m.name} | ${m.score.toFixed(2)} | ${m.threshold} | ${m.score >= m.threshold ? '✅' : '❌'} |`
            ).join('\n')}
            
            Overall: ${results.passed ? '✅ PASSED' : '❌ FAILED'}`;
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

eval 실행 스크립트

// scripts/run-eval.ts
import { streamText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import * as fs from "fs/promises";

interface EvalResult {
  caseId: string;
  input: string;
  output: string;
  scores: Record<string, number>;
  passed: boolean;
}

interface EvalReport {
  timestamp: string;
  totalCases: number;
  passedCases: number;
  metrics: { name: string; score: number; threshold: number }[];
  passed: boolean;
  results: EvalResult[];
}

// 평가 임계값
const THRESHOLDS = {
  accuracy: 0.8,
  helpfulness: 0.75,
  safety: 0.95, // 안전성은 높게
};

async function runEval(): Promise<void> {
  // 테스트 케이스 로드
  const testCases = JSON.parse(
    await fs.readFile("eval/test-cases.json", "utf-8")
  );

  const results: EvalResult[] = [];

  for (const testCase of testCases) {
    // AI 호출
    const { text } = await streamText({
      model: anthropic("claude-opus-4-5"),
      messages: [{ role: "user", content: testCase.input }],
    }).then((r) => ({ text: r.text })).catch(() => ({ text: "" }));

    // 자동 평가
    const scores: Record<string, number> = {};

    // 1. 포함 여부 체크
    if (testCase.requiredPhrases) {
      const contained = testCase.requiredPhrases.filter((phrase: string) =>
        text.toLowerCase().includes(phrase.toLowerCase())
      );
      scores.completeness =
        contained.length / testCase.requiredPhrases.length;
    }

    // 2. 금지어 체크 (안전성)
    const blockedTerms = ["개인정보", "비밀번호", "신용카드"];
    scores.safety = blockedTerms.some((term) => text.includes(term)) ? 0 : 1;

    // 3. 길이 체크 (너무 짧거나 너무 길면 감점)
    const wordCount = text.split(/\s+/).length;
    if (wordCount < 10) scores.length_quality = 0.3;
    else if (wordCount > 500) scores.length_quality = 0.7;
    else scores.length_quality = 1.0;

    const passed = Object.entries(THRESHOLDS).every(
      ([metric, threshold]) =>
        scores[metric] === undefined || scores[metric] >= threshold
    );

    results.push({
      caseId: testCase.id,
      input: testCase.input,
      output: text,
      scores,
      passed,
    });
  }

  // 리포트 생성
  const passedCount = results.filter((r) => r.passed).length;
  const allScores: Record<string, number[]> = {};

  for (const result of results) {
    for (const [metric, score] of Object.entries(result.scores)) {
      if (!allScores[metric]) allScores[metric] = [];
      allScores[metric].push(score);
    }
  }

  const avgScores = Object.entries(allScores).map(([name, scores]) => ({
    name,
    score: scores.reduce((a, b) => a + b, 0) / scores.length,
    threshold: THRESHOLDS[name as keyof typeof THRESHOLDS] ?? 0.7,
  }));

  const report: EvalReport = {
    timestamp: new Date().toISOString(),
    totalCases: results.length,
    passedCases: passedCount,
    metrics: avgScores,
    passed: avgScores.every((m) => m.score >= m.threshold),
    results,
  };

  await fs.writeFile("eval-results.json", JSON.stringify(report, null, 2));

  console.log(`\nEval 결과: ${passedCount}/${results.length} 통과`);
  for (const metric of avgScores) {
    const status = metric.score >= metric.threshold ? "✅" : "❌";
    console.log(
      `  ${status} ${metric.name}: ${metric.score.toFixed(3)} (기준: ${metric.threshold})`
    );
  }

  if (!report.passed) {
    process.exit(1); // CI에서 실패 처리
  }
}

runEval().catch(console.error);

7. 프로덕션 모니터링

CI Eval은 배포 전 품질 게이트다. 하지만 프로덕션에 올라간 후에도 계속 모니터링해야 한다. 모델 드리프트, 새로운 유저 패턴, 프롬프트 인젝션 시도 등이 생기기 때문이다.

// 프로덕션 AI 응답 로깅 및 샘플링
interface AIInteractionLog {
  id: string;
  sessionId: string;
  userId?: string;
  input: string;
  output: string;
  model: string;
  latencyMs: number;
  tokenCount: number;
  timestamp: Date;
  // 자동 평가 결과
  autoEvalScores?: Record<string, number>;
  // 유저 피드백 (thumbs up/down)
  userRating?: 1 | 2 | 3 | 4 | 5;
  userFeedback?: string;
}

async function logAndEvaluate(
  supabase: ReturnType<typeof createClient>,
  log: Omit<AIInteractionLog, "id" | "autoEvalScores">
): Promise<void> {
  // 빠른 자동 평가 (프로덕션에서는 가볍게)
  const autoEvalScores: Record<string, number> = {
    // 길이 기반 품질 휴리스틱
    length_quality: log.output.length > 50 && log.output.length < 2000 ? 1 : 0.5,
    // 응답 시간 (느리면 UX 문제)
    latency_ok: log.latencyMs < 3000 ? 1 : log.latencyMs < 5000 ? 0.5 : 0,
  };

  const id = crypto.randomUUID();

  await supabase.from("ai_interaction_logs").insert({
    ...log,
    id,
    auto_eval_scores: autoEvalScores,
    timestamp: log.timestamp.toISOString(),
  });

  // 10%만 샘플링해서 비용이 큰 LLM-as-judge 적용
  if (Math.random() < 0.1) {
    // 비동기로 처리 (응답 지연 없도록)
    setTimeout(async () => {
      const deepScores = await llmJudge(log.input, log.output, [
        { name: "quality", description: "응답 품질", scale: "1-5" },
      ]);

      await supabase
        .from("ai_interaction_logs")
        .update({ auto_eval_scores: { ...autoEvalScores, ...deepScores } })
        .eq("id", id);
    }, 0);
  }
}

Epilogue: Eval은 AI 개발의 테스트다

Eval 없이 AI 기능을 운영하는 건, 테스트 없이 코드를 배포하는 것과 같다. 단기적으로는 빠르게 출시할 수 있지만, 장기적으로는 품질 저하와 사용자 신뢰 손실로 이어진다.

처음부터 완벽한 Eval 시스템을 만들 필요는 없다. 시작은 단순하게:

가장 중요한 케이스 10-20개로 시작
자동화할 수 있는 간단한 체크부터 (contains, regex)
점차 LLM-as-judge와 semantic similarity 추가
CI에 연동해서 매 PR마다 실행

AI 기능의 "이 정도면 됐겠지"를 "이 데이터가 증명한다"로 바꾸는 것. 그게 Eval이 주는 가치다.

#LLM Evaluation #AI Quality #Evals #AI Engineering

← 목록으로 돌아가기

M·07AI ENGINEERING2026.03.164 MIN READ

LLM 평가(Eval): AI 기능의 품질을 어떻게 측정할까

LLM Evaluation: How to Measure the Quality of AI Features

codemapo

INTERDISCIPLINARY DEV · SEOUL

Prologue: "그냥 테스트해보니까 잘 되던데요"

AI 기능을 처음 프로덕션에 올렸을 때, 품질 기준이 이랬다.

"내가 몇 번 써봤는데 잘 나오더라고요."

그때 깨달았다. AI 기능에도 테스트가 필요하다. 다만 일반적인 소프트웨어 테스트와는 다른 방식으로.

그게 **LLM Evaluation(Eval)**의 시작이었다.

1. Eval이란 무엇인가

LLM Eval은 AI 모델이나 AI 기능의 출력 품질을 체계적으로 측정하는 방법론이다.

일반 소프트웨어 테스트와 근본적인 차이가 있다.

	일반 테스트	LLM Eval
입력	고정된 값	다양한 자연어 입력
출력 검증	정확한 값 비교 (`===`)	품질/관련성/정확성 판단
결정론성	동일 입력 = 동일 출력	동일 입력 ≠ 동일 출력
실패 기준	명확 (pass/fail)	스펙트럼 (0.0 ~ 1.0)

핵심은 "맞다/틀리다"가 아니라 **"얼마나 좋은가"**를 측정한다는 것이다.

Eval이 필요한 순간들

새 AI 기능 출시 전 품질 검증
프롬프트 변경이 성능을 올리는지/내리는지 확인
모델 버전 업그레이드 전후 비교 (gpt-4o → gpt-4o-mini 비용 절감 시)
프로덕션에서 AI 응답 품질 모니터링
회귀 테스트 — 수정이 다른 케이스를 망치지 않았는지 확인

2. Eval의 세 가지 유형

2-1. 자동 평가 (Automatic Evaluation)

코드로 자동화된 평가. 가장 빠르고 CI에 붙이기 쉽다.

정확 일치 (Exact Match)

function exactMatch(output: string, expected: string): number {
  return output.trim() === expected.trim() ? 1.0 : 0.0;
}

// 사용 케이스: 분류, 코드 생성, 구조화된 데이터 추출
const result = exactMatch(
  llmOutput, // "positive"
  "positive"  // expected
);

포함 여부 (Contains)

function containsCheck(output: string, requiredPhrases: string[]): number {
  const matched = requiredPhrases.filter((phrase) =>
    output.toLowerCase().includes(phrase.toLowerCase())
  );
  return matched.length / requiredPhrases.length;
}

// 사용 케이스: 특정 정보가 응답에 포함되는지 확인
const score = containsCheck(
  "서울의 현재 기온은 22도이며 맑습니다.",
  ["기온", "22도", "맑"]
);
// → 1.0 (모두 포함)

정규식 패턴 매칭

function regexMatch(output: string, pattern: RegExp): number {
  return pattern.test(output) ? 1.0 : 0.0;
}

// 사용 케이스: 응답 포맷 검증 (JSON, 날짜, 이메일 등)
const isValidJson = regexMatch(output, /^\{.*\}$/s);
const hasDate = regexMatch(output, /\d{4}-\d{2}-\d{2}/);

임베딩 유사도 (Semantic Similarity)

import { openai } from "@ai-sdk/openai";
import { embed } from "ai";

async function semanticSimilarity(
  output: string,
  reference: string
): Promise<number> {
  const [outputEmbed, referenceEmbed] = await Promise.all([
    embed({ model: openai.embedding("text-embedding-3-small"), value: output }),
    embed({
      model: openai.embedding("text-embedding-3-small"),
      value: reference,
    }),
  ]);

  // 코사인 유사도 계산
  const dot = outputEmbed.embedding.reduce(
    (sum, val, i) => sum + val * referenceEmbed.embedding[i],
    0
  );
  const magA = Math.sqrt(
    outputEmbed.embedding.reduce((sum, val) => sum + val * val, 0)
  );
  const magB = Math.sqrt(
    referenceEmbed.embedding.reduce((sum, val) => sum + val * val, 0)
  );

  return dot / (magA * magB);
}

// 사용 케이스: 의미적으로 유사한 답변인지 확인
const similarity = await semanticSimilarity(
  "강아지는 사람의 친구다",
  "개는 인간과 친한 동물이다"
);
// → 0.89 (의미적으로 유사)

2-2. 사람 평가 (Human Evaluation)

가장 정확하지만 비용과 시간이 많이 든다. 실제 사용자나 전문가가 AI 응답을 평가한다.

interface HumanEvalTask {
  id: string;
  input: string;
  output: string;
  criteria: string[];
  evaluatorId: string;
}

interface HumanEvalResult {
  taskId: string;
  scores: Record<string, number>; // criterion → score (1-5)
  comments: string;
  evaluatorId: string;
  timestamp: Date;
}

// 사용 케이스:
// - 창의성, 유머, 문화적 적절성 같은 주관적 품질
// - 도메인 전문 지식이 필요한 정확성 (의료, 법률, 금융)
// - 새 기능의 초기 품질 기준 설정

2-3. LLM-as-Judge

LLM이 LLM의 응답을 평가하는 방식. 자동화되면서도 사람 평가에 가까운 품질을 얻을 수 있다.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface EvalCriteria {
  name: string;
  description: string;
  scale: string; // "1-5" or "0-1"
}

async function llmJudge(
  input: string,
  output: string,
  criteria: EvalCriteria[]
): Promise<Record<string, number>> {
  const criteriaText = criteria
    .map((c) => `- ${c.name}: ${c.description} (${c.scale} 척도)`)
    .join("\n");

  const prompt = `당신은 AI 응답의 품질을 평가하는 전문가입니다.

다음 입력과 AI 응답을 평가해주세요.

[입력]
${input}

[AI 응답]
${output}

[평가 기준]
${criteriaText}

각 기준에 대해 점수와 간단한 이유를 JSON 형식으로 답해주세요:
{
  "scores": {
    "기준명": 점수,
    ...
  },
  "reasoning": {
    "기준명": "이유",
    ...
  }
}`;

  const response = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });

  const text =
    response.content[0].type === "text" ? response.content[0].text : "{}";

  try {
    const jsonMatch = text.match(/\{[\s\S]*\}/);
    const parsed = JSON.parse(jsonMatch?.[0] ?? "{}");
    return parsed.scores ?? {};
  } catch {
    return {};
  }
}

// 사용 예시
const scores = await llmJudge(
  "파이썬에서 리스트를 정렬하는 방법을 알려줘",
  "Python에서 리스트를 정렬하려면 `sorted()` 함수나 `.sort()` 메서드를 사용할 수 있습니다...",
  [
    {
      name: "accuracy",
      description: "기술적으로 정확한가",
      scale: "1-5",
    },
    {
      name: "completeness",
      description: "질문에 충분히 답했는가",
      scale: "1-5",
    },
    {
      name: "clarity",
      description: "이해하기 쉽게 설명됐는가",
      scale: "1-5",
    },
  ]
);

LLM-as-Judge의 주의점:

판사 모델의 편향이 결과에 영향을 미친다
피평가 모델과 같은 회사의 모델을 쓰면 편향이 생길 수 있다
"자신의 스타일을 선호하는" 경향이 있다 (GPT가 GPT 스타일의 응답에 높은 점수를 주는 등)
여러 모델로 평가하고 평균 내는 게 좋다

3. 핵심 평가 메트릭

RAG (Retrieval-Augmented Generation) 시스템의 메트릭

RAG 시스템을 평가할 때 특히 중요한 세 가지:

Faithfulness (충실도): AI의 응답이 제공된 컨텍스트(검색된 문서)에 충실한가? 컨텍스트에 없는 내용을 만들어내지 않았는가?

컨텍스트: "서울의 면적은 605.2 km²이다."
질문: "서울의 면적은?"
좋은 응답: "서울의 면적은 605.2 km²입니다."  → Faithfulness: 1.0
나쁜 응답: "서울의 면적은 약 600 km²이며, 인구는 1000만 명입니다."  
  → Faithfulness: 0.5 (인구는 컨텍스트에 없음)

Answer Relevance (답변 관련성): 응답이 질문에 실제로 답하는가?

질문: "파이썬 리스트와 튜플의 차이는?"
나쁜 응답: "파이썬은 가이도 반 로섬이 만든 언어입니다. 1991년..."
  → Relevance: 0.1 (질문에 안 답함)
좋은 응답: "리스트는 mutable(변경 가능), 튜플은 immutable(변경 불가)..."
  → Relevance: 0.95

Context Precision (컨텍스트 정밀도): 검색된 문서 중 실제로 답변에 사용된 비율. 불필요한 컨텍스트가 많으면 노이즈가 된다.

일반 텍스트 생성 메트릭

// BLEU Score - 번역, 요약 등에서 사용
// n-gram 겹침으로 유사도 측정
function bleuScore(hypothesis: string, references: string[]): number {
  // 실제 구현은 라이브러리 사용 권장
  // npm install natural
  return 0; // placeholder
}

// ROUGE Score - 요약 품질 측정
// Recall 기반 (참조 텍스트가 얼마나 포함됐는가)
interface RougeScore {
  rouge1: number; // 단어 수준 겹침
  rouge2: number; // 바이그램 겹침
  rougeL: number; // 최장 공통 부분 시퀀스
}

실무에서는 BLEU/ROUGE보다 LLM-as-Judge + 자동화 휴리스틱의 조합이 더 실용적인 경우가 많다.

4. Eval 데이터셋 구축하기

좋은 Eval은 좋은 데이터셋에서 시작한다. 데이터셋 구축이 전체 작업의 절반이다.

데이터셋 구성 원칙

interface EvalCase {
  id: string;
  // 입력
  input: string | Record<string, unknown>;
  // 기대 출력 (정확한 값 또는 참조)
  expected?: string;
  // 평가 기준
  criteria: string[];
  // 메타데이터
  tags: string[];
  difficulty: "easy" | "medium" | "hard";
  category: string;
}

// 좋은 데이터셋의 구성 비율 예시
const datasetComposition = {
  happyPath: 0.4,     // 정상적인 사용 케이스
  edgeCases: 0.3,     // 경계 케이스 (빈 입력, 매우 긴 입력 등)
  adversarial: 0.2,   // 의도적으로 까다로운 케이스
  regression: 0.1,    // 과거에 문제가 있었던 케이스
};

데이터셋 수집 방법

1. 프로덕션 로그에서 추출

실제 유저 쿼리가 가장 현실적이다.

// 프로덕션 로그에서 평가 케이스 추출
async function extractEvalCasesFromLogs(
  supabase: ReturnType<typeof createClient>,
  limit = 100
) {
  // 낮은 피드백 점수를 받은 응답 우선
  const { data } = await supabase
    .from("ai_interactions")
    .select("input, output, user_rating, user_feedback")
    .not("user_rating", "is", null)
    .order("user_rating", { ascending: true })
    .limit(limit);

  return data?.map((row) => ({
    id: crypto.randomUUID(),
    input: row.input,
    expected: row.output, // 현재 응답을 베이스라인으로
    criteria: ["accuracy", "helpfulness"],
    tags: ["production", row.user_rating <= 2 ? "negative" : "positive"],
    difficulty: "medium" as const,
    category: "production",
  }));
}

2. 합성 데이터 생성

LLM을 써서 다양한 테스트 케이스를 자동 생성한다.

async function generateSyntheticCases(
  topic: string,
  count = 20
): Promise<EvalCase[]> {
  const client = new Anthropic();

  const response = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 4096,
    messages: [
      {
        role: "user",
        content: `"${topic}" 주제와 관련해서 AI 시스템을 테스트할 ${count}개의 다양한 질문을 생성해주세요.

요구사항:
- 쉬운 질문 8개, 보통 5개, 어려운 5개, 까다로운 질문 2개
- 각 질문마다 이상적인 답변의 핵심 요소도 포함
- JSON 배열 형식으로 출력

형식:
[
  {
    "question": "질문 내용",
    "key_points": ["핵심 요소 1", "핵심 요소 2"],
    "difficulty": "easy|medium|hard|adversarial"
  }
]`,
      },
    ],
  });

  const text =
    response.content[0].type === "text" ? response.content[0].text : "[]";

  try {
    const jsonMatch = text.match(/\[[\s\S]*\]/);
    const cases = JSON.parse(jsonMatch?.[0] ?? "[]");

    return cases.map(
      (c: { question: string; key_points: string[]; difficulty: string }) => ({
        id: crypto.randomUUID(),
        input: c.question,
        criteria: c.key_points,
        tags: ["synthetic", topic],
        difficulty: c.difficulty as EvalCase["difficulty"],
        category: topic,
      })
    );
  } catch {
    return [];
  }
}

5. Eval 도구들

promptfoo

오픈소스 LLM 평가 프레임워크. CLI로 쉽게 실행할 수 있다.

npm install -g promptfoo

# promptfooconfig.yaml
description: "Customer support chatbot eval"

prompts:
  - "당신은 {{company}}의 고객 지원 담당자입니다.\n\n{{question}}"

providers:
  - id: anthropic:claude-opus-4-5
    config:
      temperature: 0.3
  - id: openai:gpt-4o
    config:
      temperature: 0.3

tests:
  - vars:
      company: "테크스타트업"
      question: "환불 정책이 어떻게 되나요?"
    assert:
      - type: contains
        value: "환불"
      - type: llm-rubric
        value: "응답이 친절하고 구체적인 정보를 제공하는가?"
      - type: cost
        threshold: 0.01  # 최대 $0.01

  - vars:
      company: "테크스타트업"
      question: "니 엄마는 뭐해?"  # adversarial
    assert:
      - type: llm-rubric
        value: "응답이 부적절한 질문을 정중히 거절하는가?"

  - vars:
      company: "테크스타트업"
      question: ""  # 빈 입력
    assert:
      - type: llm-rubric
        value: "빈 입력에 대해 안내 메시지를 제공하는가?"

# 실행
promptfoo eval

# 결과 확인 (웹 UI)
promptfoo view

Braintrust

더 강력한 Eval 플랫폼. 데이터셋 관리, 실험 추적, 팀 협업까지 지원한다.

import { Eval, Score } from "braintrust";

// Braintrust eval 정의
const result = await Eval("customer-support-bot", {
  data: () => [
    {
      input: { question: "환불 정책이 어떻게 되나요?", company: "테크스타트업" },
      expected: "환불은 구매 후 14일 이내에 가능합니다",
      metadata: { category: "refund", difficulty: "easy" },
    },
    // ... 더 많은 테스트 케이스
  ],

  task: async (input) => {
    // 실제 AI 호출
    const response = await callYourAI(input.question, input.company);
    return response;
  },

  scores: [
    // 내장 스코어러
    (args) =>
      ({
        name: "Contains refund info",
        score: args.output.toLowerCase().includes("환불") ? 1 : 0,
      }) as Score,

    // LLM-as-judge
    async (args) => {
      const judgeScore = await callLLMJudge(args.input, args.output);
      return { name: "Quality", score: judgeScore } as Score;
    },
  ],
});

console.log(result.summary);

RAGAS (RAG 특화)

RAG 시스템 전용 평가 라이브러리.

# Python 예시 (RAGAS는 주로 Python 생태계)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

data = {
    "question": ["서울의 면적은?", "파이썬이란?"],
    "answer": ["605.2 km²입니다.", "파이썬은 프로그래밍 언어입니다."],
    "contexts": [
        ["서울의 면적은 605.2 km²이다. 인구는 약 950만명이다."],
        ["파이썬(Python)은 1991년 귀도 반 로섬이 만든 인터프리터 언어이다."],
    ],
    "ground_truths": [["605.2 km²"], ["Python은 프로그래밍 언어"]],
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.89, 'context_precision': 0.91}

도구 비교

도구	타입	강점	약점
promptfoo	오픈소스 CLI	빠른 셋업, CI 통합 쉬움	데이터셋 관리 기본적
Braintrust	SaaS	팀 협업, 실험 추적, UI	유료
RAGAS	오픈소스 라이브러리	RAG 특화, 메트릭 풍부	Python만 지원
LangSmith	SaaS	LangChain 통합	LangChain 의존
Weave (W&B)	SaaS	ML 실험 추적과 통합	설정 복잡

6. CI에 Eval 통합하기

Eval이 개발 파이프라인에 통합되지 않으면 "한 번 해보고 잊어버리는 것"으로 끝난다. CI에 붙여서 매 PR마다 자동으로 실행되게 해야 한다.

GitHub Actions 통합

# .github/workflows/eval.yml
name: LLM Eval

on:
  pull_request:
    paths:
      - "src/lib/prompts/**"
      - "src/app/api/chat/**"
  push:
    branches: [main]
  schedule:
    - cron: "0 9 * * 1"  # 매주 월요일 9시 (모델 드리프트 감지)

jobs:
  eval:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "20"
          
      - name: Install dependencies
        run: npm ci
        
      - name: Run eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: npm run eval
        
      - name: Check eval thresholds
        run: |
          node scripts/check-eval-thresholds.js
          
      - name: Comment results on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval-results.json', 'utf8'));
            
            const comment = `## Eval Results
            
            | Metric | Score | Threshold | Status |
            |--------|-------|-----------|--------|
            ${results.metrics.map(m => 
              `| ${m.name} | ${m.score.toFixed(2)} | ${m.threshold} | ${m.score >= m.threshold ? '✅' : '❌'} |`
            ).join('\n')}
            
            Overall: ${results.passed ? '✅ PASSED' : '❌ FAILED'}`;
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

eval 실행 스크립트

// scripts/run-eval.ts
import { streamText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import * as fs from "fs/promises";

interface EvalResult {
  caseId: string;
  input: string;
  output: string;
  scores: Record<string, number>;
  passed: boolean;
}

interface EvalReport {
  timestamp: string;
  totalCases: number;
  passedCases: number;
  metrics: { name: string; score: number; threshold: number }[];
  passed: boolean;
  results: EvalResult[];
}

// 평가 임계값
const THRESHOLDS = {
  accuracy: 0.8,
  helpfulness: 0.75,
  safety: 0.95, // 안전성은 높게
};

async function runEval(): Promise<void> {
  // 테스트 케이스 로드
  const testCases = JSON.parse(
    await fs.readFile("eval/test-cases.json", "utf-8")
  );

  const results: EvalResult[] = [];

  for (const testCase of testCases) {
    // AI 호출
    const { text } = await streamText({
      model: anthropic("claude-opus-4-5"),
      messages: [{ role: "user", content: testCase.input }],
    }).then((r) => ({ text: r.text })).catch(() => ({ text: "" }));

    // 자동 평가
    const scores: Record<string, number> = {};

    // 1. 포함 여부 체크
    if (testCase.requiredPhrases) {
      const contained = testCase.requiredPhrases.filter((phrase: string) =>
        text.toLowerCase().includes(phrase.toLowerCase())
      );
      scores.completeness =
        contained.length / testCase.requiredPhrases.length;
    }

    // 2. 금지어 체크 (안전성)
    const blockedTerms = ["개인정보", "비밀번호", "신용카드"];
    scores.safety = blockedTerms.some((term) => text.includes(term)) ? 0 : 1;

    // 3. 길이 체크 (너무 짧거나 너무 길면 감점)
    const wordCount = text.split(/\s+/).length;
    if (wordCount < 10) scores.length_quality = 0.3;
    else if (wordCount > 500) scores.length_quality = 0.7;
    else scores.length_quality = 1.0;

    const passed = Object.entries(THRESHOLDS).every(
      ([metric, threshold]) =>
        scores[metric] === undefined || scores[metric] >= threshold
    );

    results.push({
      caseId: testCase.id,
      input: testCase.input,
      output: text,
      scores,
      passed,
    });
  }

  // 리포트 생성
  const passedCount = results.filter((r) => r.passed).length;
  const allScores: Record<string, number[]> = {};

  for (const result of results) {
    for (const [metric, score] of Object.entries(result.scores)) {
      if (!allScores[metric]) allScores[metric] = [];
      allScores[metric].push(score);
    }
  }

  const avgScores = Object.entries(allScores).map(([name, scores]) => ({
    name,
    score: scores.reduce((a, b) => a + b, 0) / scores.length,
    threshold: THRESHOLDS[name as keyof typeof THRESHOLDS] ?? 0.7,
  }));

  const report: EvalReport = {
    timestamp: new Date().toISOString(),
    totalCases: results.length,
    passedCases: passedCount,
    metrics: avgScores,
    passed: avgScores.every((m) => m.score >= m.threshold),
    results,
  };

  await fs.writeFile("eval-results.json", JSON.stringify(report, null, 2));

  console.log(`\nEval 결과: ${passedCount}/${results.length} 통과`);
  for (const metric of avgScores) {
    const status = metric.score >= metric.threshold ? "✅" : "❌";
    console.log(
      `  ${status} ${metric.name}: ${metric.score.toFixed(3)} (기준: ${metric.threshold})`
    );
  }

  if (!report.passed) {
    process.exit(1); // CI에서 실패 처리
  }
}

runEval().catch(console.error);

7. 프로덕션 모니터링

// 프로덕션 AI 응답 로깅 및 샘플링
interface AIInteractionLog {
  id: string;
  sessionId: string;
  userId?: string;
  input: string;
  output: string;
  model: string;
  latencyMs: number;
  tokenCount: number;
  timestamp: Date;
  // 자동 평가 결과
  autoEvalScores?: Record<string, number>;
  // 유저 피드백 (thumbs up/down)
  userRating?: 1 | 2 | 3 | 4 | 5;
  userFeedback?: string;
}

async function logAndEvaluate(
  supabase: ReturnType<typeof createClient>,
  log: Omit<AIInteractionLog, "id" | "autoEvalScores">
): Promise<void> {
  // 빠른 자동 평가 (프로덕션에서는 가볍게)
  const autoEvalScores: Record<string, number> = {
    // 길이 기반 품질 휴리스틱
    length_quality: log.output.length > 50 && log.output.length < 2000 ? 1 : 0.5,
    // 응답 시간 (느리면 UX 문제)
    latency_ok: log.latencyMs < 3000 ? 1 : log.latencyMs < 5000 ? 0.5 : 0,
  };

  const id = crypto.randomUUID();

  await supabase.from("ai_interaction_logs").insert({
    ...log,
    id,
    auto_eval_scores: autoEvalScores,
    timestamp: log.timestamp.toISOString(),
  });

  // 10%만 샘플링해서 비용이 큰 LLM-as-judge 적용
  if (Math.random() < 0.1) {
    // 비동기로 처리 (응답 지연 없도록)
    setTimeout(async () => {
      const deepScores = await llmJudge(log.input, log.output, [
        { name: "quality", description: "응답 품질", scale: "1-5" },
      ]);

      await supabase
        .from("ai_interaction_logs")
        .update({ auto_eval_scores: { ...autoEvalScores, ...deepScores } })
        .eq("id", id);
    }, 0);
  }
}

Epilogue: Eval은 AI 개발의 테스트다

처음부터 완벽한 Eval 시스템을 만들 필요는 없다. 시작은 단순하게:

가장 중요한 케이스 10-20개로 시작
자동화할 수 있는 간단한 체크부터 (contains, regex)
점차 LLM-as-judge와 semantic similarity 추가
CI에 연동해서 매 PR마다 실행

AI 기능의 "이 정도면 됐겠지"를 "이 데이터가 증명한다"로 바꾸는 것. 그게 Eval이 주는 가치다.

#LLM Evaluation #AI Quality #Evals #AI Engineering

← 목록으로 돌아가기