LLM APIs in Practice: Building Features with OpenAI and Anthropic

When I first integrated an LLM API, my token costs hit $50 in a week. Every time a user uploaded a long document, GPT-4 was reading the entire thing. I didn't know about prompt caching, wasn't using streaming, and users stared at blank screens for 10 seconds. That's when I realized: LLM APIs aren't like REST APIs.

Here's what I learned using OpenAI and Anthropic APIs in production. Token cost optimization, streaming responses, structured output, function calling. The practical tips that aren't in the documentation.

What Actually Happens in an API Call

LLM API calls look like HTTP requests, but something completely different happens inside. A normal REST API makes one DB query and you're done. An LLM generates tokens one by one through hundreds of billions of parameters.

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function summarizeText(text: string) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: 'You are an assistant that provides concise summaries.'
      },
      {
        role: 'user',
        content: `Summarize this in 3 sentences:\n\n${text}`
      }
    ],
    temperature: 0.3,
    max_tokens: 200
  });

  return response.choices[0].message.content;
}

The problem is the await. When a user uploads a 5000-character document, they wait 15 seconds while GPT-4 generates 200 tokens. This was my first summarization feature. Users watched the loading spinner and left.

The key is streaming. LLMs generate tokens one by one, but we don't need to wait for all of them. Like streaming a movie, show them as they're generated.

async function summarizeTextStreaming(text: string) {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: 'You are an assistant that provides concise summaries.'
      },
      {
        role: 'user',
        content: `Summarize this in 3 sentences:\n\n${text}`
      }
    ],
    temperature: 0.3,
    max_tokens: 200,
    stream: true
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || '';
    if (content) {
      process.stdout.write(content); // or send to client
    }
  }
}

The perceived speed changes completely. The first token arrives in 1-2 seconds, and users watch the text typing out. This is why ChatGPT uses streaming.

Preventing Token Cost Explosions

The biggest trap with LLM APIs is token costs. For GPT-4, input tokens cost $2.5 per million, output costs $10. I didn't think much of it at first, but as users grew, costs skyrocketed exponentially.

First realization: tokens aren't words. In English, 1 word ≈ 1.3 tokens, but in Korean, 1 character ≈ 2-3 tokens. "안녕하세요" (hello) is 5 characters but over 10 tokens. Korean services pay 2-3x more in token costs than English ones.

Second realization: model selection determines 80% of costs. GPT-4 is powerful but expensive. Simple classification or summarization works fine with GPT-4o-mini. I used GPT-4 for everything, then split models by feature and cut costs by 70%.

const MODEL_SELECTION = {
  // Tasks requiring complex reasoning
  complex: 'gpt-4o',
  // Simple classification, summarization
  simple: 'gpt-4o-mini',
  // Batch processing
  batch: 'gpt-4o-mini'
};

async function classifyFeedback(text: string) {
  // Simple tasks like sentiment classification work fine with mini
  const response = await openai.chat.completions.create({
    model: MODEL_SELECTION.simple,
    messages: [
      {
        role: 'system',
        content: 'Classify user feedback as positive/negative/neutral.'
      },
      {
        role: 'user',
        content: text
      }
    ],
    temperature: 0, // Classification needs consistency
    max_tokens: 10  // One word is enough
  });

  return response.choices[0].message.content;
}

Third realization: Prompt Caching is a game changer. Anthropic's Claude API supports prompt caching. Cache long system prompts or documents, and from the second call onward, cached token costs drop by 90%.

import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY
});

async function analyzeCodeWithCache(code: string) {
  const response = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    system: [
      {
        type: 'text',
        text: 'You are a code review expert. Follow these guidelines...',
        cache_control: { type: 'ephemeral' } // Cache for 5 minutes
      }
    ],
    messages: [
      {
        role: 'user',
        content: `Review this code:\n\n${code}`
      }
    ]
  });

  return response.content[0].text;
}

If your system prompt is 1000 tokens, without caching you pay 1000 tokens per request. With caching, you pay 1000 for the first request, then only 100 for subsequent ones. Essential for services with repeated function calls.

What Temperature and Max Tokens Really Mean

At first I didn't know what temperature was, so I just left it at 0.7. Results varied every time and I was confused. Temperature isn't "creativity," it's randomness in the probability distribution.

When selecting the next token, LLMs create a probability distribution. Like "hello" followed by "there" 80%, "friend" 15%, "darkness" 5%. Temperature determines how flat to make this distribution.

temperature = 0: Always picks the highest probability token. Deterministic results.
temperature = 1: Samples from the probability distribution as-is. Varied results.
temperature = 2: Flattens the distribution for more randomness.

// Data extraction/classification - consistency matters
async function extractStructuredData(text: string) {
  return await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0, // Same result every time
    messages: [...]
  });
}

// Creative content - variety matters
async function generateBlogIdeas(topic: string) {
  return await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.9, // Diverse ideas
    messages: [...]
  });
}

max_tokens is your safety net against cost explosions. At first I didn't set it, and GPT-4 generated a 5000-token essay, costing me 5 cents per request. Now I always set max_tokens.

Structured Output: JSON Mode and Function Calling

LLM responses are text. Parsing "This is positive" is a nightmare. Typos, line breaks, different phrasings... I wanted JSON.

First attempt: "Respond in JSON" in the prompt

// ❌ Unreliable
const badResponse = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    {
      role: 'user',
      content: 'Analyze this review and respond in JSON: {"sentiment": "positive/negative", "score": 1-5}'
    }
  ]
});
// Result: "Sure! Here's the analysis:\n```json\n{...}\n```"

GPT-4 wraps it in code blocks or adds explanations. Parsing became a mess.

Solution: JSON Mode (OpenAI) or Tool Use (Anthropic)

// OpenAI JSON Mode
async function analyzeSentiment(review: string) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    response_format: { type: 'json_object' },
    messages: [
      {
        role: 'system',
        content: 'Respond only in JSON format.'
      },
      {
        role: 'user',
        content: `Analyze review (sentiment: positive/negative/neutral, score: 1-5, summary: string):\n${review}`
      }
    ]
  });

  return JSON.parse(response.choices[0].message.content);
}

// Anthropic Tool Use (more powerful)
async function extractProductInfo(description: string) {
  const response = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    tools: [
      {
        name: 'save_product',
        description: 'Save product information',
        input_schema: {
          type: 'object',
          properties: {
            name: { type: 'string', description: 'Product name' },
            price: { type: 'number', description: 'Price' },
            category: { type: 'string', description: 'Category' },
            features: { type: 'array', items: { type: 'string' } }
          },
          required: ['name', 'price', 'category']
        }
      }
    ],
    messages: [
      {
        role: 'user',
        content: `Extract product info:\n${description}`
      }
    ]
  });

  const toolUse = response.content.find(block => block.type === 'tool_use');
  return toolUse?.input;
}

Tool use is the evolution of function calling. The LLM returns structured data in the form of "calling" a function. Schema validation is automatic, making it far more stable than JSON mode.

Error Handling: Rate Limits, Timeouts, Retries

LLM APIs fail. Often. You hit rate limits, servers slow down, networks drop. At first I just used a single try-catch, but it kept breaking in production.

Core strategy: Exponential Backoff + Retry

async function callLLMWithRetry<T>(
  apiCall: () => Promise<T>,
  maxRetries = 3
): Promise<T> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await apiCall();
    } catch (error: any) {
      const isLastRetry = i === maxRetries - 1;

      // Rate limit - wait and retry
      if (error?.status === 429) {
        if (isLastRetry) throw error;
        const waitTime = Math.pow(2, i) * 1000; // 1s, 2s, 4s
        console.log(`Rate limited. Waiting ${waitTime}ms...`);
        await new Promise(resolve => setTimeout(resolve, waitTime));
        continue;
      }

      // Server error (500-599) - retry
      if (error?.status >= 500 && error?.status < 600) {
        if (isLastRetry) throw error;
        await new Promise(resolve => setTimeout(resolve, 1000));
        continue;
      }

      // Client error (400-499) - don't retry
      throw error;
    }
  }

  throw new Error('Max retries exceeded');
}

// Usage
const result = await callLLMWithRetry(() =>
  openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [...]
  })
);

Timeouts are essential too. GPT-4 sometimes takes over 30 seconds. Users won't wait that long.

async function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
  const timeout = new Promise<never>((_, reject) =>
    setTimeout(() => reject(new Error('Timeout')), ms)
  );
  return Promise.race([promise, timeout]);
}

// 10 second limit
const result = await withTimeout(
  openai.chat.completions.create({...}),
  10000
);

OpenAI vs Anthropic: Real-World Comparison

I've used both APIs. Each has clear pros and cons.

OpenAI Pros:

GPT-4o is fast and cheap (input $2.5/1M, output $10/1M)
JSON mode is simple
Multimodal APIs like Whisper, DALL-E are integrated
Rich documentation

OpenAI Cons:

No prompt caching (limits cost optimization)
128k context window, smaller than Claude
Strict rate limits (especially free tier)

Anthropic Pros:

Claude 3.5 Sonnet's coding/reasoning beats GPT-4o
Prompt caching cuts repeated call costs by 90%
200k context window
Powerful tool use (more stable than JSON mode)

Anthropic Cons:

More expensive (input $3/1M, output $15/1M)
Streaming implementation more complex than OpenAI
Smaller ecosystem (fewer plugins, tools)

My choice:

Simple classification/summarization: OpenAI GPT-4o-mini
Complex reasoning/coding: Anthropic Claude 3.5 Sonnet
Features with many repeated calls: Anthropic (for prompt caching)
Batch processing: OpenAI (for price)

Real Example: Smart FAQ Auto-Generation

Theory applied to practice. I built a feature that collects customer inquiries and automatically generates FAQs.

interface FAQ {
  question: string;
  answer: string;
  category: string;
  priority: number;
}

async function generateFAQ(customerQueries: string[]): Promise<FAQ[]> {
  // Step 1: Classify queries (use cheap model)
  const categorized = await Promise.all(
    customerQueries.map(query =>
      callLLMWithRetry(() =>
        openai.chat.completions.create({
          model: 'gpt-4o-mini',
          temperature: 0,
          response_format: { type: 'json_object' },
          messages: [
            {
              role: 'system',
              content: 'Respond in JSON: {category: string, priority: number}'
            },
            {
              role: 'user',
              content: `Classify this inquiry:\n${query}`
            }
          ]
        })
      )
    )
  );

  // Step 2: Group by category
  const grouped = categorized.reduce((acc, item, index) => {
    const data = JSON.parse(item.choices[0].message.content);
    const category = data.category;
    if (!acc[category]) acc[category] = [];
    acc[category].push({ query: customerQueries[index], ...data });
    return acc;
  }, {} as Record<string, any[]>);

  // Step 3: Generate FAQs (powerful model + caching)
  const faqs: FAQ[] = [];

  for (const [category, queries] of Object.entries(grouped)) {
    const response = await anthropic.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 2000,
      system: [
        {
          type: 'text',
          text: 'You are an expert at analyzing customer inquiries and writing clear FAQs. Follow these principles:\n1. Questions should be clear and specific\n2. Answers should be concise, 2-3 sentences\n3. Explain technical terms simply',
          cache_control: { type: 'ephemeral' } // Cache system prompt
        }
      ],
      messages: [
        {
          role: 'user',
          content: `Category: ${category}\nInquiries:\n${queries.map(q => q.query).join('\n')}\n\nGenerate top 5 FAQs as JSON array.`
        }
      ]
    });

    const categoryFAQs = JSON.parse(response.content[0].text);
    faqs.push(...categoryFAQs);
  }

  return faqs;
}

This code incorporates all the practical tips:

Model selection: mini for simple classification, Sonnet for complex generation
Parallel processing: Promise.all for concurrent classification
Error handling: retry logic for stability
Cost optimization: system prompt caching reduces repeated call costs
Structured output: JSON for direct use without parsing

At first I ran everything on GPT-4 and paid 20 cents per request. After this change, it's down to 5 cents. Speed is 3x faster.

Cost Management Strategies

For startups, LLM API costs matter. They grow exponentially as users increase. Here are my strategies.

1. Reduce Input Tokens

Split long documents into chunks, send only needed parts
Keep system prompts as short as possible
Use prompt caching

2. Limit Output Tokens

Always set max_tokens
Use instructions like "briefly", "in 3 sentences" to control output length

3. Model Mix

Process 80% with mini/cheap models
Use premium models only for 20%

4. Caching Strategy

Cache identical requests in DB (especially classification, translation)
Redis with 1-hour cache cut API calls by 60%

async function cachedLLMCall(
  key: string,
  apiCall: () => Promise<string>
): Promise<string> {
  const cached = await redis.get(key);
  if (cached) return cached;

  const result = await apiCall();
  await redis.setex(key, 3600, result); // 1 hour cache
  return result;
}

// Usage
const sentiment = await cachedLLMCall(
  `sentiment:${hashText(review)}`,
  () => analyzeSentiment(review)
);

5. Usage Monitoring

Log all API calls
Track token usage by user and feature
Alert on anomalies (sudden 10x increase)

With this management, monthly LLM costs stay under $200. At 1000 MAU, that's 20 cents per user.

Conclusion: LLM APIs Are Different

You can't use LLM APIs like regular REST APIs. Every token costs money, responses are slow, and results are probabilistic. But when used properly, they cut development time by 10x.

Key lessons:

Streaming is essential - time to first token determines UX
Model selection is 80% of costs - don't run everything on GPT-4
Prompt caching is a game changer - use Anthropic for many repeated calls
Use structured output - avoid parsing hell with JSON mode or tool use
Errors always happen - can't go to production without retry logic

The overwhelm I felt when first integrating LLM APIs disappeared once I understood "this is a different kind of API." Now I build classification, summarization, extraction, and generation features in hours. 50 lines of code is enough.

LLM APIs aren't magic, they're tools. Understand their characteristics, manage costs, prepare for errors, and they become powerful weapons for startups.

LLM APIs in Practice: Building Features with OpenAI and Anthropic

Related Posts

BERT vs GPT: Two Faces of AI (Understanding vs Generation)

AI Agents: How Autonomous AI Systems Actually Work

GraphQL vs REST: Buffet or Set Menu?

AI Coding Assistants Compared: GitHub Copilot vs Claude Code vs Cursor