When I first integrated an LLM API, my token costs hit $50 in a week. Every time a user uploaded a long document, GPT-4 was reading the entire thing. I didn't know about prompt caching, wasn't using streaming, and users stared at blank screens for 10 seconds. That's when I realized: LLM APIs aren't like REST APIs.
Here's what I learned using OpenAI and Anthropic APIs in production. Token cost optimization, streaming responses, structured output, function calling. The practical tips that aren't in the documentation.
What Actually Happens in an API Call
LLM API calls look like HTTP requests, but something completely different happens inside. A normal REST API makes one DB query and you're done. An LLM generates tokens one by one through hundreds of billions of parameters.
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function summarizeText(text: string) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: 'You are an assistant that provides concise summaries.'
},
{
role: 'user',
content: `Summarize this in 3 sentences:\n\n${text}`
}
],
temperature: 0.3,
max_tokens: 200
});
return response.choices[0].message.content;
}
The problem is the await. When a user uploads a 5000-character document, they wait 15 seconds while GPT-4 generates 200 tokens. This was my first summarization feature. Users watched the loading spinner and left.
The key is streaming. LLMs generate tokens one by one, but we don't need to wait for all of them. Like streaming a movie, show them as they're generated.
async function summarizeTextStreaming(text: string) {
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: 'You are an assistant that provides concise summaries.'
},
{
role: 'user',
content: `Summarize this in 3 sentences:\n\n${text}`
}
],
temperature: 0.3,
max_tokens: 200,
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
if (content) {
process.stdout.write(content); // or send to client
}
}
}
The perceived speed changes completely. The first token arrives in 1-2 seconds, and users watch the text typing out. This is why ChatGPT uses streaming.
Preventing Token Cost Explosions
The biggest trap with LLM APIs is token costs. For GPT-4, input tokens cost $2.5 per million, output costs $10. I didn't think much of it at first, but as users grew, costs skyrocketed exponentially.
First realization: tokens aren't words. In English, 1 word ≈ 1.3 tokens, but in Korean, 1 character ≈ 2-3 tokens. "안녕하세요" (hello) is 5 characters but over 10 tokens. Korean services pay 2-3x more in token costs than English ones.
Second realization: model selection determines 80% of costs. GPT-4 is powerful but expensive. Simple classification or summarization works fine with GPT-4o-mini. I used GPT-4 for everything, then split models by feature and cut costs by 70%.
const MODEL_SELECTION = {
// Tasks requiring complex reasoning
complex: 'gpt-4o',
// Simple classification, summarization
simple: 'gpt-4o-mini',
// Batch processing
batch: 'gpt-4o-mini'
};
async function classifyFeedback(text: string) {
// Simple tasks like sentiment classification work fine with mini
const response = await openai.chat.completions.create({
model: MODEL_SELECTION.simple,
messages: [
{
role: 'system',
content: 'Classify user feedback as positive/negative/neutral.'
},
{
role: 'user',
content: text
}
],
temperature: 0, // Classification needs consistency
max_tokens: 10 // One word is enough
});
return response.choices[0].message.content;
}
Third realization: Prompt Caching is a game changer. Anthropic's Claude API supports prompt caching. Cache long system prompts or documents, and from the second call onward, cached token costs drop by 90%.
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
async function analyzeCodeWithCache(code: string) {
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
system: [
{
type: 'text',
text: 'You are a code review expert. Follow these guidelines...',
cache_control: { type: 'ephemeral' } // Cache for 5 minutes
}
],
messages: [
{
role: 'user',
content: `Review this code:\n\n${code}`
}
]
});
return response.content[0].text;
}
If your system prompt is 1000 tokens, without caching you pay 1000 tokens per request. With caching, you pay 1000 for the first request, then only 100 for subsequent ones. Essential for services with repeated function calls.
What Temperature and Max Tokens Really Mean
At first I didn't know what temperature was, so I just left it at 0.7. Results varied every time and I was confused. Temperature isn't "creativity," it's randomness in the probability distribution.
When selecting the next token, LLMs create a probability distribution. Like "hello" followed by "there" 80%, "friend" 15%, "darkness" 5%. Temperature determines how flat to make this distribution.
- temperature = 0: Always picks the highest probability token. Deterministic results.
- temperature = 1: Samples from the probability distribution as-is. Varied results.
- temperature = 2: Flattens the distribution for more randomness.
// Data extraction/classification - consistency matters
async function extractStructuredData(text: string) {
return await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0, // Same result every time
messages: [...]
});
}
// Creative content - variety matters
async function generateBlogIdeas(topic: string) {
return await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.9, // Diverse ideas
messages: [...]
});
}
max_tokens is your safety net against cost explosions. At first I didn't set it, and GPT-4 generated a 5000-token essay, costing me 5 cents per request. Now I always set max_tokens.
Structured Output: JSON Mode and Function Calling
LLM responses are text. Parsing "This is positive" is a nightmare. Typos, line breaks, different phrasings... I wanted JSON.
First attempt: "Respond in JSON" in the prompt
// ❌ Unreliable
const badResponse = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: 'Analyze this review and respond in JSON: {"sentiment": "positive/negative", "score": 1-5}'
}
]
});
// Result: "Sure! Here's the analysis:\n```json\n{...}\n```"
GPT-4 wraps it in code blocks or adds explanations. Parsing became a mess.
Solution: JSON Mode (OpenAI) or Tool Use (Anthropic)
// OpenAI JSON Mode
async function analyzeSentiment(review: string) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
response_format: { type: 'json_object' },
messages: [
{
role: 'system',
content: 'Respond only in JSON format.'
},
{
role: 'user',
content: `Analyze review (sentiment: positive/negative/neutral, score: 1-5, summary: string):\n${review}`
}
]
});
return JSON.parse(response.choices[0].message.content);
}
// Anthropic Tool Use (more powerful)
async function extractProductInfo(description: string) {
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
tools: [
{
name: 'save_product',
description: 'Save product information',
input_schema: {
type: 'object',
properties: {
name: { type: 'string', description: 'Product name' },
price: { type: 'number', description: 'Price' },
category: { type: 'string', description: 'Category' },
features: { type: 'array', items: { type: 'string' } }
},
required: ['name', 'price', 'category']
}
}
],
messages: [
{
role: 'user',
content: `Extract product info:\n${description}`
}
]
});
const toolUse = response.content.find(block => block.type === 'tool_use');
return toolUse?.input;
}
Tool use is the evolution of function calling. The LLM returns structured data in the form of "calling" a function. Schema validation is automatic, making it far more stable than JSON mode.
Error Handling: Rate Limits, Timeouts, Retries
LLM APIs fail. Often. You hit rate limits, servers slow down, networks drop. At first I just used a single try-catch, but it kept breaking in production.
Core strategy: Exponential Backoff + Retry
async function callLLMWithRetry<T>(
apiCall: () => Promise<T>,
maxRetries = 3
): Promise<T> {
for (let i = 0; i < maxRetries; i++) {
try {
return await apiCall();
} catch (error: any) {
const isLastRetry = i === maxRetries - 1;
// Rate limit - wait and retry
if (error?.status === 429) {
if (isLastRetry) throw error;
const waitTime = Math.pow(2, i) * 1000; // 1s, 2s, 4s
console.log(`Rate limited. Waiting ${waitTime}ms...`);
await new Promise(resolve => setTimeout(resolve, waitTime));
continue;
}
// Server error (500-599) - retry
if (error?.status >= 500 && error?.status < 600) {
if (isLastRetry) throw error;
await new Promise(resolve => setTimeout(resolve, 1000));
continue;
}
// Client error (400-499) - don't retry
throw error;
}
}
throw new Error('Max retries exceeded');
}
// Usage
const result = await callLLMWithRetry(() =>
openai.chat.completions.create({
model: 'gpt-4o',
messages: [...]
})
);
Timeouts are essential too. GPT-4 sometimes takes over 30 seconds. Users won't wait that long.
async function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
const timeout = new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), ms)
);
return Promise.race([promise, timeout]);
}
// 10 second limit
const result = await withTimeout(
openai.chat.completions.create({...}),
10000
);
OpenAI vs Anthropic: Real-World Comparison
I've used both APIs. Each has clear pros and cons.
OpenAI Pros:
- GPT-4o is fast and cheap (input $2.5/1M, output $10/1M)
- JSON mode is simple
- Multimodal APIs like Whisper, DALL-E are integrated
- Rich documentation
OpenAI Cons:
- No prompt caching (limits cost optimization)
- 128k context window, smaller than Claude
- Strict rate limits (especially free tier)
Anthropic Pros:
- Claude 3.5 Sonnet's coding/reasoning beats GPT-4o
- Prompt caching cuts repeated call costs by 90%
- 200k context window
- Powerful tool use (more stable than JSON mode)
Anthropic Cons:
- More expensive (input $3/1M, output $15/1M)
- Streaming implementation more complex than OpenAI
- Smaller ecosystem (fewer plugins, tools)
My choice:
- Simple classification/summarization: OpenAI GPT-4o-mini
- Complex reasoning/coding: Anthropic Claude 3.5 Sonnet
- Features with many repeated calls: Anthropic (for prompt caching)
- Batch processing: OpenAI (for price)
Real Example: Smart FAQ Auto-Generation
Theory applied to practice. I built a feature that collects customer inquiries and automatically generates FAQs.
interface FAQ {
question: string;
answer: string;
category: string;
priority: number;
}
async function generateFAQ(customerQueries: string[]): Promise<FAQ[]> {
// Step 1: Classify queries (use cheap model)
const categorized = await Promise.all(
customerQueries.map(query =>
callLLMWithRetry(() =>
openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0,
response_format: { type: 'json_object' },
messages: [
{
role: 'system',
content: 'Respond in JSON: {category: string, priority: number}'
},
{
role: 'user',
content: `Classify this inquiry:\n${query}`
}
]
})
)
)
);
// Step 2: Group by category
const grouped = categorized.reduce((acc, item, index) => {
const data = JSON.parse(item.choices[0].message.content);
const category = data.category;
if (!acc[category]) acc[category] = [];
acc[category].push({ query: customerQueries[index], ...data });
return acc;
}, {} as Record<string, any[]>);
// Step 3: Generate FAQs (powerful model + caching)
const faqs: FAQ[] = [];
for (const [category, queries] of Object.entries(grouped)) {
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 2000,
system: [
{
type: 'text',
text: 'You are an expert at analyzing customer inquiries and writing clear FAQs. Follow these principles:\n1. Questions should be clear and specific\n2. Answers should be concise, 2-3 sentences\n3. Explain technical terms simply',
cache_control: { type: 'ephemeral' } // Cache system prompt
}
],
messages: [
{
role: 'user',
content: `Category: ${category}\nInquiries:\n${queries.map(q => q.query).join('\n')}\n\nGenerate top 5 FAQs as JSON array.`
}
]
});
const categoryFAQs = JSON.parse(response.content[0].text);
faqs.push(...categoryFAQs);
}
return faqs;
}
This code incorporates all the practical tips:
- Model selection: mini for simple classification, Sonnet for complex generation
- Parallel processing: Promise.all for concurrent classification
- Error handling: retry logic for stability
- Cost optimization: system prompt caching reduces repeated call costs
- Structured output: JSON for direct use without parsing
At first I ran everything on GPT-4 and paid 20 cents per request. After this change, it's down to 5 cents. Speed is 3x faster.
Cost Management Strategies
For startups, LLM API costs matter. They grow exponentially as users increase. Here are my strategies.
1. Reduce Input Tokens
- Split long documents into chunks, send only needed parts
- Keep system prompts as short as possible
- Use prompt caching
2. Limit Output Tokens
- Always set max_tokens
- Use instructions like "briefly", "in 3 sentences" to control output length
3. Model Mix
- Process 80% with mini/cheap models
- Use premium models only for 20%
4. Caching Strategy
- Cache identical requests in DB (especially classification, translation)
- Redis with 1-hour cache cut API calls by 60%
async function cachedLLMCall(
key: string,
apiCall: () => Promise<string>
): Promise<string> {
const cached = await redis.get(key);
if (cached) return cached;
const result = await apiCall();
await redis.setex(key, 3600, result); // 1 hour cache
return result;
}
// Usage
const sentiment = await cachedLLMCall(
`sentiment:${hashText(review)}`,
() => analyzeSentiment(review)
);
5. Usage Monitoring
- Log all API calls
- Track token usage by user and feature
- Alert on anomalies (sudden 10x increase)
With this management, monthly LLM costs stay under $200. At 1000 MAU, that's 20 cents per user.
Conclusion: LLM APIs Are Different
You can't use LLM APIs like regular REST APIs. Every token costs money, responses are slow, and results are probabilistic. But when used properly, they cut development time by 10x.
Key lessons:
- Streaming is essential - time to first token determines UX
- Model selection is 80% of costs - don't run everything on GPT-4
- Prompt caching is a game changer - use Anthropic for many repeated calls
- Use structured output - avoid parsing hell with JSON mode or tool use
- Errors always happen - can't go to production without retry logic
The overwhelm I felt when first integrating LLM APIs disappeared once I understood "this is a different kind of API." Now I build classification, summarization, extraction, and generation features in hours. 50 lines of code is enough.
LLM APIs aren't magic, they're tools. Understand their characteristics, manage costs, prepare for errors, and they become powerful weapons for startups.