
LLM APIs in Practice: Building Features with OpenAI and Anthropic
My first LLM API integration brought token cost explosions, latency issues, and hallucinations. Here's what I learned building real features.

My first LLM API integration brought token cost explosions, latency issues, and hallucinations. Here's what I learned building real features.
Both are children of Transformer, so why the difference? Using 'Fill-in-the-blank' vs 'Write-next-word' analogies to explain BERT vs GPT. Practical guide based on trial and error.

ChatGPT answers questions. AI Agents plan, use tools, and complete tasks autonomously. Understanding this difference changes how you build with AI.

Why did Facebook ditch REST API? The charm of picking only what you want with GraphQL, and its fatal flaws (Caching, N+1 Problem).

I actually used all three AI coding tools for real projects. Here's an honest comparison of Copilot, Claude Code, and Cursor.

When I first integrated an LLM API, my token costs hit $50 in a week. Every time a user uploaded a long document, GPT-4 was reading the entire thing. I didn't know about prompt caching, wasn't using streaming, and users stared at blank screens for 10 seconds. That's when I realized: LLM APIs aren't like REST APIs.
Here's what I learned using OpenAI and Anthropic APIs in production. Token cost optimization, streaming responses, structured output, function calling. The practical tips that aren't in the documentation.
LLM API calls look like HTTP requests, but something completely different happens inside. A normal REST API makes one DB query and you're done. An LLM generates tokens one by one through hundreds of billions of parameters.
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function summarizeText(text: string) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: 'You are an assistant that provides concise summaries.'
},
{
role: 'user',
content: `Summarize this in 3 sentences:\n\n${text}`
}
],
temperature: 0.3,
max_tokens: 200
});
return response.choices[0].message.content;
}
The problem is the await. When a user uploads a 5000-character document, they wait 15 seconds while GPT-4 generates 200 tokens. This was my first summarization feature. Users watched the loading spinner and left.
The key is streaming. LLMs generate tokens one by one, but we don't need to wait for all of them. Like streaming a movie, show them as they're generated.
async function summarizeTextStreaming(text: string) {
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: 'You are an assistant that provides concise summaries.'
},
{
role: 'user',
content: `Summarize this in 3 sentences:\n\n${text}`
}
],
temperature: 0.3,
max_tokens: 200,
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
if (content) {
process.stdout.write(content); // or send to client
}
}
}
The perceived speed changes completely. The first token arrives in 1-2 seconds, and users watch the text typing out. This is why ChatGPT uses streaming.
The biggest trap with LLM APIs is token costs. For GPT-4, input tokens cost $2.5 per million, output costs $10. I didn't think much of it at first, but as users grew, costs skyrocketed exponentially.
First realization: tokens aren't words. In English, 1 word ≈ 1.3 tokens, but in Korean, 1 character ≈ 2-3 tokens. "안녕하세요" (hello) is 5 characters but over 10 tokens. Korean services pay 2-3x more in token costs than English ones.
Second realization: model selection determines 80% of costs. GPT-4 is powerful but expensive. Simple classification or summarization works fine with GPT-4o-mini. I used GPT-4 for everything, then split models by feature and cut costs by 70%.
const MODEL_SELECTION = {
// Tasks requiring complex reasoning
complex: 'gpt-4o',
// Simple classification, summarization
simple: 'gpt-4o-mini',
// Batch processing
batch: 'gpt-4o-mini'
};
async function classifyFeedback(text: string) {
// Simple tasks like sentiment classification work fine with mini
const response = await openai.chat.completions.create({
model: MODEL_SELECTION.simple,
messages: [
{
role: 'system',
content: 'Classify user feedback as positive/negative/neutral.'
},
{
role: 'user',
content: text
}
],
temperature: 0, // Classification needs consistency
max_tokens: 10 // One word is enough
});
return response.choices[0].message.content;
}
Third realization: Prompt Caching is a game changer. Anthropic's Claude API supports prompt caching. Cache long system prompts or documents, and from the second call onward, cached token costs drop by 90%.
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
async function analyzeCodeWithCache(code: string) {
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
system: [
{
type: 'text',
text: 'You are a code review expert. Follow these guidelines...',
cache_control: { type: 'ephemeral' } // Cache for 5 minutes
}
],
messages: [
{
role: 'user',
content: `Review this code:\n\n${code}`
}
]
});
return response.content[0].text;
}
If your system prompt is 1000 tokens, without caching you pay 1000 tokens per request. With caching, you pay 1000 for the first request, then only 100 for subsequent ones. Essential for services with repeated function calls.
At first I didn't know what temperature was, so I just left it at 0.7. Results varied every time and I was confused. Temperature isn't "creativity," it's randomness in the probability distribution.
When selecting the next token, LLMs create a probability distribution. Like "hello" followed by "there" 80%, "friend" 15%, "darkness" 5%. Temperature determines how flat to make this distribution.
// Data extraction/classification - consistency matters
async function extractStructuredData(text: string) {
return await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0, // Same result every time
messages: [...]
});
}
// Creative content - variety matters
async function generateBlogIdeas(topic: string) {
return await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.9, // Diverse ideas
messages: [...]
});
}
max_tokens is your safety net against cost explosions. At first I didn't set it, and GPT-4 generated a 5000-token essay, costing me 5 cents per request. Now I always set max_tokens.
LLM responses are text. Parsing "This is positive" is a nightmare. Typos, line breaks, different phrasings... I wanted JSON.
First attempt: "Respond in JSON" in the prompt// ❌ Unreliable
const badResponse = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: 'Analyze this review and respond in JSON: {"sentiment": "positive/negative", "score": 1-5}'
}
]
});
// Result: "Sure! Here's the analysis:\n```json\n{...}\n```"
GPT-4 wraps it in code blocks or adds explanations. Parsing became a mess.
Solution: JSON Mode (OpenAI) or Tool Use (Anthropic)// OpenAI JSON Mode
async function analyzeSentiment(review: string) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
response_format: { type: 'json_object' },
messages: [
{
role: 'system',
content: 'Respond only in JSON format.'
},
{
role: 'user',
content: `Analyze review (sentiment: positive/negative/neutral, score: 1-5, summary: string):\n${review}`
}
]
});
return JSON.parse(response.choices[0].message.content);
}
// Anthropic Tool Use (more powerful)
async function extractProductInfo(description: string) {
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
tools: [
{
name: 'save_product',
description: 'Save product information',
input_schema: {
type: 'object',
properties: {
name: { type: 'string', description: 'Product name' },
price: { type: 'number', description: 'Price' },
category: { type: 'string', description: 'Category' },
features: { type: 'array', items: { type: 'string' } }
},
required: ['name', 'price', 'category']
}
}
],
messages: [
{
role: 'user',
content: `Extract product info:\n${description}`
}
]
});
const toolUse = response.content.find(block => block.type === 'tool_use');
return toolUse?.input;
}
Tool use is the evolution of function calling. The LLM returns structured data in the form of "calling" a function. Schema validation is automatic, making it far more stable than JSON mode.
LLM APIs fail. Often. You hit rate limits, servers slow down, networks drop. At first I just used a single try-catch, but it kept breaking in production.
Core strategy: Exponential Backoff + Retryasync function callLLMWithRetry<T>(
apiCall: () => Promise<T>,
maxRetries = 3
): Promise<T> {
for (let i = 0; i < maxRetries; i++) {
try {
return await apiCall();
} catch (error: any) {
const isLastRetry = i === maxRetries - 1;
// Rate limit - wait and retry
if (error?.status === 429) {
if (isLastRetry) throw error;
const waitTime = Math.pow(2, i) * 1000; // 1s, 2s, 4s
console.log(`Rate limited. Waiting ${waitTime}ms...`);
await new Promise(resolve => setTimeout(resolve, waitTime));
continue;
}
// Server error (500-599) - retry
if (error?.status >= 500 && error?.status < 600) {
if (isLastRetry) throw error;
await new Promise(resolve => setTimeout(resolve, 1000));
continue;
}
// Client error (400-499) - don't retry
throw error;
}
}
throw new Error('Max retries exceeded');
}
// Usage
const result = await callLLMWithRetry(() =>
openai.chat.completions.create({
model: 'gpt-4o',
messages: [...]
})
);
Timeouts are essential too. GPT-4 sometimes takes over 30 seconds. Users won't wait that long.
async function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
const timeout = new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), ms)
);
return Promise.race([promise, timeout]);
}
// 10 second limit
const result = await withTimeout(
openai.chat.completions.create({...}),
10000
);
I've used both APIs. Each has clear pros and cons.
OpenAI Pros:Theory applied to practice. I built a feature that collects customer inquiries and automatically generates FAQs.
interface FAQ {
question: string;
answer: string;
category: string;
priority: number;
}
async function generateFAQ(customerQueries: string[]): Promise<FAQ[]> {
// Step 1: Classify queries (use cheap model)
const categorized = await Promise.all(
customerQueries.map(query =>
callLLMWithRetry(() =>
openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0,
response_format: { type: 'json_object' },
messages: [
{
role: 'system',
content: 'Respond in JSON: {category: string, priority: number}'
},
{
role: 'user',
content: `Classify this inquiry:\n${query}`
}
]
})
)
)
);
// Step 2: Group by category
const grouped = categorized.reduce((acc, item, index) => {
const data = JSON.parse(item.choices[0].message.content);
const category = data.category;
if (!acc[category]) acc[category] = [];
acc[category].push({ query: customerQueries[index], ...data });
return acc;
}, {} as Record<string, any[]>);
// Step 3: Generate FAQs (powerful model + caching)
const faqs: FAQ[] = [];
for (const [category, queries] of Object.entries(grouped)) {
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 2000,
system: [
{
type: 'text',
text: 'You are an expert at analyzing customer inquiries and writing clear FAQs. Follow these principles:\n1. Questions should be clear and specific\n2. Answers should be concise, 2-3 sentences\n3. Explain technical terms simply',
cache_control: { type: 'ephemeral' } // Cache system prompt
}
],
messages: [
{
role: 'user',
content: `Category: ${category}\nInquiries:\n${queries.map(q => q.query).join('\n')}\n\nGenerate top 5 FAQs as JSON array.`
}
]
});
const categoryFAQs = JSON.parse(response.content[0].text);
faqs.push(...categoryFAQs);
}
return faqs;
}
This code incorporates all the practical tips:
At first I ran everything on GPT-4 and paid 20 cents per request. After this change, it's down to 5 cents. Speed is 3x faster.
For startups, LLM API costs matter. They grow exponentially as users increase. Here are my strategies.
1. Reduce Input Tokensasync function cachedLLMCall(
key: string,
apiCall: () => Promise<string>
): Promise<string> {
const cached = await redis.get(key);
if (cached) return cached;
const result = await apiCall();
await redis.setex(key, 3600, result); // 1 hour cache
return result;
}
// Usage
const sentiment = await cachedLLMCall(
`sentiment:${hashText(review)}`,
() => analyzeSentiment(review)
);
5. Usage Monitoring
With this management, monthly LLM costs stay under $200. At 1000 MAU, that's 20 cents per user.
You can't use LLM APIs like regular REST APIs. Every token costs money, responses are slow, and results are probabilistic. But when used properly, they cut development time by 10x.
Key lessons:
The overwhelm I felt when first integrating LLM APIs disappeared once I understood "this is a different kind of API." Now I build classification, summarization, extraction, and generation features in hours. 50 lines of code is enough.
LLM APIs aren't magic, they're tools. Understand their characteristics, manage costs, prepare for errors, and they become powerful weapons for startups.