Practical Prompt Engineering: Getting Structured Output
1. Prologue — "Why did it return something broken when I asked for JSON?"
First time wiring an LLM into production, you hit this quickly.
Prompt: "Analyze this review and return JSON.
sentiment must be one of: positive/negative/neutral."
LLM Response:
"Of course! Here is the analysis in JSON format:
\`\`\`json
{
"sentiment": "POSITIVE", ← should be 'positive', not 'POSITIVE'
"score": "high", ← should be a number, not a string
"issues": null ← should be [] when empty
}
\`\`\`
The overall sentiment is quite positive!" ← extra text after the JSON
JSON.parse() blows up in production. Field names come back different sometimes. Arrays arrive as null. This non-determinism is what makes devs hesitant to use LLMs for real production features (vs. quick demos).
This post is about fixing that.
2. Understanding the Role Structure
Modern LLM APIs organize conversations into three roles:
| Role | Purpose | Written by |
|---|
system | Sets model behavior and persona | Developer |
user | User input, questions, requests | User or developer |
assistant | Previous model responses (used in few-shot) | Model or developer |
The Importance of System Prompts
The system prompt is the operating manual you give the LLM — "here's who you are, here's how you behave." Enforcing output format here dramatically improves consistency.
const systemPrompt = `
You are a sentiment analysis specialist for user reviews.
## Output Rules (MANDATORY)
- Output ONLY valid JSON. Include no other text whatsoever.
- Do NOT use JSON code blocks (\`\`\`json ... \`\`\`)
- Do NOT use markdown
- Do NOT add explanatory text
## JSON Schema
{
"sentiment": "positive" | "negative" | "neutral",
"confidence": number between 0.0 and 1.0,
"key_phrases": string[] (max 3 items),
"issues": string[] (empty array [] if none)
}
`;
This improves consistency, but still can't guarantee 100% compliance. The stronger methods are below.
3. Few-shot Prompting
Showing the model examples of what you want is far more effective than describing it.
const messages = [
{
role: "system" as const,
content: "You are a review sentiment analyst. Output JSON only."
},
// Example 1: positive case
{
role: "user" as const,
content: "Review: This product is amazing! Fast shipping and great quality."
},
{
role: "assistant" as const,
content: JSON.stringify({
sentiment: "positive",
confidence: 0.95,
key_phrases: ["amazing product", "fast shipping", "great quality"],
issues: []
})
},
// Example 2: negative case
{
role: "user" as const,
content: "Review: The packaging was terrible and the product arrived scratched."
},
{
role: "assistant" as const,
content: JSON.stringify({
sentiment: "negative",
confidence: 0.88,
key_phrases: ["terrible packaging", "product scratched"],
issues: ["poor packaging quality", "product damage"]
})
},
// Example 3: edge case — neutral
{
role: "user" as const,
content: "Review: It's okay I guess. Nothing special, nothing bad."
},
{
role: "assistant" as const,
content: JSON.stringify({
sentiment: "neutral",
confidence: 0.72,
key_phrases: ["okay", "nothing special"],
issues: []
})
},
// Actual input
{
role: "user" as const,
content: `Review: ${userReview}`
}
];
The key to good few-shot: diversity of examples. Only showing happy paths fails at edge cases. Include negative, neutral, and cases with issues.
4. Chain-of-Thought — Give the Model Space to Think
CoT prompts the model to reason step-by-step before producing a final answer. For complex analysis or judgment tasks, accuracy improves substantially.
// Without CoT (simple classification)
const withoutCoT = `
Determine if this contract is valid.
Contract: "${contractText}"
Output: {"valid": boolean, "reason": string}
`;
// With CoT (step-by-step reasoning)
const withCoT = `
Analyze this contract's validity through these steps:
1. Parties: Are both parties clearly identified?
2. Purpose: Is the contract's purpose clear?
3. Obligations: Are rights and duties specified?
4. Legal requirements: Are signatures, dates, and legal formalities present?
5. Final judgment: Synthesize the above analysis.
Document your reasoning in the "analysis" field, then provide "valid" and "reason".
{
"analysis": {
"parties": "string",
"purpose": "string",
"obligations": "string",
"legal_requirements": "string"
},
"valid": boolean,
"reason": "string"
}
Contract: "${contractText}"
`;
The key: include the reasoning trace in the JSON output. This forces the model to actually work through the steps before concluding — and the trace stays in the response for debugging.
When Is CoT Needed?
| Task Type | CoT Necessity | Examples |
|---|
| Simple classification | Low | Sentiment labeling |
| Information extraction | Low | Entity extraction |
| Complex reasoning | High | Contract analysis, code review |
| Numerical calculation | High | Cost estimation, formula application |
| Multi-step judgment | High | Medical triage, legal review |
5. JSON Mode and Function Calling
Prompt engineering alone can't guarantee structured output 100% of the time. We need stronger mechanisms.
JSON Mode (OpenAI)
Pass response_format: { type: "json_object" } and the model is guaranteed to return valid JSON.
import OpenAI from 'openai';
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
response_format: { type: 'json_object' }, // JSON mode
messages: [
{
role: 'system',
content: 'Analyze sentiment and return JSON. sentiment must be one of: positive/negative/neutral.'
},
{
role: 'user',
content: `Review: ${userReview}`
}
]
});
// JSON.parse() no longer throws (malformed JSON is guaranteed not to happen)
const result = JSON.parse(response.choices[0].message.content!);
Caveat: valid JSON is guaranteed, but the schema (field names, types, value constraints) is not.
Structured Outputs (OpenAI)
Specify a JSON Schema explicitly and the model returns output matching that schema exactly.
const response = await openai.chat.completions.create({
model: 'gpt-4o-2024-08-06', // Structured Outputs supported model
response_format: {
type: 'json_schema',
json_schema: {
name: 'sentiment_analysis',
strict: true,
schema: {
type: 'object',
properties: {
sentiment: {
type: 'string',
enum: ['positive', 'negative', 'neutral'] // constrained values!
},
confidence: {
type: 'number',
minimum: 0,
maximum: 1
},
key_phrases: {
type: 'array',
items: { type: 'string' },
maxItems: 3
},
issues: {
type: 'array',
items: { type: 'string' }
}
},
required: ['sentiment', 'confidence', 'key_phrases', 'issues'],
additionalProperties: false
}
}
},
messages: [...]
});
Function Calling
Function Calling is marketed as "the model can call functions" — but the real power is that it's the strongest way to get structured output.
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
tools: [
{
type: 'function',
function: {
name: 'analyze_sentiment',
description: 'Analyzes the sentiment of a review text',
parameters: {
type: 'object',
properties: {
sentiment: {
type: 'string',
enum: ['positive', 'negative', 'neutral'],
},
confidence: {
type: 'number',
description: 'Classification confidence (0.0 to 1.0)'
},
key_phrases: {
type: 'array',
items: { type: 'string' },
description: 'Key phrases (max 3)'
},
issues: {
type: 'array',
items: { type: 'string' },
description: 'List of identified issues'
}
},
required: ['sentiment', 'confidence', 'key_phrases', 'issues']
}
}
}
],
tool_choice: { type: 'function', function: { name: 'analyze_sentiment' } },
messages: [
{ role: 'system', content: 'You are a review sentiment analyst.' },
{ role: 'user', content: `Review: ${userReview}` }
]
});
const toolCall = response.choices[0].message.tool_calls?.[0];
if (toolCall) {
const result = JSON.parse(toolCall.function.arguments);
console.log(result.sentiment); // always 'positive' | 'negative' | 'neutral'
}
6. Type-Safe Output with Zod + AI SDK
Combine the Vercel AI SDK (ai package) with Zod to get full TypeScript type safety.
import { openai } from '@ai-sdk/openai';
import { generateObject } from 'ai';
import { z } from 'zod';
// Define Zod schema
const SentimentSchema = z.object({
sentiment: z.enum(['positive', 'negative', 'neutral']),
confidence: z.number().min(0).max(1),
key_phrases: z.array(z.string()).max(3),
issues: z.array(z.string()),
});
// Type inference
type SentimentResult = z.infer<typeof SentimentSchema>;
// {
// sentiment: "positive" | "negative" | "neutral";
// confidence: number;
// key_phrases: string[];
// issues: string[];
// }
async function analyzeSentiment(review: string): Promise<SentimentResult> {
const { object } = await generateObject({
model: openai('gpt-4o-mini'),
schema: SentimentSchema,
prompt: `Analyze the sentiment of this review: "${review}"`,
system: 'You are a review sentiment analysis specialist.'
});
// object is auto-inferred as SentimentResult
// Zod validation runs automatically
return object;
}
const result = await analyzeSentiment("Great product, fast shipping!");
console.log(result.sentiment); // TypeScript knows this is 'positive' | 'negative' | 'neutral'
generateObject uses Function Calling or Structured Outputs internally, then runs Zod validation on top. Type errors get caught at compile time, not runtime.
Complex Nested Schemas
const ProductExtractionSchema = z.object({
products: z.array(z.object({
name: z.string(),
category: z.enum(['electronics', 'clothing', 'food', 'other']),
price: z.number().positive().optional(),
attributes: z.record(z.string()), // dynamic key-value
})),
total_count: z.number().int().nonnegative(),
confidence: z.number().min(0).max(1),
});
const { object } = await generateObject({
model: openai('gpt-4o'),
schema: ProductExtractionSchema,
prompt: `Extract product info from this shopping list: "${shoppingList}"`,
});
// object.products is Array<{name: string; category: ...}> — fully typed
7. Common Failure Patterns and Fixes
Failure 1: Model ignores format instructions
Symptom: Asked for JSON only, got explanatory text attached
Cause: Weak system prompt, ambiguous instructions
Fix:
1. Explicitly enumerate prohibited behaviors in system prompt
2. Show desired format with few-shot examples
3. Use JSON mode or Function Calling
Failure 2: Inconsistent enum values
Symptom: "positive", "POSITIVE", "Positive", "pos" all appear
Cause: Possible values not clearly constrained in prompt
Fix:
- Use enum in JSON schema
- Use exactly the same values in few-shot examples
- Post-process to normalize (toLowerCase, etc.)
Failure 3: null vs empty array for optional fields
Symptom: issues is sometimes null, sometimes []
Cause: Model decides how to handle empty case on its own
Fix:
- Zod schema: z.array(z.string()).default([])
- Prompt: "Use empty array [] when there are no issues"
- Show empty array case explicitly in few-shot examples
Failure 4: Missing fields in nested objects
Symptom: Some fields are missing or renamed in nested objects
Cause: Deep nesting is hard for models to follow perfectly
Fix:
- Flatten the schema wherever possible
- Use nesting only when necessary
- Enumerate required fields explicitly
- Use Function Calling strict mode
Failure 5: Mixed number types
Symptom: Price comes as "50000" (string) or "50,000" (with comma)
Cause: Models tend to represent numbers as formatted text
Fix:
- Prompt: "Numbers must be integers or floats with no comma separators"
- Zod: z.number() (auto-validates type)
- Post-process: parseFloat(String(value).replace(/,/g, ''))
8. Temperature and Top-p Tuning
For structured output, consistency matters more than creativity.
Temperature
Controls the "randomness" of model output.
| Temperature | Behavior | Best Use Case |
|---|
| 0.0 | Always picks most probable token | Structured output, classification, extraction |
| 0.3–0.5 | Slight variation | Summarization, Q&A |
| 0.7–1.0 | Creative | Writing, brainstorming |
| 1.0+ | Very creative / unstable | Experimental |
For structured output: use temperature=0 or below 0.1.
const { object } = await generateObject({
model: openai('gpt-4o-mini'),
schema: SentimentSchema,
temperature: 0, // maximum consistency
prompt: `Analyze: "${review}"`,
});
Top-p (Nucleus Sampling)
Top-p samples only from tokens whose cumulative probability reaches p. Similar effect to temperature but different mechanism. For structured output: leave top-p alone, just lower temperature. Changing both simultaneously is rarely necessary and makes behavior harder to predict.
9. Prompt Version Control
Prompts are code. Version control them.
// prompts/sentiment-analysis/v1.ts
export const SENTIMENT_ANALYSIS_PROMPT = {
version: '1.0.0',
system: `You are a review sentiment analyst...`,
description: 'Initial version — basic sentiment classification',
createdAt: '2026-01-01',
};
// prompts/sentiment-analysis/v2.ts
export const SENTIMENT_ANALYSIS_PROMPT_V2 = {
version: '2.0.0',
system: `You are a review sentiment analyst.
You also analyze sentiment intensity...`,
description: 'v2 — added intensity field',
createdAt: '2026-03-01',
};
Maintain a test suite for prompt performance:
const testCases = [
{
input: "This is an amazing product!",
expected: { sentiment: "positive" },
},
{
input: "The packaging was terrible.",
expected: { sentiment: "negative" },
},
{
input: "It's just okay, nothing special.",
expected: { sentiment: "neutral" },
},
];
describe('Sentiment Analysis Prompt', () => {
it.each(testCases)('correctly classifies "$input"', async ({ input, expected }) => {
const result = await analyzeSentiment(input);
expect(result.sentiment).toBe(expected.sentiment);
});
});
Run this test suite every time you modify a prompt to catch regressions.
10. Conclusion
Tiered strategy for getting structured output:
- Prompt level (quick start): Format specification in system prompt + few-shot examples
- JSON mode (guarantee valid JSON):
response_format: { type: "json_object" }
- Function Calling / Structured Outputs (guarantee schema): Recommended for production
- Zod + AI SDK (TypeScript type safety): Final form for TypeScript codebases
One principle to burn in: the more ambiguous the prompt, the more freely the model interprets it. When you need structured output, eliminate ambiguity ruthlessly and enforce the contract with technical mechanisms.
Prompt engineering isn't magic incantation — it's specification writing. Like a good product spec, the more precisely you document edge cases, expected formats, and prohibited behaviors, the more reliably the LLM behaves.