OpenTelemetry: Finding Bottlenecks with Distributed Tracing
You get a report: a specific API is occasionally taking 2–3 seconds. You check the logs — no errors, just "slow". With 5 microservices chained together, even figuring out where it's slow is painful. That's what life without observability looks like.
OpenTelemetry (OTel) tackles this problem head-on. Once set up properly, you can see exactly which service, which function, and how many milliseconds each piece took.
The Three Pillars of Observability
Three types of data help you understand a system:
| Type | Description | Example |
|---|---|---|
| Logs | Event records at a point in time | "User 42 logged in at 14:32" |
| Metrics | Aggregated numeric system state | http_requests_total = 1024 |
| Traces | Full request flow with durations | A → B → C, each leg in ms |
Logs tell you what happened but not why it's slow. Metrics tell you that it's slow but not where it's slow. Only traces reveal "the DB query in the A→B leg ate 800ms."
OpenTelemetry unifies all three under one SDK, one standard.
What Is OpenTelemetry?
A CNCF project born in 2019 from the merger of OpenCensus and OpenTracing. The core value is vendor-neutral instrumentation.
[ Your App ] → [ OTel SDK ] → [ OTel Collector ] → [ Jaeger / Tempo / Datadog / ... ]
Instrument once with OTel, and you can swap backends (Jaeger, Grafana Tempo, Honeycomb, Datadog) without touching app code. That's the real win.
Core Components
- API: Interface definitions for instrumentation
- SDK: Implementation of the API (data collection/processing)
- Collector: Agent/gateway that receives, processes, and forwards data
- OTLP: OpenTelemetry Protocol — the standard wire format
Understanding Spans and Traces
The basic unit of tracing is a Span.
Span = a record of "what work was done and how long it took"
Multiple Spans connected in parent-child relationships form a Trace.
Trace ID: abc-123
│
├─ Span: POST /api/orders (342ms total)
│ ├─ Span: validateRequest (3ms)
│ ├─ Span: getUserFromDB (45ms)
│ ├─ Span: checkInventory → InventoryService (210ms) ← here's the slowdown
│ │ ├─ Span: Redis cache lookup (2ms) - MISS
│ │ └─ Span: PostgreSQL query (205ms)
│ └─ Span: createOrderRecord (84ms)
You can immediately see that checkInventory took 210ms, and inside it, a Redis cache miss led to a 205ms PostgreSQL query. This picture is impossible to construct from logs alone.
Span Attributes
A Span carries rich metadata:
{
traceId: "abc123...", // Identifies the whole trace
spanId: "def456...", // Identifies this span
parentSpanId: "bcd789...", // Parent span (null if root)
name: "GET /api/users/:id", // Operation name
startTime: 1711001234567, // Nanosecond timestamp
endTime: 1711001234812,
attributes: {
"http.method": "GET",
"http.url": "/api/users/42",
"http.status_code": 200,
"db.system": "postgresql",
"db.statement": "SELECT ...",
},
events: [
{ name: "cache.miss", timestamp: ... }
],
status: { code: "OK" }
}
Auto-Instrumentation for Node.js
One of OTel's strengths is auto-instrumentation. It instruments Express, HTTP, PostgreSQL, Redis, and more automatically — no code changes needed.
Install Packages
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-otlp-http \
@opentelemetry/resources \
@opentelemetry/semantic-conventions
Write tracing.ts
// src/tracing.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";
const exporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
});
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: "order-service",
[SEMRESATTRS_SERVICE_VERSION]: "1.0.0",
"deployment.environment": process.env.NODE_ENV ?? "development",
}),
traceExporter: exporter,
instrumentations: [
getNodeAutoInstrumentations({
"@opentelemetry/instrumentation-http": {
ignoreIncomingRequestHook: (req) => req.url === "/health",
},
"@opentelemetry/instrumentation-express": { enabled: true },
"@opentelemetry/instrumentation-pg": { enabled: true },
"@opentelemetry/instrumentation-redis": { enabled: true },
}),
],
});
sdk.start();
console.log("OpenTelemetry SDK started");
process.on("SIGTERM", () => {
sdk.shutdown().then(() => process.exit(0));
});
Load It First
// src/index.ts — must come before any other imports
import "./tracing";
import express from "express";
import { orderRouter } from "./routes/orders";
const app = express();
app.use(express.json());
app.use("/api/orders", orderRouter);
app.listen(3000);
Auto-instrumentation alone captures Express routes, DB queries, and outbound HTTP calls as spans.
Manual Instrumentation: Tracing Business Logic
Auto-instrumentation covers infrastructure-level spans. To find bottlenecks inside your business logic, add spans manually.
// src/services/order.service.ts
import { trace, context, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service", "1.0.0");
export async function processOrder(orderId: string, userId: string) {
return tracer.startActiveSpan("processOrder", async (span) => {
span.setAttributes({
"order.id": orderId,
"user.id": userId,
});
try {
const user = await validateAndFetchUser(userId);
const items = await fetchOrderItems(orderId);
const result = await chargePayment(user, items);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
throw error;
} finally {
span.end(); // always call end()
}
});
}
async function validateAndFetchUser(userId: string) {
return tracer.startActiveSpan("validateAndFetchUser", async (span) => {
span.setAttribute("user.id", userId);
try {
const cached = await redis.get(`user:${userId}`);
if (cached) {
span.addEvent("cache.hit");
return JSON.parse(cached);
}
span.addEvent("cache.miss");
const user = await db.query("SELECT * FROM users WHERE id = $1", [userId]);
if (!user) {
span.setStatus({ code: SpanStatusCode.ERROR, message: "User not found" });
throw new Error(`User ${userId} not found`);
}
await redis.setex(`user:${userId}`, 300, JSON.stringify(user));
return user;
} finally {
span.end();
}
});
}
Context Propagation
To stitch traces across microservices, you need context propagation — injecting traceparent into HTTP headers.
import { propagation, context } from "@opentelemetry/api";
import axios from "axios";
async function callInventoryService(itemId: string) {
return tracer.startActiveSpan("callInventoryService", async (span) => {
span.setAttribute("item.id", itemId);
const headers: Record<string, string> = {};
propagation.inject(context.active(), headers);
// headers now contains 'traceparent' and 'tracestate'
try {
const response = await axios.get(
`http://inventory-service/items/${itemId}`,
{ headers }
);
span.setAttribute("http.status_code", response.status);
return response.data;
} finally {
span.end();
}
});
}
The receiving service, if auto-instrumented, extracts the context from headers automatically. This connects spans across service boundaries into a single trace.
Exporting to Jaeger
The fastest way to visualize traces is Jaeger. Spin it up with Docker in 30 seconds.
# docker-compose.yml
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:1.55
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
order-service:
build: .
ports:
- "3000:3000"
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
depends_on:
- jaeger
docker-compose up -d
# Open http://localhost:16686 for the Jaeger UI
Grafana Tempo + OTel Collector
In production, a common setup is OTel Collector as gateway, storing to Grafana Tempo.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 1s
send_batch_size: 1024
attributes:
actions:
- key: "db.statement"
action: delete # strip sensitive SQL
exporters:
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
The Collector lets you swap backends, fan out to multiple backends, and filter sensitive data — all without touching app code.
A Real Debugging Example
Symptom: POST /api/checkout averages 1.8 seconds. SLA is 500ms.
Trace analysis:
POST /api/checkout (1823ms)
├─ parseAndValidateRequest (8ms)
├─ fetchUserProfile (12ms)
├─ fetchCartItems (15ms)
├─ validateInventory (1650ms) ← !!!
│ ├─ item:1001 check (11ms)
│ ├─ item:1002 check (14ms)
│ ├─ item:1003 check (1580ms) ← this item is the problem
│ │ ├─ cache lookup (2ms) - MISS
│ │ └─ db query (1578ms) ← full table scan!
│ └─ item:1004 check (12ms)
└─ calculateTotal (3ms)
The DB query for item:1003 took 1578ms. Check the query:
-- Problematic query (no index on item_id)
SELECT * FROM inventory WHERE item_id = '1003' AND warehouse_id = 'WH-42';
-- Fix: add a composite index
CREATE INDEX idx_inventory_item_warehouse
ON inventory(item_id, warehouse_id);
After adding the index: same query runs in 3ms. Total checkout time drops to 248ms.
Without traces, you'd know "the DB is slow." With traces, you know exactly which item's query is the culprit.
Best Practices
Use specific span names
// Bad
tracer.startActiveSpan("query", ...)
// Good
tracer.startActiveSpan("db.users.findById", ...)
Always record errors
catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error; // still rethrow
}
Watch out for sensitive data
// Use parameterized form, not actual values
span.setAttribute("db.statement", "SELECT * FROM users WHERE id = $1");
Configure sampling for high traffic
import { TraceIdRatioBased } from "@opentelemetry/sdk-trace-base";
const sdk = new NodeSDK({
sampler: new TraceIdRatioBased(0.1), // sample 10%
// ...
});
Wrap-Up
Recommended rollout order:
- Write
tracing.ts+ install auto-instrumentation packages - Run Jaeger locally, verify traces appear
- Add manual spans to critical business logic
- Introduce OTel Collector for production
- Wire up Grafana Tempo + Grafana dashboards
The difference between grepping logs and looking at traces when investigating a slowdown is enormous. Set it up once and the investment pays back many times over.