
OpenTelemetry: Finding Bottlenecks with Distributed Tracing
How do you find where requests slow down in a microservices system? Learn how OpenTelemetry ties logs, metrics, and traces together to pinpoint bottlenecks with real code examples.

How do you find where requests slow down in a microservices system? Learn how OpenTelemetry ties logs, metrics, and traces together to pinpoint bottlenecks with real code examples.
Logs are the black box of your infrastructure. When disaster strikes, grep is not enough. We explore the ELK Stack—the industry standard for centralized logging. Understand the roles of Elasticsearch, Logstash, and Kibana, compare modern alternatives like Fluentd and Vector, and learn strategies for Index Lifecycle Management (ILM).

Why API Gateway is essential in Microservices Architecture. Detailed comparison of Kong vs. Nginx vs. AWS API Gateway, deep dive into Rate Limiting algorithms, GraphQL integration strategies, and ensuring Observability.

Your app works perfectly on localhost but crashes in production. Set up Sentry and error tracking to catch bugs before users complain.
I was tired of users reporting errors before I knew about them. How I set up Sentry in a Next.js project for real-time error tracking and alerting.
You get a report: a specific API is occasionally taking 2–3 seconds. You check the logs — no errors, just "slow". With 5 microservices chained together, even figuring out where it's slow is painful. That's what life without observability looks like.
OpenTelemetry (OTel) tackles this problem head-on. Once set up properly, you can see exactly which service, which function, and how many milliseconds each piece took.
Three types of data help you understand a system:
| Type | Description | Example |
|---|---|---|
| Logs | Event records at a point in time | "User 42 logged in at 14:32" |
| Metrics | Aggregated numeric system state | http_requests_total = 1024 |
| Traces | Full request flow with durations | A → B → C, each leg in ms |
Logs tell you what happened but not why it's slow. Metrics tell you that it's slow but not where it's slow. Only traces reveal "the DB query in the A→B leg ate 800ms."
OpenTelemetry unifies all three under one SDK, one standard.
A CNCF project born in 2019 from the merger of OpenCensus and OpenTracing. The core value is vendor-neutral instrumentation.
[ Your App ] → [ OTel SDK ] → [ OTel Collector ] → [ Jaeger / Tempo / Datadog / ... ]
Instrument once with OTel, and you can swap backends (Jaeger, Grafana Tempo, Honeycomb, Datadog) without touching app code. That's the real win.
The basic unit of tracing is a Span.
Span = a record of "what work was done and how long it took"
Multiple Spans connected in parent-child relationships form a Trace.
Trace ID: abc-123
│
├─ Span: POST /api/orders (342ms total)
│ ├─ Span: validateRequest (3ms)
│ ├─ Span: getUserFromDB (45ms)
│ ├─ Span: checkInventory → InventoryService (210ms) ← here's the slowdown
│ │ ├─ Span: Redis cache lookup (2ms) - MISS
│ │ └─ Span: PostgreSQL query (205ms)
│ └─ Span: createOrderRecord (84ms)
You can immediately see that checkInventory took 210ms, and inside it, a Redis cache miss led to a 205ms PostgreSQL query. This picture is impossible to construct from logs alone.
A Span carries rich metadata:
{
traceId: "abc123...", // Identifies the whole trace
spanId: "def456...", // Identifies this span
parentSpanId: "bcd789...", // Parent span (null if root)
name: "GET /api/users/:id", // Operation name
startTime: 1711001234567, // Nanosecond timestamp
endTime: 1711001234812,
attributes: {
"http.method": "GET",
"http.url": "/api/users/42",
"http.status_code": 200,
"db.system": "postgresql",
"db.statement": "SELECT ...",
},
events: [
{ name: "cache.miss", timestamp: ... }
],
status: { code: "OK" }
}
One of OTel's strengths is auto-instrumentation. It instruments Express, HTTP, PostgreSQL, Redis, and more automatically — no code changes needed.
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-otlp-http \
@opentelemetry/resources \
@opentelemetry/semantic-conventions
// src/tracing.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";
const exporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
});
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: "order-service",
[SEMRESATTRS_SERVICE_VERSION]: "1.0.0",
"deployment.environment": process.env.NODE_ENV ?? "development",
}),
traceExporter: exporter,
instrumentations: [
getNodeAutoInstrumentations({
"@opentelemetry/instrumentation-http": {
ignoreIncomingRequestHook: (req) => req.url === "/health",
},
"@opentelemetry/instrumentation-express": { enabled: true },
"@opentelemetry/instrumentation-pg": { enabled: true },
"@opentelemetry/instrumentation-redis": { enabled: true },
}),
],
});
sdk.start();
console.log("OpenTelemetry SDK started");
process.on("SIGTERM", () => {
sdk.shutdown().then(() => process.exit(0));
});
// src/index.ts — must come before any other imports
import "./tracing";
import express from "express";
import { orderRouter } from "./routes/orders";
const app = express();
app.use(express.json());
app.use("/api/orders", orderRouter);
app.listen(3000);
Auto-instrumentation alone captures Express routes, DB queries, and outbound HTTP calls as spans.
Auto-instrumentation covers infrastructure-level spans. To find bottlenecks inside your business logic, add spans manually.
// src/services/order.service.ts
import { trace, context, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service", "1.0.0");
export async function processOrder(orderId: string, userId: string) {
return tracer.startActiveSpan("processOrder", async (span) => {
span.setAttributes({
"order.id": orderId,
"user.id": userId,
});
try {
const user = await validateAndFetchUser(userId);
const items = await fetchOrderItems(orderId);
const result = await chargePayment(user, items);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
throw error;
} finally {
span.end(); // always call end()
}
});
}
async function validateAndFetchUser(userId: string) {
return tracer.startActiveSpan("validateAndFetchUser", async (span) => {
span.setAttribute("user.id", userId);
try {
const cached = await redis.get(`user:${userId}`);
if (cached) {
span.addEvent("cache.hit");
return JSON.parse(cached);
}
span.addEvent("cache.miss");
const user = await db.query("SELECT * FROM users WHERE id = $1", [userId]);
if (!user) {
span.setStatus({ code: SpanStatusCode.ERROR, message: "User not found" });
throw new Error(`User ${userId} not found`);
}
await redis.setex(`user:${userId}`, 300, JSON.stringify(user));
return user;
} finally {
span.end();
}
});
}
To stitch traces across microservices, you need context propagation — injecting traceparent into HTTP headers.
import { propagation, context } from "@opentelemetry/api";
import axios from "axios";
async function callInventoryService(itemId: string) {
return tracer.startActiveSpan("callInventoryService", async (span) => {
span.setAttribute("item.id", itemId);
const headers: Record<string, string> = {};
propagation.inject(context.active(), headers);
// headers now contains 'traceparent' and 'tracestate'
try {
const response = await axios.get(
`http://inventory-service/items/${itemId}`,
{ headers }
);
span.setAttribute("http.status_code", response.status);
return response.data;
} finally {
span.end();
}
});
}
The receiving service, if auto-instrumented, extracts the context from headers automatically. This connects spans across service boundaries into a single trace.
The fastest way to visualize traces is Jaeger. Spin it up with Docker in 30 seconds.
# docker-compose.yml
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:1.55
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
order-service:
build: .
ports:
- "3000:3000"
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
depends_on:
- jaeger
docker-compose up -d
# Open http://localhost:16686 for the Jaeger UI
In production, a common setup is OTel Collector as gateway, storing to Grafana Tempo.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 1s
send_batch_size: 1024
attributes:
actions:
- key: "db.statement"
action: delete # strip sensitive SQL
exporters:
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
The Collector lets you swap backends, fan out to multiple backends, and filter sensitive data — all without touching app code.
Symptom: POST /api/checkout averages 1.8 seconds. SLA is 500ms.
Trace analysis:
POST /api/checkout (1823ms)
├─ parseAndValidateRequest (8ms)
├─ fetchUserProfile (12ms)
├─ fetchCartItems (15ms)
├─ validateInventory (1650ms) ← !!!
│ ├─ item:1001 check (11ms)
│ ├─ item:1002 check (14ms)
│ ├─ item:1003 check (1580ms) ← this item is the problem
│ │ ├─ cache lookup (2ms) - MISS
│ │ └─ db query (1578ms) ← full table scan!
│ └─ item:1004 check (12ms)
└─ calculateTotal (3ms)
The DB query for item:1003 took 1578ms. Check the query:
-- Problematic query (no index on item_id)
SELECT * FROM inventory WHERE item_id = '1003' AND warehouse_id = 'WH-42';
-- Fix: add a composite index
CREATE INDEX idx_inventory_item_warehouse
ON inventory(item_id, warehouse_id);
After adding the index: same query runs in 3ms. Total checkout time drops to 248ms.
Without traces, you'd know "the DB is slow." With traces, you know exactly which item's query is the culprit.
// Bad
tracer.startActiveSpan("query", ...)
// Good
tracer.startActiveSpan("db.users.findById", ...)
Always record errors
catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error; // still rethrow
}
Watch out for sensitive data
// Use parameterized form, not actual values
span.setAttribute("db.statement", "SELECT * FROM users WHERE id = $1");
Configure sampling for high traffic
import { TraceIdRatioBased } from "@opentelemetry/sdk-trace-base";
const sdk = new NodeSDK({
sampler: new TraceIdRatioBased(0.1), // sample 10%
// ...
});
Recommended rollout order:
tracing.ts + install auto-instrumentation packagesThe difference between grepping logs and looking at traces when investigating a slowdown is enormous. Set it up once and the investment pays back many times over.