E·86PRACTICAL DEV2026.03.213 MIN READ

OpenTelemetry: Finding Bottlenecks with Distributed Tracing

OpenTelemetry: 분산 추적으로 병목 찾기

How do you find where requests slow down in a microservices system? Learn how OpenTelemetry ties logs, metrics, and traces together to pinpoint bottlenecks with real code examples.

codemapo

INTERDISCIPLINARY DEV · SEOUL

OpenTelemetry: Finding Bottlenecks with Distributed Tracing

You get a report: a specific API is occasionally taking 2–3 seconds. You check the logs — no errors, just "slow". With 5 microservices chained together, even figuring out where it's slow is painful. That's what life without observability looks like.

OpenTelemetry (OTel) tackles this problem head-on. Once set up properly, you can see exactly which service, which function, and how many milliseconds each piece took.

The Three Pillars of Observability

Three types of data help you understand a system:

Type	Description	Example
Logs	Event records at a point in time	`"User 42 logged in at 14:32"`
Metrics	Aggregated numeric system state	`http_requests_total = 1024`
Traces	Full request flow with durations	A → B → C, each leg in ms

Logs tell you what happened but not why it's slow. Metrics tell you that it's slow but not where it's slow. Only traces reveal "the DB query in the A→B leg ate 800ms."

OpenTelemetry unifies all three under one SDK, one standard.

What Is OpenTelemetry?

A CNCF project born in 2019 from the merger of OpenCensus and OpenTracing. The core value is vendor-neutral instrumentation.

[ Your App ] → [ OTel SDK ] → [ OTel Collector ] → [ Jaeger / Tempo / Datadog / ... ]

Instrument once with OTel, and you can swap backends (Jaeger, Grafana Tempo, Honeycomb, Datadog) without touching app code. That's the real win.

Core Components

API: Interface definitions for instrumentation
SDK: Implementation of the API (data collection/processing)
Collector: Agent/gateway that receives, processes, and forwards data
OTLP: OpenTelemetry Protocol — the standard wire format

Understanding Spans and Traces

The basic unit of tracing is a Span.

Span = a record of "what work was done and how long it took"

Multiple Spans connected in parent-child relationships form a Trace.

Trace ID: abc-123
│
├─ Span: POST /api/orders (342ms total)
│   ├─ Span: validateRequest (3ms)
│   ├─ Span: getUserFromDB (45ms)
│   ├─ Span: checkInventory → InventoryService (210ms)  ← here's the slowdown
│   │   ├─ Span: Redis cache lookup (2ms) - MISS
│   │   └─ Span: PostgreSQL query (205ms)
│   └─ Span: createOrderRecord (84ms)

You can immediately see that checkInventory took 210ms, and inside it, a Redis cache miss led to a 205ms PostgreSQL query. This picture is impossible to construct from logs alone.

Span Attributes

A Span carries rich metadata:

{
  traceId: "abc123...",         // Identifies the whole trace
  spanId: "def456...",          // Identifies this span
  parentSpanId: "bcd789...",    // Parent span (null if root)
  name: "GET /api/users/:id",   // Operation name
  startTime: 1711001234567,     // Nanosecond timestamp
  endTime: 1711001234812,
  attributes: {
    "http.method": "GET",
    "http.url": "/api/users/42",
    "http.status_code": 200,
    "db.system": "postgresql",
    "db.statement": "SELECT ...",
  },
  events: [
    { name: "cache.miss", timestamp: ... }
  ],
  status: { code: "OK" }
}

Auto-Instrumentation for Node.js

One of OTel's strengths is auto-instrumentation. It instruments Express, HTTP, PostgreSQL, Redis, and more automatically — no code changes needed.

Install Packages

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-otlp-http \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

Write tracing.ts

// src/tracing.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";

const exporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
});

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: "order-service",
    [SEMRESATTRS_SERVICE_VERSION]: "1.0.0",
    "deployment.environment": process.env.NODE_ENV ?? "development",
  }),
  traceExporter: exporter,
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-http": {
        ignoreIncomingRequestHook: (req) => req.url === "/health",
      },
      "@opentelemetry/instrumentation-express": { enabled: true },
      "@opentelemetry/instrumentation-pg": { enabled: true },
      "@opentelemetry/instrumentation-redis": { enabled: true },
    }),
  ],
});

sdk.start();
console.log("OpenTelemetry SDK started");

process.on("SIGTERM", () => {
  sdk.shutdown().then(() => process.exit(0));
});

Load It First

// src/index.ts — must come before any other imports
import "./tracing";

import express from "express";
import { orderRouter } from "./routes/orders";

const app = express();
app.use(express.json());
app.use("/api/orders", orderRouter);

app.listen(3000);

Auto-instrumentation alone captures Express routes, DB queries, and outbound HTTP calls as spans.

Manual Instrumentation: Tracing Business Logic

Auto-instrumentation covers infrastructure-level spans. To find bottlenecks inside your business logic, add spans manually.

// src/services/order.service.ts
import { trace, context, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service", "1.0.0");

export async function processOrder(orderId: string, userId: string) {
  return tracer.startActiveSpan("processOrder", async (span) => {
    span.setAttributes({
      "order.id": orderId,
      "user.id": userId,
    });

    try {
      const user = await validateAndFetchUser(userId);
      const items = await fetchOrderItems(orderId);
      const result = await chargePayment(user, items);

      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      throw error;
    } finally {
      span.end(); // always call end()
    }
  });
}

async function validateAndFetchUser(userId: string) {
  return tracer.startActiveSpan("validateAndFetchUser", async (span) => {
    span.setAttribute("user.id", userId);
    try {
      const cached = await redis.get(`user:${userId}`);
      if (cached) {
        span.addEvent("cache.hit");
        return JSON.parse(cached);
      }

      span.addEvent("cache.miss");
      const user = await db.query("SELECT * FROM users WHERE id = $1", [userId]);

      if (!user) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: "User not found" });
        throw new Error(`User ${userId} not found`);
      }

      await redis.setex(`user:${userId}`, 300, JSON.stringify(user));
      return user;
    } finally {
      span.end();
    }
  });
}

Context Propagation

To stitch traces across microservices, you need context propagation — injecting traceparent into HTTP headers.

import { propagation, context } from "@opentelemetry/api";
import axios from "axios";

async function callInventoryService(itemId: string) {
  return tracer.startActiveSpan("callInventoryService", async (span) => {
    span.setAttribute("item.id", itemId);

    const headers: Record<string, string> = {};
    propagation.inject(context.active(), headers);
    // headers now contains 'traceparent' and 'tracestate'

    try {
      const response = await axios.get(
        `http://inventory-service/items/${itemId}`,
        { headers }
      );
      span.setAttribute("http.status_code", response.status);
      return response.data;
    } finally {
      span.end();
    }
  });
}

The receiving service, if auto-instrumented, extracts the context from headers automatically. This connects spans across service boundaries into a single trace.

Exporting to Jaeger

The fastest way to visualize traces is Jaeger. Spin it up with Docker in 30 seconds.

# docker-compose.yml
version: "3.8"
services:
  jaeger:
    image: jaegertracing/all-in-one:1.55
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  order-service:
    build: .
    ports:
      - "3000:3000"
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
    depends_on:
      - jaeger

docker-compose up -d
# Open http://localhost:16686 for the Jaeger UI

Grafana Tempo + OTel Collector

In production, a common setup is OTel Collector as gateway, storing to Grafana Tempo.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  attributes:
    actions:
      - key: "db.statement"
        action: delete  # strip sensitive SQL

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

The Collector lets you swap backends, fan out to multiple backends, and filter sensitive data — all without touching app code.

A Real Debugging Example

Symptom: POST /api/checkout averages 1.8 seconds. SLA is 500ms.

Trace analysis:

POST /api/checkout (1823ms)
├─ parseAndValidateRequest (8ms)
├─ fetchUserProfile (12ms)
├─ fetchCartItems (15ms)
├─ validateInventory (1650ms)  ← !!!
│   ├─ item:1001 check (11ms)
│   ├─ item:1002 check (14ms)
│   ├─ item:1003 check (1580ms)  ← this item is the problem
│   │   ├─ cache lookup (2ms) - MISS
│   │   └─ db query (1578ms)   ← full table scan!
│   └─ item:1004 check (12ms)
└─ calculateTotal (3ms)

The DB query for item:1003 took 1578ms. Check the query:

-- Problematic query (no index on item_id)
SELECT * FROM inventory WHERE item_id = '1003' AND warehouse_id = 'WH-42';

-- Fix: add a composite index
CREATE INDEX idx_inventory_item_warehouse
  ON inventory(item_id, warehouse_id);

After adding the index: same query runs in 3ms. Total checkout time drops to 248ms.

Without traces, you'd know "the DB is slow." With traces, you know exactly which item's query is the culprit.

Best Practices

Use specific span names

// Bad
tracer.startActiveSpan("query", ...)

// Good
tracer.startActiveSpan("db.users.findById", ...)

Always record errors

catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR });
  throw error; // still rethrow
}

Watch out for sensitive data

// Use parameterized form, not actual values
span.setAttribute("db.statement", "SELECT * FROM users WHERE id = $1");

Configure sampling for high traffic

import { TraceIdRatioBased } from "@opentelemetry/sdk-trace-base";

const sdk = new NodeSDK({
  sampler: new TraceIdRatioBased(0.1), // sample 10%
  // ...
});

Wrap-Up

Recommended rollout order:

Write tracing.ts + install auto-instrumentation packages
Run Jaeger locally, verify traces appear
Add manual spans to critical business logic
Introduce OTel Collector for production
Wire up Grafana Tempo + Grafana dashboards

The difference between grepping logs and looking at traces when investigating a slowdown is enormous. Set it up once and the investment pays back many times over.

#OpenTelemetry#Observability#Distributed Tracing#Monitoring

← Back to List

E·86PRACTICAL DEV2026.03.213 MIN READ

OpenTelemetry: Finding Bottlenecks with Distributed Tracing

OpenTelemetry: 분산 추적으로 병목 찾기

How do you find where requests slow down in a microservices system? Learn how OpenTelemetry ties logs, metrics, and traces together to pinpoint bottlenecks with real code examples.

codemapo

INTERDISCIPLINARY DEV · SEOUL

OpenTelemetry: Finding Bottlenecks with Distributed Tracing

OpenTelemetry (OTel) tackles this problem head-on. Once set up properly, you can see exactly which service, which function, and how many milliseconds each piece took.

The Three Pillars of Observability

Three types of data help you understand a system:

Type	Description	Example
Logs	Event records at a point in time	`"User 42 logged in at 14:32"`
Metrics	Aggregated numeric system state	`http_requests_total = 1024`
Traces	Full request flow with durations	A → B → C, each leg in ms

Logs tell you what happened but not why it's slow. Metrics tell you that it's slow but not where it's slow. Only traces reveal "the DB query in the A→B leg ate 800ms."

OpenTelemetry unifies all three under one SDK, one standard.

What Is OpenTelemetry?

A CNCF project born in 2019 from the merger of OpenCensus and OpenTracing. The core value is vendor-neutral instrumentation.

[ Your App ] → [ OTel SDK ] → [ OTel Collector ] → [ Jaeger / Tempo / Datadog / ... ]

Instrument once with OTel, and you can swap backends (Jaeger, Grafana Tempo, Honeycomb, Datadog) without touching app code. That's the real win.

Core Components

API: Interface definitions for instrumentation
SDK: Implementation of the API (data collection/processing)
Collector: Agent/gateway that receives, processes, and forwards data
OTLP: OpenTelemetry Protocol — the standard wire format

Understanding Spans and Traces

The basic unit of tracing is a Span.

Span = a record of "what work was done and how long it took"

Multiple Spans connected in parent-child relationships form a Trace.

Trace ID: abc-123
│
├─ Span: POST /api/orders (342ms total)
│   ├─ Span: validateRequest (3ms)
│   ├─ Span: getUserFromDB (45ms)
│   ├─ Span: checkInventory → InventoryService (210ms)  ← here's the slowdown
│   │   ├─ Span: Redis cache lookup (2ms) - MISS
│   │   └─ Span: PostgreSQL query (205ms)
│   └─ Span: createOrderRecord (84ms)

You can immediately see that checkInventory took 210ms, and inside it, a Redis cache miss led to a 205ms PostgreSQL query. This picture is impossible to construct from logs alone.

Span Attributes

A Span carries rich metadata:

{
  traceId: "abc123...",         // Identifies the whole trace
  spanId: "def456...",          // Identifies this span
  parentSpanId: "bcd789...",    // Parent span (null if root)
  name: "GET /api/users/:id",   // Operation name
  startTime: 1711001234567,     // Nanosecond timestamp
  endTime: 1711001234812,
  attributes: {
    "http.method": "GET",
    "http.url": "/api/users/42",
    "http.status_code": 200,
    "db.system": "postgresql",
    "db.statement": "SELECT ...",
  },
  events: [
    { name: "cache.miss", timestamp: ... }
  ],
  status: { code: "OK" }
}

Auto-Instrumentation for Node.js

One of OTel's strengths is auto-instrumentation. It instruments Express, HTTP, PostgreSQL, Redis, and more automatically — no code changes needed.

Install Packages

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-otlp-http \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

Write tracing.ts

// src/tracing.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";

const exporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
});

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: "order-service",
    [SEMRESATTRS_SERVICE_VERSION]: "1.0.0",
    "deployment.environment": process.env.NODE_ENV ?? "development",
  }),
  traceExporter: exporter,
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-http": {
        ignoreIncomingRequestHook: (req) => req.url === "/health",
      },
      "@opentelemetry/instrumentation-express": { enabled: true },
      "@opentelemetry/instrumentation-pg": { enabled: true },
      "@opentelemetry/instrumentation-redis": { enabled: true },
    }),
  ],
});

sdk.start();
console.log("OpenTelemetry SDK started");

process.on("SIGTERM", () => {
  sdk.shutdown().then(() => process.exit(0));
});

Load It First

// src/index.ts — must come before any other imports
import "./tracing";

import express from "express";
import { orderRouter } from "./routes/orders";

const app = express();
app.use(express.json());
app.use("/api/orders", orderRouter);

app.listen(3000);

Auto-instrumentation alone captures Express routes, DB queries, and outbound HTTP calls as spans.

Manual Instrumentation: Tracing Business Logic

Auto-instrumentation covers infrastructure-level spans. To find bottlenecks inside your business logic, add spans manually.

// src/services/order.service.ts
import { trace, context, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service", "1.0.0");

export async function processOrder(orderId: string, userId: string) {
  return tracer.startActiveSpan("processOrder", async (span) => {
    span.setAttributes({
      "order.id": orderId,
      "user.id": userId,
    });

    try {
      const user = await validateAndFetchUser(userId);
      const items = await fetchOrderItems(orderId);
      const result = await chargePayment(user, items);

      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      throw error;
    } finally {
      span.end(); // always call end()
    }
  });
}

async function validateAndFetchUser(userId: string) {
  return tracer.startActiveSpan("validateAndFetchUser", async (span) => {
    span.setAttribute("user.id", userId);
    try {
      const cached = await redis.get(`user:${userId}`);
      if (cached) {
        span.addEvent("cache.hit");
        return JSON.parse(cached);
      }

      span.addEvent("cache.miss");
      const user = await db.query("SELECT * FROM users WHERE id = $1", [userId]);

      if (!user) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: "User not found" });
        throw new Error(`User ${userId} not found`);
      }

      await redis.setex(`user:${userId}`, 300, JSON.stringify(user));
      return user;
    } finally {
      span.end();
    }
  });
}

Context Propagation

To stitch traces across microservices, you need context propagation — injecting traceparent into HTTP headers.

import { propagation, context } from "@opentelemetry/api";
import axios from "axios";

async function callInventoryService(itemId: string) {
  return tracer.startActiveSpan("callInventoryService", async (span) => {
    span.setAttribute("item.id", itemId);

    const headers: Record<string, string> = {};
    propagation.inject(context.active(), headers);
    // headers now contains 'traceparent' and 'tracestate'

    try {
      const response = await axios.get(
        `http://inventory-service/items/${itemId}`,
        { headers }
      );
      span.setAttribute("http.status_code", response.status);
      return response.data;
    } finally {
      span.end();
    }
  });
}

The receiving service, if auto-instrumented, extracts the context from headers automatically. This connects spans across service boundaries into a single trace.

Exporting to Jaeger

The fastest way to visualize traces is Jaeger. Spin it up with Docker in 30 seconds.

# docker-compose.yml
version: "3.8"
services:
  jaeger:
    image: jaegertracing/all-in-one:1.55
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  order-service:
    build: .
    ports:
      - "3000:3000"
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
    depends_on:
      - jaeger

docker-compose up -d
# Open http://localhost:16686 for the Jaeger UI

Grafana Tempo + OTel Collector

In production, a common setup is OTel Collector as gateway, storing to Grafana Tempo.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  attributes:
    actions:
      - key: "db.statement"
        action: delete  # strip sensitive SQL

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

The Collector lets you swap backends, fan out to multiple backends, and filter sensitive data — all without touching app code.

A Real Debugging Example

Symptom: POST /api/checkout averages 1.8 seconds. SLA is 500ms.

Trace analysis:

POST /api/checkout (1823ms)
├─ parseAndValidateRequest (8ms)
├─ fetchUserProfile (12ms)
├─ fetchCartItems (15ms)
├─ validateInventory (1650ms)  ← !!!
│   ├─ item:1001 check (11ms)
│   ├─ item:1002 check (14ms)
│   ├─ item:1003 check (1580ms)  ← this item is the problem
│   │   ├─ cache lookup (2ms) - MISS
│   │   └─ db query (1578ms)   ← full table scan!
│   └─ item:1004 check (12ms)
└─ calculateTotal (3ms)

The DB query for item:1003 took 1578ms. Check the query:

-- Problematic query (no index on item_id)
SELECT * FROM inventory WHERE item_id = '1003' AND warehouse_id = 'WH-42';

-- Fix: add a composite index
CREATE INDEX idx_inventory_item_warehouse
  ON inventory(item_id, warehouse_id);

After adding the index: same query runs in 3ms. Total checkout time drops to 248ms.

Without traces, you'd know "the DB is slow." With traces, you know exactly which item's query is the culprit.

Best Practices

Use specific span names

// Bad
tracer.startActiveSpan("query", ...)

// Good
tracer.startActiveSpan("db.users.findById", ...)

Always record errors

catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR });
  throw error; // still rethrow
}

Watch out for sensitive data

// Use parameterized form, not actual values
span.setAttribute("db.statement", "SELECT * FROM users WHERE id = $1");

Configure sampling for high traffic

import { TraceIdRatioBased } from "@opentelemetry/sdk-trace-base";

const sdk = new NodeSDK({
  sampler: new TraceIdRatioBased(0.1), // sample 10%
  // ...
});

Wrap-Up

Recommended rollout order:

Write tracing.ts + install auto-instrumentation packages
Run Jaeger locally, verify traces appear
Add manual spans to critical business logic
Introduce OTel Collector for production
Wire up Grafana Tempo + Grafana dashboards

The difference between grepping logs and looking at traces when investigating a slowdown is enormous. Set it up once and the investment pays back many times over.

#OpenTelemetry#Observability#Distributed Tracing#Monitoring

← Back to List