Observability in Production: Metrics, Traces, and Logs That Actually Matter
Production systems fail. That’s not pessimism—it’s reality. The question isn’t whether your cloud-native applications will encounter issues, but whether you’ll be able to diagnose and fix them before they impact users. This is where observability becomes critical, moving beyond simple monitoring to provide deep insights into system behavior.
Observability in software systems is fundamentally about understanding internal state through external outputs. Unlike traditional monitoring that tells you what happened, observability helps you understand why it happened. For teams running microservices on Kubernetes, this distinction can mean the difference between a five-minute fix and a three-hour war room session.
The Three Pillars: More Than Marketing Buzzwords
The industry has standardized on three pillars of observability: metrics, logs, and traces. But as the Kubernetes documentation notes, these aren’t just categories—they’re complementary data sources that together provide a complete picture of system health.
Metrics give you the quantitative data: response times, error rates, resource utilization. They’re your system’s vital signs.
Logs provide the qualitative context: what happened, when it happened, and often why it happened. They’re your system’s diary.
Traces show you the journey: how a request flows through your distributed system, where it slows down, and where it fails. They’re your system’s GPS.
Here’s what makes this powerful: each pillar compensates for the others’ weaknesses. Metrics are efficient but lack context. Logs provide context but can be overwhelming. Traces show relationships but generate massive data volumes.
Metrics That Actually Drive Decisions
Most teams collect too many metrics and act on too few. The key is identifying metrics that directly correlate with user experience and business outcomes.
The Four Golden Signals
Start with Google’s Four Golden Signals, adapted for your specific context:
- Latency: How long requests take to complete
- Traffic: How many requests you’re handling
- Errors: How many requests are failing
- Saturation: How close to capacity your resources are
For a Kubernetes-based e-commerce API, this might look like:
# Prometheus recording rules example
groups:
- name: ecommerce_sli
rules:
- record: http_request_duration_seconds:rate5m
expr: rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
- record: http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service, method)
- record: http_requests_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
- record: pod_cpu_utilization
expr: rate(container_cpu_usage_seconds_total[5m]) / container_spec_cpu_quota * 100
Business-Specific Metrics
Beyond infrastructure metrics, track what matters to your business:
- Conversion funnel metrics: Cart additions, checkout completions, payment successes
- Feature adoption: New feature usage rates, user engagement depth
- Revenue impact: Transaction volumes, average order values, failed payment rates
The goal is creating a direct line from technical metrics to business impact. When CPU utilization spikes, you should immediately know whether it’s affecting checkout completion rates.
Structured Logging: Your Debug Lifeline
According to Stack Overflow’s analysis, developer productivity increases significantly when engineers can jump directly to the root cause instead of hunting across multiple systems. Structured logging is fundamental to this capability.
JSON All the Things
Structured logs in JSON format enable powerful querying and correlation:
// Bad: Unstructured logging
console.log("User john_doe failed to complete checkout for order 12345");
// Good: Structured logging
logger.info({
event: "checkout_failed",
user_id: "john_doe",
order_id: "12345",
error_code: "PAYMENT_DECLINED",
payment_method: "credit_card",
cart_value: 89.99,
session_id: "sess_abc123",
trace_id: "trace_xyz789"
});
Context Propagation
Every log entry should include enough context to understand the request flow:
const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json(),
winston.format.printf(({ timestamp, level, message, ...meta }) => {
return JSON.stringify({
timestamp,
level,
message,
service: process.env.SERVICE_NAME,
version: process.env.SERVICE_VERSION,
trace_id: meta.trace_id,
span_id: meta.span_id,
user_id: meta.user_id,
...meta
});
})
)
});
Log Levels That Make Sense
Use log levels strategically:
- ERROR: Something broke that requires immediate attention
- WARN: Something unexpected happened but the system recovered
- INFO: Normal business operations (user actions, external API calls)
- DEBUG: Detailed information for troubleshooting (disabled in production)
Distributed Tracing: Following the Breadcrumbs
In microservices architectures, a single user request might touch dozens of services. Distributed tracing connects these interactions, showing you exactly where requests slow down or fail.
OpenTelemetry Implementation
OpenTelemetry has become the standard for instrumentation. Here’s how to add tracing to a Node.js service:
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const sdk = new NodeSDK({
traceExporter: new JaegerExporter({
endpoint: process.env.JAEGER_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
requestHook: (span, request) => {
span.setAttributes({
'user.id': request.headers['x-user-id'],
'request.size': request.headers['content-length']
});
}
}
})]
});
sdk.start();
Custom Spans for Business Logic
Auto-instrumentation covers HTTP calls and database queries, but add custom spans for critical business operations:
const opentelemetry = require('@opentelemetry/api');
async function processPayment(orderId, paymentDetails) {
const tracer = opentelemetry.trace.getTracer('payment-service');
return tracer.startActiveSpan('process_payment', async (span) => {
span.setAttributes({
'order.id': orderId,
'payment.method': paymentDetails.method,
'payment.amount': paymentDetails.amount
});
try {
const result = await paymentGateway.charge(paymentDetails);
span.setAttributes({
'payment.status': result.status,
'payment.transaction_id': result.transactionId
});
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: opentelemetry.SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
});
}
Kubernetes-Specific Observability
Kubernetes observability requires understanding both application behavior and cluster health. The platform’s dynamic nature—pods starting, stopping, and moving—adds complexity that traditional monitoring approaches can’t handle.
Pod and Node Metrics
Monitor resource utilization and availability:
# Prometheus scrape config for Kubernetes metrics
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Service Mesh Integration
If you’re using a service mesh like Istio, leverage its built-in observability:
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: custom-metrics
spec:
metrics:
- providers:
- name: prometheus
- overrides:
- match:
metric: ALL_METRICS
tagOverrides:
request_id:
operation: UPSERT
value: "%{REQUEST_ID}"
Application Performance in Kubernetes Context
Traditional APM tools often miss Kubernetes-specific context. Ensure your observability stack correlates application performance with:
- Pod restarts and scheduling events
- Resource limits and requests
- Network policies and service mesh configuration
- ConfigMap and Secret changes
Building Effective Dashboards
Dashboards should tell a story, not just display data. Structure them around user journeys and system flows.
The Inverted Pyramid Approach
Start with high-level business metrics, then drill down:
- Business KPIs: Revenue, conversion rates, user satisfaction
- Service-level indicators: Request rates, error rates, latencies
- Infrastructure metrics: CPU, memory, network, storage
- Detailed diagnostics: Individual service performance, database queries
Alert Fatigue is Real
As Splunk notes, the goal is reducing time to resolution, not increasing alert volume. Design alerts that require action:
# Good: Actionable alert
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}"
runbook_url: "https://wiki.company.com/runbooks/high-error-rate"
# Bad: Noisy alert
- alert: HighCPU
expr: cpu_usage > 80
for: 30s
Observability-Driven Development
Wikipedia describes observability-driven development as shipping features with custom instrumentation. This means thinking about observability during development, not after deployment.
Instrumentation as Code
Include observability requirements in your definition of done:
- Every new API endpoint includes metrics, logging, and tracing
- Business logic includes custom spans for critical operations
- Error handling includes structured error logging
- Feature flags include adoption and performance metrics
Testing Observability
Test your observability instrumentation like any other code:
describe('Payment Processing', () => {
it('should create trace spans for payment operations', async () => {
const mockTracer = new MockTracer();
await processPayment('order-123', paymentDetails);
const spans = mockTracer.report().spans;
expect(spans).toHaveLength(3); // payment validation, gateway call, result processing
expect(spans[0].operationName).toBe('validate_payment');
expect(spans[1].operationName).toBe('gateway_charge');
expect(spans[2].operationName).toBe('process_result');
});
});
The Economics of Observability
Observability isn’t free. Data ingestion, storage, and processing costs can quickly spiral out of control. Budget for roughly 5-15% of your infrastructure costs for observability tooling.
Sampling Strategies
For high-traffic services, implement intelligent sampling:
const sampler = {
shouldSample: (context, traceId, spanName, spanKind, attributes) => {
// Always sample errors
if (attributes['http.status_code'] >= 400) {
return { decision: SamplingDecision.RECORD_AND_SAMPLE };
}
// Sample 1% of successful requests
if (traceId[0] % 100 === 0) {
return { decision: SamplingDecision.RECORD_AND_SAMPLE };
}
return { decision: SamplingDecision.NOT_RECORD };
}
};
Data Retention Policies
Implement tiered retention:
- High-resolution metrics: 7 days
- Medium-resolution metrics: 30 days
- Low-resolution metrics: 1 year
- Traces: 7 days (with longer retention for errors)
- Logs: 30 days (with longer retention for errors and security events)
Making Observability Actionable
The best observability setup is useless if it doesn’t drive better outcomes. Microsoft’s research on AI systems emphasizes that observability is foundational for operational control in production systems.
Runbooks and Automation
Every alert should link to a runbook that explains:
- What the alert means
- How to investigate the issue
- Common causes and solutions
- When to escalate
Better yet, automate responses where possible:
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: auto-remediation
spec:
entrypoint: high-error-rate-response
templates:
- name: high-error-rate-response
steps:
- - name: scale-up
template: scale-deployment
arguments:
parameters:
- name: deployment
value: "{{workflow.parameters.deployment}}"
- name: replicas
value: "{{workflow.parameters.current-replicas * 2}}"
Continuous Improvement
Use observability data to drive architectural decisions:
- Identify services that would benefit from caching
- Spot database queries that need optimization
- Find microservices boundaries that create unnecessary latency
- Discover features that aren’t being used and can be deprecated
The Path Forward
Observability isn’t a destination—it’s a capability that evolves with your systems. Start with the basics: structured logging, key metrics, and distributed tracing for critical paths. Build dashboards that tell stories, not just display data. Create alerts that drive action, not noise.
Most importantly, make observability a team responsibility, not just an operations concern. When developers can quickly understand how their code behaves in production, everyone wins: faster debugging, fewer outages, and systems that actually scale with confidence.
The complexity of cloud-native systems isn’t going away. But with proper observability, that complexity becomes manageable, debuggable, and ultimately, a competitive advantage.