Telemetry
Metrics, traces, and monitoring - understanding what your application is doing in production.
Beyond Logging#
Logs tell you what happened. Telemetry tells you:
- How often things happen (metrics)
- How long things take (traces)
- What's connected to what (distributed tracing)
Together, they give you complete visibility into your application.
The Three Pillars of Observability#
1. Logs#
What happened, in detail.
"User 123 logged in at 14:23:45"
"Order 456 failed: payment declined"
2. Metrics#
Numbers over time.
requests_total: 1,234,567
response_time_p99: 245ms
error_rate: 0.1%
active_users: 523
3. Traces#
Request flow across services.
Request → API Gateway → Auth Service → User Service → Database
[2ms] [15ms] [8ms] [25ms]
Simple Metrics with prom-client#
npm install prom-client
// src/utils/metrics.js
import client from 'prom-client';
// Collect default metrics (CPU, memory, etc.)
client.collectDefaultMetrics();
// Custom metrics
export const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
});
export const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path', 'status'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
});
export const activeConnections = new client.Gauge({
name: 'active_connections',
help: 'Number of active connections',
});
// Export the registry for the /metrics endpoint
export { client };
Metrics Types#
Counter - Only goes up
const loginAttempts = new client.Counter({
name: 'login_attempts_total',
help: 'Total login attempts',
labelNames: ['success'],
});
loginAttempts.inc({ success: 'true' });
loginAttempts.inc({ success: 'false' });
Gauge - Can go up or down
const activeUsers = new client.Gauge({
name: 'active_users',
help: 'Currently active users',
});
activeUsers.inc(); // User connected
activeUsers.dec(); // User disconnected
activeUsers.set(42); // Set directly
Histogram - Distribution of values
const responseTimes = new client.Histogram({
name: 'response_time_seconds',
help: 'Response time distribution',
buckets: [0.1, 0.5, 1, 2, 5],
});
responseTimes.observe(0.234); // Record a value
Metrics Middleware#
// src/middleware/metrics.js
import { httpRequestsTotal, httpRequestDuration } from '../utils/metrics.js';
export function metricsMiddleware(req, res, next) {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const labels = {
method: req.method,
path: req.route?.path || req.path,
status: res.statusCode,
};
httpRequestsTotal.inc(labels);
httpRequestDuration.observe(labels, duration);
});
next();
}
Expose Metrics Endpoint#
// src/app.js
import { client } from './utils/metrics.js';
import { metricsMiddleware } from './middleware/metrics.js';
app.use(metricsMiddleware);
// Prometheus scrapes this endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.send(await client.register.metrics());
});
Output:
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/users",status="200"} 1234
http_requests_total{method="POST",path="/api/orders",status="201"} 567
# HELP http_request_duration_seconds HTTP request duration in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/api/users",status="200",le="0.1"} 1000
http_request_duration_seconds_bucket{method="GET",path="/api/users",status="200",le="0.5"} 1200
Business Metrics#
Track what matters to your business:
// src/utils/metrics.js
export const ordersCreated = new client.Counter({
name: 'orders_created_total',
help: 'Total orders created',
labelNames: ['status'],
});
export const orderValue = new client.Histogram({
name: 'order_value_dollars',
help: 'Order value distribution',
buckets: [10, 50, 100, 250, 500, 1000],
});
export const paymentFailures = new client.Counter({
name: 'payment_failures_total',
help: 'Payment failures',
labelNames: ['reason'],
});
Use in services:
// src/services/orders.js
import { ordersCreated, orderValue } from '../utils/metrics.js';
export async function createOrder(userId, items) {
const order = await Order.create({ ... });
ordersCreated.inc({ status: 'success' });
orderValue.observe(order.total);
return order;
}
Health Checks#
Beyond simple "200 OK":
// src/routes/health.js
import mongoose from 'mongoose';
import { redis } from '../config/redis.js';
export async function healthCheck(req, res) {
const health = {
status: 'ok',
timestamp: new Date().toISOString(),
uptime: process.uptime(),
checks: {},
};
// Check MongoDB
try {
await mongoose.connection.db.admin().ping();
health.checks.mongodb = { status: 'ok' };
} catch (error) {
health.checks.mongodb = { status: 'error', message: error.message };
health.status = 'degraded';
}
// Check Redis
try {
await redis.ping();
health.checks.redis = { status: 'ok' };
} catch (error) {
health.checks.redis = { status: 'error', message: error.message };
health.status = 'degraded';
}
// Memory check
const memUsage = process.memoryUsage();
health.checks.memory = {
heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024) + 'MB',
heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024) + 'MB',
};
const statusCode = health.status === 'ok' ? 200 : 503;
res.status(statusCode).json(health);
}
Kubernetes Probes#
// Liveness - is the app running?
app.get('/health/live', (req, res) => {
res.json({ status: 'ok' });
});
// Readiness - is the app ready to serve traffic?
app.get('/health/ready', async (req, res) => {
try {
await mongoose.connection.db.admin().ping();
await redis.ping();
res.json({ status: 'ok' });
} catch {
res.status(503).json({ status: 'not ready' });
}
});
Distributed Tracing with OpenTelemetry#
For microservices, trace requests across services:
npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
// src/tracing.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
serviceName: 'user-api',
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_URL || 'http://localhost:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Load it first:
// src/index.js
import './tracing.js'; // Must be first!
import { app } from './app.js';
// ...
Simple APM Without Infrastructure#
If you don't want to run Prometheus/Grafana:
Logging-Based Metrics#
// Log metrics periodically
import { logger } from './utils/logger.js';
setInterval(() => {
const memUsage = process.memoryUsage();
logger.info({
type: 'metrics',
memory: {
heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024),
heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024),
},
uptime: process.uptime(),
}, 'System metrics');
}, 60000); // Every minute
Request Timing Logs#
// Already have this from request logging middleware
// Just ensure you log duration
req.log.info({
duration: Date.now() - start,
statusCode: res.statusCode,
}, 'Request completed');
Simple SaaS Options#
No infrastructure to manage:
- Datadog - Full APM suite
- New Relic - Metrics, traces, logs
- Sentry - Error tracking + performance
- Better Stack (Logtail) - Logs + uptime
What to Monitor#
System Metrics#
- CPU usage
- Memory usage
- Disk space
- Network I/O
Application Metrics#
- Request rate (requests/second)
- Error rate (errors/requests)
- Response time (p50, p95, p99)
- Active connections
Business Metrics#
- User signups
- Orders created
- Revenue
- Feature usage
Dependencies#
- Database response time
- Cache hit rate
- External API latency
- Queue depth
Key Takeaways#
- Logs, metrics, traces - Different tools for different insights
- Use Prometheus metrics - Standard format, works everywhere
- Track business metrics - Not just technical health
- Health checks matter - Kubernetes needs them
- Start simple - Logging + basic metrics, add tracing later
The Minimum
At minimum:
- Request logging with duration
- Health check endpoint
- Error tracking (even just logs)
Add Prometheus metrics when you need dashboards. Add tracing when you have multiple services.
Ready to level up your skills?
Explore more guides and tutorials to deepen your understanding and become a better developer.