Introduction
As organizations adopt microservice architectures, the complexity of their systems increases dramatically. Instead of monitoring a single monolithic application, teams now need to track the health and performance of dozens or even hundreds of distributed services. This shift has made observability not just nice-to-have but essential for operating reliable systems.
Observability refers to the ability to understand the internal state of a system based on its external outputs. In the context of microservices, this means having visibility into what's happening within and between services, being able to identify issues quickly, and understanding the impact of changes or failures.
Over the past year, I've worked extensively on improving observability in Go-based microservice architectures. In this article, I'll share practical approaches for implementing the three pillars of observability—structured logging, metrics, and distributed tracing—in Go services, along with strategies for creating effective dashboards and alerts.
The Three Pillars of Observability
Observability is typically implemented through three complementary approaches:
- Structured Logging: Detailed records of discrete events that occur within a service
- Metrics: Aggregated numerical measurements of system behavior over time
- Distributed Tracing: End-to-end tracking of requests as they travel through multiple services
Each approach has its strengths and weaknesses, and together they provide a comprehensive view of your system.
Structured Logging in Go
Traditional logging often consists of simple text messages that are difficult to parse and analyze at scale. Structured logging addresses this by representing log entries as structured data (typically JSON) with a consistent schema.
Choosing a Logging Library
Several excellent structured logging libraries are available for Go:
- Zerolog: Focuses on zero-allocation JSON logging for high performance
- Zap: Offers both a high-performance core and a more user-friendly sugared logger
- Logrus: One of the most widely-used structured logging libraries for Go
For new projects, I recommend either Zerolog or Zap for their performance characteristics. Here's how to set up Zerolog:
import ( "os" "github.com/rs/zerolog" "github.com/rs/zerolog/log" )
func initLogger() { // Set global log level zerolog.SetGlobalLevel(zerolog.InfoLevel)
// Enable development mode in non-production environments
if os.Getenv("ENVIRONMENT") != "production" {
log.Logger = log.Output(zerolog.ConsoleWriter{Out: os.Stdout})
}
}
Contextual Logging
The real power of structured logging comes from adding context to your log entries:
func processOrder(ctx context.Context, order *Order) error { logger := log.With(). Str("order_id", order.ID). Str("user_id", order.UserID). Float64("amount", order.TotalAmount). Logger()
logger.Info().Msg("Processing order")
// Business logic...
if err := validatePayment(order); err != nil {
logger.Error().Err(err).Msg("Payment validation failed")
return err
}
logger.Info().Msg("Order processed successfully")
return nil
}
Request-Scoped Logging
In HTTP services, it's valuable to include request-specific information in all logs:
func loggingMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { // Generate a request ID if not present requestID := r.Header.Get("X-Request-ID") if requestID == "" { requestID = uuid.New().String() }
// Create a request-scoped logger
logger := log.With().
Str("request_id", requestID).
Str("method", r.Method).
Str("path", r.URL.Path).
Str("remote_addr", r.RemoteAddr).
Logger()
// Store the logger in the request context
ctx := logger.WithContext(r.Context())
// Call the next handler with the updated context
next.ServeHTTP(w, r.WithContext(ctx))
})
}
// In your handlers, retrieve the logger from context func handleGetUser(w http.ResponseWriter, r *http.Request) { logger := log.Ctx(r.Context())
userID := chi.URLParam(r, "id")
logger.Info().Str("user_id", userID).Msg("Getting user")
// Handler logic...
}
Standard Log Fields
Consistency is crucial for structured logging. Define standard fields to be used across all services:
const ( // Standard field names FieldRequestID = "request_id" FieldServiceName = "service" FieldEnvironment = "environment" FieldUserID = "user_id" FieldTraceID = "trace_id" FieldSpanID = "span_id" FieldStatusCode = "status_code" FieldError = "error" FieldDuration = "duration_ms" FieldMessage = "message" )
// Initialize the global logger with service information func initServiceLogger(serviceName, environment string) { log.Logger = log.With(). Str(FieldServiceName, serviceName). Str(FieldEnvironment, environment). Logger() }
Logging Sensitive Information
Be cautious about logging sensitive information like passwords, tokens, or personal identifiable information (PII):
type User struct {
ID string json:"id"
Email string json:"email"
Password string json:"-"
// Tagged to exclude from JSON
AuthToken string json:"-"
// Tagged to exclude from JSON
}
// Safe logging method func (u *User) LogValue() zerolog.LogObjectMarshaler { return zerolog.Dict(). Str("id", u.ID). Str("email", maskEmail(u.Email)) // Use helper to mask email }
func maskEmail(email string) string { parts := strings.Split(email, "@") if len(parts) != 2 { return "invalid-email" }
username := parts[0]
domain := parts[1]
if len(username) <= 2 {
return username[0:1] + "***@" + domain
}
return username[0:2] + "***@" + domain
}
Metrics with Prometheus
Metrics provide aggregated numerical data about your system's behavior over time. They're excellent for dashboards, alerting, and understanding trends.
Setting Up Prometheus in Go
The official Prometheus client library for Go makes it easy to instrument your code:
import ( "net/http" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" "github.com/prometheus/client_golang/prometheus/promhttp" )
// Define metrics var ( httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "endpoint", "status"}, )
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
activeRequests = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "http_active_requests",
Help: "Number of active HTTP requests",
},
)
databaseConnectionsOpen = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "database_connections_open",
Help: "Number of open database connections",
},
)
)
// Setup the metrics endpoint func setupMetrics() { http.Handle("/metrics", promhttp.Handler()) }
Instrumenting HTTP Handlers
Create middleware to collect metrics for all HTTP requests:
func metricsMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { endpoint := r.URL.Path
// Increment active requests
activeRequests.Inc()
defer activeRequests.Dec()
// Track request duration
timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, endpoint))
defer timer.ObserveDuration()
// Use a response writer wrapper to capture the status code
wrapper := newResponseWriter(w)
// Call the next handler
next.ServeHTTP(wrapper, r)
// Record request completion
httpRequestsTotal.WithLabelValues(
r.Method,
endpoint,
fmt.Sprintf("%d", wrapper.statusCode),
).Inc()
})
}
// ResponseWriter wrapper to capture status code type responseWriter struct { http.ResponseWriter statusCode int }
func newResponseWriter(w http.ResponseWriter) *responseWriter { return &responseWriter{w, http.StatusOK} }
func (rw *responseWriter) WriteHeader(code int) { rw.statusCode = code rw.ResponseWriter.WriteHeader(code) }
Custom Business Metrics
Beyond basic infrastructure metrics, define custom metrics for important business operations:
var ( ordersProcessed = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "orders_processed_total", Help: "Total number of processed orders", }, []string{"status"}, )
orderValueSum = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "order_value_total",
Help: "Total value of processed orders",
},
[]string{"status"},
)
paymentProcessingDuration = promauto.NewHistogram(
prometheus.HistogramOpts{
Name: "payment_processing_duration_seconds",
Help: "Payment processing duration in seconds",
Buckets: prometheus.LinearBuckets(0.1, 0.1, 10), // 0.1s to 1.0s
},
)
)
func processOrder(order *Order) error { timer := prometheus.NewTimer(paymentProcessingDuration) defer timer.ObserveDuration()
err := processPayment(order)
status := "success"
if err != nil {
status = "failure"
}
ordersProcessed.WithLabelValues(status).Inc()
orderValueSum.WithLabelValues(status).Add(order.TotalAmount)
return err
}
Database Metrics
Track database performance to identify bottlenecks:
import ( "database/sql" "github.com/prometheus/client_golang/prometheus" "github.com/jmoiron/sqlx" )
func instrumentDB(db *sql.DB) { // Report database stats periodically go func() { for { stats := db.Stats()
databaseConnectionsOpen.Set(float64(stats.OpenConnections))
// Add more metrics for other stats as needed
// - stats.InUse
// - stats.Idle
// - stats.WaitCount
// - stats.WaitDuration
// - stats.MaxIdleClosed
// - stats.MaxLifetimeClosed
time.Sleep(10 * time.Second)
}
}()
}
Distributed Tracing with OpenTelemetry
Distributed tracing tracks requests as they flow through multiple services, providing crucial context for debugging and understanding system behavior.
Setting Up OpenTelemetry
OpenTelemetry is the emerging standard for distributed tracing. It supports multiple backends including Jaeger, Zipkin, and cloud-native solutions:
import ( "context" "log" "os"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer(serviceName string) (*trace.TracerProvider, error) { // Create Jaeger exporter exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(os.Getenv("JAEGER_ENDPOINT")))) if err != nil { return nil, err }
// Create trace provider with the exporter
tp := trace.NewTracerProvider(
trace.WithBatcher(exp),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(serviceName),
attribute.String("environment", os.Getenv("ENVIRONMENT")),
)),
)
// Set the global trace provider
otel.SetTracerProvider(tp)
return tp, nil
}
func main() { tp, err := initTracer("user-service") if err != nil { log.Fatalf("Failed to initialize tracer: %v", err) } defer tp.Shutdown(context.Background())
// Rest of your application...
}
HTTP Middleware for Tracing
Add middleware to automatically create spans for incoming HTTP requests:
import ( "net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/trace"
)
func tracingMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { // Extract trace context from the incoming request propagator := otel.GetTextMapPropagator() ctx := propagator.Extract(r.Context(), propagation.HeaderCarrier(r.Header))
// Create a span for this request
tracer := otel.Tracer("http")
ctx, span := tracer.Start(ctx, r.URL.Path, trace.WithSpanKind(trace.SpanKindServer))
defer span.End()
// Add common attributes
span.SetAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.String()),
attribute.String("http.user_agent", r.UserAgent()),
)
// Store trace and span IDs in request-scoped logger
traceID := span.SpanContext().TraceID().String()
spanID := span.SpanContext().SpanID().String()
logger := log.Ctx(r.Context()).With().
Str("trace_id", traceID).
Str("span_id", spanID).
Logger()
ctx = logger.WithContext(ctx)
// Call the next handler with the updated context
next.ServeHTTP(w, r.WithContext(ctx))
})
}
Tracing HTTP Clients
Propagate trace context in outgoing HTTP requests:
import ( "context" "net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/trace"
)
func tracingTransport(base http.RoundTripper) http.RoundTripper { return traceTransport{base: base} }
type traceTransport struct { base http.RoundTripper }
func (t traceTransport) RoundTrip(req *http.Request) (*http.Response, error) { ctx := req.Context()
tracer := otel.Tracer("http-client")
url := req.URL.String()
ctx, span := tracer.Start(ctx, "HTTP "+req.Method, trace.WithSpanKind(trace.SpanKindClient))
defer span.End()
// Add span attributes
span.SetAttributes(
attribute.String("http.method", req.Method),
attribute.String("http.url", url),
)
// Inject trace context into request headers
propagator := otel.GetTextMapPropagator()
propagator.Inject(ctx, propagation.HeaderCarrier(req.Header))
// Execute the request
resp, err := t.base.RoundTrip(req)
if err != nil {
span.RecordError(err)
return resp, err
}
// Add response attributes
span.SetAttributes(
attribute.Int("http.status_code", resp.StatusCode),
)
return resp, err
}
// Use the transport in your HTTP client func createTracingClient() *http.Client { return &http.Client{ Transport: tracingTransport(http.DefaultTransport), } }
Tracing Database Operations
Add tracing to database queries to identify slow operations:
import ( "context" "database/sql"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
func GetUserByID(ctx context.Context, db *sql.DB, id string) (*User, error) { tracer := otel.Tracer("database") ctx, span := tracer.Start(ctx, "GetUserByID", trace.WithSpanKind(trace.SpanKindClient)) defer span.End()
span.SetAttributes(
attribute.String("db.operation", "query"),
attribute.String("db.statement", "SELECT * FROM users WHERE id = ?"),
attribute.String("db.user_id", id),
)
var user User
err := db.QueryRowContext(ctx, "SELECT id, name, email FROM users WHERE id = ?", id).
Scan(&user.ID, &user.Name, &user.Email)
if err != nil {
span.RecordError(err)
return nil, err
}
return &user, nil
}
Integrating the Pillars
The real power of observability comes from integrating logs, metrics, and traces:
Correlation with Request ID
Use a consistent request ID across all three pillars:
func handleRequest(w http.ResponseWriter, r *http.Request) { ctx := r.Context() requestID := getRequestID(ctx)
// For logging
logger := log.Ctx(ctx).With().Str("request_id", requestID).Logger()
// For metrics
httpRequestsWithID.WithLabelValues(requestID).Inc()
// For tracing
span := trace.SpanFromContext(ctx)
span.SetAttributes(attribute.String("request_id", requestID))
// Process the request...
}
Correlating Logs with Traces
Include trace and span IDs in logs:
func processOrder(ctx context.Context, order *Order) error { span := trace.SpanFromContext(ctx) traceID := span.SpanContext().TraceID().String() spanID := span.SpanContext().SpanID().String()
logger := log.Ctx(ctx).With().
Str("trace_id", traceID).
Str("span_id", spanID).
Str("order_id", order.ID).
Logger()
logger.Info().Msg("Processing order")
// Business logic...
return nil
}
Recording Metrics in Spans
Add key metrics as span attributes:
func processPayment(ctx context.Context, payment *Payment) error { tracer := otel.Tracer("payment") ctx, span := tracer.Start(ctx, "ProcessPayment") defer span.End()
startTime := time.Now()
// Process payment...
// Record duration as span attribute
duration := time.Since(startTime)
span.SetAttributes(
attribute.Float64("payment.amount", payment.Amount),
attribute.String("payment.method", payment.Method),
attribute.Int64("payment.duration_ms", duration.Milliseconds()),
)
// Also record as a metric
paymentProcessingDuration.Observe(duration.Seconds())
return nil
}
Effective Dashboards and Alerts
Observability data is only valuable if it helps you understand your system and detect issues quickly.
Creating Effective Dashboards
Design dashboards that tell a story about your system:
-
Service Overview Dashboard:
- Request rate, error rate, and latency (RED metrics)
- Active instances and health status
- Resource utilization (CPU, memory, network)
-
Business Metrics Dashboard:
- Orders processed per minute
- Conversion rates
- Revenue metrics
- User activity
-
Dependency Health Dashboard:
- Database connection pool status
- External API latency and error rates
- Message queue depth and processing rate
Setting Up Meaningful Alerts
Define alerts that detect actual problems without creating alert fatigue:
-
Golden Signals Alerts:
- High error rate (e.g., > 1% errors for 5 minutes)
- High latency (e.g., p95 latency > 500ms for 5 minutes)
- Traffic drop/spike (e.g., 50% change from baseline)
- Saturation (e.g., memory usage > 85% for 10 minutes)
-
Business Alerts:
- Order processing failures above threshold
- Payment processing success rate below threshold
- Critical user journey completion rate drop
Alert Response Procedures
For each alert, define a clear response procedure:
- What to check first: Logs, traces, metrics, recent deployments
- Who to contact: Primary on-call, backup, domain experts
- Remediation steps: Common fixes, rollback procedures
- Escalation path: When and how to escalate issues
Real-World Example: Troubleshooting with Observability
Let's walk through a real example of how integrated observability can help troubleshoot an issue:
The Problem
Users report intermittent timeouts when placing orders.
Investigation with Observability
-
Start with Metrics:
- Dashboard shows increased p95 latency in the order service
- Payment service shows normal metrics
- Database connection pool is near capacity
-
Examine Logs:
- Filter logs for errors related to order processing
- Find entries showing database query timeouts
- Extract trace IDs from error logs
-
Analyze Traces:
- Look at traces for slow requests
- Discover that a query for product inventory is taking > 1s
- Spans show the database as the bottleneck
-
Root Cause:
- Missing index on the product inventory table
- High traffic causing table scans instead of index lookups
Resolution
- Add the missing index
- Optimize the query
- Increase database connection pool capacity
- Add caching for frequently accessed inventory data
Without integrated observability, this issue could have taken hours or days to diagnose. With proper instrumentation, it was resolved in minutes.
Implementing Observability Across Services
For consistent observability across your microservice architecture, consider these approaches:
Shared Libraries
Create shared libraries for standardized instrumentation:
// pkg/observability/observability.go package observability
import ( "context" "net/http"
"github.com/rs/zerolog"
"go.opentelemetry.io/otel/trace"
)
// Config holds configuration for all observability components type Config struct { ServiceName string Environment string LogLevel zerolog.Level JaegerEndpoint string PrometheusPort string }
// Service provides access to all observability components type Service struct { Logger zerolog.Logger TracerProvider *trace.TracerProvider HTTPMiddleware func(http.Handler) http.Handler Cleanup func(context.Context) error }
// New creates a fully configured observability service func New(cfg Config) (*Service, error) { // Initialize logger logger := initLogger(cfg)
// Initialize tracer
tp, err := initTracer(cfg)
if err != nil {
return nil, err
}
// Initialize metrics
initMetrics(cfg)
// Create combined middleware
middleware := chainMiddleware(
loggingMiddleware(logger),
tracingMiddleware(),
metricsMiddleware(),
)
// Create cleanup function
cleanup := func(ctx context.Context) error {
return tp.Shutdown(ctx)
}
return &Service{
Logger: logger,
TracerProvider: tp,
HTTPMiddleware: middleware,
Cleanup: cleanup,
}, nil
}
Service Mesh Approach
For larger deployments, a service mesh like Istio can provide consistent observability without code changes:
- Automatic Tracing: Service mesh proxies automatically generate and propagate trace headers
- Metrics Collection: Detailed traffic metrics without manual instrumentation
- Uniform Telemetry: Consistent observability across services regardless of language
Conclusion
Building proper observability into Go microservices is essential for operating reliable systems at scale. By implementing structured logging, metrics, and distributed tracing, you can gain deep visibility into your services and quickly diagnose issues when they arise.
Key takeaways from this article:
- Use structured logging with contextual information to make logs searchable and analyzable
- Implement metrics for both technical and business operations to understand system behavior
- Add distributed tracing to follow requests across service boundaries
- Integrate all three pillars for a complete observability solution
- Design effective dashboards and alerts to detect and diagnose issues quickly
Remember that observability is not just about tooling—it's about building a culture where teams value visibility and invest in the instrumentation needed to understand their systems.
In future articles, I'll explore advanced observability topics including anomaly detection, SLO monitoring, and implementing observability in serverless and event-driven architectures.
About the author: I'm a software engineer with experience in systems programming and distributed systems. Over the past years, I've been designing and implementing Go microservices with a focus on reliability, performance, and observability.
No comments:
Post a Comment