OpenTelemetry Part 3 - Observability
In this third part of the OpenTelemetry series, we dive into observability—the ability to monitor, track, and analyze Logs, Metrics, and Traces from distributed systems. We'll explore open-source tools and third-party libraries that help you visualize these three telemetry pillars in a consistent way. Our goal is to create a unified observability solution using mostly free and open-source software. While some premium tools offer extra features and support, we'll focus on building a cost-effective solution that can run on-premise.
Demo - Solution
In part 2 of this series, we created a microservices architecture using OpenTelemetry. It featured a simple web shop where users could place orders. We also implemented a dummy service that updated product prices periodically. If you followed along with the Docker setup in Part 2, you're ready to generate Logs, Metrics, and Traces for this observability demonstration.
Create a order
To create an order in our web shop, you need a JSON payload with the products you want to order. The products are seeded in the Postgres Product_Database
during initialization.
Here is the sample data for our products:
Product id | Name | Description | Price | Stock |
---|---|---|---|---|
P001 | Laptop | High-performance laptop with SSD | 1200.00 | 0 |
P002 | Smartphone | Latest model with advanced features | 800.00 | 100 |
P003 | Smart TV | 4K Ultra HD with Smart Hub | 1500.00 | 30 |
P004 | Digital Camera | Professional DSLR camera | 900.00 | 20 |
P005 | Wireless Headphones | Noise-canceling technology | 150.00 | 80 |
P006 | Fitness Tracker | Track your health and activities | 80.00 | 120 |
P007 | Gaming Console | Next-gen gaming experience | 400.00 | 0 |
P008 | Coffee Maker | Espresso and cappuccino machine | 200.00 | 60 |
P009 | External Hard Drive | 2TB storage capacity | 100.00 | 25 |
P010 | Wireless Router | High-speed internet connectivity | 70.00 | 0 |
P011 | Desk Chair | Ergonomic design for comfort | 120.00 | 35 |
P012 | Portable Speaker | Bluetooth-enabled for on-the-go music | 50.00 | 70 |
P013 | Cookware Set | Non-stick pots and pans | 150.00 | 15 |
P014 | Backpack | Water-resistant and spacious | 40.00 | 90 |
P015 | Printer | Color inkjet printer with wireless printing | 80.00 | 30 |
P016 | Smart Thermostat | Energy-efficient temperature control | 100.00 | 40 |
P017 | Yoga Mat | High-density foam for comfortable workouts | 20.00 | 65 |
P018 | Blender | Powerful blender for smoothies and shakes | 60.00 | 25 |
P019 | Security Camera | HD camera with motion detection | 120.00 | 0 |
P020 | LED Desk Lamp | Adjustable brightness for study or work | 30.00 | 0 |
Here is the JSON payload to create an order:
{
"Articles": [
{
"Id": "P001",
"Price": "1200"
},
{
"Id": "P002",
"Price": "699"
},
{
"Id": "P003",
"Price": "1500"
},
{
"Id": "P099",
"Price": "1"
}
]
}
This example has some intentional issues: P001
is out of stock, P002
has an incorrect price, and P099
doesn't exist in our product list. This simulates real-world scenarios where stock levels change and prices fluctuate.
To create an order, send this JSON payload as an HTTP POST request to http://localhost:8080/order
.
You'll get a response like this:
{
"order": {
"notification": {
"notificationsSent": true
},
"order": {
"id": "OR1712081210523"
},
"payment": {
"paymentSucceeded": true
},
"product": {
"orderedProducts": [
{
"id": "P002",
"price": 800.0
},
{
"id": "P003",
"price": 1500.0
}
],
"missingInventory": [
{
"id": "P001",
"price": 1200.0
},
{
"id": "P099",
"price": 1
}
]
},
"shipping": {
"trackingNumber": "TR303589374SE",
"estimatedDelivery": "2024-04-05T18:06:50.8414918Z"
}
}
}
The response shows that P002
's price was corrected, while P001
and P099
were not included in the order due to stock issues. You also see the order ID, payment status, and estimated delivery information.
From this response, we can tell that several services were involved, but it's unclear what happened behind the scenes. This is where observability tools come in, helping us trace the request's journey, measure performance, and monitor system health.
Open Source
Let's look at open-source tools that can help us achieve observability in our development environment.
Jaeger
Jaeger is an open-source distributed tracing system that helps you monitor microservices. It tracks the flow of requests, identifies performance bottlenecks, and helps you understand service dependencies. Although Jaeger has production-level features, it’s commonly used in development and testing environments.
Features
Distributed Tracing: Jaeger allows you to trace requests across different microservices, providing a detailed view of their execution flow.
Span and Trace Visualizations: It offers graphical representations of traces and spans, enabling easier identification of performance bottlenecks.
Integration with OpenTelemetry: Jaeger is one of the primary backends supported by OpenTelemetry, making it simple to integrate into existing applications.
Sampling: Jaeger supports various sampling strategies to manage the volume of collected traces.
Use Cases
Root Cause Analysis: Find out where errors occur and understand their propagation across services.
Performance Optimization: Identify slow-performing services and optimize them.
Service Dependencies: Understand how services interact with each other.
Setup and Configuration
To use Jaeger, you'll need to set up a Jaeger backend and a storage system, like Elasticsearch or Cassandra. You can run Jaeger in Docker using the "all-in-one" configuration, which includes all necessary components but doesn't persist data after you stop the container.
For our example, let's look at how Jaeger traces our order creation request. This trace provides a visual representation of the request's journey across multiple services.
Traces
With Jaeger, you can visualize how a request moves through your system. Here's an example from our order creation scenario:
This timeline graph shows all the services involved in the order process. It provides a detailed view of each step along the way. If you expand a specific span, you get even more information about what happened at that point in the process.
For example, you might see a producer and subscriber responsible for sending an email notification to the customer. You'll also find other details like tags, attributes, and the service version. The great thing about Open Telemetry is that you can customize it to fit your needs, allowing you to focus on the most important parts of your application's flow.
Prometheus
Prometheus is an open-source monitoring and alerting toolkit primarily used for capturing time-series data, such as metrics, from various systems and applications. It is widely adopted for its robust querying language, PromQL, and its seamless integration with OpenTelemetry.
Features
Time-Series Data Collection: Prometheus collects metrics at regular intervals, allowing for time-series analysis.
PromQL Query Language: A powerful query language for querying metrics data and creating complex expressions.
Alerting: Supports alerting rules to trigger notifications when specific conditions are met.
Service Discovery: Automatically discover targets for metric collection in a dynamic environment.
Integration with Grafana: Prometheus works well with Grafana for visualization.
Use Cases
Infrastructure Monitoring: Track resource usage and performance across various components.
Application Metrics: Monitor application-specific metrics such as response times, error rates, and throughput.
Alerting: Set up alerts to notify you when metrics exceed defined thresholds.
Setup and Configuration
Prometheus can be set up as a standalone server, collecting metrics from various exporters or OpenTelemetry instrumentation. Once installed, you can configure scrape jobs to define which services to monitor. Visualization and analysis can be done using Grafana or the built-in Prometheus dashboard.
Metrics
Prometheus allows you to monitor a variety of metrics generated by your services, infrastructure, and OpenTelemetry collector. For example, you can check how much memory different services are using in near real-time. You can also set up alerts to notify you when certain metrics hit specific thresholds, giving you a heads-up before small issues become big problems.
Seq
Seq is a log aggregation and visualization platform designed for structured logs. It offers powerful search capabilities and is commonly used for analyzing logs from .NET and other structured logging systems. While not strictly open-source, Seq has a community edition that can be used in smaller-scale environments without a subscription fee.
Features
Structured Log Aggregation: Seq collects and stores structured logs for analysis.
Advanced Search and Filtering: Allows you to search logs based on specific fields and conditions.
Correlation: Supports correlation of logs across different sources to understand request flow.
Alerting and Notifications: Set up alerts based on log patterns or specific events.
Integrations: Works with popular logging frameworks like Serilog, NLog, and OpenTelemetry.
Use Cases
Log Analysis: Search and analyze application logs to understand behavior and detect anomalies.
Troubleshooting: Investigate specific log entries and correlate them with system events.
Alerting and Notifications: Create alerts based on specific log patterns or events.
Setup and Configuration
Seq can be set up on-premise or in the cloud. After installation, configure your application to send structured logs to Seq using appropriate libraries or OpenTelemetry instrumentation. Use the Seq web interface to visualize and search through logs.
Logs
Seq allows you to view logs from all your systems in one place. This is powerful because you can follow a log trace from start to finish without having to check multiple databases or files. It helps you quickly find where things went wrong without the usual hassle of cross-referencing different sources.
With Seq, you also get a better overview of potential security issues like intrusions or other types of system attacks.
Dashboards and alerts in Seq make it easy to spot problems at a glance, giving you an early warning when something's not right.
Zipkin
Zipkin is an open-source distributed tracing system designed to trace requests across microservices. It provides similar functionality to Jaeger, with a focus on ease of use and integration with existing applications. Zipkin has been around for ages and i first played with it when service mesh:es like Istio was the buzz.
Zipkin's features are similar to Jaeger's, so it's a matter of preference when choosing between the two. However, Zipkin is known for its simplicity, making it ideal for quick setup and basic distributed tracing.
Premium Tools
While open-source tools can handle most observability needs, some premium tools offer additional features and support. These tools often require a subscription or one-time purchase. Here are some well-known premium tools for observability:
Honeycomb.io: A cloud-based observability platform designed for distributed tracing and high-cardinality data analysis.
Azure Monitor: Part of Microsoft Azure's cloud services, offering observability tools for metrics, logs, and traces.
AWS OpenTelemetry: Amazon's observability solution with OpenTelemetry integration.
Splunk: A comprehensive platform for searching, monitoring, and analyzing machine-generated data.
SigNoz: An open-source observability platform that supports metrics, traces, and logs.
I've experimented with SigNoz to showcase an alternative toolset. SigNoz has a Docker-based setup that allows you to integrate your services with minimal configuration. You can find an example setup on their GitHub repository.
I won't include images here because the output looks similar to what we've already discussed, but with a different user interface.
Look at my sample here, where i integrated my own services and got it to run on windows
Wrapping up
Observability is crucial for understanding, troubleshooting, and improving microservices-based architectures. With OpenTelemetry and tools like Jaeger, Prometheus, Seq, and Zipkin, you have a range of open-source solutions to choose from. If you're interested in premium services, options like Honeycomb, Azure Monitor, and Splunk offer extra features and support, though they usually require a subscription.
When setting up your observability stack, make sure you implement OpenTelemetry across all the systems you want to monitor. If one part of your architecture lacks OpenTelemetry, it creates a gap in your monitoring, breaking the chain of spans and traces. This can leave you with incomplete data and make troubleshooting more difficult.
To avoid these gaps, start with a simple setup and ensure all core services are instrumented. Once you have a stable baseline, you can expand to more complex configurations like child spans and custom metrics. This way, you'll get the most out of your OpenTelemetry implementation and maintain a consistent view of your system's health.
Hopefully this series has been giving for you. Please reach out on my LinkedIn or johanol.com/contact.
As always, all code is up on GitHub