Observability

Runbooks and Dashboards: Scrapbooking for Engineers

It seems like every company these days has two things they turn to when first hitting an incident. a) Is there an Alert Reference / Runbook / Whatever (the actual term depends on who you talk to), and b) Is there a dashboard that can tell me exactly what’s wrong. In this post, I’m going to cover why these tools are not only unhelpful, but are actively harmful to your incident response. Let’s get into it. ...

I Don't Think ElasticSearch Is A Good Logging System

I’ve been a heavy user of ElasticSearch for coming up 7 years now. During that time I’ve used it for a few main usecases: A Search Engine, An APM Solution (after NewRelic started being stupidly expensive), a backend for Jaeger, and as a log storage system. In all of those usecases I’ve really pushed ElasticSearch to its limits, with hundreds of terrabytes of data across dozens of machines and tens of thousands of shards and in all that time I’ve found that it really only works well for one of those situations. Particularly with Elastic’s push towards being anti-user, I wanted to question whether storing log data is a good usecase for ElasticSearch and suggest some better options. ...

You Should Use Structured Logging

Logging (i.e. exporting text data from your application) is one of the very first things that any budding programmer learns to do. Who among us hasn’t started learning a programming language with a: 10 PRINT "Hello World!" 20 END That’s a log! Not a good log, but a log nonetheless. And that is why the importance of logging in a systems observability is so often overlooked - everyone can log, it’s hard to log well. Here I want to talk about logging, in particular structured logging and justify my opinion: If you’re not using structured logging, your logs are not actually useful. ...

Are Grafana DSLs Actually Useful?

Grafana is one of, if not the most common tool for visualization of telemetry data from your applications. I myself use Grafana both at work, and for my own personal projects to plot all sorts of data, from my Plex server usage stats, to my houses power usage. Generally I use the WYSIWYG interface for building my dashboards, but in modern software development the push seems to be to Gitops everything, so I went looking for solutions - here’s what I found. Note that this isn’t a debate on whether dashboards are actually useful, that’s a topic for another day. ...

I Think Prometheus Is Impossible for FAAS Applications

I recently had the experience of trying to orchestrate Prometheus metrics in a Function-As-A-Service (FAAS) application, which turned out to be a bit of a harrowing experience. Here’s what I learned. Prometheus in a FAAS World In a “standard” architecture, you have a long running service running on some machine somewhere. That service exposes an HTTP(S) endpoint that Prometheus discovers (through some service discovery mechanism), and periodically sends GET requests to, parsing the metrics that your application generates. This “pull based” model relies on several properties of this architecture - applications live long enough to coincide with when Prometheus decides to scrape them, and to be able to count things on their own. ...

The Tragedy of the Platform Commons

In many modern system architectures, there exists some common building blocks between systems. We generally call these the “Platform” (See Facebooks “Platform Engineer” role), which makes a nice analogy as these components are supposed to form a solid base for you to build a system on top of. But how should we manage these common pieces? Therein lies an interesting question, that I here offer some thoughts on. Organisational Models There’s three main organisational models I have seen in the wild, each with their benefits and drawbacks. ...