Runbooks and Dashboards: Scrapbooking for Engineers

It seems like every company these days has two things they turn to when first hitting an incident. a) Is there an Alert Reference / Runbook / Whatever (the actual term depends on who you talk to), and b) Is there a dashboard that can tell me exactly what’s wrong. In this post, I’m going to cover why these tools are not only unhelpful, but are actively harmful to your incident response....

July 18, 2022 · 7 min

I Don't Think ElasticSearch Is A Good Logging System

I’ve been a heavy user of ElasticSearch for coming up 7 years now. During that time I’ve used it for a few main usecases: A Search Engine, An APM Solution (after NewRelic started being stupidly expensive), a backend for Jaeger, and as a log storage system. In all of those usecases I’ve really pushed ElasticSearch to its limits, with hundreds of terrabytes of data across dozens of machines and tens of thousands of shards and in all that time I’ve found that it really only works well for one of those situations....

September 28, 2021 · 5 min

You Should Use Structured Logging

Logging (i.e. exporting text data from your application) is one of the very first things that any budding programmer learns to do. Who among us hasn’t started learning a programming language with a: 10 PRINT "Hello World!" 20 END That’s a log! Not a good log, but a log nonetheless. And that is why the importance of logging in a systems observability is so often overlooked - everyone can log, it’s hard to log well....

September 28, 2021 · 4 min

Are Grafana DSLs Actually Useful?

Grafana is one of, if not the most common tool for visualization of telemetry data from your applications. I myself use Grafana both at work, and for my own personal projects to plot all sorts of data, from my Plex server usage stats, to my houses power usage. Generally I use the WYSIWYG interface for building my dashboards, but in modern software development the push seems to be to Gitops everything, so I went looking for solutions - here’s what I found....

August 23, 2021 · 3 min

I Think Prometheus Is Impossible for FAAS Applications

I recently had the experience of trying to orchestrate Prometheus metrics in a Function-As-A-Service (FAAS) application, which turned out to be a bit of a harrowing experience. Here’s what I learned. Prometheus in a FAAS World In a “standard” architecture, you have a long running service running on some machine somewhere. That service exposes an HTTP(S) endpoint that Prometheus discovers (through some service discovery mechanism), and periodically sends GET requests to, parsing the metrics that your application generates....

August 1, 2021 · 5 min

The Tragedy of the Platform Commons

In many modern system architectures, there exists some common building blocks between systems. We generally call these the “Platform” (See Facebooks “Platform Engineer” role), which makes a nice analogy as these components are supposed to form a solid base for you to build a system on top of. But how should we manage these common pieces? Therein lies an interesting question, that I here offer some thoughts on. Organisational Models There’s three main organisational models I have seen in the wild, each with their benefits and drawbacks....

November 30, 2020 · 3 min