It’s been just over a year since I left my 7 and a bit year stint at Cloudflare, and I’ve had a lot of time to mull over the experience. I have … a lot of thoughts, and I figured the best way to get them out is to write them down.
Just upfront: These are my own lived experiences, but I’m not going to touch on the controversial stuff - Daily Stormer, Kiwi Farms, etc. Everything here is my own opinion, and doesn’t reflect anything official at Cloudflare. If you’re here looking for juicy goss, this ain’t it. With that out of the way:
How I got into Cloudflare
Around the middle of 2017, I was looking to move to the UK. A lot of New Zealanders do - enough that we have a phrase for it: The Big OE. I had originally planned to move with my employer at the time (Xero), but didn’t for two reasons:
- They wouldn’t let me relocate
- Even if they did, I’d have to live in Milton Keynes 🤮
That left me a bit frantic looking for a way to not starve to death on the streets of London. From my emails at the time, I put in about 10 different applications, before scrolling through the Cloudflare dashboard one day to debug my email server and realising: Hey, Cloudflare is a tech company! Interestingly, I originally put in an application to be a Systems Engineer on the DNS team (I deeply admire Ólafur Guðmundsson, having met him a few times prior at various conferences), however I never heard back about that application. Instead, I somehow got routed into the SRE queue (presumably because my CV has “SRE” in it). My first interview was a technical one with James O’Gorman at around 9PM (for some reason I misremembered this as being much later. Either way, the perils of interviewing for a job around the world). I don’t remember much beyond flubbing it, and walking away to pour myself a commiseration glass of Amrut.
Nevertheless, they called me back for an on-site interview. I remember Catherine (my recruiter) asking if I wanted them to fly me in, which was very generous, but because I had business to finish off in NZ and had already booked my flights for a week and a bit later this would have made it NZ -> UK -> NZ -> China -> UK within two weeks, and that is frankly too much time on a plane.
So, two and a bit weeks later, after visiting my brother in China, I ended up in London. I landed on a balmy 30C day, in my thick NZ winter jacket having been warned that even in the summer the UK is very cold. I was staying in Canary Wharf at the time, but I remember scoping out the Cloudflare office in Lavington Street an entire week before the interview so that I knew how the tube worked and exactly where I was going.
The day of my interview I arrived way too early (probably out of nerves), and was stashed in one of the “fishbowl” meeting rooms of Cloudflare’s Lavington Street office (Babbage, for those who know the layout) where everyone can gawk at you. Similarly, I don’t remember much about the interview. I do remember things though: being so nervous that I said VNets were level 3 (a critical mistake for someone ostensibly with a network engineering degree), and casually mentioning the “China Network” having just been in China, and getting laughed at at lunch. Again, having assumed that I flubbed it, I left the office pretty dejected and ended up on the Southwark Station steps organising another commiseration drink with a friend of mine (seeing a pattern?). Against all odds though, while sitting on those steps I received an email from Michael Daly that they wanted to continue with me, and two weeks later I was signing a contract. I am infinitely grateful to everyone involved in that interview process for taking a chance on a naive little NZer out in the world - you all literally changed my life.
My first few months
When I started at Cloudflare, we had one monolithic “SRE” team, and a separate “PlatOps” team. Roughly, these handled “the edge” (Cloudflare’s datacenters around the world), and “core” (our big central Single Point Of Failure in Portland) respectively. As Cloudflare’s product offerings grew, the SRE team in particular was tasked with understanding and maintaining an increasingly large catalogue of stuff. This made onboarding as a new SRE a monumental task, and I’m not sure anyone on the team fully understood everything. About 4 or so months in, to combat burnout among our more senior staff among other things, we split those teams into “Edge Operations”, “Edge Platform”, “Core Operations”, and “Core Platform”. The Operations teams would handle the day-to-day stuff (including on-call), and the Platform side would handle more long-term projects. To this day, I’m not sure we got this correct. Cloudflare has a tendency of running teams very lean, which is fine when it comes to standard engineering work (and, I would say, is actually one of Cloudflare’s strengths), but when you’re an on-call team, running lean has a tendency to burn people out very quickly. Having been thrown into the fledgling Core Ops team (as one of 3 folks), I found myself having to come up to speed on an entirely different tech stack having only just learnt the edge. Thankfully my first project here was rather simple: migrating “RespectTables”, our on-call managing chat bot, over to Google Chat in an (remember, IMO) ill-conceived and rushed migration from our self-hosted HipChat server (amusingly, this HipChat server stayed alive for several years past this, and served as a fun hangout spot for folks when Google Chat went down). RespectTables is now taught during orientation, having expanded to cover nearly every facet of day-to-day work for everyone at Cloudflare, from engineering to sales, and everything in between. I’m very proud of that.
Onto Observability
As a bit of a defence mechanism to the vast amount of things we needed to learn, I instead decided when I joined Core Operations that I was instead going to focus on one area of Core, and get really good at it. Picking up from Matt Bostock (who is another engineer that I admire greatly, who sadly left to join Google not long after I joined), I decided that was going to be our Observability infrastructure. Cloudflare’s Observability was pretty simple at the time - Bostock had moved us over to this weird new thing called Prometheus, off of our terrible to manage OpenTSDB (and resulting HBase, and thus Hadoop) infrastructure, and we had a pretty large (soon to become giant) ELK stack for storing logs. It was the ELK stack that I started with. In particular, ElasticSearch only has a very limited support for arbitrary fields in a document. Once you go beyond a certain limit, your ElasticSearch index will just stop accepting logs. This, coupled with the fact that such fields are tremendously inefficient, led me to one of my most regretted, and yet necessary decisions: We ended up implementing a strict schema for logs in ES. This solved the initial problem - our ElasticSearch no longer fell over every few weeks, but resulted in a learning curve for new engineers that still exists to this day. This taught me an interesting lesson - everything is a trade-off. While solving one problem, I introduced a little bit of friction, but when that little bit of friction is multiplied by every engineer, suddenly, that’s a lot of friction.
Around the start of the pandemic, I started pushing for splitting out a dedicated Observability team - our offering had expanded to include tracing, and profiling, and it was getting far too much to manage in a sub-team of a sub-team. Frustratingly, we ended up with two Observability teams - one based out of Core (which had unified from Core Ops and Core Platform), and one based out of Edge Ops. These two teams worked at odds with one another until around mid 2021, with the “Edge Observability” team mostly handling the local Prometheus servers in each datacenter, and the “Core Observability” team handling the logging pipeline. Around that time, I had decided to move to Australia partly to be a bit closer to my family, and partly to escape London as after my original manager Michael had left, the remaining management structure in London had started to become quite overbearing. I remember being in a Quarantine Hotel (man, the pandemic was a weird time) meeting our new Platform VP Jeremy and expressing my woes. He agreed that a dedicated team would be a good idea, but there was still the issue of unifying the disparate teams. This was made a little bit easier when my manager at the time, Robbie, ended up leaving, and so, we formed our team under the just hired new head of Edge SRE in APAC. We pulled in both the APAC SRE teams, to form a properly sized Observability Team right off the bat (to avoid the above “lean team” issue).
The Observability Team
From then on, I tech-led our Observability team, designing a lot of the overall strategy and coming up with moonshot ideas for where to go next.
One of the first things I did was take a trip to the US, and sit down with our engineering teams to work out where the Observability offerings started to fall down. The major outcome of those meetings was an acknowledgment of the push for “Cloudflare on Cloudflare” - building more and more Cloudflare products on top of Cloudflare’s ever-growing developer suite. In particular, teams couldn’t use our existing Observability tooling from a Cloudflare Worker so had resorted to numerous hacks, mostly involving a Prometheus Push Gateway. This spawned our first real offering as an Observability team - “wshim”, a proxy that runs on Cloudflare’s machines, and takes fetch requests from workers containing Prometheus data, logging, and traces, and outputs them into the downstream telemetry systems. I ended up open-sourcing some of this as the “gravel gateway”, and gave a talk about it at KubeCon which had an excellent response.
As Cloudflare continued to grow, another issue we saw was our Alertmanager cluster falling over relatively frequently. It turns out it’s not really designed for several thousand machines all sending alerts at once when something falls over. At the time, Alertmanager upstream was really reluctant to add any sort of business logic to Alertmanager, partially due to a lack of resources, particularly due to the weird pseudo-corporate feel that the Alertmanager repo has. The recommendation from upstream was to implement a reverse proxy in front of Alertmanager that affects the business logic you wanted, but notably there wasn’t anything that existed at the time. To combat that we came up with Alertmanager Bouncer, a rule enforcing reverse proxy that my collegue Dmitri gave a lightning talk on at PromCon. After a lot of tweaking to get the settings right, the Bouncer became one of our best tools at keeping our alerting infrastructure alive.
One interesting night happened in 2021 when a relative of mine informed me that their Minecraft server provider, Hypixel, has sent them a really strange email about a new vulnerability that they needed to upgrade for. I noted that it seemed like a rather interesting vulnerability affecting a Java library that some of our components (mainly our ELK stack) used, so I cut an internal VULN ticket and then clocked off for the day. About 9pm, I got one of the most terrifying voice mails that I’ve ever received (and that I still have saved to this day): John Graham-Cumming (our CTO), calling to say “Colin, can you please get online? We need to chat”. Coming to terms with the idea that I was about to get fired, I logged on and got pulled into the incident response for the as yet unnamed Log4Shell vulnerability. I pointed out my worry that data could come in from users through the logging pipeline until it hit a Java component, and was immediately told to shut down the entire logging pipeline. Fair enough. It took a few hours, but eventually one of my collegues, Pradeep, was able to come up with a nice hack to disable the vulnerable feature in our systems, which even though we upgraded I’m pretty sure is still there to this day.
A strange thing we discovered during that day though: There were a few logs in our logging pipeline that were jibberish. Like, random binary jibberish. If you were a Cloudflare engineer at a certain time, random binary jibberish might give you nightmares back to CloudBleed, where Cloudflare leaked random memory across the web, so naturally this raised some hairs. We never did figure out where that jibberish was coming from, but it pointed some fingers at Syslog-NG, our daemon component for collecting logs from each machine, and notably written in C. This isn’t to bash Syslog-NG, in fact, it’s actually exceedingly good software, but it started to draw unwanted attention. Thankfully, OpenTelemetry Logs had just had its initial standard written, and so we started our first foray into OpenTelemetry at Cloudflare, using the OpenTelemetry Collector to replace Syslog-NG. We eventually documented this on the Cloudflare blog, but at the time, this was a huge bet on a relatively unproven technology. I’m glad it worked out though, as it started my relationship with the OTel community that I’m continuing to this day.
In my final days at Cloudflare, my work mostly focused around experimentation. Our ELK stack was bursting at the seams, so we experimented with shipping our logs there and proxying queries from Kibana, (building on the work of Uber, which was being continued in Quesma before they sold it to Hydrolix and pivoted to AI slop). We also experimented with shipping Prometheus metrics there as an alternative to Thanos, our long term metrics solution. That was an experiment that I’m glad didn’t work out, as it allowed our newest hire, Micheal to do it so much better in recent months with his work writing Prometheus blocks as Parquet.
Overall, I’d like to think I did a good job here - I know I made some bad calls over the years, but I’m happy with where we ended up, and having kept in contact with the engineers I left behind, I’m so excited for what they’re doing, and where they continue to push the boundaries in the Observability space.
Leaving
Finally, in September 2024, it came time for me to leave Cloudflare and move back to the UK for personal reasons. It was, and will remain one of the hardest decisions I’ve ever had to make, to leave behind the Observability Team, and all the fantastic engineers that we cultivated in it.
My time at Cloudflare definitely taught me a lot, especially my time in Australia where for a long time I was the only engineer in a sea of sales folks. Often it seems there is a rift between engineering and sales, but seeing the other side of the business so closely opened my eyes a lot.
At the same time, the engineering challenges at Cloudflare are unlike any other tech company, both in regards to scale, but also the unique engineering culture that Cloudflare has cultivated. Experiencing and working in that culture was the experience of a lifetime and taught me so much about working with people, big data systems, and of course Observability, that I’m continuing to apply at DuckDuckGo. I’ll never forget the hard lessons of trade-offs, and not being able to please everyone or solve for every variable, and the importance of pushing forward when you see a problem that needs solving. There are so many open problems in our world today, just looking for someone to push on them.
I'm on BlueSky: @colindou.ch. Come yell at me!