We’re happy to finally write about our latest larger-scale open source project: Crane.
We use Rancher for container orchestration extensively at Kiwi.com, and saw a need to streamline our release process. Even though Rancher’s sweet web UI makes the manual upgrade procedure at least bearable, there’s still plenty of things to manage around announcing and recording deployments. Crane is our attempt at automating as much of this as possible.
However, we see Crane not just as a tool for saving time on repetitive tasks, but as a tool that became an integral part of our culture out of merit. This is due to the way it brings transparency into day-to-day operations and streamlines the flow of information around them. So, here’s how it does all that!
Crane handles all the basics you’d expect when reading the ‘to upgrade services in Rancher’ part of the pitch. Namely, you add it as the last stage in your GitLab CI config, tell it the name of your Rancher service, and it deploys your crisp, freshly built Docker image to that service. You also have the option of fine-tuning exactly how it should do this: start the new image before stopping the old one, wait 10 seconds after launching each new container, and that sort of thing. But that’s not the real magic…
…the real magic is in the hooks (or plugins) system. Crane can be extended with all sorts of functionality to be executed based on how the release goes. Here are the ones we use at Kiwi.com which come bundled with the public release of Crane.
- Sentry: This is one of the simple ones. Deployments are reported to Sentry, our error tracker, in accordance with their release tracking support. With knowledge of our deployments, we can identify regressions way faster and easier, as we get a nice page specifically for newly appeared errors from the latest deployment. Since Crane reports information about the deployed commits, it’s even easier to understand how a certain failure case was introduced.
- Datadog: Our ops dashboards are kept on Datadog. Other than collecting time series metrics, they also support recording events, which is what Crane sends. These events can then be overlaid on top of charts to easily identify if a change in numbers correlates with a release. Not only that, but we also get a nice, structured way to store and explore data about our deployments.
- Generic Webhook: We wanted to take things a bit further though. Our analytics team wanted to keep deployment datapoints right next to other, possibly correlated data. This plugin exports events as simple JSON objects via HTTP. The team built an API which listens for these webhooks, saves them into their data warehouse and does some other analytics magic too.
- Slack: Yeah, this is the big one. In fact, it’s so big, I’m just gonna give it a whole new heading.
Here, just have a glance at this beauty:
You’ve got every important thing about the release right here. You have the commit messages of the changes and the author of the changes. This helps everyone keep up with exactly what we’re releasing. The message also tells you who is running the release, so you know who to ping about any issues you notice.
The coolest of all is the Environment section, which updates in real time with exactly what the status of the release is — showing a cute little spinner gif if it’s still in progress, and a ❌ or ✅ emoji afterwards. Having multiple environments lets us track the rollout stages; in the screenshot above, ‘canary’ corresponds to 5% of live traffic, and ‘production’ is the rest of it. The thread also contains (timestamped) messages with updates on the status of the release.
We also have a few handy links included which give you quick access to things like new Sentry errors and Datadog dashboards. There is a warning that the branch being deployed is not
master, so the code didn’t go through the full review and acceptance process, and is therefore more risky than usual.
What’s not immediately obvious but is a huge benefit of these automated announcements, is that you get a predictable, central place for discussion of the release. If there’s something wrong, everyone knows where to read and write about it.
Finally, a somewhat unexpected benefit is that by publishing the commit log to the entire company on Slack, engineers are driven to write more descriptive and readable commit messages, which is always a nice time investment.
“The most ridiculous, stupid, and effective hack I’ve ever written”
To conclude this announcement, I’d like to digress from my ‘sales pitch’ and take just a moment to mention a silly little line of code from Crane. For the Slack messages updates, since it’s executed in multiple, totally independent CI jobs and sandboxes, these processes somehow need to share state amongst each other.
It’s not really desirable to maintain a database (and set up the database connection URL in every single project’s variables) just to be able to find the Slack message ID that corresponds to a given deployment ID. While trying to figure out how to store this data with the least effort, a crazy thought came to mind — we’re already definitely guaranteed to be connected to one external dependency: Slack.
So, is there a way to store this data in Slack? As it turns out, yes! But it feels, really, really dirty. Slack supports some basic message formatting, including links with custom link text like this:
<https://kiwi.com|Whee, cheap flights!>. Messing around with the API showed that the link text can actually be whitespace-only, effectively making the link invisible in the message, like so:
<https://kiwi.com| >. And if the link is hidden, we can just write anything there, to hide it in a message. In our case, Crane sets the deployment ID (constructed from the commit hashes) in the URL, like this:
<https://f7174c3a3f83f915b17302b0091ce1cd9295e9fc8a9c17e6c87bcc2b1bfec4c618931285dda0b0d.com| > which is included in the release announcements. Now, if you request the latest 100 messages from the #releases channel, you basically get the mapping for message IDs to deployment IDs, without having to maintain any sort of data store!
This solution does introduce a tiny race condition, but after 7 months of internal use of Crane for managing our releases, we still haven’t encountered any bugs due to this. Amazing, and horrifying at the same time.
So if all of this sounds like something you’d want to use, be our guest! The README tells you exactly how to get started. If your service is already running in Rancher, and you’re using Docker images tagged with a commit hash (as you should!) then it should be possible to get Crane up and running within 5 minutes. Good luck!
Thanks are in order for the efforts of
Simone E, the author of Crane version 0, and
Jakub Sobotka, who designed the logo for us.
It also feels appropriate to mention a similar project you might want to check out, called rancher-gitlab-deploy.