We’ve been using Looker as a self-service and reporting tool in Kiwi.com for more than 3 years. Onboarding this tool has been an interesting journey led by many, sometimes too late, realizations. And many, many of them could have been avoided if we knew what we know today.
That’s why we have decided to record our journey and help others out there to avoid our mistakes.
Three Years Ago
Our story begins in 2019. Back then, we have been using various reporting tools for analytical data such as GoodData, Retool, Datadog, or Google Data Studio simultaneously. All reports were built by over 30 analysts and an unknown number of managers and developers. Those reports were used by over 300 users. Such a set up was unmanageable, untransparent and led to many duplications, biased measures which affected KPIs, since everybody could have created anything.
As Kiwi.com was growing, we needed a single source of truth, so that’s when we decided to use Looker.
There were many tools for reporting out there on the market, so why did we decide to choose Looker? The first reason was that Looker is not only a reporting tool, but also offers self service exploration features. Our analysts were overwhelmed by various repeated questions from business data consumers. So why spend analysts’ precious time by doing easy or repeated tasks if people can find answers by themselves and instead invest the analyst’s time in something more advanced?
Another reason was that Looker doesn’t store data in the cloud. It directly queries your database instance, so you have all the data under your control and with immediate access.
Once we have decided to start onboarding Looker, we realized we missed somebody experienced with this tool in our team. Setting up the whole instance required some knowledge and somebody’s full time attention. But we successfully ignored this fact and started to do everything on our own from scratch, without any Looker experience and best practices…. as a side project.
How and Why (not) to Set up Things
Before listing various setting areas inside Looker, it’s important to mention that we are talking about a middle size company with more than 1500 employees around the world which operates in the IT field. Each setting area below is described briefly. The goal is not to describe each issue much into detail, but rather to give readers an overview of what may or may not be the wrong approach.
We created one repository for all analytics data. At the beginning, it seemed like a good idea. We expected that having everything in one place is a great way to easily access other teams’ data, which may lead to decreasing duplications and increasing the sharing of basic datasets. Unfortunately, it wasn’t the case and it turned out to be a mistake.
After 3 years, we ended up with an endless number of duplicated views and highly confusing folder structure. Currently, we are migrating to Hub & Spoke model concept where each data team will have their own repository and general models used by multiple teams will be defined in the hub.
The majority of data in a database we work with is structured in data models in a star schema following Kimball’s methodology. Meaning, we have fact and dimensional tables. And we decided to copy those tables 1:1 to Looker. So one fact table created one view table, same for dimensional table.
But you know, the devil is in the details, and after creating a couple of explores, we were forced to create duplications of dimensional table views on purpose. Why? Because using one dimensional view reference in two fact views may create unexpected relationships once you decide to join those fact views within one explore as well. So in the end, we need to closely watch if we don’t accidentally create unwanted relationships. To avoid such a situation we started to create duplications with various creative prefixes and suffixes and that’s the way to a hell of a mess.
Or another issue is that referencing two different columns in one fact view to the same dimensional view creates a relationship where those two columns are considered to be one. Typically, when we have one dimensional table for countries but we have two columns in the fact table: source and the destination.
Solution? We have recently decided to create a special denormalized table per each fact table in a database. Using denormalized tables offers a tidier structure in folders and requires significantly less joins in explore definitions.
We created a huge number of models because we created one model per explore. Why? Because we didn’t get that a model is mainly a combination of the connection and data area and should serve as a roof for more explores.
Previously I have mentioned that the majority of data we are working with are structured in data models. Unfortunately, there were cases when we didn’t have any, or very minor, data transformations done on a source data and we actually made all transformations inside Looker. Even though we knew it was not a good idea, we always said “It’s a temporary solution and once we have time we will fix it”, but you probably know how it goes with “temporary” solutions. Creating heavy transformations inside Looker should be an exception, not a habit. It has a significant impact on a Looker performance, not in a good way.
Having over 20 Looker developers may be overwhelming and it’s easy to lose track. That’s why we decided to have ambassadors per data area. Ambassador is a Looker developer from an analytics team, who understands data and is willing to help colleagues with learning how to develop in Looker and keep data and its structure tidy. We organize a meetup for Ambassadors once per month to discuss the current state of our Looker instance and share experiences. We believe it’s a good way to stay on the same boat with all teams.
At the beginning, we had only one role for the user and one for the viewer. Then we felt like it’s not enough and we created too many roles, one viewer and user per data area. We ended up with a confusing structure and our users didn’t know what access right they should ask for. So later on we created a general user and viewer for the majority of models and a couple of special user/viewer roles for sensitive data for our HR and Finance teams.
The main goal of using Looker for us, is to let people self-service their needs and create content on their own. Creating dashboards shouldn’t be available only for analysts. Project managers or developers may also have their needs and legit ideas for insightful dashboards. And if they are capable of creating dashboards on their own, we should let them.
Of course letting users create dashboards may bring up a lot of obstacles like non-valid or biased conclusions by wrong usage of explores, misunderstanding the data, out-of-date dashboards and so on. But we let users create their own dashboards and widely share it with colleagues. When did we find out we had a problem? When we kept receiving questions on Slack about popular dashboards from personal folders which hadn’t been validated and sometimes they were obsolete or had performed very badly. At this point, we decided to remove the All Users group from all personal folders. So if a person wants to share their own dashboards with more people, they have to cooperate with an analytics team on moving it to the supervised shared folder.
Communication Across the Company
Setting up Slack channels dedicated to Looker was a good idea. We created one for questions from our users (#plz-looker), and one for Looker developers where do we share news, ideas and knowledge (#looker-devs).
Since we had only one project, we also had only one really messy repository. We set up mandatory merge requests and approval at least from one person. And that’s all we did. It took us almost three years to realize we need tests in a pipeline, otherwise people won’t always follow best practices. We used one of those open source linters so it actually didn’t require a lot of work on our side.
System activity is a wonderful way to check performance of explores, dashboards, schedules or to check what’s broken and many other use cases. Unfortunately, we still don’t have any automatic alerts for broken content or something like that and we only manually check it from time to time. We then write a message on Slack that something is wrong, ask people to fix it and then ask them once again in private messages until it’s finally fixed. This approach is time consuming and doesn’t cover all cases, so it’s not very effective either. Hopefully, one day we will come up with an automatic solution.
Surprisingly, we managed to automatize one part of the system activity checks — handling unused content. We have a bot which automatically moves content that has been unused for more than 90 days, to trash. It helps us to easily get rid of outdated content.
Onboarding a new tool is always a journey with a lot of ups and downs. Lessons learned:
- Decentralize the repository if more teams are involved
- Don’t copy tables 1:1 from db to Looker. OBT is more transparent than star schema tables architecture in Looker repository
- Model is a roof for more explores based on a connection and a data area. Keep it simple.
- Building transformations inside Looker on raw tables should be an exception
- Make sure that every analytics team participates in Looker development and organize periodical meetings with their representatives
- Keep access management simple and transparent. Special rights should be needed only in case of sensitive data, if possible.
- People shouldn’t create and keep production and popular dashboards in their personal folders
- Be available for users by using a slack channel dedicated to Looker
- People sometimes forget to follow best practices in developing LookML models. Introduction of tests in a Gitlab pipeline checks their behavior for you.
- Check regularly system activity dashboards and automatize alerts based on them. It helps to keep your instance cleaner.
- Delete unused content automatically