Production experience with Google Data Catalogue at Kiwi.com
Nowadays when data is spread across various systems of heterogeneous technology stacks, the catalog of a company’s data assets becomes crucial. The right choice of a data catalogue has to go hand in hand with the right usage of a tool and with understanding its limitations.
A data catalog can serve various needs of users like data scientists, data analysts, data designers, and others. Based on the survey and interviews we identified that the most crucial aspects for internal users in Kiwi are:
- Trust that data are curated and of known data quality
- Data discovery allowing for quick and easy discovery of data sets
- Data lineage for an understanding of where data are coming from and what happens to them
Several months ago we at Kiwi decided to use Google Data Catalogue (GDC). This blog aims to explain how we mapped these requirements to GDC functionality, what are the features we benefit from most and what are the limitations we hit.
Google Data Catalog in Real Life
Shekhar Bapat product manager of Google Data Catalog in 2019 said:
“Google has been dealing with a large number of datasets and felt the need to build something for itself. And we are in the third iteration of this internal data catalog that handles more than a million objects and it’s used by more than 30 000 active Googlers.”
Eventually, Google decided to make the data catalog available to GCP customers as a fully managed service. GDC comes with reasonable pay-as-you-go pricing and well-written documentation. These are the featurs we find the most interesting:
- BigQuery and PubSub assets are discovered and ingested into the catalog automatically
- Tag templates which allow for complex tag structures
- Powerful search
No matter what are the coolest features of a tool, a data catalog is useful only if it contains relevant entries. To populate the GDC with BigQuery and PubSub assets is a piece of cake since they appear there automatically, but what about the other assets?
We scrape and ingest metadata from sources like:
- 100+ Postgres databases
- Apache Airflow — workflow management platform used for ETLs
- Looker — a reporting tool
By doing this we not only get a list of company data assets but also a 360 view on what are the source tables, which Airflow dags do ETL over them, what are the output tables from transformations, which Looker explores are fed by what BigQuery or Postgres tables and so on. These additional lineage metadata are then stored in custom tags assigned to data catalog entries.
Having all the relevant entries in the catalog and having them easily searchable, in the case of BigQuery datasets also easily explorable, is covering the need for data discovery capability. Capturing a 360 view of how the assets relate to each other is a step towards data lineage. But what about the trust?
Data quality is identified by users as the most important aspect. We have developed an internal tool to execute user-defined business data quality checks as well as technical ones (consistency, freshness, timelines, …). For the latter, the data catalog again comes to the game, since we use it to retrieve the list of data assets for which data quality checks are generated.
Full Recall is Not Possible
Soon after we started using GDC in production we noticed that sometimes when a table is dropped from the postgres database, it still remains in the catalog. Our data catalog synchronizer job was doing the following steps:
- connect to the source database and scrapes metadata like a database schema, tables, columns in case of database, in case of Airflow parse the dag bag
- prepare entries for the data catalog
- search catalog for entries and deletes obsolete ones, it means those which are in the catalog, but not in the prepared entries
- ingest the entries to catalog by updating existing ones and inserting new ones
It turned out that the issue is in the third step: search for entries, because as stated the documentation :
“…Data Catalog search queries don’t guarantee full recall. Results that match your query might not be returned, even in subsequent result pages. Additionally, returned (and not returned) results can vary if you repeat search queries.”
In the concept of information retrieval, recall is understood as the number of entries retrieved compared to all the relevant entries in the set. GDC comes with two methods to retrieve entries, search and list entries. Their recall is significantly different. The following comparison of retrieved entries comes from testing both methods against the instance of the data catalog which was previously populated on purpose with entries such that both entrygroup and system are set to value my_database_no1
The list entries method seems to be of significantly higher recall, but is it a full recall? The explanation from Google support team comes in clear words:
“Data Catalog does not guarantee full recall in the mentioned methods below as well. For a search request that has >1000 results, full recall is not guaranteed.
The fact that GDC does not guarantee full recall implies that:
A data catalog is not the optimal place to store and retrieve the whole set of entries per system, meaning it’s great for queries like: “find all the BigQuery tables in bookings data set which contain seat number”, but not so much for “give me all the Postgres tables”
Cannot List BigQuery Entries
As demonstrated in the previous section there is a list method, which retrieves entries belonging to the particular entry group. Entry groups are custom entities created by the user. Unfortunately there is no way to list data catalog assets of BigQuery or PubSub, as they do not belong to entry group. To put it in simple words, if we would like to get all BigQuery entries by calling https://datacatalog.googleapis.com/v1/projects/myproject/location/europe-west1/entryGroups/@bigquery/ , it will not work.
Length of Tags is Limited
Every entry can be assigned with tags, even better tags can be structured to follow a schema defined by tag templates. We make use of tags to assign a semantic meaning, data lineage, and other additional information to data assets.
The issue is that the tag can hold only 2000 bytes. Why would anybody need more than 2000 bytes? As already said, we are storing the lineage information. Let’s say table_a is used in transformations performed by Airflow dags, so that it is input to dag_1, dag_2, dag_3 and output from dag_x, dag_y, dag_z, … In some cases the string of dag names is so long that we can’t capture the whole information in a single tag.
This blog summed up the benefits and limitations from the production experience with Google Data Catalogue.
Great thing is that there are no worries about the number of available licenses — anybody who has access to the GCP project can use the Data Catalog. A fee is paid per number of calls.
Very convenient is the native integration with BigQuery and PubSub. It would be even better if the Google Cloud Storage filesets were included in the catalog out of the box. At this point they can be linked to the entry group, but this requires a user action.
Due to missing functionality for full recall, we needed to patch the way we synchronize data catalog entries and redesign our approach for further integrations from batch scraping towards event-driven.
A similar consideration applies to the way we generate automatic data quality checks. The understanding that a Google Data Catalog is not a 100% reliable list of data assets leads us to rethink the structure of entries and the generation of data quality rules.
The tagging functionality is nice, with ready-to-use tag templates and with the possibility to define own ones. The downside is the tag size — once again it turns out that it’s a matter of good design to understand the limitations and use the tool in accordance with them.
Looking for a new adventure in Engineering? Check our jobs.kiwi.com! ????