Apr 20, 2021

CatBoost C++ library

In this article, you will find information about our implementation of the CatBoost C++ library, the reasons why we wrote it, and how to optimize C++ computations based on memory reorganization and vectorization instructions.

Kiwi.com open sourced a new CatBoost applying library written in C++ with reasonably good optimizations and without dependencies. It means you can use this library in your projects to apply CatBoost models without huge catboost official library dependency. Moreover, it is possible to integrate this library into your build process.


In 2009 Yandex introduced the MatrixNet algorithm and started using it for ranking web pages. The algorithm itself is a gradient boosting on decision trees. This implementation’s main advantage is that it is a great deal easier to tune parameters for MatrixNet than for XGBoost or other concurrent implementations.

Anyway, until the year 2017, nobody but Yandex and CERN could use MatrixNet because it is a proprietary product. But time had changed in 2017, and Yandex released the CatBoost. CatBoost is a very similar algorithm and implementation, but it has support for categorical features and is an open source library so that you can use it in your products. At that time, I was working in Yandex and could say that it was a great success for the company. CatBoost is supported by Apple CoreML, and lots of companies are using this library for Machine Learning tasks.

It is not a surprise that when it came to solving prediction problems in Kiwi.com, where I’ve been working as a C++ developer, I’ve decided to use the CatBoost model as a baseline. And finally, it’s finished as the best ML algorithm in the long list of competitors. But here was a problem: it is not easy to integrate the original CatBoost library into your project if you are not using ya.make (and you don’t). I’d like to have open source ya.make; it is a great build system with a nice command line interface, but we are living in the real world. So, I had two options:

  1. To take Yandex “balancer” project and extract CMake related code and then implement build scripts for CatBoost.
  2. Implement CatBoost library itself.

Yes, I know that it is not a good solution to implement everything myself, but it is not hard to do, so I decided to implement CatBoost.


Gradient boosting on decision trees is a very simple thing in terms of applying a model. But let’s start with some simple things. Given we have some features space and we have some train dataset:

We want to construct approximation function f(x) in the form:

where g(x) is computed using the following algorithm. For each g, we select some number of features (let’s say k) and prepare a vector of border values. Then we compare each of the selected features with the border and return the value from a table with prepared values.

Training of this model is a very tricky process, but applying is simple. So let’s get started.

Naive implementation

We should have some data structure for a single tree, and then we can write some simple functions. For example:

struct Tree {
std::vector<float> borders;
std::vector<int> indexes;
std::vector<double> values; double apply(const std::vector<float>& x) const {
unsigned idx = 0;
unsigned one = 1;
for (size_t i = 0; i < borders.size(); ++i) {
if (x[indexes[i]] > borders[i])
idx |= one;
one <<= 1;
return values[idx];

The full model is a vector of such trees. It is a very obvious solution, but it has a lot of disadvantages. And the main one is the most problematic. Memory access is not cache friendly if we use such a structure.

Here I want to write several words about the memory model of the current CPUs. Typically CPU has several caching layers, and we can use values from the L0 cache immediately. Still, we need to stop execution and wait until data is loaded to the cache for reading data from RAM. This happens because memory works on different frequencies, and those frequencies are much lower than CPU frequency. So we can’t just load data from RAM; we need to wait. And one of the consequences of this is that it makes no sense to load only one word of data, so the CPU loads some amount of bytes named page. So, if we have sequential data in the RAM, we will have better performance.

Memory optimization

So, to do it better, we need to write borders and indexes into the same array. That’s easy to do, so let’s do it:

struct Tree {
struct SimpleCompare {
float border;
int index; bool operator()(const std::vector<float>& x) const {
return x[index] > border;
std::vector<SimpleCompare> cmps;
std::vector<double> values;
double apply(const std::vector<float>& x) const {
unsigned idx = 0;
unsigned one = 1; for (size_t i = 0; i < cmps.size(); ++i) {
if (cmps[i](x))
idx |= one;
one <<= 1;
return values[idx];

Even this version works much better, but it is not all. It is better to write all the SimpleCompare structures from all the trees into the same array and use only one vector to access them. And it is possible to write all values in the same array. In such a case, we will have only a number of tree levels in the structure and will operate sequentially in terms of memory. This behavior allows us to be faster than the CatBoost model compiled into C++ code. And I think that it is not too bad even now!

But here we come to the next problem. We are loading border and index values for each sample we want to predict the output. It is not effective, and we can do it better.

From scalars to vectors

To be even faster, we need to use SIMD (Single Instruction Multiple Data) instructions. SIMD was introduced in the early 1970s for supercomputers and later has been implemented in lots of CPUs. For the Intel x86 series, we can say that all CPUs starting from the Pentium MMX have support for some SIMD instructions set.

What is SIMD exactly for? It is very simple technology: this set of instructions allows to load several (typically 2, 4, 8, 16) values on registers and operate them simultaneously. For example:

float x[4], y[4], z[4];
for (int i = 0; i < 4; ++i)
z[i] = x[i] + y[i];

This code could be implemented using only one instruction, and it will be executed in a very short time (depending on CPU). So we can compare 4 values, add 4 values, and process values using 4-groups.

You can look into details in the catboost-cxx repository, but here I want to describe the thing in general.

We have two ways to parallelize computations on the decision trees. The first one is to load values by 4 throw layers (vertical parallelization), and the second one is to load 4 trees simultaneously and apply them in parallel. The second way is very simple, and all we need is to replace scalars with SSE registers. The only tricky part is:

if (cmp)
idx |= one;

should be replaced with

idx |= (one & cmp);

But when it comes to buckets, we can see that Yandex implementation is a great deal faster. And to be comparable, we need to be a little more parallel. So I’m going to another axis in parallel, and this axis is a group of examples. We can load 4 values from 4 examples at once and compare them with vector borders. It allows our library to be only 2 times slower on batches but 2 times faster on single values. Final results could be found on the catboost-cxx results page.

Some results:

Results table

Limitations and future

Now the library does not support multitarget models. We will implement them in the future. Also, we don’t support categorical features yet. The performance could be better, and I’ll try to make it even faster.

Everyone is welcomed to submit issues and open pull requests on GitHub.

Join us

Searching for a new adventure in Engineering? Check our open positions.

Featured articles
Generating SwiftUI snapshot tests with Swift macros
Don’t Fix Bad Data, Do This Instead