Introduction to Data-Centric Query Compilation
How modern Databases transform SQL into efficient imperative code.
In 2011 Thomas Neumann published a paper which introduced the notion of “data centric” query compilation. So the ideas presented here are not at all new. But since I’m currently working on implementing it and there aren’t many resources on the topic (besides research papers and source code) I decided to write this article.
How traditional query execution engines work
I will quickly go into the Volcano execution model. Almost all traditional and modern Systems use it or some variation of it like the vectorized model. But I won’t spend too much time on it since a lot of information is available online. I also assume that you are already familiar with basic operators of relational algebra.
So let’s say we are running the simple query SELECT a * 2 FROM R WHERE a > 5. The resulting query plan will look
somewhat like this:
The plan consists of a projection, a selection and a table scan. Each operator implements a next() function which is
responsible for producing a single tuple. To start you call next() on the root operator (projection in this case). It
in turn would call next() on the selection. The selection operator calls next() on the table scan which outputs the
next tuple of R. The selection node repeatedly calls next() on the table scan until it finds a tuple that satisfies a>5. It then passes it up to the projection which finally returns it back to you.
This approach is pretty intuitive. It directly works with the query plan representation that database systems use
anyway. It’s operator-centric. But it’s also pretty inefficient. We had to perform 2 function calls to produce a single tuple for the simplest possible query. next() is usually implemented as a virtual function which adds the overhead of
dynamic dispatch onto that. The vectorized model improves on this by passing batches between the operators instead of
single tuples, but it still suffers from the overhead of materializing each intermediate result.
The Data-Centric model
Alright, centering query execution around the operators is bad. How can we do better? If we were to implement the query logic by hand, we would probably come up with something like the following pseudo-java:
for(Tuple t : R){
if(t.a > 5){
Tuple projected = new Tuple(t.a * 2);
emit(projected);
}
} We can no longer see the relational operators, only a flat loop. The code is no longer operator-centric, but instead it’s data-centric. Furthermore, a tuple can stay in CPU registers or fast caches for the entire execution of the pipeline, which makes it super fast.
Code Generation
To generate this code we need to implement 2 functions on every operator. produce() generates the code which produces
tuples and consume() generates code that acts on the tuples once they are produced. Usually the parent operator calls produce() on its child and the child calls consume() on its parent once it generated the producing code. Note that produce()/consume() do not exist in the generated code but only during the code-generating phase. Let’s look at
pseudocode for our example:
void Projection::produce() {
child.produce();
}
void Projection::consume() {
code += "Tuple projected = new Tuple(${project_fields()});";
parent.consume();
}
void Selection::produce() {
child.produce();
}
void Selection::consume() {
code += "if (${translate_predicate()}) {";
parent.consume();
code += "}";
}
void TableScan::produce() {
code += "for(Tuple t : ${table}) {";
parent.consume();
code += "}";
} We start out by calling produce() on the Projection operator. It just forwards the call to the Selection which
forwards it to the Table Scan. The table outputs a for-loop which iterates over all tuples. To fill the body it calls
the consume function of its parent, the Selection. Note that the Table Scan does not need a consume() since its only
job is to produce tuples. Selection::consume() then generates an if-statement that filters for the tuples that satisfy
the predicate. Inside that if-statement Projection::consume() is called which finally applies the projection.
You can mentally go through the code and verify that it indeed pretty much produces the same code as what we handwrote previously. That’s pretty neat, isn’t it? And the same idea can be applied to all other operators to build a general purpose execution engine.
Note that I left out some implementation details to focus on the main idea. For a practical implementation you would wrap the root operator in another operator which takes care of materializing and emitting the result. You would also have to do some bookkeeping in order to avoid variable name clashes and so forth. You also wouldn’t allocate a new tuple inside the hot loop in a real system but only use primitive types to store the projected values.
Breaking the pipeline
The example we looked at consisted of a single pipeline. This is why the generated code was only a single loop. But not
all queries are this simple. Consider: SELECT b, SUM(a) FROM R WHERE a > 5 GROUP BY b HAVING SUM(a) > 10 and its plan:
The execution of this query is split into 2 separate pipelines. Why is that? When performing a Hash Aggregation we have to first go over all tuples to aggregate the results and after that we need to iterate over all groups in our hash table to perform the having clause. This means that our single for-loop won’t cut it anymore. Also on a side note: Did you notice the orientation of the arrows changing in the second plan? This is because the model we use is push-based. Operators push tuples to their parents as opposed to the volcano model where operators pull tuples from their children.
Luckily our produce/consume architecture supports Pipeline breakers and here is the code for Aggregation:
void Aggregation::produce() {
child.produce();
code += "for (Entry e : aggregationMap) {";
code += " Tuple group = new Tuple(e.key, e.value);";
parent.consume();
code += "}";
}
void Aggregation::consume() {
for (Aggregate aggregate: aggregates) {
code += "int currentSum = aggregationMap.getOrDefault(${aggregationKey}, 0);";
code += "aggregationMap.put(${aggregationKey}, currentSum + ${aggregate.expression});";
}
} The produce code is a bit more involved than before. But only a little bit. We still call produce() on our child which
will generate the for-loop over R and the if-filter. Inside of that Aggregation::consume() will be called. Here we
have to fill the aggregation hash map. We go over all the aggregates in our query (only SUM(a) in our case) and
generate code for updating the aggregation in the hash map. The code only shows the logic for sum-aggregation but other
kinds work similarly. After the line child.produce() finishes we have a loop which goes over all tuples that satisfy a>5 and aggregates them into a hash map. That wraps up Pipeline #1.
Then we start a new pipeline by iterating over the aggregation map. Only now do we call the consume() on our parent to generate the filter for our HAVING-Clause.
You should go ahead and verify that calling produce() on the root Selection node indeed generates the correct code. Of
course the code shown here is far from being usable. But trying to get it to run is a fun exercise that I encourage you
to try out. Alternatively try to think about how you would implement joins using the produce/consume model (consider
which input side of a hash join breaks the pipeline and why).
Trade-Offs
The code we generated is pretty much as fast as it could be since it matches the code you would write by hand. So why do most modern databases still use the vectorized model? There are some Trade-Offs one must consider:
Code-generating databases like Apache Spark transpile the query into some high level language like we did, and then compile it using an embedded compiler like janino. But this compilation step is pretty expensive. For long-running queries this is not an issue as the compilation time is easily amortized, but for small queries it can cause significant latency spikes. Umbra tries to get around this by first emitting x86 machine-code for a query and adaptively switching to llvm-optimized code if the query is still running when it finishes compiling.
In addition to that, building a production-ready query compiler is a major feat of engineering. It requires very deep knowledge of the architecture you are compiling for. It’s like writing a database and a compiler at once. Finding bugs in the generated code can also be extremely tedious.