Stare ma, I wrote a new JIT compiler for PostgreSQL

Generally, I don’t know why I accomplish things. It’s this kind of times. A few months ago, Python 3.13 obtained its JIT engine, built with a new JIT compiler building methodology (replica-patch, cf. research paper). After reading the paper, I was sold and I accurate had to examine out it with PostgreSQL. And what a fun scramble it’s been so far. This blog put up will now not cloak every part, and I prefer diversified communication programs, nonetheless I would like to introduce pg-copyjit, the latest and shiniest way to ~~slay and segfault~~ accelerate up your PostgreSQL server.

Before going any additional, a mandatory warning: all code produced right here is experimental. Please. I want to hear reviews from you, esteem “ho it’s fun”, “ho I obtained this performance increase”, “hi there maybe this may very successfully be done”, nonetheless now not “hi there, your extension price me hours of downtime on my trade critical application”. Anyway, its contemporary state is for professional hackers, I hope you realize better than trusting experimental code with a production server.

Within the starting, there was no JIT, and then came the LLVM JIT compiler

In a PostgreSQL release a very lengthy time ago, in a galaxy far far away, Andres Freund launched the PostgreSQL world to the magics of JIT compilation, the spend of LLVM. They married and there was grand rejoicing. Alas, darkness there was within the shimmering castle, for LLVM is a very very demanding husband.

LLVM is a great compilation framework. Its optimizer produces very suitable and setting pleasant code, and Andres went additional than what anybody else would have belief and tried in relate to squeeze the last microsecond of performance in his JIT compiler. Here is a great work and I don’t understand how you can yell my love for the madness this form of dedication to performance is. But LLVM has a mountainous draw back : it’s now not built for JIT compilation. At least now not within the way PostgreSQL will spend it: the LLVM optimizer is extraordinarily dear, nonetheless now not the spend of it may be worse than no compilation at all. And in relate to assemble fully the suitable stuff, the queries that can experience the performance increase, the typical interrogate price estimation is used. And that’s the PostgreSQL draw back making the total factor almost very now not likely: charges in PostgreSQL are now not designed to mean anything. They are meant to be compared to each diversified, nonetheless accomplish now not mean anything regarding the real execution time. A interrogate costing 100 may shuffle in 1 second, while another costing 1000 may shuffle in 100 milliseconds. It’s now not a worm, it’s a beget determination. That’s why a lot of oldsters (including me) discontinue up turning off the JIT compiler: most if now not all queries on my production system will now not salvage satisfactory from the performance increase to compensate the LLVM optimizer price. If I can shuffle the interrogate 10ms faster nonetheless it wanted 50ms to be optimized, it’s pure loss.

There is one way to make the LLVM JIT compiler extra usable, nonetheless I fear it’s going to take years to be implemented: being able to cache and reuse compiled queries. I will now not dig additional into that topic on this put up, nonetheless have confidence me, it’s now not going to be a small feat to achieve.

And in 2021, replica-and-patch was described…

So, what can we accomplish? We want fast satisfactory code generated the fastest way conceivable. Fast satisfactory code mean at least a bit faster than the contemporary interpreter… But writing a compiler is painful, writing several code generators (for diversified ISAs for instance) is even worse…

Here is the place the innovation of replica-and-patch comes into play and saves the day.

With replica-patch, you write stencils in C. These stencils are features with holes, and they are compiled by your typical clang compiler (gcc make stronger pending, too complicated to explain right here). Then for these who want to assemble one thing, you stitch stencils collectively, possess within the gaps, and bounce straight into your brand new “compiled” feature.

And this is it. Here is the magic of replica-and-patch. You fully replica the stencils in a new memory area, patch the holes, and voilà.

Needless to say, you can dash additional. You can make a selection out what computation can be done at compilation time, you can break up loops in several stencils to unroll them, you can merge several stencils collectively to optimize them in a single dash (creating form of meta-stencils…)

This paper caught the eyes of the Faster-CPython team, they implemented it in CPython 3.13, and this is when extra other folks (including me) realized it.

Bringing replica-and-patch to PostgreSQL

So, what does it take to gain a new JIT engine in PostgreSQL? Expectantly, now not that grand, in any other case I would likely now not be operating a blog about this.

When JIT compilation was launched, it was advised on hackers to make LLVM a plugin, allowing future extensions to bring diversified JIT compilers. Back then, I was somewhat skeptical to this idea (nonetheless by no means expressed this belief, I did now not want to be unfavorable later), and it grew to become out I proved myself unfavorable… The interface is really straightforward, your .so fully needs to provide a single _PG_jit_provider_init feature, and on this feature initialize three callbacks, named compile_expr, release_context and reset_after_error. The main one is obviously compile_expr. You salvage one ExprState* parameter, a pointer to an expression, made of opcodes. Then it’s “fully” a matter of compiling the opcodes collectively in any way you want, mark this built code as executable, and changing the evalfunc to this code instead of the PostgreSQL interpreter. Here is easy, and you have an automatic fallback to the PostgreSQL interpreter in case you come across any opcode you’ve now not implemented yet.

The replica and patch algorithm (implemented with fully a few small optimizations so far) is so easy I can explain it right here. For each opcode, the compiler will stumble on into the stencil assortment. If the opcode has a stencil, the stencil is appended to the “built” code. In any other case, the compilation stops and the PostgreSQL interpreter will kick in. After appending the stencil, each of its holes are patched with the mandatory value.

For instance, let’s suppose about this basic unoptimized stencil, for the opcode CONST.

Datum stencil_EEOP_CONST (struct ExprState *expression, struct ExprContext *econtext, bool *isNull)
{
    *op.resnull=op.d.constval.isnull;
    *op.resvalue=op.d.constval.value;

    NEXT_OP();
}

op is declared as extern ExprEvalStep op; (and NEXT_OP is a bit harder to explain, I received’t dig into it right here). When building this to a single .o file, the compiler will leave a gap within the assembly code, the place the address for op will have to be inserted (the spend of a relocation). When the stencil assortment is built, this information is saved and utilized by the JIT compiler to make spend of the contemporary opcode building address in relate to salvage a working code.

The gain direction of for the stencils is awfully fun, now not complicated, nonetheless fun. Step one is to gain the stencils to a single .o file, and then extract the assembly code and relocations from this .o file into C usable constructions, that the JIT compiler will hyperlink to.

And that’s about all there’s.

At first, I was extracting the assembly code manually. The usage of that way, I managed to salvage the three wanted opcodes for SELECT 42; to work. And there was grand pleasure. After this first proof of thought (and I guess some skittish looks a few days ago at PgDay.Paris when other folks saw me happy with being able to shuffle SELECT 42, that may have sound unusual), I wrote a DirtyPython (unofficial variant) script to automate the assembly code extraction, and in a few hours I implemented feature calls, single table queries, extra complicated data varieties, launched a few optimizations…

Latest state

It really works on my laptop with PostgreSQL 16. It must be radiant with older releases. It fully supports AMD64 because that’s what I have and I can now not target every part at once. Later I will add ARM64, and I would like to have a while to add make stronger for some fascinating targets esteem POWER64 or S390x (these may require some compiler patches, sadly, and access to such computers, nudge nudge wink wink)…

Performance-clever, successfully, defending in ideas that I’ve spent almost no time optimizing it, the outcomes are great. Code generation is done in a few a entire lot microseconds, making it usable even for fast queries, the place LLVM is simply out of the game. On a straightforward SELECT 42; interrogate, operating without a JIT takes 0,3ms, with copyjit it requires 0,6ms, LLVM without a optimizations goes to 1,6ms and optimizing LLVM will require 6,6ms. Obvious, LLVM can create really fast code, nonetheless the total idea right here is to rapidly generate fast satisfactory code, and thus comparing both instruments received’t make grand sense.

But calm, you are all waiting for a benchmark, so right here we dash, benchmarking two queries on a straightforward non-listed 90k rows table. This benchmark is done on a laptop and my have confidence in such a benchmark setup is moderate at fully, a factual benchmark may be done later on a desktop laptop with out any form of thermal envelope shenanigans. And I have now not optimized my compiler, it’s calm somewhat dull and there’s a lot of things that can and must be done.

Question	Min/max (ms)	Median (ms) and stdev
make a selection * from b; — no JIT	10.340/14.046	10.652/0.515
make a selection * from b; — JIT	10.326/14.613	10.614/0.780
make a selection i, j from b the place i <10; — no JIT	3.348/4.070	3.7333/0.073
make a selection i, j from b the place i <10; — JIT	3.210/4.701	3.519/0.107

Stupid benchmark on a laptop operating non-optimized code, don’t have confidence these…

As you can watch, even within the contemporary unfinished state, as almost immediately as there’s CPU work to accomplish (right here it’s the the place clause), performance relative to the interpreter recuperate. It’s fully logical, and what is important right here is that even supposing the JIT is an extra, somewhat time drinking step, it takes so miniature time even these queries can dash a few percents faster.

Indicate that even supposing I’ve implemented fully a small handful of opcodes, I can shuffle any interrogate on my server, the JIT engine will fully complain loudly about it and let the interpreter shuffle the interrogate…

For the extra queer, the code is dumped there on github. I said dumped because I focal level fully on the code and now not on the clarity of my git history nor on wrapping it in a good paper with flying colors and fairly vegetation, that’s what you accomplish when the code is done, this one isn’t yet… Whereas you want to gain it, the gain-stencils.sh file must be shuffle manually first. But again, I accomplish now not doc it yet because I simply can now not provide any make stronger for the code in its contemporary state.

TODO…

Here is a proof of thought. I’ve now not worked on making it easy to gain, on making it conceivable to package it… The gain scripts are Debian and PostgreSQL 16 yell. And, successfully, to be exact, at this level, I don’t care grand and it will now not disaster me, my focal level is on implementing extra opcodes, and searching for optimizations.

I really hope I will reach a level the place I can safely package this and deploy it on my production servers. This way, I’ll retain the spend of the LLVM JIT on the server that can spend it (a GIS server the place queries are value the optimization) and spend this JIT on my net-application databases, the place fast interrogate time is a must have, and the LLVM optimizations discontinue up being counter-productive.

I am also dead serious on porting this to diversified architectures. I love the frail days of Alpha, Itanium, Sparc, M68k and diversified diversified architectures. I am now not going to make spend of this form of system, nonetheless I pass over the variety, and I really don’t want to be a part of the monoculture challenge right here.

Thanks

First, broad thanks to my contemporary day-job employer, Entr’ouvert. We are a small french SaaS company, free-software centered, and my colleagues simply let me toy on this between tickets and diversified DBA or sysadmin tasks.

I would like to thank my DBA chums for supporting me and motivating me into doing this (received’t give their names, they know who they are). BTW: spend PoWA, great software, relate your chums…

Also, expeditiously question: they imply I shall dash to PGConf.dev to demonstrate this, nonetheless it’s too late for the agenda and since I live in France I did now not intend to transfer there. Whereas you watched it’s important or value it, please, please, say so (comments below, or my email is p@this.domain), in any other case watch you in future european PG events 🙂