2021-09-23
Markus Demleitner
Mild warning: This is exclusively technobabble mainly addressing DaCHS
deployers. If you're an astronomer (or yet something else), you're of
course still welcome to enjoy it, but don't complain if you're bored.
My development machine as been on Debian bullseye for a while, which
means I've been running Postgres 13 for the past few months. Against
Postgres 11, 13 is a lot more optimistic when doing Just-In-Time (JIT)
compilation, and that's the beginning of this story.
This JIT thing in plain language means that Postgres is writing small
programmes to compute query results, then compiles them to machine code
and executes that rather than running the query plan in some sort of
interpreter. This at first sounds like a great idea that should speed up
large queries quite a bit. But for one, query time is often bounded not
so much by CPU but by I/O, and the sort of analysis that happens for JIT
compilation is not free. Not at all.
I noticed that when a query in the regression test suite I'm running
before every commit to DaCHS started to occasionally fail. That test
executes:
SELECT TOP 1 obs_publisher_did
FROM ivoa.obscore
WHERE distance(s_ra, s_dec, 83.8,-5.4)<0.2
and then asserts that the result is in within 10 seconds. The purpose of
this particular regression test is to make
sure all sizable tables in the obscore view have a usable spatial index
on the production system. On the development system, there really aren't
any tables in obscore that would be slow even when seqscanned.
How on earth could this query be slow then?
The natural reaction in such a situation to use EXPLAIN in psql. In
this case, there is some non-trivial rewriting of the query going on
between ADQL and postgres, which means you cannot just paste the ADQL to
Postgres. To figure out the query that DaCHS actually executes, I picked
the translated query from the VOTable returned from a successful request
(look for the sql_query INFO; that's a DaCHS extension, so that
trick won't work for other TAP servers), ran the psql gavo DaCHS
operators are probably used to, and then typed:
EXPLAIN SELECT obs_publisher_did
FROM ivoa.obscore
WHERE q3c_join(83.8, - 5.4, s_ra, s_dec, 0.2) LIMIT 1;
to it. The result was inconspicuous; a few seqscans here and there, but
the total cost estimate was “0.00..7.12”, which in physical units works
out to “basically nothing”, many orders of magnitude away from the 10
seconds I occasionally saw in the regression tests.
Well, when a query plan doesn't match your expectations, the next thing
to do is EXPLAIN ANALYZE. With that, Postgres executes the plan it has
made and then compares its estimates to what the cost turned out to be;
this, by the way, is also a good way to find out when you should raise
the statistics target of one or more of your columns (see Element
Column in the
DaCHS reference for details).
For me, the output looked something like this:
Limit (cost=10000000000.00..10156565675.53 rows=1 width=57) (actual time=6206.883..6206.899 rows=1 loops=1)
[...]
Planning Time: 22.174 ms
JIT:
Functions: 130
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 55.404 ms, Inlining 107.280 ms, Optimization 3479.626 ms, Emission 2601.411 ms, Total 6243.721 ms
Execution Time: 6263.243 ms
Ok, I'm lying a bit; there is another reason than just the analyze for
why the cost estimate exploded from 7.12 to 10156565675.53. I'll confess
in the appendix to this post.
The main point, however, is: the execution time now is of the order that
I'm expecting (the database is rather busy during a regression test, so
those 6 seconds can easily become double that then). Interestingly,
essentially all the execution time went into “Optimization” and
“Emission”. Until yesterday, I'd never seen a thing like that in
Postgres query plans.
That is because here the JIT is at work, and that was at least a lot
less likely in Postgres 11. Now, estimating 10 Gigapennies as execution
cost up front, Postgres 13 thought some extra time for writing and
compiling a little programme is well spent. Of course, that estimate is
badly off, and the right thing to do is to fix the reason for the bad
estimate. See the appendix for why I don't just yet.
That my obscore view has 32 tables contributing to it, giving its
definition a whopping 1280 lines, probably does not help. But in
particular since the query plans in the presence of Q3C and pgsphere
still are usually badly off, it might be wise to discourage Postgres a
bit from using JIT compilation with DaCHS' workloads in your
configuration if you're running TAP services (you should) and before you
upgrade to Postgres 13. To do that, add a:
jit_above_cost = 20000000000
(or so; perhaps you can set your limit a good deal lower) to your
postgresql.conf. On Debian boxes, that file is in
/etc/postgresql/13/main/ (obviously, change the 13 if you have a
different version). You need to restart postgres to make this take
effect.
While I was in that file, I thought I can share what other configuration
I have in there, because it is likely you can speed up your data centre
quite a bit by judicious tuning. The following settings aren't
particularly well thought out, but I claim they are not unreasonable for
a 64 GB machine that runs as a dedicated server; that last thing also
causes the first configuration item, as for two-server operation, you
have to set
- listen_addresses = '*' – only then can you talk to postgres from
another machine (disregarding hacks like ssh tunnels that may even
work as last-resort options). Of course, this may mean your postgres
port is visible to the internet, which means you ought to understand
what pg_hba.conf is before configuring that. Other configuration I'm
doing includes
- max_connections = 200 – I actually ran out of connections once;
DaCHS itself is now a bit more parsimonious with them, but if you have
enough RAM, it still doesn't hurt to be generous here.
- localtime = UTC – TIMESTAMPs suck, because it is hard to compute
with them, are a pain when plotting, there are time zones, and they
generally are a Babylonian mess (as evinced by base-60 numbers). But you
can't always escape timestamps, and if you somehow manage to create them
“with time zone”, telling the server to do UTC helps limit their damage
radius.
- shared_buffers = 15GB – the Postgres documentation says 25% of the
RAM is a good default for shared_buffers, so that's roughly what I went
for here. Note that the kernel usually limits how much shared memory
processes are allowed to allocate, and you will have to adjust those
limits for this to take effect. On Debian, the postgresql-common package
installs a file /etc/sysctl.d/30-postgresql-shm.conf for easy
adjusting of the limits.
- temp_buffers = 100MB – that one gives buffers for temporary
tables, and raising it helps TAP uploads (which use those, at least for
now). Since our TAP uploads tend to be large as temporary tables go, it
pays to set aside a couple of megabytes for them. Now that I look at
this again and think about what people upload into my data centre: I
think I could even raise that a bit more.
- work_mem = 64MB – this one is for doing joins and the like (which
includes cross-matches), and again these tend to be larger in Astronomy
than in many other disciplines, where matching tens-of-millions against
billions would count as Big Data. Hence, postgres' default of 4 MB is
quite certainly going to be causing a lot of unnecessary disk activity.
That said, DaCHS could be a bit smarter here and raise work_mem itself
when running TAP jobs (or perhaps only TAP jobs that actually do joins).
Note that a single query can use up many times work_mem, which means you
shouldn't choose this too high, either. One thing I'd like to look into
one day is the hash_mem_multiplier (cf. a bit down on Postgres docs on
resource limits).
If you do research in that direction with astronomy workloads, please
let me know.
- maintenance_work_mem = 2048MB – this is relevant to keep VACUUM
runs fast, which become necessary as rows are added to or replaced in
the database. I have some relatively large tables that regularly see
deletes (e.g., the relational registry), and hence I want smooth
vacuuming. If you don't have large tables that regularly change, you
probably don't need to bother with maintenance_work_mem.
If you have additional (or contradicting) advice on Postgres
configuration for DaCHS: Please let us know, preferably on the
dachs-support mailing list (see DaCHS support).
Appendix: As I said: I was lying above. The original with-JIT plan
was just fine. The horrible, cost 100 Giga, plan was only chosen when I
did the SET enable_seqscan=false. Why would I do a thing like that,
forcing Postgres in the wrong direction? Well, DaCHS' TAP executor makes
the same setting. And why does it do that to Postgres? That's a long
story closely related to the Q3C and pgsphere troubles I've mentioned
above – and for which there's now finally hope: See q3c issue #30 if you're curious.