Artikel mit Tag Nerdstuff:

  • The Case of the Disappearing Bits

    [number line with location markers]

    Every green line in this image stands for a value exactly representable in a floating point value of finite size. As you see, it's a white area out there [source]

    While I was preparing the publication of Coryn Bailer-Jones' distance estimations based on Gaia eDR3 (to be released about tomorrow), Coryn noticed I was swallowing digits from his numbers. My usual reaction of “aw, these are meaningless anyway because your errors are at least an order of magnitude higher” didn't work this time, because Gaia is such an incredible machine that some of the values really have six significant decimal digits. For an astronomical distance! If I had a time machine, I'd go back to F.W. Bessel right away to make him pale in envy.

    I'm storing these distances as PostgreSQL REALs, so these six digits are perilously close the seven decimal digits that the 23 bits of mantissa of single precision IEEE 754 floats are usually translated to. Suddenly, being cavalier with the last few bits of the mantissa isn't just a venial sin. It will lose science.

    So, I went hunting for the bits, going from parsing (in this case C's sscanf) through my serialisation into Postgres binary copy material (DaCHS operators: this is using a booster) to pulling the material out of the database again. And there I hit it: the bits disappeared between copying them in and retrieving them from the database.

    Wow. Turns out: It's a feature. And one I should have been aware of in that Postgres' docs have a prominent warning box where it explains its floating point types: Without setting extra-float-digits it will cut off bits. And it's done this ever since the dawn of DaCHS (in postgres terms, version 8.2 or so).

    Sure enough (edited for brevity):

    gavo=$ select r_med_geo from gedr3dist.main
    gavo-$ where source_id=563018673253120;
        1430.9
    
    gavo=$ set extra_float_digits=3;
    gavo=$ select r_med_geo from gedr3dist.main
    gavo-$ where source_id=563018673253120;
     1430.90332
    

    Starting with its database schema 26 (which is the second part of the output of dachs --version), DaCHS will configure its database roles always have extra_float_digits 3; operators beware: this may break your regression tests after the next upgrade.

    If you want to configure your non-DaCHS role, too, all it takes is:

    alter role (you) set extra_float_digits=3,
    

    You could also make the entire database or even the entire cluster behave like that; but then losing these bits isn't always a bad idea: It really makes the floats prettier while most of the time not losing significant data. It's just when you want to preserve the floats as you get them – and with science data, that's mostly a good idea – that we just can't really afford that prettyness.

    Update (2021-04-22): It turns out that this was already wrong (for some meaning of wrong) when I wrote this. Since PostgreSQL 12, Postgres uses shortest-precise by default (and whenever extra_float_digits is positive). The official documentation has a nice summary of the problem and the way post-12 postgres addresses it. So: expect your float-literal-comparing regression tests to break after the upgrade to bullseye.

  • Parallel Queries

    Image: Plot of run times

    An experiment with parallel querying of PPMX, going from single-threaded execution to using seven workers.

    Let me start this post with a TL;DR for

    scientists:Large analysis queries (like those that contain a GROUP BY clause) profit a lot from parallel execution, and you needn't do a thing for that.
    DaCHS operators:
     When you have large tables, Postgres 11 together with the next DaCHS release may speed up your responses quite dramatically in some cases.

    So, here's the story –

    I've finally overcome my stretch trauma and upgraded the Heidelberg data center's database server to Debian buster. With that, I got Postgres 11, and I finally bothered to look into what it takes to enable parallel execution of database queries.

    Turns out: My Postgres started to do parallel execution right away, but just in case, I went for the following lines in postgresql.conf:

    max_parallel_workers_per_gather = 4
    max_worker_processes = 10
    max_parallel_workers = 10
    

    Don't quote me on this – I frankly admit I haven't really developed a feeling for the consequences of max_parallel_workers_per_gather and instead just did some experiments while the box was loaded otherwise, determining where raising that number has a diminishing return (see below for more on this).

    The max_worker_processes thing, on the other hand, is an educated guess: on my data center, there's essentially never more than one person at a time who's running “interesting”, long-running queries (i.e., async), and that person should get the majority of the execution units (the box has 8 physical CPUs that look like 16 cores due to hyperthreading) because all other operations are just peanuts in comparison. I'll gladly accept advice to the effect that that guess isn't that educated after all.

    Of course, that wasn't nearly enough. You see, since TAP queries can return rather large result sets – on the GAVO data center, the match limit is 16 million rows, which for a moderate row size of 2 kB already translates to 32 GB of memory use if pulled in at once, half the physical memory of that box –, DaCHS uses cursors (if you're a psycopg2 person: named cursors) to stream results and write them out to disk as they come in.

    Sadly, postgres won't do parallel plans if it thinks people will discard a large part of the result anyway, and it thinks that if you're coming through a cursor. So, in SVN revision 7370 of DaCHS (and I'm not sure if I'll release that in this form), I'm introducing a horrible hack that, right now, just checks if there's a literal “group” in the query and doesn't use a cursor if so. The logic is, roughly: With GROUP, the result set probably isn't all that large, so streaming isn't that important. At the same time, this type of query is probably going to profit from parallel execution much more than your boring sequential scan.

    This gives rather impressive speed gains. Consider this example (of course, it's selected to be extreme):

    import contextlib
    import pyvo
    import time
    
    @contextlib.contextmanager
    def timeit(activity):
      start_time = time.time()
      yield
      end_time = time.time()
      print("Time spent on {}: {} s".format(activity, end_time-start_time))
    
    
    svc = pyvo.tap.TAPService("http://dc.g-vo.org/tap")
    with timeit("Cold (?) run"):
      svc.run_sync("select round(Rmag) as bin, count(*) as n"
        " from ppmx.data group by bin")
    with timeit("Warm run"):
      svc.run_sync("select round(Rmag) as bin, count(*) as n"
        " from ppmx.data group by bin")
    

    (if you run it yourself and you get warnings about VOTable versions from astropy, ignore them; I'm right and astropy is wrong).

    Before enabling parallel execution, this was 14.5 seconds on a warm run, after, it was 2.5 seconds. That's an almost than a 6-fold speedup. Nice!

    Indeed, that holds beyond toy examples. The showcase Gaia density plot:

    SELECT
            count(*) AS obs,
            source_id/140737488355328 AS hpx
    FROM gaia.dr2light
    GROUP BY hpx
    

    (the long odd number is 235416-6, which turns source_ids into level 6-HEALPixes as per Gaia footnote id; please note that Postgres right now isn't smart enough to parallelise ivo_healpix), which traditionally ran for about an hour is now done in less than 10 minutes.

    In case you'd like to try things out on your postgres, here's what I've done to establish the max_parallel_workers_per_gather value above.

    1. Find a table with a few 1e7 rows. Think of a query that will return a small result set in order to not confuse the measurements by excessive client I/O. In my case, that's a magnitude histogram, and the query would be:

      select round(Rmag) as bin, count(*) as n from ppmx.data group by bin;

      Run this query once so the data is in the disk cache (the query is “warm”).

    2. Establish a non-parallel baseline. That's easy to do:

      set max_parallel_workers_per_gather=0;
      
    3. Then run:

      explain analyze select round(Rmag) as bin, count(*) as n from ppmx.data group by bin;
      

      You should see a simple query plan with the runtime for the non-parallel execution – in my case, a bit more than 12 seconds.

    4. Then raise the number of max_parallel_workers_per_gatherer successively. Make sure the query plan has lines of the form “Workers Planned” or so. You should see that the execution time falls with the number of workers you give it, up to the value of max_worker_processes – or until postgres decides your table is too small to warrant further parallelisation, which for my settings happened at 7.

    Note, though, that in realistic, more complex queries, there will probably be multiple operations that will profit from parallelisation in a single query. So, if in this trivial example you can go to 15 gatherers and still see an improvement, this could actually make things slower for complex queries. But as I said above: I have no instinct yet for how things will actually work out. If you have experiences to share: I'm sure I'm not the only person on dachs-users who't be interested.

Page 1 / 1