• Migrating Away From Wordpress

    Since 2016, this blog was served through a Wordpress instance at the Astrophysical Institute Potsdam AIP – thanks again to our colleagues there for maintaining the platform over all these years.

    But since it now seems as if this is something that might last a long time (by Web standards), we have decided that we should leave PHP behind and look for something properly version controllable, and something that can simply live somewhere on a web server with essentially zero maintenance. Hence, we have moved the content to pelican – which has a clean Debian package, is written in Python, and does not need any active components of its own.

    As an extra bonus, the blog posts are now authored in ReStructuredText, which happens to be what DaCHS' documentation is written in, and what you can use to author metadata for DaCHS resources. If you want, you can now check out the source code for the articles (sorry, it's still subversion; one of these days I'll find something fancier than naked git but lighter than gitlab, and then I'll move GAVO's VCS to git).

    As expected, porting the theme (which I only did rather half-heartedly, so things are a bit less pretty now) and getting the figures right was what caused the bulk of the work. On the plus side, I have also greatly cleaned up categories and tags. Still, it's quite likely we messed something up. If you find anything broken here, please let us know: https://www.g-vo.org/pmwiki/About/Impressum lists the main ways through which you can reach us.

    With that: Subscribe to our Atom feed!

  • Taming the Postgres JIT

    Mild warning: This is exclusively technobabble mainly addressing DaCHS deployers. If you're an astronomer (or yet something else), you're of course still welcome to enjoy it, but don't complain if you're bored.

    My development machine as been on Debian bullseye for a while, which means I've been running Postgres 13 for the past few months. Against Postgres 11, 13 is a lot more optimistic when doing Just-In-Time (JIT) compilation, and that's the beginning of this story.

    This JIT thing in plain language means that Postgres is writing small programmes to compute query results, then compiles them to machine code and executes that rather than running the query plan in some sort of interpreter. This at first sounds like a great idea that should speed up large queries quite a bit. But for one, query time is often bounded not so much by CPU but by I/O, and the sort of analysis that happens for JIT compilation is not free. Not at all.

    I noticed that when a query in the regression test suite I'm running before every commit to DaCHS started to occasionally fail. That test executes:

    SELECT TOP 1 obs_publisher_did
    FROM ivoa.obscore
    WHERE distance(s_ra, s_dec, 83.8,-5.4)<0.2
    

    and then asserts that the result is in within 10 seconds. The purpose of this particular regression test is to make sure all sizable tables in the obscore view have a usable spatial index on the production system. On the development system, there really aren't any tables in obscore that would be slow even when seqscanned.

    How on earth could this query be slow then?

    The natural reaction in such a situation to use EXPLAIN in psql. In this case, there is some non-trivial rewriting of the query going on between ADQL and postgres, which means you cannot just paste the ADQL to Postgres. To figure out the query that DaCHS actually executes, I picked the translated query from the VOTable returned from a successful request (look for the sql_query INFO; that's a DaCHS extension, so that trick won't work for other TAP servers), ran the psql gavo DaCHS operators are probably used to, and then typed:

    EXPLAIN SELECT obs_publisher_did
    FROM ivoa.obscore
    WHERE  q3c_join(83.8, - 5.4, s_ra, s_dec, 0.2) LIMIT 1;
    

    to it. The result was inconspicuous; a few seqscans here and there, but the total cost estimate was “0.00..7.12”, which in physical units works out to “basically nothing”, many orders of magnitude away from the 10 seconds I occasionally saw in the regression tests.

    Well, when a query plan doesn't match your expectations, the next thing to do is EXPLAIN ANALYZE. With that, Postgres executes the plan it has made and then compares its estimates to what the cost turned out to be; this, by the way, is also a good way to find out when you should raise the statistics target of one or more of your columns (see Element Column in the DaCHS reference for details).

    For me, the output looked something like this:

    Limit  (cost=10000000000.00..10156565675.53 rows=1 width=57) (actual time=6206.883..6206.899 rows=1 loops=1)
    [...]
     Planning Time: 22.174 ms
     JIT:
       Functions: 130
       Options: Inlining true, Optimization true, Expressions true, Deforming true
       Timing: Generation 55.404 ms, Inlining 107.280 ms, Optimization 3479.626 ms, Emission 2601.411 ms, Total 6243.721 ms
     Execution Time: 6263.243 ms
    

    Ok, I'm lying a bit; there is another reason than just the analyze for why the cost estimate exploded from 7.12 to 10156565675.53. I'll confess in the appendix to this post.

    The main point, however, is: the execution time now is of the order that I'm expecting (the database is rather busy during a regression test, so those 6 seconds can easily become double that then). Interestingly, essentially all the execution time went into “Optimization” and “Emission”. Until yesterday, I'd never seen a thing like that in Postgres query plans.

    That is because here the JIT is at work, and that was at least a lot less likely in Postgres 11. Now, estimating 10 Gigapennies as execution cost up front, Postgres 13 thought some extra time for writing and compiling a little programme is well spent. Of course, that estimate is badly off, and the right thing to do is to fix the reason for the bad estimate. See the appendix for why I don't just yet.

    That my obscore view has 32 tables contributing to it, giving its definition a whopping 1280 lines, probably does not help. But in particular since the query plans in the presence of Q3C and pgsphere still are usually badly off, it might be wise to discourage Postgres a bit from using JIT compilation with DaCHS' workloads in your configuration if you're running TAP services (you should) and before you upgrade to Postgres 13. To do that, add a:

    jit_above_cost = 20000000000
    

    (or so; perhaps you can set your limit a good deal lower) to your postgresql.conf. On Debian boxes, that file is in /etc/postgresql/13/main/ (obviously, change the 13 if you have a different version). You need to restart postgres to make this take effect.

    While I was in that file, I thought I can share what other configuration I have in there, because it is likely you can speed up your data centre quite a bit by judicious tuning. The following settings aren't particularly well thought out, but I claim they are not unreasonable for a 64 GB machine that runs as a dedicated server; that last thing also causes the first configuration item, as for two-server operation, you have to set

    • listen_addresses = '*' – only then can you talk to postgres from another machine (disregarding hacks like ssh tunnels that may even work as last-resort options). Of course, this may mean your postgres port is visible to the internet, which means you ought to understand what pg_hba.conf is before configuring that. Other configuration I'm doing includes
    • max_connections = 200 – I actually ran out of connections once; DaCHS itself is now a bit more parsimonious with them, but if you have enough RAM, it still doesn't hurt to be generous here.
    • localtime = UTC – TIMESTAMPs suck, because it is hard to compute with them, are a pain when plotting, there are time zones, and they generally are a Babylonian mess (as evinced by base-60 numbers). But you can't always escape timestamps, and if you somehow manage to create them “with time zone”, telling the server to do UTC helps limit their damage radius.
    • shared_buffers = 15GB – the Postgres documentation says 25% of the RAM is a good default for shared_buffers, so that's roughly what I went for here. Note that the kernel usually limits how much shared memory processes are allowed to allocate, and you will have to adjust those limits for this to take effect. On Debian, the postgresql-common package installs a file /etc/sysctl.d/30-postgresql-shm.conf for easy adjusting of the limits.
    • temp_buffers = 100MB – that one gives buffers for temporary tables, and raising it helps TAP uploads (which use those, at least for now). Since our TAP uploads tend to be large as temporary tables go, it pays to set aside a couple of megabytes for them. Now that I look at this again and think about what people upload into my data centre: I think I could even raise that a bit more.
    • work_mem = 64MB – this one is for doing joins and the like (which includes cross-matches), and again these tend to be larger in Astronomy than in many other disciplines, where matching tens-of-millions against billions would count as Big Data. Hence, postgres' default of 4 MB is quite certainly going to be causing a lot of unnecessary disk activity. That said, DaCHS could be a bit smarter here and raise work_mem itself when running TAP jobs (or perhaps only TAP jobs that actually do joins). Note that a single query can use up many times work_mem, which means you shouldn't choose this too high, either. One thing I'd like to look into one day is the hash_mem_multiplier (cf. a bit down on Postgres docs on resource limits). If you do research in that direction with astronomy workloads, please let me know.
    • maintenance_work_mem = 2048MB – this is relevant to keep VACUUM runs fast, which become necessary as rows are added to or replaced in the database. I have some relatively large tables that regularly see deletes (e.g., the relational registry), and hence I want smooth vacuuming. If you don't have large tables that regularly change, you probably don't need to bother with maintenance_work_mem.

    If you have additional (or contradicting) advice on Postgres configuration for DaCHS: Please let us know, preferably on the dachs-support mailing list (see DaCHS support).

    Appendix: As I said: I was lying above. The original with-JIT plan was just fine. The horrible, cost 100 Giga, plan was only chosen when I did the SET enable_seqscan=false. Why would I do a thing like that, forcing Postgres in the wrong direction? Well, DaCHS' TAP executor makes the same setting. And why does it do that to Postgres? That's a long story closely related to the Q3C and pgsphere troubles I've mentioned above – and for which there's now finally hope: See q3c issue #30 if you're curious.

  • Query the Registry with WIRR

    Search windows of VODesktop and WIRR

    Pixels from venerable VODesktop and WIRR: it's supposed to be about the same thing, except WIRR uses and exposes the latest Registry standards (and then some tech that's not standard yet).

    When the VO was young, there was a programme called VODesktop that had a very nice interface for searching the Registry. Also, it would run queries against the services discovered, giving nice all-VO querying that few modern clients do quite as elegantly. Regrettably, when the astrogrid UK project was de-funded, VODesktop's development ceased in 2010.

    In 2012, it had become clear that nobody would step up to continue it, and I wanted to at least provide a replacement for the Registry interface part. In consequence, Florian Rothmaier and I wrote the Web Interface to the Relational Registry, or WIRR for short; this lets you build Registry queries in your Web Browser in an interface inspired by VODesktop (which, I'm told, in turn was inspired by early iTunes).

    WIRR's sweet spot is between the Registry interfaces in the usual clients (TOPCAT, Aladin: these try to hide the gory details of where their service lists come from and hence are limited in what interaction they allow) and using a TAP client to write and execute RegTAP queries (where there are no limitations beyond the protocol's, but it's tedious unless you happen to know the RegTAP standard by heart).

    In contrast to its model VODesktop, WIRR cannot run any queries against the services discovered using it. But you can transfer the services you have found to clients via SAMP (TOPCAT can handle the relevant MTypes, but I'm frankly not sure what else). Apart from that, an obvious use for WIRR are the queries one needs in VO curation. For instance, I keep linking to it when sending people canned registry queries, as in the section on claiming an authority in the DaCHS Tutorial.

    Given that both Javascript and the Registry have evolved a lot in the past decade, WIRR was in need of a major redecoration for some time now, and in early July, I found some time to do it. The central result is that the code is now halfway modern, strict Javascript; let's see how many web browsers still run that can't execute this.

    On the surface, much less has changed, but there are some news I'd consider noteworthy and that might help your data discovery-fu:

    • Since I've added some constraint types, the constraint type selector is now a hierarchical box, sporting what I think are or should be the most common constraint types (full text, service type and UAT term) on level 0 and then having “Blind Discovery“, “Finer Grained“, and “Special Effects“ as pop-ups; all this so we obey Miller's Rule of Seven.
    • Rather than explain the constraints on a second, separate page, there are now brief help texts coming with each constaint.
    • You can now match against UAT concepts, and there is a completing input box for them; in case you're wondering what this is about, see this post from last February. And yes, next time I'll play with WIRR I'll probably include SemBaReBro here.
    • When constraining by column UCD, you can now choose from UCDs found in the registry (the “Pick one“ button).
    • You can now constrain by spatial, temporal, and spectral coverage, though that's still a gamble because not many (or, actually, very few in the case of temporal and spectral) operators care to declare their services' coverage. When they don't, you won't see their resources with such blind discovery constraints. For some background on this, check Space and Time not lost on the Registry on this blog.
    • There is now a „SQL“ button with successful searches that lets you retrieve the SQL executed for the particular constraint. While that query does not immediately execute on RegTAP services (it's Postgres' SQL rather than ADQL), it ought to give you a head start when transplanting your Registry query into, say, a pyVO-based script.
    • You can now use your browser's back and forward buttons (or, in my case. key bindings) to navigate in your query history.

    What this still doesn't do: Work without Javascript. That's a bit of a disgrace, since after the last changes it would actually be reasonable to provide non-javascript fallbacks for some of the basic functionality (of course, no SAMP at all then…). I'll do it the first time someone asks. Promised.

    A document that now needs at least slight updates because things have moved about a bit is the data discovery use case Florian wrote back then. The updates absolutely necessary are not terribly involved, but I would like to use the opportunity to add a bit more spice to the tutorial. If you have ideas: I'm all ears.

    Oh, and before I close: you can still run VODesktop; kudos to the maintainers of the JVM for that. But it's nevertheless not really usable any more, which perhaps isn't too surprising for a client built on top of experimental online services ten years ago. For one, its TAP client speaks pre-release versions of both TAP and ADQL, so those won't work on modern TAP services (and the ancient ones have vanished). Worse, it needed to use a non-standard extension of RegTAP's predecessor (for those old enough to remember: it used XQuery), and none of the modern searchable registries understands that any more.

    Which is a pity, really. It's been a fine programme. It just was a few years early: By 2012, everything it needed has been defined in nice, stable standards that are still around and probably will be for another decade at least.

  • DaCHS 2.4 is out: Blind discovery, pretty datalink, and more

    DaCHS screenshots and logo

    DaCHS 2.4: automatic ranges (with registry support!), pretty datalink (with vocabulary support!). And then the usual bunch of improvements (hopefully!).

    I have released DaCHS 2.4 today, and as usual for stable releases, I would like to have something like a commented changelog here so DaCHS deployers perhaps look forward to upgrading – which would be good, because there are far too many outdated DaCHSes out there.

    Among the more notable changes in version 2.4 are:

    Blind discovery overhaul. If you've been following my requests to include coverage metadata three years ago, you have probably felt that the way DaCHS started to hack your RDs to include the metadata it had obtained from the data was a bit odd. Well, it was. DaCHS no longer does that when running dachs limits. While you can still do manual overrides, all the statistics gathered by DaCHS is now kept in the database and injected into the DaCHS' internal idea of your RDs at loading time.

    I have not only changed this because the old way really sucked; it was also necessary because I wanted to have per-column metadata routinely, and since in advanced DaCHS there often are no XML literals for columns (because of active tags), there wouldn't be a place to keep information like what a column is minimally, maximally, in median, or as a “2σ range“ within the RD itself. A longer treatment of where this is going is given in the IVOA note Blind Discovery 2: Advanced Column Statistics that Grégory and I have recently uploaded.

    For you, it's easy: Just run dachs limits q once you're happy with your data, or perhaps once a month for living data, and leave the rest to DaCHS. A fringe benefit: in browser froms, there are now value ranges of the various numeric constraints as placeholders (that's the screenshot on the left in the title picture).

    There is a slight downside: As part of this overhaul, DaCHS is now computing the coverage of SIAP and SSAP services based on the footprints of the products as MOCs. While that gives much more precise service footprints, it only works with bleeding-edge pgsphere as delivered in Debian bullseye – or from our Debian repository. If you want to build this from source, you need to get credativ's pgsphere fork for now.

    Generate column elements: If you have tables with many columns, even just lexically entering the <column> elements becomes straining. That is particularly annoying if there already is a halfway machine-readable representation of that data.

    To alleviate that, very early in the development of DaCHS, I had the gavo mkrd subcommand that you could feed FITS images or VOTables to get template RDs. For a number of reasons, that never worked well enough to make me like or advertise it, and I eventually ended up writing dachs start instead, which is something I like and advertise for general usage.

    However, what that doesn't do is come up with the column declarations. To make good on this, there is now a dachs gencol command that will, from a FITS binary table, a VOTable, or a VizieR-style byte-by-byte description, generate columns with as much metadata as it can fathom. Paste that into the output of dachs start, and, depending on your input format, you should have a quick start on a fairly full-featured data collection (also note there's dachs adm suggestucds for another command that may help quickly generate rich metadata).

    This currently doesn't work for products (i.e., tables of spectra, images, and the like); at least for FITS arrays, I suppose turning their non-obvious header cards into columns might save some work. Let's see: your feedback is welcome.

    Refurbished Datalink XSLT: Since the dawn of datalink, DaCHS has delivered Datalink documents with XSLT stylesheets in order to have nicely formatted pages rather than wild XML when web browsers chance on datalink documents. I have overhauled the Javascript part of this (which, I have to admit, is what makes it pretty). For one, the spatial cutout now works again, and it's modeless (no clicking “edit“ any more before you can drag cutout vertices). I'm also using the datalink/core vocabulary to furnish link groups with proper titles and descriptions, and to have them sorted in in a proper result tree. I've talked about it at the interop, and I've prepared a showcase of various datalink documents in the Heidelberg data centre.

    Update to DaCHS 2.4 and you'll get the same thing for your datalinks.

    Non-product datalinks: When writing a datalink service, you have to first come up with a descriptor generator. DaCHS will provide a simple one for you (or perhaps a bit more complex ones for FITS images or spectra) – but all of these assume that whatever the datalink ID parameter references is in DaCHS' product table. It turned out that in many interesting cases – for instance, attaching time series to object catalogues – that is not the case, and then you had to write rather obscure code to keep DaCHS from poking around in the product table.

    No longer: There is now the //datalink#fromtable descriptor generator. Just fill in which column contains the identifier and the name of the table containing that column and you're (basically) done. Your descriptor will then have a metadata attribute containing the relevant row – along with everything else DaCHS expects from a datalink descriptor.

    gavo_specconv: That's a longer story covered previously on this blog.

    Index declaration in views: Saying on which columns a database index exists allows users to write smart queries, and DaCHS uses such information internally when rewriting geometrical expressions from ADQL to whatever is in use in the actual database. Hence, making sure these indexes are properly declared is important. But at the same time it's difficult for views, because postgres doesn't let you have indexes on views (for good reasons). Still, queries against views will (usually) use indexes of their underlying tables, and hence those should be declared in the corresponding metadata.

    This is tedious in general. DaCHS now helps you with the //procs#declare-indexes-from stream. Essentially, it will compare the columns in the view with the ones from the source tables and then guess which view columns correspond to indexed columns from the source tables; using that, it adds indexed flags to some view columns.

    If all this is too weird for you: Thanks to declare-indexes-from, the index declaration now automatically happens in the modern way to build SSAP services, the //ssap#view mixin. Hence, chances are you won't even see this particular STREAM but just notice its beneficial consequences.

    Sunsetting resources: I've been fiddling off and on with a smart way to pull resources I no longer want to maintain while still leaving a tombstone. I had to re-visit this problem recently because I dropped the Gaia DR1 table from my Heidelberg data centre. So, how do I explain to people why the thing that's been there no longer is?

    In general, this is a rather untractable problem; for instance, it's very hard to do something sensible with the TAP_SCHEMA entries or the VOSI tables endpoints for the tables that went away. Pure web pages, on the other hand, can be adorned with helpful info. To enable that, there is now the superseded meta item, which you define in the RD that once held the resources. For Gaia DR1, here's what I used:

    <meta name="superseded" format="rst">
      We do not publish Gaia DR1 data here any more.
      If you actually need DR1 data, refer to the
      full Gaia mirrors, for instance `the one at
      ARI`_.  Otherwise, please use more recent data
      releases, for instance `eDR3`_.
    
      .. _the one at ARI: http://gaia.ari.uni-heidelberg.de
      .. _eDR3: /browse/gaia/q3
    </meta>
    

    Root page template: I slightly streamlined the default root page template, in particular dropping the "i" and "Q" icons for going to the metadata and querying the service. If you have overridden the root template, you may want to see if you want to merge the changes.

    As usual, there are many more small repairs and additions, but most of these are either very minor or rather technical. One last thing, though: DaCHS now works with Python 3.8 (3.7 will continue to be supported for a few years at least, earlier 3.x never was), which is going to be the python3 in Debian bullseye. Bullseye itself will only have DaCHS 2.3 (with the Python 3.8 fixes backported), though. Once bullseye has become stable, we will look into putting DaCHS 2.4 into the backports.

  • GAVO at the Northern Spring Interop 2021

    As usual in May, the people making the Virtual Observatory happen meet for their Interoperability Conference, better known as the Interop – where “meet” still has to be taken with a generous helping of salt (more on this near the end of this post). As has become customary on this blog, let me briefly discuss contributions with a significant involvement of GAVO.

    A major thing from my perspective actually happened in the run-up: The IVOA executive committee (“Exec“) approved Version 2.0 of Vocabularies in the VO, a standard saying how hierarchical word lists (“vocabularies“) can be managed, disseminated, and consumed within the VO. Developing the main ideas from sufficiently restricting RDF to coming up with desise (which makes complicated things possible with surprisingly little code), and trying things out on our growing number of vocabularies took up quite a bit of my standards time in the last 20 months or so – and I'm fairly happy with the outcome, which I celebrated with a brief talk on programming with IVOA semantics during Wednesday morning's semantics session.

    In that session I gave a second, more discussion-oriented, talk, probing how to formalise data product types – which is surprisingly involved, even with the relatively straightforward use case “figure out a programme to handle the data“: What's a spectrum? Well, something that maps a spectral coordinate to... hm. Is it still a spectrum if there's multiple sorts values (perhaps flux, magnitude, and polarisation)? If we allow, in effect, tuples, why not whole images, which would make spectral cubes spectra – but of course few client programmes that deal with spectra do anything useful with cubes, so clearly such a definition would kill our use case. And what about slit spectra, mapping a spatial coordinat to spectra?

    All this of course is reminiscent of the classical problems of semantics: An elephant is a big animal with a trunk. But when an elephant loses its trunk in an accident: does it stop being an elephant? So, much of the art here is finding the sweet spot of usability between strict and formal semantics (that will never fit the real world) and just tossing around loosely defined strings (that will simply not be machine-readable). After the session, I came up with the 2021-05-26 draft of product-type. If you read this a few years down the road, it might be interesting to compare with what product-type is today. I'm curious myself.

    Later on Wednesday CET, I did a shameless plug for my Datalink-transforming XSLT (apologies for a github link, but I'm fishing for PRs here; if you use DaCHS, you'll get the updated stuff with version 2.4, due soon). The core of this dates back to the dawn of datalink, but with a new graphical cutout code and in particular vocabulary-based tree-ification of the result rows, I figured it's time to remind the operators of datalink services it's still out there for them to take up. Perhaps more than from the slides, you can see what I am after here by just trying the Datalink examples I've collected for this talk and comparing document source, the appearance without Javascript (pure XSLT) and the appearance with Javascript (I'm a bit ashamed I'm relying so heavily on it, but much of this really can only be done client-side).

    Quite a bit after midnight my time (still Thursday UTC), Mark Taylor talked about Software Identification, something I've been working on with him recently. It's is one of the things that is short and trivial but that, when unregulated, just doesn't work; in this case it's servers and clients saying what they are when they speak HTTP. I stumbled into the problem while trying to locate severely outdated DaCHS installations – so, I a way I put effort into the Note Mark was talking about (and which I have just uploaded to the IVOA Document Repository) as a sort of penance.

    While I was already asleep when Mark gave his talk, I was back at the Interop Friday morning CEST, when Hendrik Heinl talked about the LOFAR TAP service (which, I'm proud to say, runs on top of DaCHS); this was mainly live operations in TOPCAT (which is why there's no exciting slides), but Hendrik used a pyVO script doing cutouts in an (optical) mosaic of the Fornax cluster built on top of – and that's the main point – Datalink and SODA. Working this out with Hendrik made me realise the documentation of Datalink in pyVO really needs… love. Or, better, work.

    Later on Friday, there was the Registry session, where I gave brief (and somewhat cramped) talks on advanced column metadata (which is intended to one day let you query the registry for things like “roughly complete to 18 mag” or “having objects out to redshift 4“) and how to put VODataService 1.2 coverage into RegTAP – I expect you'll read more on both topics on this blog as they mature to a level at which this can leave the Registry nerd circles.

    And now, about 10 pm on Friday, the meeting is slowly winding down; beyond all the talks (which were, regrettably for a free software spirit like me, on zoom), the real bonus was that there was a gather.town attached to the conference. Now, that's a closed, proprietary, non-self-hostable platform, too, and so I have all reason to grumble. But: for the first time since February 2020 it felt like a conference, with the most useful action happening outside of the lecture halls, from trying to reach consensus on VEP-006 to teaching DaCHS datalink service declaration to learning about working with visibilities coming from VLBI (where it's even more difficult than it is with the big antenna arrays). So… this one time I've made my peace with proprietary platforms.

    A propos of “say no to platforms“ (in this case, slack): Due to the recent troubles with freenode, in addition to the Interop last week saw the the GAVO IRC channel move to libera.chat (where it's still #gavo). So, for instant messaging us now that the Interop is (in effect) over: Come there.

« Page 8 / 20 »