• We'd still have IDL

    I am newly appointed as a member of the topic group for Federated Infrastructures of DIG-UM (that's an acronym for Digital Transformation in the Research on Universe and Matter), a “bottom-up organization for synergetic research on the digital transformation” (as it says in their Guidelines) in the fields covered by what the German Ministry for Research (BMBF) funds as part of its “Erforschung von Universum und Materie” (ErUM) programme. Since GAVO's work has largely been funded through that programme and its predecessors, I feel obliged to overcome my natural aversion against committee work in this case.

    The first thing I am trying to do in that function is explain the VO to our partners, which come from different branches of physics ranging from astroparticle physiscs (where I still feel relatively at home, though I haven't quite got around to figuring out root, a programme and format that's really common there) to accelerator physics to the Komitee Forschung mit nuklearen Sonden und Ionenstrahlen (KFSI), where people are probing into solid state matter using positron beams, which to me sounds (a) cool and (b) as if you'd better have your 511 keV-protective suit on when visiting them.

    A part of this was summarising what I think are the VO's most difficult challenges at this point. Probably the most pressing of those is the problem that we now routinely have data that is so large that moving it around in full is not a good idea. Now, for large catalogues, I think TAP and ADQL are a good basis for giving people tools for remote analysis, so there I'd say all that is needed is detail work.

    In contrast, for collections of array-like (images, say, but what I'm saying would also apply for things like a bulk analysis of a big collection of spectra) data, we do not have anything remotely comparable; the best you can do is make a remote cutout if you're lucky and your operator has implemented SODA. Doing something like “give me all spectra that have a strong Hα feature”, for instance, requires you to download all spectra, or at least the lines in question.

    Most data providers at this point respond to this challenge is to give their users jupyter hubs next to the data, which boils down to letting people write and execute Python scripts on the data providers' boxes from within a web browser. Admittedly, this works rather nicely for the moment, but I consider this a massive regression over the current VO, for at least the following reasons:

    • Lock-in: You cannot in general transport the jupyter notebooks you write from one provider to the next, because the execution environments are massively different (Python and package versions, package availability, data access).
    • Ephemeral: You probably will not even be able to execute the notebook reliably after the next update of the provider's platform: Python evolves relatively quickly, and many of the packages evolve even faster.
    • Undiscoverable: Nobody currently as figured out how these things could sensibly be registered such that you could ask: “Give me all execution environments I can use on data from ivo://dc.g-vo.org/tap.” Not that many are trying, given all the other problems.
    • Browser-based: Web browsers are probably the most broken and least sustainable element in current computing; if you've ever tried to tweak one of the “major browsers” to your liking, you probably know what I mean. With jupyter hubs, not only do I have to work through one of these horrible “major browsers”, the data providers also control what code is being executed in it. If they don't let me edit in vi, I can't edit in vi. Full stop[1].
    • Central control: More generally, with the current VO and its API endpoints, users get to choose what tools they use. If you'd like to use the APIs from lua or Haskell or want to cobble together stilts and shell script, go ahead. Yes, there is some initial effort to parse VOTable and perhaps support the more subtle aspects of TAP, but that's still not unreasonable. With the “platforms”, it is up to the service operators what tools they let you use.

    As a big fan of Python, I'm happy this platform thing happened exactly in the moment when Python was all the fashion (at least in Astronomy). But Python certainly isn't the end of history. People will think of smarter things (arguably, they already have), and very certainly the expectation that one tool fits all is very wrong.

    All that went through my head this morning when riding to work. And then a slogan crossed my mind that I liked so much for bringing the Platform Problem to a point that I wrote this entire post so I could publish it:

    If science platforms had come around 15 years ago, we'd all still be stuck with IDL.

    [1]Ok, there's greasemonkey-like hacks, but that's really to fragile to seriously consider.
  • The 2021 Southern Spring Interop

    A Venn diagram of product types that just doesn't work.

    A contribution for the ”things that didn't work out” (“Arbeiten, die zu keiner Lösung geführt haben”) section in our reports to BMBF: an attempt to systematise product types at the last Interop. I've made a new proposal at this Interop, and there is reason to hope it will fare better.

    Last night, the second IVOA Interop conference of 2021 came to an end; I'm calling it ”southern spring” because notionally, it happened in Cape Town, back to back with this year's ADASS. In reality, it was again an online event, and so, in keeping up with the tradition established in the pandemic times, the closing event was around midnight CET. I cannot say I will miss these late-night events, although I would not go as far as some people at the conference who quipped they'd prefer the airport security checks to having to sit through another zoom marathon.

    My contributions at this interop again had a clear focus on semantics, for instance with my public confession that my attempt to systematise “product types” at the last interop was entirely misguided; trying to force concepts like “time series”, “spectrum“ or “image” into a tree does not lead to anything that actually works for what this is intended to do, that is, helping people find the sort of data they are after for a particular purpose, or helping clients route data products to other clients better suited to process them. I will now try a restart using SKOS, a plan that was met with a lot more agreement than that previous attempt. Some entertainment at the side was provided by the realisation that a “time-image cube“ is normally called a movie. Next time I'll take in moving pictures, I'll find out what people say when I claim to investigate a time cube.

    Another talk that took up a topic from the last Interop's Semantics session was about making an IVOA vocabulary of object types based on the work done within the CDS over the last 40 year or so. This certainly is just the beginning of a longer effort, not the least because the current concepts severely fall short in the area of the solar system. But it's a start, and there's plenty of time to elaborate this before it will go through a review, presumably with the next version of Obscore.

    Also semantics-related, but over in the session of the Operations interest group, Mark Taylor reported on his activities to evaluate the standards adherence of semantics information in published tables. This activity is what had triggered me to make DaCHS validate UCDs assigned to columns in summer, something that I expect will result in quite few diagnostics when DaCHS operators upgrade to DaCHS 2.5 (expected for November). But that's fine: making it more likely that computers will actually recognise a, say, error in proper motion for what it is is undoubtedly a good thing. I'm therefore glad that there is almost a million “good” UCDs out there and a lot fewer somehow “bad”. I had expected much worse after my realisation that my own annotations left a lot to be desired in summer. By now, the only bad UCDs I'm still pushing out are the ones mandated by SSAP and SLAP. The contradictions between those standards and UCD are going to be addressed with Errata in the coming months.

    My talk in the third Apps session on Thursday afternoon still had some relationship with Semantics; it was a quick show and tell on the enhancements to WIRR I had reported on here in July, and it in particular showcased obtaining UCD constraints by full-text searching the rr.table_column table in my RegTAP service and the selection through UAT concepts. Satisfyingly in some way, it were these topics that people took up in the discussion after the talk. Less satisfyingly, people playing with the thing afterwards turned up something that has the alarming taste of a bug in the new MOC operations in pgsphere. Ouch.

    This segues into the realm of Registry, where there was no actual session but a rather well-attended side meeting in the gathertown instance we could take over from ADASS (that, incidentally, was substantially better attended than during the previous meetings). There, I mainly presented (and explained) my proposed changes to pyVO's registry interface currently living in a private branch in my fork on github. I will write a bit more on that around the time I will turn that into a PR.

    Another outcome of this was that there was some interest to turn the note on documents in the Registry – which is what feeds VOTT – into either an endorsed note or perhaps a Recommendation of the Registry WG.

    My fourth “proper” (in the rather twisted sense of: in a zoom session) talk was an attempt to finally do something about the problems pointed out in my caproles note lamenting that our current service registration patterns are fundamentally flawed. It proposed some ways to to get VOSI availability fixed, and the outcome was that we probably will drop what we currently require in that field, not the least because these requirements are cheerily ignored by 98% of the resources in the Registry.

    Those were again three fairly long days, usually starting with sessions around 7:00 CET and ending with sessions around midnight. Which is clearly not healthy. But on the other hand, it somehow does convey a physical sense of the global nature of the Virtual Observatory, on which people in many, many time zones work. And that, I have to say, still is something I do appreciate.

  • Migrating Away From Wordpress

    Since 2016, this blog was served through a Wordpress instance at the Astrophysical Institute Potsdam AIP – thanks again to our colleagues there for maintaining the platform over all these years.

    But since it now seems as if this is something that might last a long time (by Web standards), we have decided that we should leave PHP behind and look for something properly version controllable, and something that can simply live somewhere on a web server with essentially zero maintenance. Hence, we have moved the content to pelican – which has a clean Debian package, is written in Python, and does not need any active components of its own.

    As an extra bonus, the blog posts are now authored in ReStructuredText, which happens to be what DaCHS' documentation is written in, and what you can use to author metadata for DaCHS resources. If you want, you can now check out the source code for the articles (sorry, it's still subversion; one of these days I'll find something fancier than naked git but lighter than gitlab, and then I'll move GAVO's VCS to git).

    As expected, porting the theme (which I only did rather half-heartedly, so things are a bit less pretty now) and getting the figures right was what caused the bulk of the work. On the plus side, I have also greatly cleaned up categories and tags. Still, it's quite likely we messed something up. If you find anything broken here, please let us know: https://www.g-vo.org/pmwiki/About/Impressum lists the main ways through which you can reach us.

    With that: Subscribe to our Atom feed!

  • Taming the Postgres JIT

    Mild warning: This is exclusively technobabble mainly addressing DaCHS deployers. If you're an astronomer (or yet something else), you're of course still welcome to enjoy it, but don't complain if you're bored.

    My development machine as been on Debian bullseye for a while, which means I've been running Postgres 13 for the past few months. Against Postgres 11, 13 is a lot more optimistic when doing Just-In-Time (JIT) compilation, and that's the beginning of this story.

    This JIT thing in plain language means that Postgres is writing small programmes to compute query results, then compiles them to machine code and executes that rather than running the query plan in some sort of interpreter. This at first sounds like a great idea that should speed up large queries quite a bit. But for one, query time is often bounded not so much by CPU but by I/O, and the sort of analysis that happens for JIT compilation is not free. Not at all.

    I noticed that when a query in the regression test suite I'm running before every commit to DaCHS started to occasionally fail. That test executes:

    SELECT TOP 1 obs_publisher_did
    FROM ivoa.obscore
    WHERE distance(s_ra, s_dec, 83.8,-5.4)<0.2
    

    and then asserts that the result is in within 10 seconds. The purpose of this particular regression test is to make sure all sizable tables in the obscore view have a usable spatial index on the production system. On the development system, there really aren't any tables in obscore that would be slow even when seqscanned.

    How on earth could this query be slow then?

    The natural reaction in such a situation to use EXPLAIN in psql. In this case, there is some non-trivial rewriting of the query going on between ADQL and postgres, which means you cannot just paste the ADQL to Postgres. To figure out the query that DaCHS actually executes, I picked the translated query from the VOTable returned from a successful request (look for the sql_query INFO; that's a DaCHS extension, so that trick won't work for other TAP servers), ran the psql gavo DaCHS operators are probably used to, and then typed:

    EXPLAIN SELECT obs_publisher_did
    FROM ivoa.obscore
    WHERE  q3c_join(83.8, - 5.4, s_ra, s_dec, 0.2) LIMIT 1;
    

    to it. The result was inconspicuous; a few seqscans here and there, but the total cost estimate was “0.00..7.12”, which in physical units works out to “basically nothing”, many orders of magnitude away from the 10 seconds I occasionally saw in the regression tests.

    Well, when a query plan doesn't match your expectations, the next thing to do is EXPLAIN ANALYZE. With that, Postgres executes the plan it has made and then compares its estimates to what the cost turned out to be; this, by the way, is also a good way to find out when you should raise the statistics target of one or more of your columns (see Element Column in the DaCHS reference for details).

    For me, the output looked something like this:

    Limit  (cost=10000000000.00..10156565675.53 rows=1 width=57) (actual time=6206.883..6206.899 rows=1 loops=1)
    [...]
     Planning Time: 22.174 ms
     JIT:
       Functions: 130
       Options: Inlining true, Optimization true, Expressions true, Deforming true
       Timing: Generation 55.404 ms, Inlining 107.280 ms, Optimization 3479.626 ms, Emission 2601.411 ms, Total 6243.721 ms
     Execution Time: 6263.243 ms
    

    Ok, I'm lying a bit; there is another reason than just the analyze for why the cost estimate exploded from 7.12 to 10156565675.53. I'll confess in the appendix to this post.

    The main point, however, is: the execution time now is of the order that I'm expecting (the database is rather busy during a regression test, so those 6 seconds can easily become double that then). Interestingly, essentially all the execution time went into “Optimization” and “Emission”. Until yesterday, I'd never seen a thing like that in Postgres query plans.

    That is because here the JIT is at work, and that was at least a lot less likely in Postgres 11. Now, estimating 10 Gigapennies as execution cost up front, Postgres 13 thought some extra time for writing and compiling a little programme is well spent. Of course, that estimate is badly off, and the right thing to do is to fix the reason for the bad estimate. See the appendix for why I don't just yet.

    That my obscore view has 32 tables contributing to it, giving its definition a whopping 1280 lines, probably does not help. But in particular since the query plans in the presence of Q3C and pgsphere still are usually badly off, it might be wise to discourage Postgres a bit from using JIT compilation with DaCHS' workloads in your configuration if you're running TAP services (you should) and before you upgrade to Postgres 13. To do that, add a:

    jit_above_cost = 20000000000
    

    (or so; perhaps you can set your limit a good deal lower) to your postgresql.conf. On Debian boxes, that file is in /etc/postgresql/13/main/ (obviously, change the 13 if you have a different version). You need to restart postgres to make this take effect.

    While I was in that file, I thought I can share what other configuration I have in there, because it is likely you can speed up your data centre quite a bit by judicious tuning. The following settings aren't particularly well thought out, but I claim they are not unreasonable for a 64 GB machine that runs as a dedicated server; that last thing also causes the first configuration item, as for two-server operation, you have to set

    • listen_addresses = '*' – only then can you talk to postgres from another machine (disregarding hacks like ssh tunnels that may even work as last-resort options). Of course, this may mean your postgres port is visible to the internet, which means you ought to understand what pg_hba.conf is before configuring that. Other configuration I'm doing includes
    • max_connections = 200 – I actually ran out of connections once; DaCHS itself is now a bit more parsimonious with them, but if you have enough RAM, it still doesn't hurt to be generous here.
    • localtime = UTC – TIMESTAMPs suck, because it is hard to compute with them, are a pain when plotting, there are time zones, and they generally are a Babylonian mess (as evinced by base-60 numbers). But you can't always escape timestamps, and if you somehow manage to create them “with time zone”, telling the server to do UTC helps limit their damage radius.
    • shared_buffers = 15GB – the Postgres documentation says 25% of the RAM is a good default for shared_buffers, so that's roughly what I went for here. Note that the kernel usually limits how much shared memory processes are allowed to allocate, and you will have to adjust those limits for this to take effect. On Debian, the postgresql-common package installs a file /etc/sysctl.d/30-postgresql-shm.conf for easy adjusting of the limits.
    • temp_buffers = 100MB – that one gives buffers for temporary tables, and raising it helps TAP uploads (which use those, at least for now). Since our TAP uploads tend to be large as temporary tables go, it pays to set aside a couple of megabytes for them. Now that I look at this again and think about what people upload into my data centre: I think I could even raise that a bit more.
    • work_mem = 64MB – this one is for doing joins and the like (which includes cross-matches), and again these tend to be larger in Astronomy than in many other disciplines, where matching tens-of-millions against billions would count as Big Data. Hence, postgres' default of 4 MB is quite certainly going to be causing a lot of unnecessary disk activity. That said, DaCHS could be a bit smarter here and raise work_mem itself when running TAP jobs (or perhaps only TAP jobs that actually do joins). Note that a single query can use up many times work_mem, which means you shouldn't choose this too high, either. One thing I'd like to look into one day is the hash_mem_multiplier (cf. a bit down on Postgres docs on resource limits). If you do research in that direction with astronomy workloads, please let me know.
    • maintenance_work_mem = 2048MB – this is relevant to keep VACUUM runs fast, which become necessary as rows are added to or replaced in the database. I have some relatively large tables that regularly see deletes (e.g., the relational registry), and hence I want smooth vacuuming. If you don't have large tables that regularly change, you probably don't need to bother with maintenance_work_mem.

    If you have additional (or contradicting) advice on Postgres configuration for DaCHS: Please let us know, preferably on the dachs-support mailing list (see DaCHS support).

    Appendix: As I said: I was lying above. The original with-JIT plan was just fine. The horrible, cost 100 Giga, plan was only chosen when I did the SET enable_seqscan=false. Why would I do a thing like that, forcing Postgres in the wrong direction? Well, DaCHS' TAP executor makes the same setting. And why does it do that to Postgres? That's a long story closely related to the Q3C and pgsphere troubles I've mentioned above – and for which there's now finally hope: See q3c issue #30 if you're curious.

  • Query the Registry with WIRR

    Search windows of VODesktop and WIRR

    Pixels from venerable VODesktop and WIRR: it's supposed to be about the same thing, except WIRR uses and exposes the latest Registry standards (and then some tech that's not standard yet).

    When the VO was young, there was a programme called VODesktop that had a very nice interface for searching the Registry. Also, it would run queries against the services discovered, giving nice all-VO querying that few modern clients do quite as elegantly. Regrettably, when the astrogrid UK project was de-funded, VODesktop's development ceased in 2010.

    In 2012, it had become clear that nobody would step up to continue it, and I wanted to at least provide a replacement for the Registry interface part. In consequence, Florian Rothmaier and I wrote the Web Interface to the Relational Registry, or WIRR for short; this lets you build Registry queries in your Web Browser in an interface inspired by VODesktop (which, I'm told, in turn was inspired by early iTunes).

    WIRR's sweet spot is between the Registry interfaces in the usual clients (TOPCAT, Aladin: these try to hide the gory details of where their service lists come from and hence are limited in what interaction they allow) and using a TAP client to write and execute RegTAP queries (where there are no limitations beyond the protocol's, but it's tedious unless you happen to know the RegTAP standard by heart).

    In contrast to its model VODesktop, WIRR cannot run any queries against the services discovered using it. But you can transfer the services you have found to clients via SAMP (TOPCAT can handle the relevant MTypes, but I'm frankly not sure what else). Apart from that, an obvious use for WIRR are the queries one needs in VO curation. For instance, I keep linking to it when sending people canned registry queries, as in the section on claiming an authority in the DaCHS Tutorial.

    Given that both Javascript and the Registry have evolved a lot in the past decade, WIRR was in need of a major redecoration for some time now, and in early July, I found some time to do it. The central result is that the code is now halfway modern, strict Javascript; let's see how many web browsers still run that can't execute this.

    On the surface, much less has changed, but there are some news I'd consider noteworthy and that might help your data discovery-fu:

    • Since I've added some constraint types, the constraint type selector is now a hierarchical box, sporting what I think are or should be the most common constraint types (full text, service type and UAT term) on level 0 and then having “Blind Discovery“, “Finer Grained“, and “Special Effects“ as pop-ups; all this so we obey Miller's Rule of Seven.
    • Rather than explain the constraints on a second, separate page, there are now brief help texts coming with each constaint.
    • You can now match against UAT concepts, and there is a completing input box for them; in case you're wondering what this is about, see this post from last February. And yes, next time I'll play with WIRR I'll probably include SemBaReBro here.
    • When constraining by column UCD, you can now choose from UCDs found in the registry (the “Pick one“ button).
    • You can now constrain by spatial, temporal, and spectral coverage, though that's still a gamble because not many (or, actually, very few in the case of temporal and spectral) operators care to declare their services' coverage. When they don't, you won't see their resources with such blind discovery constraints. For some background on this, check Space and Time not lost on the Registry on this blog.
    • There is now a „SQL“ button with successful searches that lets you retrieve the SQL executed for the particular constraint. While that query does not immediately execute on RegTAP services (it's Postgres' SQL rather than ADQL), it ought to give you a head start when transplanting your Registry query into, say, a pyVO-based script.
    • You can now use your browser's back and forward buttons (or, in my case. key bindings) to navigate in your query history.

    What this still doesn't do: Work without Javascript. That's a bit of a disgrace, since after the last changes it would actually be reasonable to provide non-javascript fallbacks for some of the basic functionality (of course, no SAMP at all then…). I'll do it the first time someone asks. Promised.

    A document that now needs at least slight updates because things have moved about a bit is the data discovery use case Florian wrote back then. The updates absolutely necessary are not terribly involved, but I would like to use the opportunity to add a bit more spice to the tutorial. If you have ideas: I'm all ears.

    Oh, and before I close: you can still run VODesktop; kudos to the maintainers of the JVM for that. But it's nevertheless not really usable any more, which perhaps isn't too surprising for a client built on top of experimental online services ten years ago. For one, its TAP client speaks pre-release versions of both TAP and ADQL, so those won't work on modern TAP services (and the ancient ones have vanished). Worse, it needed to use a non-standard extension of RegTAP's predecessor (for those old enough to remember: it used XQuery), and none of the modern searchable registries understands that any more.

    Which is a pity, really. It's been a fine programme. It just was a few years early: By 2012, everything it needed has been defined in nice, stable standards that are still around and probably will be for another decade at least.

« Page 8 / 21 »