Posts with the Tag DaCHS:

  • Small Change, Big Win

    Screenshot with the Erratum content (2 lines) highlighted

    That's SCS 1.03 Erratum 2 rendered in my browser with a bit of image processing to celebrate that there's one painful VO legacy less on this world.

    PSA: what follows is VO lore that may be entertaining but will not help you use or publish astronomical data.

    Today, I've made a very small commit to my VO publication package DaCHS (revision 8452):

    --- gavo/web/vodal.py (revision 8451)
    +++ gavo/web/vodal.py (working copy)
    @@ -260,7 +260,6 @@
            version = "1.0"
            parameterStyle = "dali"
            standardId = "ivo://ivoa.net/std/ConeSearch"
    -     defaultOutputFormat = "votable1.1"
    

    One deleted line, small cause, huge effect.

    This story starts with the oldest „operational“ VO standard, Simple Cone Search, which was formally published in 2008 but really got its current shape a lot earlier.

    I've not been there back then, but I think the authors expected that clients would be parsing the VOTables that the services were returning using something called XML binding. That, well, was a technique where code was generated from an XML schema, and only instance documents conforming to that exact schema could be parsed with that code.

    That is of course the opposite of the golden rule of interoperability (“be strict in what you produce and lenient in what you accept”) and thus would have been a terrible implementation choice for interoperable clients (and I believe nobody ever tried it). But somehow – or that is my explanation – the XML binding reasoning translated into the requirement that SCS services could only return VOTable 1.0 or VOTable 1.1, and that made it into the standard. It was hence the law. And that it DaCHS had to keep alive VOTable 1.1 for writing (which the above commit of course doesn't remove, but I can remove it now any time I feel like it). And that it couldn't do a lot of useful things that required features not present in VOTable 1.1.

    Nobody dared to touch the problem for about a decade, as it was actually unclear whether some ancient code might still be doing useful work with SCS and XML binding. And I shouldn't be scoulding them after I have recently broken ESO examples under the assumption that “aw, nobody's gonna do this“. Then, starting about five years ago, we had a couple of discussions at various conferences about how we might bring SCS into the present VO (where it, it has to be said, sticks out a bit for several other reasons, too, like its funky error reporting and the funny UCDs it uses). But these weren't easy: What exactly are we allowed to break within a minor version under the above assumption (“aw, nobody… “)? If we do a major version, how do we plan for co-existence for two parallel major version?

    Well: For the version restriction, in the end a simple Erratum was enough. On January 26, 2022, the IVOA Technical Coordination Group accepted SCS 1.03 Erratum 2. And now I can return whatever VOTable version suits me. Phewy.

    I can now have GROUPs in GROUPs (which I need to annotate photometry), I can finally return tables with my old proposal for STC in VOTable in SCS results (where they would have mattered most – not that anyone cares any more, as that ship has sailed somewhere completely different).

    Hey, I can have xtypes. Doesn't mean anything to you? Well, try this: In TOPCAT, open VO/Cone Search. Type “Constellations” and select the “cslt cone“ service. Run a query for some part of the sky, with a size of a few 10s of degrees. Open a sky plot, and in there, do Layers → Add Area Control, and in that control select the table you have just pulled in. Presto: You'll see the constellation boundaries without further configuration, and that's because TOPCAT has the xtype to figure out that the odd numbers it sees are really the vertex coordinates of a spherical polygon in DALI serialisation.

    Not a big deal, you say? Perhaps. But lots of small deals accumulated make the difference between what you can do and what you cannot, in particular across services (which is what the VO is about).

    Removing the erroneous constraint on VOTable versions in SCS opened the standard up for quite a few small deals. Thanks, TCG!

  • DaCHS 2.5: Check your UCDs

    DaCHS logo on top of a map of UCDs

    In the background of the DaCHS 2.5 release picture: UCDs grabbed from the Registry. The factual background: DaCHS 2.5 will now moan at you when you invent or mistype UCDs

    This afternoon, I have released DaCHS 2.5. As usual, I will discuss the more important changes in a blog post – this one.

    A change many of you will not like too much is that DaCHS now validates UCDs you give it, and it will warn you when you do not follow the UCD rules. This may seem like nit-picking, but as blind discovery is on the verge of becoming usable in the VO, making sure these strings actually are what they should be is becoming operationally important: If I want to find resources that give errors for their photometry, I have to know whether it's stat.error;phot.mag.b or phot.mag.b;stat.error, or else I will miss half the resources out there.

    So, I'm sorry if DaCHS starts complaining about half of your RDs after you update, but it's for a good cause. And don't feel bad about the complaints: DaCHS complained about close to half of my RDs after I had put in that feature.

    By the way, this comes as part of a larger effort on the side of the Operations IG to improve the validity of UCDs and units in the VO, an effort that has unearthed bugs in the SSAP and SLAP specifications in that they require UCDs forbidden by the UCD standard. DaCHS 2.5 still follows SSAP and SLAP, and hence external tools like stilts will protest because of bad UCDs even if DaCHS is happy. Errata for the specifications are being worked on, and once they are accepted, DaCHS and stilts will finally agree on UCD validity, or so I hope.

    Code-wise, a much more intrusive change was that asynchronous services (in particular, async TAP) now use the same formalism for parsing parameters as their synchronous counterparts. It may seem odd that that hasn't been the case up to now, but there were good reasons for that; for instance, with async, people can post incomplete parameter sets that would be rejected by normal sync processing.

    Unless you are running User UWS services, you should not notice anything. If you do run User UWS services, please contact me before upgrading. I would like to work with you on how these should look like in the future.

    Another change that might break your services is that DaCHS now actually complies to VOUnits, which has always forbidden whitespace of all kinds in unit strings. DaCHS, on the other hand, has foolishly encouraged putting whitespace between scale factors and pure units, as in 1e-10 m. That's not interoperable, and hence DaCHS now rejects such units. This may lead to hidden failures when dachs val doesn't notice something is a unit, and things only break during execution. I'm aware of one place where that's relevant: spectral cutout services that need to know the spectral unit If you're running those, make double sure that the spectralUnit in the SSAP mixin does not contain any whitespace. It's 0.1nm according to VOUnits, not 0.1 nm.

    An update that should silently make your services more compliant is that DaCHS' representation of EPN-TAP is updated to what is currently under IVOA review. After you upgrade, DaCHS will try to update your EPN tables' metadata, which in turn should make stilts taplint a lot happier. It will also make DaCHS pass on the new, IVOA table utype to the Registry, which is how people should in the future find EPN-TAP data.

    DaCHS now also contains some code that may help you import data from HDF5 files. For one, there is the HDF5 grammar, which rather directly pulls data from HDF5s written by astropy or vaex. But, really: HDF5 is a rather low-level format not particularly well suited for relational data, and it is virtually impossible to write generic code for doing something sensible with it. The two flavours DaCHS supports have very little in common, and it is therefore almost certain that if you have HDF5s coming from somewhere else, hdf5Grammar will not understand them. Still, let us know what you've got, we may be able to put support for it in.

    Hdf5grammar is written in Python, and thus imports perhaps a few thousand rows per second. For Gigarow-sized data collections, that's nowhere near fast enough, and hence for vaex-written HDF5s, there is booster support. As before, if you have bulk data in HDF5 that you want to put into a database and that was not written by vaex, let us know and we'll see what we can do.

    A surprisingly minor change enabled DaCHS to deal with materialised views, database views that are turned into actual tables by postgres. See the corresponding section in the tutorial for how you can use them. We do not have any materialised views in our Heidelberg data center yet. So, if you use them and notice something is clunky, your feedback is particularly appreciated.

    There are many smaller changes and improvements; let me mention what the changelog euphemistically calls ”better systemd integration”, which really means that so far systemctl restart dachs simply didn't do anything at all. Apologies. And shame on everyone who was bewildered but failed to report this to dachs-support.

    Also, you can use float arrays in boosters now, and DaCHS' ADQL has just leared about COALESCE. That's a SQL feature that lets you deal sensibly with NULLs in some cases: COALESCE(arg1, arg2, ...) will return the first non-NULL argument it encounters. That may sound like a slightly exotic function. Until you need it, at which point you wonder how ADQL could reach its ripe age without COALESCE.

    Finally, let me mention something that is not part of the release, though it is DaCHS-related and is new since the last release: I have cleaned up the access log processing machinery we have used in Heidelberg in the past 15 years or so, and I have packaged it up for general consumption. It is, of course, a DaCHS RD that you can just check out and use in your own DaCHS installation if you have to keep access logs and want to do that with at least some basic respect for your user's rights. See http://docs.g-vo.org/DaCHS/tutorial.html#access-logs for details.

  • Taming the Postgres JIT

    Mild warning: This is exclusively technobabble mainly addressing DaCHS deployers. If you're an astronomer (or yet something else), you're of course still welcome to enjoy it, but don't complain if you're bored.

    My development machine as been on Debian bullseye for a while, which means I've been running Postgres 13 for the past few months. Against Postgres 11, 13 is a lot more optimistic when doing Just-In-Time (JIT) compilation, and that's the beginning of this story.

    This JIT thing in plain language means that Postgres is writing small programmes to compute query results, then compiles them to machine code and executes that rather than running the query plan in some sort of interpreter. This at first sounds like a great idea that should speed up large queries quite a bit. But for one, query time is often bounded not so much by CPU but by I/O, and the sort of analysis that happens for JIT compilation is not free. Not at all.

    I noticed that when a query in the regression test suite I'm running before every commit to DaCHS started to occasionally fail. That test executes:

    SELECT TOP 1 obs_publisher_did
    FROM ivoa.obscore
    WHERE distance(s_ra, s_dec, 83.8,-5.4)<0.2
    

    and then asserts that the result is in within 10 seconds. The purpose of this particular regression test is to make sure all sizable tables in the obscore view have a usable spatial index on the production system. On the development system, there really aren't any tables in obscore that would be slow even when seqscanned.

    How on earth could this query be slow then?

    The natural reaction in such a situation to use EXPLAIN in psql. In this case, there is some non-trivial rewriting of the query going on between ADQL and postgres, which means you cannot just paste the ADQL to Postgres. To figure out the query that DaCHS actually executes, I picked the translated query from the VOTable returned from a successful request (look for the sql_query INFO; that's a DaCHS extension, so that trick won't work for other TAP servers), ran the psql gavo DaCHS operators are probably used to, and then typed:

    EXPLAIN SELECT obs_publisher_did
    FROM ivoa.obscore
    WHERE  q3c_join(83.8, - 5.4, s_ra, s_dec, 0.2) LIMIT 1;
    

    to it. The result was inconspicuous; a few seqscans here and there, but the total cost estimate was “0.00..7.12”, which in physical units works out to “basically nothing”, many orders of magnitude away from the 10 seconds I occasionally saw in the regression tests.

    Well, when a query plan doesn't match your expectations, the next thing to do is EXPLAIN ANALYZE. With that, Postgres executes the plan it has made and then compares its estimates to what the cost turned out to be; this, by the way, is also a good way to find out when you should raise the statistics target of one or more of your columns (see Element Column in the DaCHS reference for details).

    For me, the output looked something like this:

    Limit  (cost=10000000000.00..10156565675.53 rows=1 width=57) (actual time=6206.883..6206.899 rows=1 loops=1)
    [...]
     Planning Time: 22.174 ms
     JIT:
       Functions: 130
       Options: Inlining true, Optimization true, Expressions true, Deforming true
       Timing: Generation 55.404 ms, Inlining 107.280 ms, Optimization 3479.626 ms, Emission 2601.411 ms, Total 6243.721 ms
     Execution Time: 6263.243 ms
    

    Ok, I'm lying a bit; there is another reason than just the analyze for why the cost estimate exploded from 7.12 to 10156565675.53. I'll confess in the appendix to this post.

    The main point, however, is: the execution time now is of the order that I'm expecting (the database is rather busy during a regression test, so those 6 seconds can easily become double that then). Interestingly, essentially all the execution time went into “Optimization” and “Emission”. Until yesterday, I'd never seen a thing like that in Postgres query plans.

    That is because here the JIT is at work, and that was at least a lot less likely in Postgres 11. Now, estimating 10 Gigapennies as execution cost up front, Postgres 13 thought some extra time for writing and compiling a little programme is well spent. Of course, that estimate is badly off, and the right thing to do is to fix the reason for the bad estimate. See the appendix for why I don't just yet.

    That my obscore view has 32 tables contributing to it, giving its definition a whopping 1280 lines, probably does not help. But in particular since the query plans in the presence of Q3C and pgsphere still are usually badly off, it might be wise to discourage Postgres a bit from using JIT compilation with DaCHS' workloads in your configuration if you're running TAP services (you should) and before you upgrade to Postgres 13. To do that, add a:

    jit_above_cost = 20000000000
    

    (or so; perhaps you can set your limit a good deal lower) to your postgresql.conf. On Debian boxes, that file is in /etc/postgresql/13/main/ (obviously, change the 13 if you have a different version). You need to restart postgres to make this take effect.

    While I was in that file, I thought I can share what other configuration I have in there, because it is likely you can speed up your data centre quite a bit by judicious tuning. The following settings aren't particularly well thought out, but I claim they are not unreasonable for a 64 GB machine that runs as a dedicated server; that last thing also causes the first configuration item, as for two-server operation, you have to set

    • listen_addresses = '*' – only then can you talk to postgres from another machine (disregarding hacks like ssh tunnels that may even work as last-resort options). Of course, this may mean your postgres port is visible to the internet, which means you ought to understand what pg_hba.conf is before configuring that. Other configuration I'm doing includes
    • max_connections = 200 – I actually ran out of connections once; DaCHS itself is now a bit more parsimonious with them, but if you have enough RAM, it still doesn't hurt to be generous here.
    • localtime = UTC – TIMESTAMPs suck, because it is hard to compute with them, are a pain when plotting, there are time zones, and they generally are a Babylonian mess (as evinced by base-60 numbers). But you can't always escape timestamps, and if you somehow manage to create them “with time zone”, telling the server to do UTC helps limit their damage radius.
    • shared_buffers = 15GB – the Postgres documentation says 25% of the RAM is a good default for shared_buffers, so that's roughly what I went for here. Note that the kernel usually limits how much shared memory processes are allowed to allocate, and you will have to adjust those limits for this to take effect. On Debian, the postgresql-common package installs a file /etc/sysctl.d/30-postgresql-shm.conf for easy adjusting of the limits.
    • temp_buffers = 100MB – that one gives buffers for temporary tables, and raising it helps TAP uploads (which use those, at least for now). Since our TAP uploads tend to be large as temporary tables go, it pays to set aside a couple of megabytes for them. Now that I look at this again and think about what people upload into my data centre: I think I could even raise that a bit more.
    • work_mem = 64MB – this one is for doing joins and the like (which includes cross-matches), and again these tend to be larger in Astronomy than in many other disciplines, where matching tens-of-millions against billions would count as Big Data. Hence, postgres' default of 4 MB is quite certainly going to be causing a lot of unnecessary disk activity. That said, DaCHS could be a bit smarter here and raise work_mem itself when running TAP jobs (or perhaps only TAP jobs that actually do joins). Note that a single query can use up many times work_mem, which means you shouldn't choose this too high, either. One thing I'd like to look into one day is the hash_mem_multiplier (cf. a bit down on Postgres docs on resource limits). If you do research in that direction with astronomy workloads, please let me know.
    • maintenance_work_mem = 2048MB – this is relevant to keep VACUUM runs fast, which become necessary as rows are added to or replaced in the database. I have some relatively large tables that regularly see deletes (e.g., the relational registry), and hence I want smooth vacuuming. If you don't have large tables that regularly change, you probably don't need to bother with maintenance_work_mem.

    If you have additional (or contradicting) advice on Postgres configuration for DaCHS: Please let us know, preferably on the dachs-support mailing list (see DaCHS support).

    Appendix: As I said: I was lying above. The original with-JIT plan was just fine. The horrible, cost 100 Giga, plan was only chosen when I did the SET enable_seqscan=false. Why would I do a thing like that, forcing Postgres in the wrong direction? Well, DaCHS' TAP executor makes the same setting. And why does it do that to Postgres? That's a long story closely related to the Q3C and pgsphere troubles I've mentioned above – and for which there's now finally hope: See q3c issue #30 if you're curious.

  • DaCHS 2.4 is out: Blind discovery, pretty datalink, and more

    DaCHS screenshots and logo

    DaCHS 2.4: automatic ranges (with registry support!), pretty datalink (with vocabulary support!). And then the usual bunch of improvements (hopefully!).

    I have released DaCHS 2.4 today, and as usual for stable releases, I would like to have something like a commented changelog here so DaCHS deployers perhaps look forward to upgrading – which would be good, because there are far too many outdated DaCHSes out there.

    Among the more notable changes in version 2.4 are:

    Blind discovery overhaul. If you've been following my requests to include coverage metadata three years ago, you have probably felt that the way DaCHS started to hack your RDs to include the metadata it had obtained from the data was a bit odd. Well, it was. DaCHS no longer does that when running dachs limits. While you can still do manual overrides, all the statistics gathered by DaCHS is now kept in the database and injected into the DaCHS' internal idea of your RDs at loading time.

    I have not only changed this because the old way really sucked; it was also necessary because I wanted to have per-column metadata routinely, and since in advanced DaCHS there often are no XML literals for columns (because of active tags), there wouldn't be a place to keep information like what a column is minimally, maximally, in median, or as a “2σ range“ within the RD itself. A longer treatment of where this is going is given in the IVOA note Blind Discovery 2: Advanced Column Statistics that Grégory and I have recently uploaded.

    For you, it's easy: Just run dachs limits q once you're happy with your data, or perhaps once a month for living data, and leave the rest to DaCHS. A fringe benefit: in browser froms, there are now value ranges of the various numeric constraints as placeholders (that's the screenshot on the left in the title picture).

    There is a slight downside: As part of this overhaul, DaCHS is now computing the coverage of SIAP and SSAP services based on the footprints of the products as MOCs. While that gives much more precise service footprints, it only works with bleeding-edge pgsphere as delivered in Debian bullseye – or from our Debian repository. If you want to build this from source, you need to get credativ's pgsphere fork for now.

    Generate column elements: If you have tables with many columns, even just lexically entering the <column> elements becomes straining. That is particularly annoying if there already is a halfway machine-readable representation of that data.

    To alleviate that, very early in the development of DaCHS, I had the gavo mkrd subcommand that you could feed FITS images or VOTables to get template RDs. For a number of reasons, that never worked well enough to make me like or advertise it, and I eventually ended up writing dachs start instead, which is something I like and advertise for general usage.

    However, what that doesn't do is come up with the column declarations. To make good on this, there is now a dachs gencol command that will, from a FITS binary table, a VOTable, or a VizieR-style byte-by-byte description, generate columns with as much metadata as it can fathom. Paste that into the output of dachs start, and, depending on your input format, you should have a quick start on a fairly full-featured data collection (also note there's dachs adm suggestucds for another command that may help quickly generate rich metadata).

    This currently doesn't work for products (i.e., tables of spectra, images, and the like); at least for FITS arrays, I suppose turning their non-obvious header cards into columns might save some work. Let's see: your feedback is welcome.

    Refurbished Datalink XSLT: Since the dawn of datalink, DaCHS has delivered Datalink documents with XSLT stylesheets in order to have nicely formatted pages rather than wild XML when web browsers chance on datalink documents. I have overhauled the Javascript part of this (which, I have to admit, is what makes it pretty). For one, the spatial cutout now works again, and it's modeless (no clicking “edit“ any more before you can drag cutout vertices). I'm also using the datalink/core vocabulary to furnish link groups with proper titles and descriptions, and to have them sorted in in a proper result tree. I've talked about it at the interop, and I've prepared a showcase of various datalink documents in the Heidelberg data centre.

    Update to DaCHS 2.4 and you'll get the same thing for your datalinks.

    Non-product datalinks: When writing a datalink service, you have to first come up with a descriptor generator. DaCHS will provide a simple one for you (or perhaps a bit more complex ones for FITS images or spectra) – but all of these assume that whatever the datalink ID parameter references is in DaCHS' product table. It turned out that in many interesting cases – for instance, attaching time series to object catalogues – that is not the case, and then you had to write rather obscure code to keep DaCHS from poking around in the product table.

    No longer: There is now the //datalink#fromtable descriptor generator. Just fill in which column contains the identifier and the name of the table containing that column and you're (basically) done. Your descriptor will then have a metadata attribute containing the relevant row – along with everything else DaCHS expects from a datalink descriptor.

    gavo_specconv: That's a longer story covered previously on this blog.

    Index declaration in views: Saying on which columns a database index exists allows users to write smart queries, and DaCHS uses such information internally when rewriting geometrical expressions from ADQL to whatever is in use in the actual database. Hence, making sure these indexes are properly declared is important. But at the same time it's difficult for views, because postgres doesn't let you have indexes on views (for good reasons). Still, queries against views will (usually) use indexes of their underlying tables, and hence those should be declared in the corresponding metadata.

    This is tedious in general. DaCHS now helps you with the //procs#declare-indexes-from stream. Essentially, it will compare the columns in the view with the ones from the source tables and then guess which view columns correspond to indexed columns from the source tables; using that, it adds indexed flags to some view columns.

    If all this is too weird for you: Thanks to declare-indexes-from, the index declaration now automatically happens in the modern way to build SSAP services, the //ssap#view mixin. Hence, chances are you won't even see this particular STREAM but just notice its beneficial consequences.

    Sunsetting resources: I've been fiddling off and on with a smart way to pull resources I no longer want to maintain while still leaving a tombstone. I had to re-visit this problem recently because I dropped the Gaia DR1 table from my Heidelberg data centre. So, how do I explain to people why the thing that's been there no longer is?

    In general, this is a rather untractable problem; for instance, it's very hard to do something sensible with the TAP_SCHEMA entries or the VOSI tables endpoints for the tables that went away. Pure web pages, on the other hand, can be adorned with helpful info. To enable that, there is now the superseded meta item, which you define in the RD that once held the resources. For Gaia DR1, here's what I used:

    <meta name="superseded" format="rst">
      We do not publish Gaia DR1 data here any more.
      If you actually need DR1 data, refer to the
      full Gaia mirrors, for instance `the one at
      ARI`_.  Otherwise, please use more recent data
      releases, for instance `eDR3`_.
    
      .. _the one at ARI: http://gaia.ari.uni-heidelberg.de
      .. _eDR3: /browse/gaia/q3
    </meta>
    

    Root page template: I slightly streamlined the default root page template, in particular dropping the "i" and "Q" icons for going to the metadata and querying the service. If you have overridden the root template, you may want to see if you want to merge the changes.

    As usual, there are many more small repairs and additions, but most of these are either very minor or rather technical. One last thing, though: DaCHS now works with Python 3.8 (3.7 will continue to be supported for a few years at least, earlier 3.x never was), which is going to be the python3 in Debian bullseye. Bullseye itself will only have DaCHS 2.3 (with the Python 3.8 fixes backported), though. Once bullseye has become stable, we will look into putting DaCHS 2.4 into the backports.

  • Spectral Units in ADQL

    math formulae.

    In case you find the piece of Python given below too hard to read: It's just this table of conversion expressions between the different SI units we are dealing with here.

    Astronomers these days work all along the electromagnetic spectrum (and beyond, of course). Depending on where they observe, they will have very different instrumentation, and hence some see their messengers very naturally as waves, others quite as naturally as particles, others just as electrons flowing out of a CCD that is sitting behind a filter.

    In consequence, when people say where in the spectrum they are, they use very different notions. A radio astronomer will say “I'm observing at 21 cm” or “at 50 GHz“. There's an entire field named after a wavelength, “submillimeter“, and blueward of that people give their bands in micrometers. Optical astronomers can't be cured of their Ångström habit. Going still more high-energy, after an island of nanometers in the UV you end up in the realm of keV in X-ray, and then MeV, GeV, TeV and even EeV.

    However, there is just one VO (or at least that's where we want to go). Historically, the VO has had a slant towards optical astronomy, which gives us the legacy of having wavelengths in far too many places, including Obscore. Retrospectively, this was an unfortunate choice not only because it makes us look optical bigots, but in particular because in contrast to energy and, by ν = E/h, frequency, messenger wavelength depends on the medium you work in, and I shudder to think how many wavelengths in my data center actually are air wavelengths rather than vacuum wavelengths. Also, as you go beyond photons, energy really is the only thing that reasonably characterises all messengers alike (well, even that still isn't quite settled for gravitational waves as long as we're not done with a quantum theory of gravitation).

    Well – the wavelength milk is spilled. Still, the VO has been boldly expanding its reach beyond the optical and infrared windows (recently, with neutrinos and gravitational waves, not to mention EPN-TAP's in-situ measurements in the solar system, even beyond the electromagnetic spectrum). Which means we will have to accomodate the various customs regarding spectral units described above. Where there are “thick” user interfaces, these can care about that. For instance, my datalink XSLT and javascript lets people constrain spectral cutouts (along BAND) in a variety of units (Example).

    But what if the UI is as shallow as it is in ADQL, where you deal with whatever is in the underlying database tables? This has come up again at last week's EuroVO Technology Forum in virtual Strasbourg in the context of making Obscore more attractive to radio astronomers. And thus I've sat down and taught DaCHS a new user defined function to address just that.

    Up front: When you read this in 2022 or beyond and everything has panned out, the function might be called ivo_specconv already, and perhaps the arguments have changed slightly. I hope I'll remember to update this post accordingly. If not, please poke me to do so.

    The function I'm proposing is, mainly, gavo_specconv(expr, target_unit). All it does is convert the SQL expression expr to the (spectral) target_unit if it knows how to do that (i.e., if the expression's unit and the target unit are spectral units properly written in VOUnit) and raise an error otherwise.

    So, you can now post:

    SELECT TOP 5 gavo_specconv(em_min, 'GHz') AS nu
    FROM ivoa.obscore
    WHERE gavo_specconv((em_min+em_max)/2, 'GHz')
        BETWEEN 1 AND 2
      AND obs_collection='VLBA LH sources'
    

    to the TAP service at http://dc.g-vo.org/tap. You will get your result in GHz, and you write your constraint in GHz, too. Oh, and see below on the ugly constraint on obs_collection.

    Similarly, an X-ray astronomer would say, perhaps:

    SELECT TOP 5 access_url, gavo_specconv(em_min, 'keV') AS energy
    FROM ivoa.obscore
    WHERE gavo_specconv((em_min+em_max)/2, 'keV')
      BETWEEN 0.5 AND 2
      AND obs_collection='RASS'
    

    This works because the ADQL translator can figure out the unit of its first argument. But, perhaps regrettably, ADQL has no notion of literals with units, and so there is no way to meaningfully say the equivalent of gavo_specconv(656, 'Hz') to get Hα in Hz, and you will receive a (hopefully helpful) error message if you try that.

    However, this functionality is highly desirable not the least because the queries above are fairly inefficient. That's why I added the funny constraints on the collection: without them, the queries will take perhaps half a minute and thus require async operation on my box.

    The (fundamental) reason for that is that postgres is not smart enough to work out it could be using an index on em_min and em_max if it sees something like nu between 3e8/em_min and 3e7/em_max by re-writing the constraint into 3e8/nu between em_min and em_max (and think really hard about whether this is equivalent in the presence of NULLs). To be sure, I will not teach that to my translation layer either. Not using indexes, however, is a recipe for slow queries when the obscore table you query has about 85 million rows (hi there in 2050: yes, that was a sizable table in our day).

    To let users fix what's too hard for postgres (or, for that matter, the translation engine when it cannot figure out units), there is a second form of gavo_specconv that takes a third argument: gavo_specconv(expr, unit_of_expr, target_unit). With that, you can write queries like:

    SELECT TOP 5 gavo_specconv(em_min, 'Angstrom') AS nu
    FROM ivoa.obscore
    WHERE gavo_specconv(5000, 'Angstrom', 'm')
      BETWEEN em_min AND em_max
    

    and hope the planner will use indexes. Full disclosure: Right now, I don't have indexes on the spectral limits of all tables contributing to my obscore table, so this particular query only looks fast because it's easy to find five datasets covering 500 nm – but that's an oversight I'll fix soon.

    Of course, to make this functionality useful in practice, it needs to be available on all obscore services (say) – only then can people run all-VO obscore searches without the optical bias. The next step (before Bambi-eyeing the TAP implementors) therefore would be to get it into the catalogue of ADQL user defined functions.

    For this, one would need to specify a bit more carefully what units must minimally be supported. In DaCHS, I have built this on a full implementation of VOUnits, which means you can query using attoparsecs of wavelength and get your result in dekaerg (which is a microjoule: 1 daerg = 1 uJ in VOUnits – don't you just love this?):

    SELECT gavo_specconv(
      (spectral_start+spectral_end)/2, 'daerg')
      AS energy
    FROM rr.stc_spectral
    WHERE gavo_specconv(0.0002, 'apc', 'J')
      BETWEEN spectral_start AND spectral_end
    

    (stop computing: an attoparsec is about 3 cm). This, incidentally, queries the draft RegTAP extension for the VODataService 1.2 coverage in space, time, and spectrum, which is another reason I'm proposing this function: I'm not quite sure how well my rationale that using Joules of energy is equally inconvenient for all communities will be generally received. The real rationale – that Joule is the SI unit for energy – I don't dare bring forward in the first place.

    Playing with wavelengths in AU (you can do that, too; note, though, that VOUnit forbids prefixes on AU, so don't even try mAU) is perhaps entertaining in a slightly twisted way, but admittedly poses a bit of a challenge in implementation when one does not have full VOUnits available. I'm currently thinking that m, nm, Angstrom, MHz, GHz, keV and MeV (ach! No Joule! But no erg, either!) plus whatever spectral units are in use in the local tables would about cover our use cases. But I'd be curious what other people think.

    Since I found the implementation of this a bit more challenging than I had at first expected, let me say a few words on how the underlying code works; I guess you can stop reading here unless you are planning to implement something like this.

    The fundamental trouble is that spectral conversions are non-linear. That means that what I do for ADQL's IN_UNIT – just compute a conversion factor and then multiply that to whatever expression is in its first argument – will not work. Instead, one has to write a new expression. And building these expressions becomes involved because there are thousands of possible combinations of input and output units.

    What I ended up doing is adopting standard (i.e., SI) units for energy (J), wavelength (m), and frequency (Hz) as common bases, and then first convert the source and target units to the applicable standard unit. This entails trying to convert each input unit to each standard unit until a conversion actually works, which in DaCHS' Python looks like this:

    def toStdUnit(fromUnit):
        for stdUnit in ["J", "Hz", "m"]:
            try:
                 factor = base.computeConversionFactor(
                     fromUnit, stdUnit)
            except base.IncompatibleUnits:
                continue
            return stdUnit, factor
    
        raise common.UfuncError(
            f"specconv: {fromUnit} is not a spectral unit understood here")
    

    The VOUnits code is hidden away in base.computeConversionFactor, which raises an IncompatibleUnits when a conversion is impossible; hence, in the end, as a by-product this function also determines what kind of spectral value (energy, frequency, or wavelength) I am dealing with.

    That accomplished, all I need to do is look up the conversions between the basic units, which can be done in a single dictionary mapping pairs of standard units to the conversion expression templates. I have not tried to make these templates particularly pretty, but if you squint, you can still, I hope, figure out this is actually what the opening image shows:

    SPEC_CONVERSION = {
        ("J", "m"): "h*c/(({expr})*{f})",
        ("J", "Hz"): "({expr})*{f}/h",
        ("J", "J"): "({expr})*{f}",
        ("Hz", "m"): "c/({expr})/{f}",
        ("Hz", "Hz"): "{f}*({expr})",
        ("Hz", "J"): "h*{f}*({expr})",
        ("m", "m"): "{f}*({expr})",
        ("m", "Hz"): "c/({expr})/{f}",
        ("m", "J"): "h*c/({expr})/{f}",}
    

    expr is (conceptually) replaced by the first argument of the UDF, and f is the conversion factor between the input unit and the unit expr is in. Note that thankfully, no additive operators are involved and thus all this is numerically well-conditioned. Hence, I can afford not attempting to simplify any of the expressions involved.

    The rest is essentially book-keeping, where I'm using the ADQL parser to turn the expression into a tree fragment and then fiddling in the tree fragment for expr into that. The result then replaces the UDF function call in the syntax tree. You can review all this in context in DaCHS' ufunctions.py, starting at the definition of toStdUnit.

    Sure: this is no Turing award material. But perhaps these notes are useful when people want to put this kind of thing into their ADQL engines. Which I'd consider a Really Good Thing™.

« Page 2 / 6 »