• Gaia DR3 XP Spectra: All Sampled

    Lots of blue crosses and a few red squares plotted over a sky photograph of a star cluster

    Around this time of the year on the northern hemisphere, you can spot the h and χ Persei double star cluster with the naked eye. One part of it, NGC 884 is shown here with LAMOST DR6 low resolution spectra (red squares) and Gaia DR3 XP spectra (blue crosses) overplotted. Given that LAMOST has already been one of the largest collections of spectra on the planet, you can see that there is really a lot of those XP spectra.

    When Gaia DR3 was released in June, I was somewhat disappointed when I realised what it is that they delivered as the BP/RP (or XP for short) spectra. You see, I had expected to see something rather similar to what I have in DFBS: structurally, arrays of a few dozen spectral points, mapping wavelengths to some sort of measure of the flux.

    What really came were, mainly, “continuous spectra“, that is coefficients of Gauss-Hermite polynomials. You can fetch them from the gaiadr3.xp_continuous_mean_spectrum table at the ARI-Gaia TAP service; the blue part of the spectrum of the star DR3 4295806720 looks like this in there:

    102.93398893929992, -12.336921213781045, -2.668856168170544, -0.12631176306793765, -0.9347021092539146, 0.05636787290132809, [...]

    No common spectral client can plot this. The Gaia DPAC has helpfully provided a Python library called GaiaXPy to turn these into “proper” spectra. Shortly after the data release, my plan has thus been to turn all these spectra into their “sampled” form using GaiaXPy and then re-publish them, both through SSAP for ad-hoc discovery and through TAP for (potentially) global analysis.

    Alas, for objects too faint to make it into DR3's xp_sampled_mean_spectrum table (that's 35 million spectra already turned to wavelength-flux pairs by DPAC), the spectra generated in this way looked fairly awful, with lots of very artificial-looking wiggles (“ringing”, if you will). After a bit of deliberation, I realised that when the errors are given on the Hermite coefficients, once you compute the samples, these errors will be liberally distributed among the output samples. In other words, the error on the samples will be grossly correlated over arbitrary distances; at least I am fairly helpless when trying to separate signal from artefact in these beasts.

    Bummer. Well, fortunately, Rene Andrae from “up the mountain” (i.e., the MPI for Astronomy) has worked out a reasonably elegant way to get more conventional spectra understandable to mere humans. Basically, you compute n distinct “realisations” of the error model given by the table of the continuous spectra and average over them. The more samples you take, the less correlated your spectral points and their errors will be and the less confusing the signal will be. The service docs for gaia/s3 give the math.

    Doing this on more than 200 million spectra is quite an effort, though, and so after some experimentation I decided to settle on 10 realisations per spectrum and have relatively wide bins (10 nm) over just the optical part of the spectrum (400 through 800 nm). The BP and RP bandpaths are a bit wider, and there is probably signal blotted out by the wide bins; I will probably be addressing this for DR4, except if these spectra become the smash hit they deserve to be.

    The result of this procedure is now available through an SSAP service that should show up in the VO Registry by the time the first of you read this; the Aladin image above gives you an impression of the density of results here – and don't forget: the spectra with the blue crosses are all reasonably well flux-calibrated.

    The data is also available on the TAP service http://dc.g-vo.org/tap, which opens up many interesting possibilities. Let me mention two here.

    Comparison with LAMOST

    I was rather nervous whether what I had done resulted in anything that bore even a fleeting resemblance to reality, and so about the first thing I tried was to compare my new data with what LAMOST has.

    That is a nice exercise for TAP and ADQL. Let's first match spectra from the two surveys, which luckily are on the same server, saving us some cross-server uploads. I am selecting a minimum of data, just the position and the two access URLs, and I let DaCHS' MAXREC kick in so I'm just retrieving 20000 of the millions of result records:

    SELECT a.ssa_location, a.accref, b.accref
    FROM
      gdr3spec.ssameta AS a
      JOIN lamost6.ssa_lrs AS b
      ON DISTANCE(a.ssa_location, b.ssa_location)<0.001
    

    (this is using the DISTANCE(.,.)<radius idiom that we will be migrating towards in ADQL 2.1 instead of the dreaded 1=CONTAINS(POINT, CIRCLE) thing everyone has loathed in ADQL 2.0).

    Using the nifty activation actions, you can now tell TOPCAT to open the two spectra next to each other when you click on a row or a point in a sky plot. To reproduce,

    1. Make a sky plot. TOPCAT doesn't yet pick up the POINT in ssa_location, so you have to configure the Lon and Lat fields yourself to ssa_location[0] and ssa_location[1].
    2. Open the activation actions, either from the button bar or from the Views menu.
    3. In there, select Plot Table, make sure it says accref in Table Location and then check Plot Table in the Actions pane. When you now click on a point in the sky plot, you should see a spectrum pop up, except it is plotted with dots, which most people consider inappropriate for spectra. Use the Form tab in the plot window to style it a bit more spectrum-like (I recommend looking into Line and XYError).
    4. But how do you now add the LAMOST plot? I don't think TOPCAT's activation actions let you plot right into the plane plot you just configured. But you can add a second Plot Table action from the Actions menu in the window with the activation actions. As before, configure this new item, except this one needs to plot accref_ (which is what DaCHS has called the access reference for LAMOST to keep the names unique).
    5. As for Gaia, configure to plot to look good as a spectrum. In order to make the two spectra optically comparable, under Axes set the range to 4000 to 8000 Angstrom manually here.

    You can now click on points in your sky plot and, after a second or so, see the corresponding spectra next to each other (if you place the two plot windows that way).

    If you try this, you will (hopefully) see that major features of spectra are nicely reproduced, such as with these, I guess, molecular bands:

    Two line plots next to each other, the right one showing more features.  the left one roughly follows the major wiggles, though.

    As you probably have guessed, the extremely low-resolution Gaia XP spectrum is left, LAMOST's (somewhat higher-resolution) low-resolution spectrum is right:

    This also works with absorption in the blue, as in this example:

    Two line plots next to each other, the right one showing a lot of relatively sharp absoprtion lines, which the left one does not have.  A few major bumps are present in both, and the general shape conincides nicely, expect perhaps at the blue edge.

    In case of doubt, I have to say I'd probably trust Gaia's calibration around 400 nm better than LAMOST's. But that's mere guesswork.

    For fainter objects, you will see remnants of the systematic wiggles from the Hermite polynomials:

    Two line plots next to each other.  Both are relatively noisy, in particular on the blue edge.  The left one also seems to have a rather regular oscillation at the blue edge.

    Anyway, if you keep an eye on the errors, you can probably even work with spectra from the fainter objects:

    Two line plots next to each other.  The left one has fairly strong ringing which is not present in the right one, but it mainly stays within the error bars.  The total flux of this star is at least a factor of 10 less than for the prettier examples above.

    Mass Retrieval of Spectra

    One nice thing about the short spectra is that you can fetch many of them in one go and in very little time. For instance, to retrieve particularly red objects from the Gaia catalogue of Nearby Stars (also on the GAVO server) with spectra, say:

    SELECT
      source_id, ra, dec, parallax, phot_g_mean_mag,
      phot_bp_mean_mag, phot_rp_mean_mag, ruwe, adoptedrv,
      flux, flux_error
    FROM gcns.main
    JOIN gdr3spec.spectra
    USING (source_id)
    WHERE phot_rp_mean_mag<phot_bp_mean_mag-4
    

    [in case you wonder how I quickly got the column names enumerated here: do control-clicks into the Columns pane in TOCPAT's TAP window and then use the Cols button]. For when you do not have Gaia DR3 source_id-s in your source table, there is also gdr3spec.withpos against which you can do more conventional positional crossmatches.

    Within a few seconds, you can retrieve more than 4000 spectra in this way. You can now do whatever analysis you want on these spectra. Or, well, just plot them. The following procedure for that later task uses TOPCAT features only available in the next release, due before mid-October[1].

    First, make a colour-magnitude diagram (CMD) from this table as usual (e.g., BP-RP vs G). Then, open another plane plot and

    1. LayersAdd XYArray Control
    2. Configure the XYArray to plot from the table you just fetched, have nothing in X Values[2] and flux in Y Values.
    3. Under Axes, configure Y Log in order to better show the 4253 spectra at one time.
    4. Throw away or at least uncheck all other layers in the plot.
    5. In order to let TOPCAT highlight the spectrum of the activated source, in the Subsets pane check the Activated subset (that's the bleeding-edge functionality you will not have in older TOPCATs) and give it a sufficiently bright colour.

    With that, you can now click around in your CMD and immediately see that source's spectrum in the context of all the others, like this:

    An animation of someone selecting various points in a CMD and have simulataneous spectra plotted.

    These spectra have also inspired me to design and implement a vector extension for ADQL, which lets you do even more interesting things with these spectra. More on this… soon.

    [1]The Activated subset is only available in TOPCAT versions later than 4.8-7 (released in October 2022).
    [2]These should be the spectral points; DaCHS does not deliver them with this query because I am a coward. I think I will find my courage relatively soon and then fix this. Once that has happened, you can select param$spectral as X values. [Update: Mark Taylor remarks that by writing sequence(41, 400, 10) in bleeding-edge TOPCATs and add(multiply(10,sequence(41)),400) before that, you can add a proper spectral axis until then]
  • Find a Dust-Free Window Using ADQL

    Five sky images, all of them showing star clusters

    Five of the seven patches of the sky that Bayestar 17 considers least obscured by dust in Aladin's WISE color HiPSes. There clearly is a pattern here. This post is about how you'll find these (and the credible ones, too).

    The upcoming AG-Tagung in Bremen will have another puzzler, and while concocting the problem I needed to find a spot on the sky where there is very little interstellar extinction. What looks like a quick query turned out to require a few ADQL tricks that I thought I might show in this little post; they will come in handy in many situations.

    First, I needed to find data on where on the sky there is dust. Had I not known about the extinction maps I've blogged about in 2018, I would probably have looked for extinction maps in the Registry, which might have led me to the Bayestar 17 map on my service eventually, too. The way it was, I immediately fired up TOPCAT and pointed it to the TAP service at http://dc.g-vo.org/tap (the “GAVO DC TAP“ of the TAP service list) and went to the column metadata of the prdust.map_union table.

    Browsing the descriptions, the relevant columns here are healpix (which will give me the position) and best_fit. That latter thing is an array of reddening E(B − V) (i.e., higher values mean more dust) per distance bin, where the bins are 0.5 mag of distance modulus wide. I decided I'd settle for bin 20, corresponding to a kiloparsec. Dust further away than that will not trouble me much in the puzzler.

    Finding the healpixes in the rows with the smallest best_fit[20] should be easy; it is a minor variant of a classic from the ADQL course:

    SELECT TOP 20 healpix
    FROM prdust.map_union
    ORDER BY best_fit[20] ASC
    

    Except that my box replies with an error message reading “Expected end of text, found '[' (at char 61), (line:3, col:18)”.

    Hu? Well… if you look, then the problem is where I ask to sort by an array element. And indeed, it turns out that DaCHS, the software driving this site, will not let you sort by array elements yet. This is arguably a bug, and in all likelihood I will have fixed it by the time your read this. But there is a technique to defeat this and similar cases that every astronomer should know about: subqueries, which turn any query into something you can work with as if it were a table. In this case:

    SELECT TOP 30 healpix, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    – the “AS q“ gives the name of the “virtual” table resulting from the query a name. It is mandatory here. Do not be tempted to leave out the “AS” – that that is even legal is one of the major blunders of the SQL standard.

    The result is looking good:

    # healpix extinction
    1021402 0.00479
    1021403 0.0068
    418619  0.00707
    ...
    

    – so, we have the healpixes for which the extinction works out to be minimal. It is also reassuring that the two healpixes with the clearest sky (by this metric) are next to each other – where there are clear skies, it's likely that there are more clear skies nearby.

    But then… where exactly are these patches? The column description says “The healpix (in galactic l, b) for which this data applies. This is of the order given in the hpx_order column”. Hm.

    To go from HEALPix to positions, there is the ivo_healpix_center user defined function (UDF) on many ADQL services; it is part of the IVOA's UDF catalogue, so whenever you see it, it will do the same thing. And where would you see it? Well, in TOPCAT, UDFs show up in the Service tab with a signature and a short description. In this case:

    ivo_healpix_center(hpxOrder INTEGER, hpxIndex BIGINT) -> POINT
    
      returns a POINT corresponding to the center of the healpix with the
      given index at the given order.
    

    With this, we can change our query to spit out positions rather than indices:

    SELECT TOP 30 ivo_healpix_center(hpx_order, healpix) AS pos, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    The result is:

    # pos                                    extinction
    "(42.27822580645164, 78.65148926014334)" 0.00479
    "(42.44939271255061, 78.6973986631694)"  0.0068
    "(58.97460937500027, 40.86635677386179)" 0.00707
    ...
    

    That's my positions all right, but they are still in galactic coordinates. That may be fine for many applications, but I'd like to have them in ICRS. Transforming them takes another UDF; this one is not yet standardised and hence has a gavo_ prefix (which means you will only find it on reasonably new services driven by DaCHS).

    On services that have that UDF (and the GAVO DC TAP certainly is one of them), you can write:

    SELECT TOP 30
      gavo_transform('GALACTIC', 'ICRS',
        ivo_healpix_center(hpx_order, healpix)) AS pos,
      extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    That results in:

    # pos                                    extinction
    "(205.6104289782676, 28.392541949473785)" 0.00479
    "(205.55600830161907, 28.42330388161418)" 0.0068
    "(250.47595812552925, 36.43011215633786)" 0.00707
    "(166.10872483007287, 21.232866316024364)" 0.00714
    "(259.3314211312357, 43.09275090468469)" 0.00742
    "(114.66957763676628, 21.603135736808532)" 0.00787
    "(229.69174233173712, 2.0244022486718793)" 0.00793
    "(214.85349325052758, 33.6802370378023)" 0.00804
    "(204.8352084989552, 36.95716352922782)" 0.00806
    "(215.95667870050661, 36.559656879148044)" 0.00839
    "(229.66068062277128, 2.142516479012763)" 0.0084
    "(219.72263539838667, 58.371829835018424)" 0.00844
    ...
    

    If you have followed along, you now have a table of the 30 least reddened patches in the sky according Bayestar17. And you are probably as curious to see them as I was. That curiosity made me start Aladin and select WISE colour imagery, reckoning dust (at the right temperature) would be more conspicuous in WISE's wavelengths then in, say, DSS.

    I then did Views -> Activation Actions and wanted to check “Send Sky Coordinates“ to make Aladin show the sky at the position of my patches. This is usually preconfigured by TOPCAT to just work when tables have positions. Alas: at least in versions up to 4.8, TOPCAT does not know about points (in the ADQL sense) when making clever guesses there.

    But there is a workaround: Select “Send Sky Coordinates” in the Activation Actions window and then type pos[0] in “RA Column“, and pos[1] in “Dec Column” – this works because under the hood, VOTable points are just 2-arrays. That done, you can check the activation action.

    After these preparations, when you click through the first few results, you will find objects like those in the opending image (and also a few fairly empty fields). Stellar clusters are relatively rare on the sky, so their prevalence in these patches quite clearly shows that Bayestar's model has a bit of a fixation about them that's certainly not related to dust.

    Which goes to serve as another example of Demleitner's law 567: “In any table, the instances with the most extreme values are broken with a likelihood of 0.567”.

  • What's new in DaCHS 2.6

    Rainbowy image with a DaCHS logo

    The transitions of four-times ionised Technetium, with the energies of the lower and upper states on the two axes and the colour a measure of the frequency of the emitted light. Well: DaCHS 2.6 has preliminary support for LineTAP.

    After six months of development, I have just released DaCHS 2.6. This blog post is the traditional discussion of major news for operators of DaCHS-based services. Also have a look at the changelog, which has finally made it to the Debian package; if you installed from package, you can now read it using zless /usr/share/doc/python3-gavo/changelog.gz.

    This post's title picture alludes to LineTAP, an upcoming standard for disseminating data on specral lines intended to obviate SLAP and play nicely with VAMDC. The standard only exists as a rather preliminary draft yet, but there should be a working draft soon-ish. If you have line data to publish or can get your hands on some, consider trying //linetap#table-0 (the “-0” suggests that there will be changes, but I'd hope not terribly many).

    Quite a few changes resulted from a seemingly minor user request: “How do I put a form interface in front of my EPN-TAP table?“ I rather foolishly chose to use the obscore table as an example, which was about the worst choice I could have made, as ivoa.obscore is a view in DaCHS (which means, for instance, that you can't simply add indexes), and a rather large one in Heidelberg at that (more than 80 Megarows, which means that without indexes, interactive services are impossible).

    The first change in that direction was supporting form conditions over pairs of columns; you need that whenever your table has intervals in column pairs, as for instance em_min/em_max in obscore. With the new code, when users write something like 8000 .. 10000, you can instruct DaCHS to translate that into SQL computing whether or not the intervals overlap.

    The spectral queries from that form still timed out, even after I had made sure there were indexes on the larger contributing tables' spectral columns. The reason for that was that the obscore mixin casted the spectral coordinates to double precision[1], and even if there is an index on a real-valued my_col, a condition like:

    my_col::double precision < 4
    

    will not use the index (unless it were over the cast expression, of course). I have hence shortened a few obscore columns (specifically, s_fov, s_resolution, em_min, em_max, em_res_power, and s_pixel_scale) to real; that's what they are in SSAP, and for now I cannot see a case where these would need to be double precision in a discovery protocol.

    Having this service reminded me that registering obscore as an independent resource (rather than just as a table in a tap service's tableset) was something I've been wanting to tackle for quite a time now. This needs proper metadata, in particular coverage metadata. Determining the coverage of obscore is now possible (run dachs limits //obscore), and using codeItems (more or less explicitly), you can inject that metadata where you need it.

    The cover story (“use case,” if you will) underlying this form-based service on top of obscore that started all that was that it was supposed to be friendly to optical astronomers, who by and large are still stuck with Ångström (that is, 10 − 10 m), and hence I wanted to write the spectral information in Ångström, too. In this case, the old displayUnit display hint would have done (because Obscore uses wavelengths, too), but by the time I noticed that, I had already written a spectralUnit display hint. With that, you can write something like:

    <column name="e_min"
      unit="J"
      description="Lower energy in the spectrum"
      displayHint="spectralUnit=Angstrom"/>
    

    This would convert e_min to Ångström when written to HTML table (but not otherwise, following the assumption that non-HTML data will be consumed by machines that have no use for legacy units).

    Talking about HTML: If your root template is derived from root-tree.html (it is not unless you made it so), you have to apply a minor update to it; locate the tmpl_resDetails “script” (it's actually some HTML) in /var/gavo/web/templates/root.html. In there, there's a $description, which for the javascript templater that interprets this thing means “insert the content of the description field, properly escaping it”. Since 2.6, however, DaCHS produces these descriptions in HTML. That's progress, since these descriptions often contain links or other formatting. But it means that you have to tell the templater to not escape things: Just write $!description instead.

    There are a few new things you can do in RDs. First, there are relocatable RDs: It is now recommended to have resdir="." in the opening resource (and dachs start's templates are nudging you to do that). Without that, the resource directory defaults to inputsDir/<schema>, which breaks as soon as you need to rename that directory. Now: renaming resource directories is never easy in DaCHS (for instance, because they are reflected in URLs). But for instance with mirrors, or when forking a resource, such renames happen, and relocatable RD make that a lot simpler. You can obtain the current value of the resource directory from the new \resdir macro.

    Then, by popular request, you can now have index options. If you look at the documentation for create index in the postgres docs, you will notice that there are quite a few things you can do to an index. Acquainting DaCHS' index element with all of these seemed wrong to me, in particular because most of these things are only interesting in rather special circumstances beyond DaCHS' control. Instead, you can now add option elements to an index to change its behaviour, each of which can reflect some postgres configuration item. DaCHS will order your fragments so the resulting command fits Postgres' grammar.

    Since this is somewhat low-level, I recommend isolating the details in userconfig. For instance, you could add streams there saying:

    <STREAM id="staticindex">
      <doc>For indexes on tables that never change, save about 10% storage
      by feeding this.</doc>
      <option>WITH (fillfactor=100)</option>
    </STREAM>
    
    <STREAM id="onfastdisk">
      <doc>FEED this into an index to let it live on a fast disk</doc>
      <option>TABLESPACE fast</option>
    </STREAM>
    

    (the second stream assumes you have set up such a tablespace). You could then configure your indexes like this:

    <index columns="foo">
      <FEED source="%#staticindex"/>
      <FEED source="%#onfastdisk"/>
    </index>
    

    A feature I have put in mainly because of, say, due diligence is that you can now store the administrator password as a hash in /etc/gavo.rc. This has the advantage that people that get to read your configuration cannot (reasonably) become administrators on DaCHS' web interface; I'd consider the hash strong enough that you could put that into version control. Of course, that administrator can't do all that much in the first place.

    The drawback of hashing the admin password is that then DaCHS itself cannot use the password to authenticate against a running server. That is not a disaster, but it will keep it from automatically discarding the root page on changes and automatically clearing a few caches when you import a resource.

    As usual, there are many other changes; let me mention

    • the modern VOTables from SCS I have celebrated here before,
    • the makeIAUId(prefix, long, lat) rowmaker function that makes creating IAU-compliant identifiers a bit simpler,
    • a function utils.formatFloat that may be helpful when producing human-readable floating-point numbers (it's not in gavo.api yet, but I think it will migrate there),
    • the statistics property on columns that you can set to enumerate on TEXT-typed columns to make DaCHS collect preliminary statistics on those (more on that in a later post),
    • the -d option to dachs limits to dump the column statistics DaCHS has gathered (see the DaCHS 2.4 announcement for more on these stats), and
    • that the maximum order of a MOC is now given in ASCII-MOCs DaCHS produces.

    With this: If you have GAVO's repository enabled, you will get DaCHS 2.6 with the next apt upgrade. I will also try to get it into the Debian backports, too, and if I manage that, you will read about it on this blog.

    [1]

    In case you wonder why it did that: The obscore mixin basically fills out templates like:

    CAST(\em_min AS real) AS em_min,
    CAST(\em_max AS real) AS em_max,
    

    where the macro replacements are taken from whatever you give in the mixin's parameters. Now, if \em_min happens to work out to NULL, Postgres just picks any old type (text, IIRC) for the corresponding column. That is not a problem until the result of that table definition is UNION-ed together with another table where \em_min is a proper floating point number: Postgres will then complain about incompatible types in a union. To avoid that, I must give a type to anything contributing to the obscore view.

  • It's Interop Time Again

    A slide with lots of XML on it

    A little ego booster in DAL I: Baptiste and Chloe discuss a feature for incremental harvesting of remote databases using odbcGrammar that I have implanted into DaCHS late last year.

    This morning at seven CEST the first Interop of this year started: It's time again for everyone involved in the VO to come together, tell each other what happened since the last Interop and plan for the next steps. The meeting is purely digital again, and again the schedule is a bit crazy in order to evenly spread time painsj across the globe: there are sessions in the relatively early morning CET, in the late afternoon, and fairly late at night.

    Fairly late at night (by my standards) is now, when I'm listening to the talks in a session of the Data Access Layer working group trying to work out how to do multiple cutouts in one request using SODA, something I've been rather skeptical about while we were coming up with the spec in the mid-2010s: Going from “single value“ to “sequence“ generally complicates matters by something like an order of magnitudes, and with HTTP 1.1 – which lets you run multiple requests in a single connection – doing multiple requests is cheap.

    In contrast, SODA doesn't really say what a service should do if, say, there are multiple positions in a cutout request: should the regions be merged (that's what DaCHS does)? Should multiple images come back? If so, how: in a tar, in a multi-extension FITS, in some other way? What happens if you give both multiple positional and spectral ranges: should there be one result per element of the cartesian product? And if it works that way: should clients have a chance to figure out what combination of parameters produced which result dataset?

    In all that mess, it's gratifying to see that my compromise proposal from way back when – if we do multi-cutout, let's do it by uploading a table specifying one cutout, including a label, per row – to be floated again. But very frankly: My vote would still be to deprecate repeated POS, CIRCLE, BAND, and friends in SODA: requests are cheap these days.

    Oh, and while I'm confessing emotions of perhaps not entirely unselfish gratification: I still rejoice when I see DaCHS applications discussed in public, as Chloé and Baptiste did in their talk.

    Update at 2022-04-27, Morning

    The “virtual” Interop may not be quite as exciting as the real thing, but at least the jetlag is back.

    Yesterday at midnight I gave a talk on requirements and validators, which really was an elaboration of some of the ideas I developed on this blog a month ago. If I may say so myself, I've grown fond of the classification of MUST-s into, in the end, items the machines need, items the users need, admonishments for implementors, and items that we believe the future may need. I'm sure there are more, but even for these I found it remarkable that the less will immediately break if someone violates a piece of a spec, the more important validation becomes. This again is one of these thoughts that feel as if someone probably has pondered them a lot more deeply before…

    I also was really happy about Mark's pitch for validating specifications themselves that kept me awake until one a.m. CEST. In my authoring system ivoatex, I've introduced a hook to allow for a test target, and Mark kindly supported that effort by adding an xsdvalidate subcommand to the excellent stilts. The ivoatex documentation then grew some advice on what and how to test; in case you're writing or maintaining IVOA specs: do have a look. Mark's talk has a few great examples where spec-time validation would have saved a lot of effort and embarrassment.

    Only six hours later, I was back in <expletive deleted> zoom to listen to the Grid session, which again featured Mark, apparently unfazed by the lack of sleep, talking about (potentially) federated authentication outside of the browser (which is something I really want for persistent TAP uploads).

    And then there was the joint time domain/radio session. The slides are not yet there, but once they are, do yourself a favour and at least look at the beautiful images Dougal showed – Radio by now can make about as pretty pictures as Optical – and Alan's talk with the hypnotic sensitivity maps that again showed that low-frequency radio astronomy, seen from outside, is even more of an arcane art than is its high-frequency sibling.

    Update at 2022-04-27, late evening

    For me, this Interop has a strong proper motion slant. In this afternoon's Apps session, I tried to sell an extension to COOSYS I've wanted for a long time, just enough to do epoch propagation.

    You see, ever since my first serious contribution to the VO standards universe, the proposal on doing STC annotation in VOTable in 2010, failed miserably because almost nobody took it up, I have struggled to still somehow get enough annotation added to VOTables to let clients apply proper motions automatically.

    Given there are now data models for Coordinates and what we call Measurements (which roughly is errors and, well, a bit of physics) on the way, I figured this might be a good time to finally fix the COOSYS VOTable element. For one, data centers will revisit the STC annotation anyway if the models and the VOTable data model annotation will pass the reviews, and producing an improved COOSYS would then almost come for free.

    But I can't lie: after the experiences of the past I'd also love to have a fallback position in case we spend another ten years on data models and annotations without getting anywhere. 25 years after the VO's birth epoch (if you will) of J2000.0, many stars have already moved of order of an arcsecond from where our first big catalogues saw them, and so we can ill afford to wait these extra ten years.

    Not surprisingly, the proposal resulted in quite a bit of pushback, perhaps even a bit more than I had expected. Well: I should have given this talk years ago.

    The proper motion topic will come back tomorrow in the second DAL session, when I will talk about ADQL user defined functions to do epoch propagation. This talk will feature one of the prettier plots I've produced in the last few months:

    Three traces of points on a sphere

    What happens if you propagate positions when all you have are proper motions (i.e., no parallaxes and no distances) and you do that naively (blue), in the tangential plane (red), and under the assumption of a purely tangential motion. The lecture notes tell you how to come up with the data plotted here.

    I think I can safely predict you will read more about some of these UDFs on this very blog later this year.

    Update 2022-04-28, late evening

    Today felt the most conferency so far for this Interop, and perhaps for any “virtual conference“ I've attended. I believe there's a technical reason for that. After the second proper motion-flavoured talk I've just mentioned – that was still using, sigh, zoom –, things mostly happened in gathertown, a platform you can actually walk around in, stand together and don't always talk on stage as in zoom. Fervently believing in the mantra of “protocols, not platforms” (of course: this is the VO), I shouldn't be saying this, but: I actually like gathertown.

    And so I guess we made quite a bit of progress in little side meetings and a hackathon on things like LineTAP (which, I hope, will bring all the rich data on spectral lines from VAMDC to the VO); how to let people have continuous integration checks against their Jupyter notebooks to notice in time when we're breaking something (my recent brown-bag pyvo bug that has somwhat started this was actually mentioned as a positive example in a talk (slide 19); and: it turned out I'm not the only notebook skeptic on this planet!); how we ought to define “facility” and “instrument“ in Obscore and the Registry (and, probably particularly insiduously, in SSAP, where what's called “facility“ there should probably be what's called “instrument“ elsewhere – sigh), a topic we already had touched yesterday, which in turn has resulted in Tamara's mail; an interesting service DaCHS operators want to run that would return PDF files as what DaCHS calls a “product” (which would normally be a thing like a FITS file); and then some more, including, of course, idle chatting.

    That was almost as good as an actual meeting.

    Update 2022-04-29, afternoon

    This morning, I chaired a nice and lively Semantics session, where I talked about the move of our Vocabulary maintenance to github. That particular thing did not elicit a lot of comments, not even when I extended an invitation to perhaps amend Vocabularies in the VO 2 in other weys. I'll take that as some sort of reassurance that I did a reasonably good job designing that thing, although I cannot entirely rule out that people just did not have enough time to find the warts.

    One thing I will call out at tonight's closing penary is Stéphane's talk on vocabularies in EPN-TAP. The way he was looking at the various word lists involved in that standard, looking at what “just works“, where the concepts are probably too special to worry about, and then the clumsy space in between – where there are or should be vocabularies that almost, but not quite fit – was exemplary. I'm looking forward to followups on the mailing lists, trying to work out where we can perhaps align different concept hierarchies so we spare implementors duplicate efforts. And figuring out where that's impossible, too expensive, or in other ways undesirable, and where the problems are. I suppose there's a lot to be learned from that.

    Another high point was the identification of Wikidata as a valuable resource for the never-ending story of creating identifiers for instruments and facilities in Baptiste's talk. There is some special gratification in making our activities matter beyond the VO, link our resources with the wider RDF world – and hack SPARQL.

    What's left for me is the Registry session, where I will briefly report, in particular, on my most recent effort of getting rid of my venerable GloTS service by adding a table of TAP-queriable tables to RegTAP. Let's see what people say – but in the end the challenge will be to convince the other operators of RegTAP services to take up the proposed changes. The central challenge there is that part of it is built on MOCs, and while the ESAC registry is built on Postgres that can already taught to deal with them, the one at MAST is based on SQLServer, which, I think, cannot yet. Let's see.

    Another thing I'm looking forward to is Hendrik's pitch for registring tutorials and similar educational material. I'd really like to see more stuff on VOTT, which is fed from such registrations.

    Update 2022-04-29, late evening

    Interops for me always have something of an ego trip when I see traces of my activities in other people's work. And I've just discovered such a trace in a place I had not expected it: Gilles' talk on extra metadata in service responses, where he showed metadata DaCHS returns with its TAP responses. This was in this morning's session of the Data Curation and Preservation interest group that, I have to admit, I skipped in favour of a proper breakfast without a screen in front of me.

    And he touched a topic that's dear to my heart, too. Really, I've been struggling to give applications enough metadata such that they can simply spit out a bunch of BibTeX for the sources used in a particular VO workflow for quite a while. In typcial DaCHS responses, you will find a bibcode and often a link to BibTeX (example), and at least the container element I got standardised in DALI 1.1. Let's see what else we can specify so that machines can reliably extract such information: Authors? Technical contact addresses? Date and time of production (could be very relevant for evolving data)? Full provenance? Well: If you've ever missed some piece of metadata, this would be a good time to bring it up.

    All that's left now is the reports of the Working Groups (which will be another midnight talk for me) and a bit of farewell ceremony. After that, I'll go to sleep, and so that's it for my Interop reporting.

  • Requirements and Validators

    Content Warning: this is mainly VO lore. I am not claiming any immediate applicability to the use or publication of astronomical data.

    This morning, I set out to reply to a mail by Mark Taylor and noticed after a while that I was writing a philosophical piece on how to write standards – and how not to – that I may want to refer to again later. So, I'll make this a blog post.

    The story started when the excellent stilts taplint during my monthly validation routine produced an error when exercising my data centre's TAP endpoint:

    I-OBS-QSUB-5 Submitting query: SELECT TOP 1 obs_id FROM ivoa.ObsCore WHERE obs_id IS NULL
    E-OBS-QERR-1 TAP query failed [Service error: "Field query: Query timed out (took too long).
    

    What happened is that stilts tried to ascertain that all rows in my obscore table satisfy the standard's requirement that the obs_id column is non-NULL (see page 20). This made Postgres – the database system actually executing the queries – run what is known as a sequential scan through the tables involved in obscore; the reason underlying this bad judgement is a bit involved and has to do with the fact that in DaCHS, ivoa.obscore is a view composed of many tables. I will spare you the details, but the net effect of that is that it is not easy to tell Postgres that rows with obs_id NULL, if they exist at all, will be few and far between.

    By now, the number of data sets in my obscore table approaches 100'000'000, and fetching all that data simply takes time, more time than a synchronous query has on my site[1].

    Granted, I could fix that by adding indexes on the columns involved, but since these come from several dozen tables, that would be quite a bit of work for both me and the computer. Is that work worth it? Well, it certainly is if otherwise I'm breaking the standard, but since it is a serious amount of work, I am tempted to wonder: does the requirement actually make sense? And this leads to the question:

    Why do we require things in standards?

    In the end, there is just one reason to require something in a standard: Without the requirement, something important breaks. When one thinks about this a bit more deeply, one can distinguish two somewhat finer classes of requirements.

    (a) “Internal requirements“. These are rules imposed so machines can do their job. The most obvious examples here are requirements on how to write things. For instance, if a client writes an interval as lower/upper and the service expects lower upper, it just won't work. Hence, a standard has to say “The separator in intervals MUST be whitespace” (or whatever).

    There are more subtle requirements in that department. For instance, many tables need a primary key because other tables may want to refer to them. For Obscore, this becomes relevant just about now, when we think about having extensions for it. Those would add specific metadata for, say, radio or gamma observations. We will probably create them by adding per-extension tables holding a foreign key into ivoa.obscore. This is nice because then you can write something like:

    SELECT ...
    FROM ivoa.obscore
    JOIN ivoa.obs_visibility
      USING (obs_publisher_did)
    WHERE (some visiblity-specific constraint)
    

    – and almost everything just works without further thought or effort: No plethora of columns that are NULL in ivoa.obscore for anything that is not a visibility, and no manual filtering out of non-visibilities either: JOIN does it all nicely for you. Isn't relational algebra great?

    But this only is possible if obs_publisher_did (well: it's not certain yet whether that actually will be obscore's designated primary key, but bear with me there) really is non-NULL, and if there are no two rows with the same publisher DID (which are the general criteria to make something a primary key in a relation). Hence, these two constraints are something we simply MUST (pun intended) require.

    (b) “Functional requirements”. These are requirements resulting from considerations of the use of the standard. I have just encountered a nice example when working on LineTAP, a future standard on how to access data about spectral lines. An important use case there is that the client displays the lines on top of a spectrum, and it will want to put something next to the lines so the user has at least a first indication just what would cause the line to show up. That it can only do if the service provides it with a plausible label – asking clients to invent a label based on the data it has is likely to produce very unsatifying results, as no machine is smart enough to figure out nice, idiomatic strings like „21 cm HI“ or „Hα“. Hence, we simply have to require that each row in such a LineTAP table has a title (technically: the corresponding column has a non-NULL constraint).

    Going back to the obs_id example, it does not seem there is a strong case to invoke either (a) or (b) – since the column explicitly has no uniqueness requirement, it will not work as a primary key, and users will probably only want to use it for “grouped” data, where multiple artefacts belong to one “observation”. For data sets not within such groups, there really is no application for obs_id I can see. Of course, I may be missing something, which is why I asked around on the mailing lists.

    If we figure out nothing breaks when we remove the requirement, then we should drop it: Every requirement causes some overhead in implementation and validation. In the present case, the implementation overhead would be all the indexes on the various obs_id columns, which I would not otherwise need. The validation overhead are the extra queries that taplint needs to do. Having overhead for no benefit (in terms of things not breaking) goes against sensible parsimony in what we ask our adopters to do (and I'll officially admit here that we do ask quite a bit already).

    … and why do we validate them?

    In the mail I have cited above, Mark has kindly offered to just not run the query in the validation suite, and all this philosophy was really intended to lead up to a “thanks, but no thanks”.

    That is because, first of all, requirements that are not checked by a machine are requirements that are not met. You see, what we do is hard. Sure, there are harder problems in computing, but globally distributed information systems run by only loosely connected parties are rather non-trivial. People writing code to solve non-trivial problems will get it wrong.

    The common way to deal with this fact is to test with one client and call it a day when that client seems to work for whatever was chosen as a test case. To mention a non-VO standard where this implement-to-the-client method failed horribly and continues to fail horribly: ACPI, the part of the firmware that's supposed to make, for instance, suspend-to-RAM something one doesn't have to think about. Vendors usually stop developing their ACPI code when the current version of Windows does not fail horribly with their implementation. A paper in the proceedings of the 2007 Linux symposium discusses some of the consequences in the least offensive way conceivable – and in a way that I, as a VO developer running quite a few Linux boxes, can very much relate to.

    The bottom line is that if an unmet requirement breaks things and validators do not check for that requirement, then services will work to some degree with a certain client and break as soon as people switch to a different client (or perhaps only try to be smart). That's in stark contrast to one of my main selling points when I do VO teaching: „Hey, you can prototype with TOPCAT, and when you've figured out things, just switch to pyVO so you can scale, automate, and make your work reproducable“.

    So, let's try to avoid unvalidated requirements.

    Instead, let's have as few requirements as we can while covering the use cases we envision. And then let's have great validators that make sure these requirements are met by the services (or instance documents, or whatever it may be). Such validators not only help making the VO an effective environment that's fun to work with. They also give service operators – like… me – a peace of mind that nothing else can provide.

    [1]I keep a rather tight limit on the sync queries because the system also answers registry discovery queries, and these should be reasonably snappy. If I let long sync queries run, it is very easy to overload the system by accident. If I don't, people who want to run long queries can move to async. There, jobs are queued and only let in one or two at a time. That will not (usually) overload anything.

« Page 6 / 20 »