A Grey Eminence of a Standard

[Screenshot: graphs and numbers]
Examples for extra metadata: extended column descriptions on the web pages accompanying the ARI-Gaia TAP service.

Last friday, I’ve uploaded a first working draft of VODataService 1.2 to the IVOA documents repository. That’s the first major step in updating a standard, and it’s an invitation to everyone to have a look and comment.

Foof, you might say, what do I care? I’ve not even heard of that standard.

Well, but you’ve probably used it. VODataService is (among several other things) the standard that governs how a TAP service tells clients (TOPCAT, say) what tables it has and what’s inside of them. So, if you see in TOPCAT that there is a column named ang_error with a unit of deg, a UCD of stat.error;pos and the meaning “1 σ confidence radius of the position”, that most likely came in a document standardised by VODataService.

The question of what (TAP) services can tell clients about their table set is one major open point: Do we want additional metadata there? This article’s image, for inspiration, shows a screenshot of extended metadata Grégory delivers to browsers on his ARI-Gaia service; among this are minima, maxima, means, standard deviations, quartiles, and fill factors (i.e., how many of the columns are NULL). He even shows histograms of the values’ distributions and HEALPix maps showing how (the means of) the values vary on the sky. Another example of extended metadata could be footnotes as you will find them on many of my resources’ reference URLs (example; footnotes are, unsurprisingly, near the foot of that page).

We could define interoperable means to communicate information like this. The question is: does the added value justify the complication in implementation? This is where it would be great if you weighed in, in particular if you are a “mere” TAP user: Are there any such pieces of metadata you’ve always wanted to see in your TAP interfaces? Oh, and metadata of course can also be added to tables rather than columns. The current draft already lets services communicate the number of rows in each table – is there more “simple”, table-specific metadata of this sort?

VODataService furthermore deals with several other topics; for instance, the STC in the registry business I’ve blogged about in February is going to be standardised here (update on this: spectral coverage is no longer in wavelength but in energy). Other changes are rather more technical in nature, like several new resource types that will improve the discovery of tables and other such resources, or a careful adjustment of some features to keep them in line with TAP evolution.

But don’t let the technicalities scare you away – just have a peek, and if you have thoughts on any of the VODataService topics: I’m just a mail away.

Space and Time not lost on the Registry

Histogram: observation dates of an image service

A histogram of times for which the Palomar-Leiden service has images: That’s temporal service coverage right there.

If you are an astronomer and you’ve ever tried looking for data in the Virtual Observatory Registry, chances are you have wondered “Why can’t I enter my position here?” Or perhaps “So, I’m looking for images in [NIII] – where would I go?”

Both of these are examples for the use of Space-Time Coordinates (STC) in data discovery – yes, spectral coordinates count as STC, too, and I could make an argument for it. But this post is about something else: None of this has worked in the Registry up to now.

It’s time to mend this blatant omission. To take the next steps, after a bit of discussion on some of the IVOA’s mailing lists, I have posted an IVOA note proposing exactly those last Thursday. It is, perhaps with a bit of over-confidence, called A Roadmap for Space-Time Discovery in the VO Registry. And I’d much appreciate feedback, in particular if you are a VO user and have ideas on what you’d like to do with such a facility.

In this post, I’d like to give a very quick run-down on what is in it for (1) VO users, (2) service operators in general, and (3) service operators who happen to run DaCHS.

First, users. We already are pretty good on spatial coverage (for about 13000 of almost 20000 resources), so it might be worth experimenting with that. For now, the corresponding table is only available on the RegTAP mirror at http://dc.g-vo.org/tap. There, you can try queries like

select ivoid from
rr.table_column
natural join rr.stc_spatial
where
  1=contains(gavo_simbadpoint('HDF'), coverage)
  and ucd like 'phot.flux;em.radio%'

to find – in this case – services that have radio fluxes in the area of the Hubble Deep Field. If these lines scare you or you don’t know what to do with the stupid ivoids, check the previous post on this blog – it explains a bit more about RegTAP and why you might care.

Similarly cool things will, hopefully, some day be possible in spectrum and time. For instance, if you were interested in SII fluxes in the crab nebula in the early sixties, you could, some day, write

SELECT ivoid FROM
rr.stc_temporal
NATURAL JOIN rr.stc_spectral
NATURAL JOIN rr.stc_spatial
WHERE
  1=CONTAINS(gavo_simbadpoint('M1'), coverage)
  AND 1=ivo_interval_overlaps(
    6.69e-7, 6.75e-7, 
    wavelength_start, wavelength_end)
  AND 1=ivo_interval_overlaps(
    36900, 38800,
    time_start, time_end)

As you can see, the spectral coordiate will, following (admittedly broken) VO convention, be given in meters of vacuum wavelength, and time in MJD. In particular the thing with the wavelength isn’t quite settled yet – personally, I’d much rather have energy there. For one, it’s independent of the embedding medium, but much more excitingly, it even remains somewhat sensible when you go to non-electromagnetic messengers.

A pattern I’m trying to establish is the use of the user-defined function ivo_interval_overlaps, also defined in the Note. This is intended to allow robust query patterns in the presence of two intrinsically interval-valued things: The service’s coverage and the part of the spectrum you’re interested in, say. With the proposed pattern, either of these can degenerate to a single point and things still work. Things only break when both the service and you figure that “Aw, Hα is just 656.3 nm” and one of you omits a digit or adds one.

But that’s academic at this point, because really few resources define their coverage in time and and spectrum. Try it yourself:

SELECT COUNT(*) FROM (
  SELECT DISTINCT ivoid FROM rr.stc_temporal) AS q

(the subquery with the DISTINCT is necessary because a single resource can have multiple rows for time and spectrum when there’s multiple distinct intervals – think observation campaigns). If this gives you more than a few dozen rows when you read this, I strongly suspect it’s no longer 2018.

To improve this situation, the service operators need to provide the information on the coverage in their resource records. Indeed, the registry schemas already have the notion of a coverage, and the Note, in its core, simply proposes to add three elements to the coverage element of VODataService 1.1. Two of these new elements – the coverage in time and space – are simple floating-point intervals and can be repeated in order to allow non-contiguous coverage. The third element, the spatial coverage, uses a nifty data structure called a MOC, which expands to “HEALPix Multi-Order Coverage map” and is the main reason why I claim we can now pull off STC in the Registry: MOCs let databases and other programs easily and quickly manipulate areas on the sphere. Without MOCs, that’s a pain.

So, if you have registry records somewhere, please add the elements as soon as you can – if you don’t know how to make a MOC: CDS’ Aladin is there to help. In the end, your coverage elements should look somewhat like this:

<coverage>
  <spatial>3/336,338,450-451,651-652,659,662-663 
    4/1816,1818-1819,1822-1823,1829,1840-1841</spatial>
  <temporal>37190 37250</temporal>
  <temporal>54776 54802</temporal>
  <spectral>3.3e-07 6.6e-07</spectral>
  <spectral>2.0e-05 3.5e-06</spectral>
  <waveband>Optical</waveband>
  <waveband>Infrared</waveband>
</coverage>

The waveband elements are remainders from VODataService 1.1. They are still in use (prominently, for one, in SPLAT), and it’s certainly still a good idea to keep giving them for the forseeable future. You can also see how you would represent multiple observing campaigns and different spectral ranges.

Finally, if you’re running DaCHS and you’re using it to generate registry records (and there’s almost no excuse for not doing so), you can simply write a coverage element into your RD starting with DaCHS 1.2 (or, if you run betas, 1.1.1, which is already available). You’ll find lots of examples at the usual place. As a relatively interesting example, the resource descriptor of plts. It has this:


  <updater spaceTable="data" spectralTable="data" mocOrder="4"/>
  <spectral>3.3e-07 6.6e-07</spectral>
  <temporal>37190 37250</temporal>
  <temporal>38776 38802</temporal>
  <temporal>41022 41107</temporal>
  <temporal>41387 41409</temporal>
  <temporal>41936 41979</temporal>
  <temporal>43416 43454</temporal>
  <spatial>3/282,410 4/40,323,326,329,332,387,390,396,648-650,1083,1085,1087,1101-1103,1123,1125,1132-1134,1136,1138-1139,1144,1146-1147,1173-1175,1216-1217,1220,1223,1229,1231,1235-1236,1238,1240,1597,1599,1614,1634,1636,1728,1730,1737,1739-1740,1765-1766,1784,1786,2803,2807,2809,2812</spatial>
</coverage>

This particular service archives plate scans from the Palomar-Leiden Trojan surveys; these were looking for Trojan asteroids (of Jupiter) using the Palomar 122 cm Schmidt and were conducted in several shortish campaigns between 1960 and 1977 (incidentally, if you’re looking for things near the Ecliptic, this stuff might still hold valuable insights for you). Because the fill factor for the whole time period is rather small, I manually extracted the time coverage; for that, I ran select dateobs from plts.data via TAP and made the histogram plot above. Zooming in a bit, I read off the limits in TOPCAT’s coordinate display.

The other coverages, however, were put in automatically by DaCHS. That’s what the updater element does: for each axis, you can say where DaCHS should look, and it will then fill in the appropriate data from what it guesses gives the relevant coordiantes – that’s straightforward for standard tables like the ones behind SSAP and SIAP services (or obscore tables, for that matter), perhaps a bit more involved otherwise. To say “just do it for all axis”, give the updater a single sourceTable attribute.

Finally, in this case I’m overriding mocOrder, the order down to which DaCHS tries to resolve spatial features. I’m doing this here because in determining the coverage of image services DaCHS right now only considers the centers of the images, and that’s severely underestimating the coverage here, where the data products are the beautiful large Schmidt plates. Hence, I’m lowering the resolution from the default 6 (about one degree linearly) to still give some approximation to the actual data coverage. We’ll fix the underlying deficit as soon as pgsphere, the postgres extension which is actually dealing with all the MOCs, has support for turning circles and polygons into MOCs.

When you have defined an updater, just run dachs limits q.rd, and DaCHS will carefully (preserving your indentation) re-write the RD to contain what DaCHS has worked out from your table (but careful: it will overwrite what was previously there; so, make sure you only ask DaCHS to only deal with axes you’re not dealing with manually).

If you feel like writing code discovering holes in the intervals, ideally already in the database: that would be great, because the tighter the intervals defined, the fewer false positives people will have in data discovery.

The take-away for DaCHS operators is:

  1. Add STC coverage to your resources as soon as you’ve updated to DaCHS 1.2
  2. If you don’t have to have the tightest coverage declaration conceivable, all you have to do to have that is add
      <coverage>
        <updater sourceTable="my_table"/>
      </coverage>

    to your RD (where my_table is the id of your service’s “main” table) and then run dachs limits q.rd

  3. For special effects and further information, see Coverage Metadata in the DaCHS reference documentation
  4. If you have a nice postgres function that splits a simple coverage interval up so the filling factor of a set of new intervals increases (or know a nice, database-compatible algorithm to do so) – please let me know.

Register your stuff with purx!

TOPCAT screenshot
If you open the TAP dialog of TOPCAT, what you see is Registry content.

The VO Registry lets people find astronomical resources (which is jargon for “dataset, service, or stuff“). Currently, most of its users don’t even notice they’re using the Registry, as when TOPCAT just magically lists what TAP services are available (image above) – but there are also interfaces that let you directly interact with the registry, for instance GAVO’s WIRR service or ESAVO’s Registry Search.

Arguably, the usefulness of the Registry scales with its completeness. With sufficient completeness, the domain-specific, structured metadata will also make it interesting for generic discovery of astronomical data; in a quip, looking for UCDs in google will never work quite well – and without that, it’s hard to find things with queries like „radio fluxes of early-type stars”.

Either way: If you have a data set or a service dealing with astronomy, it’d be great if you could register it. To do this, so far you either had to set up a publishing registry, which is nontrivial even if you have a software that natively speaks a protocol called OAI-PMH (DaCHS does, but most other publishing suites don’t) or you could use one of two web interfaces to define your resource (notes for a talk on this I gave in 2016).

Neither of these options is really attractive if you publish only a few resources (so the overhead of running a publishing registry looks excessive) that change now and then (so using a web browser to update the resource records again and again is tedious). Therefore, GAVO has developed purx, the publishing registry proxy. We’ve officially announced it during the recent Southern Spring Interop in Santiago de Chile (Program), and the lecture notes for that talk are probably a good introduction to what this is about.

If you’re running VO services and have not registered them so far, you probably want to read both these notes and the service documentation. If, on the other hand, you just have a web-published directory of files or a browser-based service, you probably can skip even that. Just grab a sample record (use the one for a simple browser service in both cases) and adapt it to what’s fitting for your website. Then put the resulting file online somewhere and paste the URL of that location on purx’ enrollment service. In case you’re uncertain about some of the terms in the record, perhaps our crib sheet for metadata we ask our data providers for will be helpful.

There’s really no excuse any more for not being in the Registry!