GAVO Blog: Virtual Observatory Matters Home

DaCHS 2.4 is out: Blind discovery, pretty datalink, and more

2021-06-09 Markus Demleitner

DaCHS 2.4: automatic ranges (with registry support!), pretty datalink (with vocabulary support!). And then the usual bunch of improvements (hopefully!).

I have released DaCHS 2.4 today, and as usual for stable releases, I would like to have something like a commented changelog here so DaCHS deployers perhaps look forward to upgrading – which would be good, because there are far too many outdated DaCHSes out there.

Among the more notable changes in version 2.4 are:

Blind discovery overhaul. If you've been following my requests to include coverage metadata three years ago, you have probably felt that the way DaCHS started to hack your RDs to include the metadata it had obtained from the data was a bit odd. Well, it was. DaCHS no longer does that when running dachs limits. While you can still do manual overrides, all the statistics gathered by DaCHS is now kept in the database and injected into the DaCHS' internal idea of your RDs at loading time.

I have not only changed this because the old way really sucked; it was also necessary because I wanted to have per-column metadata routinely, and since in advanced DaCHS there often are no XML literals for columns (because of active tags), there wouldn't be a place to keep information like what a column is minimally, maximally, in median, or as a “2σ range“ within the RD itself. A longer treatment of where this is going is given in the IVOA note Blind Discovery 2: Advanced Column Statistics that Grégory and I have recently uploaded.

For you, it's easy: Just run dachs limits q once you're happy with your data, or perhaps once a month for living data, and leave the rest to DaCHS. A fringe benefit: in browser froms, there are now value ranges of the various numeric constraints as placeholders (that's the screenshot on the left in the title picture).

There is a slight downside: As part of this overhaul, DaCHS is now computing the coverage of SIAP and SSAP services based on the footprints of the products as MOCs. While that gives much more precise service footprints, it only works with bleeding-edge pgsphere as delivered in Debian bullseye – or from our Debian repository. If you want to build this from source, you need to get credativ's pgsphere fork for now.

Generate column elements: If you have tables with many columns, even just lexically entering the <column> elements becomes straining. That is particularly annoying if there already is a halfway machine-readable representation of that data.

To alleviate that, very early in the development of DaCHS, I had the gavo mkrd subcommand that you could feed FITS images or VOTables to get template RDs. For a number of reasons, that never worked well enough to make me like or advertise it, and I eventually ended up writing dachs start instead, which is something I like and advertise for general usage.

However, what that doesn't do is come up with the column declarations. To make good on this, there is now a dachs gencol command that will, from a FITS binary table, a VOTable, or a VizieR-style byte-by-byte description, generate columns with as much metadata as it can fathom. Paste that into the output of dachs start, and, depending on your input format, you should have a quick start on a fairly full-featured data collection (also note there's dachs adm suggestucds for another command that may help quickly generate rich metadata).

This currently doesn't work for products (i.e., tables of spectra, images, and the like); at least for FITS arrays, I suppose turning their non-obvious header cards into columns might save some work. Let's see: your feedback is welcome.

Refurbished Datalink XSLT: Since the dawn of datalink, DaCHS has delivered Datalink documents with XSLT stylesheets in order to have nicely formatted pages rather than wild XML when web browsers chance on datalink documents. I have overhauled the Javascript part of this (which, I have to admit, is what makes it pretty). For one, the spatial cutout now works again, and it's modeless (no clicking “edit“ any more before you can drag cutout vertices). I'm also using the datalink/core vocabulary to furnish link groups with proper titles and descriptions, and to have them sorted in in a proper result tree. I've talked about it at the interop, and I've prepared a showcase of various datalink documents in the Heidelberg data centre.

Update to DaCHS 2.4 and you'll get the same thing for your datalinks.

Non-product datalinks: When writing a datalink service, you have to first come up with a descriptor generator. DaCHS will provide a simple one for you (or perhaps a bit more complex ones for FITS images or spectra) – but all of these assume that whatever the datalink ID parameter references is in DaCHS' product table. It turned out that in many interesting cases – for instance, attaching time series to object catalogues – that is not the case, and then you had to write rather obscure code to keep DaCHS from poking around in the product table.

No longer: There is now the //datalink#fromtable descriptor generator. Just fill in which column contains the identifier and the name of the table containing that column and you're (basically) done. Your descriptor will then have a metadata attribute containing the relevant row – along with everything else DaCHS expects from a datalink descriptor.

gavo_specconv: That's a longer story covered previously on this blog.

Index declaration in views: Saying on which columns a database index exists allows users to write smart queries, and DaCHS uses such information internally when rewriting geometrical expressions from ADQL to whatever is in use in the actual database. Hence, making sure these indexes are properly declared is important. But at the same time it's difficult for views, because postgres doesn't let you have indexes on views (for good reasons). Still, queries against views will (usually) use indexes of their underlying tables, and hence those should be declared in the corresponding metadata.

This is tedious in general. DaCHS now helps you with the //procs#declare-indexes-from stream. Essentially, it will compare the columns in the view with the ones from the source tables and then guess which view columns correspond to indexed columns from the source tables; using that, it adds indexed flags to some view columns.

If all this is too weird for you: Thanks to declare-indexes-from, the index declaration now automatically happens in the modern way to build SSAP services, the //ssap#view mixin. Hence, chances are you won't even see this particular STREAM but just notice its beneficial consequences.

Sunsetting resources: I've been fiddling off and on with a smart way to pull resources I no longer want to maintain while still leaving a tombstone. I had to re-visit this problem recently because I dropped the Gaia DR1 table from my Heidelberg data centre. So, how do I explain to people why the thing that's been there no longer is?

In general, this is a rather untractable problem; for instance, it's very hard to do something sensible with the TAP_SCHEMA entries or the VOSI tables endpoints for the tables that went away. Pure web pages, on the other hand, can be adorned with helpful info. To enable that, there is now the superseded meta item, which you define in the RD that once held the resources. For Gaia DR1, here's what I used:
```
<meta name="superseded" format="rst">
  We do not publish Gaia DR1 data here any more.
  If you actually need DR1 data, refer to the
  full Gaia mirrors, for instance `the one at
  ARI`_.  Otherwise, please use more recent data
  releases, for instance `eDR3`_.

  .. _the one at ARI: http://gaia.ari.uni-heidelberg.de
  .. _eDR3: /browse/gaia/q3
</meta>
```
Root page template: I slightly streamlined the default root page template, in particular dropping the "i" and "Q" icons for going to the metadata and querying the service. If you have overridden the root template, you may want to see if you want to merge the changes.

As usual, there are many more small repairs and additions, but most of these are either very minor or rather technical. One last thing, though: DaCHS now works with Python 3.8 (3.7 will continue to be supported for a few years at least, earlier 3.x never was), which is going to be the python3 in Debian bullseye. Bullseye itself will only have DaCHS 2.3 (with the Python 3.8 fixes backported), though. Once bullseye has become stable, we will look into putting DaCHS 2.4 into the backports.

Category: Software
GAVO at the Northern Spring Interop 2021

2021-05-28 Markus Demleitner

As usual in May, the people making the Virtual Observatory happen meet for their Interoperability Conference, better known as the Interop – where “meet” still has to be taken with a generous helping of salt (more on this near the end of this post). As has become customary on this blog, let me briefly discuss contributions with a significant involvement of GAVO.

A major thing from my perspective actually happened in the run-up: The IVOA executive committee (“Exec“) approved Version 2.0 of Vocabularies in the VO, a standard saying how hierarchical word lists (“vocabularies“) can be managed, disseminated, and consumed within the VO. Developing the main ideas from sufficiently restricting RDF to coming up with desise (which makes complicated things possible with surprisingly little code), and trying things out on our growing number of vocabularies took up quite a bit of my standards time in the last 20 months or so – and I'm fairly happy with the outcome, which I celebrated with a brief talk on programming with IVOA semantics during Wednesday morning's semantics session.

In that session I gave a second, more discussion-oriented, talk, probing how to formalise data product types – which is surprisingly involved, even with the relatively straightforward use case “figure out a programme to handle the data“: What's a spectrum? Well, something that maps a spectral coordinate to... hm. Is it still a spectrum if there's multiple sorts values (perhaps flux, magnitude, and polarisation)? If we allow, in effect, tuples, why not whole images, which would make spectral cubes spectra – but of course few client programmes that deal with spectra do anything useful with cubes, so clearly such a definition would kill our use case. And what about slit spectra, mapping a spatial coordinat to spectra?

All this of course is reminiscent of the classical problems of semantics: An elephant is a big animal with a trunk. But when an elephant loses its trunk in an accident: does it stop being an elephant? So, much of the art here is finding the sweet spot of usability between strict and formal semantics (that will never fit the real world) and just tossing around loosely defined strings (that will simply not be machine-readable). After the session, I came up with the 2021-05-26 draft of product-type. If you read this a few years down the road, it might be interesting to compare with what product-type is today. I'm curious myself.

Later on Wednesday CET, I did a shameless plug for my Datalink-transforming XSLT (apologies for a github link, but I'm fishing for PRs here; if you use DaCHS, you'll get the updated stuff with version 2.4, due soon). The core of this dates back to the dawn of datalink, but with a new graphical cutout code and in particular vocabulary-based tree-ification of the result rows, I figured it's time to remind the operators of datalink services it's still out there for them to take up. Perhaps more than from the slides, you can see what I am after here by just trying the Datalink examples I've collected for this talk and comparing document source, the appearance without Javascript (pure XSLT) and the appearance with Javascript (I'm a bit ashamed I'm relying so heavily on it, but much of this really can only be done client-side).

Quite a bit after midnight my time (still Thursday UTC), Mark Taylor talked about Software Identification, something I've been working on with him recently. It's is one of the things that is short and trivial but that, when unregulated, just doesn't work; in this case it's servers and clients saying what they are when they speak HTTP. I stumbled into the problem while trying to locate severely outdated DaCHS installations – so, I a way I put effort into the Note Mark was talking about (and which I have just uploaded to the IVOA Document Repository) as a sort of penance.

While I was already asleep when Mark gave his talk, I was back at the Interop Friday morning CEST, when Hendrik Heinl talked about the LOFAR TAP service (which, I'm proud to say, runs on top of DaCHS); this was mainly live operations in TOPCAT (which is why there's no exciting slides), but Hendrik used a pyVO script doing cutouts in an (optical) mosaic of the Fornax cluster built on top of – and that's the main point – Datalink and SODA. Working this out with Hendrik made me realise the documentation of Datalink in pyVO really needs… love. Or, better, work.

Later on Friday, there was the Registry session, where I gave brief (and somewhat cramped) talks on advanced column metadata (which is intended to one day let you query the registry for things like “roughly complete to 18 mag” or “having objects out to redshift 4“) and how to put VODataService 1.2 coverage into RegTAP – I expect you'll read more on both topics on this blog as they mature to a level at which this can leave the Registry nerd circles.

And now, about 10 pm on Friday, the meeting is slowly winding down; beyond all the talks (which were, regrettably for a free software spirit like me, on zoom), the real bonus was that there was a gather.town attached to the conference. Now, that's a closed, proprietary, non-self-hostable platform, too, and so I have all reason to grumble. But: for the first time since February 2020 it felt like a conference, with the most useful action happening outside of the lecture halls, from trying to reach consensus on VEP-006 to teaching DaCHS datalink service declaration to learning about working with visibilities coming from VLBI (where it's even more difficult than it is with the big antenna arrays). So… this one time I've made my peace with proprietary platforms.

A propos of “say no to platforms“ (in this case, slack): Due to the recent troubles with freenode, in addition to the Interop last week saw the the GAVO IRC channel move to libera.chat (where it's still #gavo). So, for instant messaging us now that the Interop is (in effect) over: Come there.

Category: Meetings
Spectral Units in ADQL

2021-04-23 Markus Demleitner

In case you find the piece of Python given below too hard to read: It's just this table of conversion expressions between the different SI units we are dealing with here.

Astronomers these days work all along the electromagnetic spectrum (and beyond, of course). Depending on where they observe, they will have very different instrumentation, and hence some see their messengers very naturally as waves, others quite as naturally as particles, others just as electrons flowing out of a CCD that is sitting behind a filter.

In consequence, when people say where in the spectrum they are, they use very different notions. A radio astronomer will say “I'm observing at 21 cm” or “at 50 GHz“. There's an entire field named after a wavelength, “submillimeter“, and blueward of that people give their bands in micrometers. Optical astronomers can't be cured of their Ångström habit. Going still more high-energy, after an island of nanometers in the UV you end up in the realm of keV in X-ray, and then MeV, GeV, TeV and even EeV.

However, there is just one VO (or at least that's where we want to go). Historically, the VO has had a slant towards optical astronomy, which gives us the legacy of having wavelengths in far too many places, including Obscore. Retrospectively, this was an unfortunate choice not only because it makes us look optical bigots, but in particular because in contrast to energy and, by ν = E/h, frequency, messenger wavelength depends on the medium you work in, and I shudder to think how many wavelengths in my data center actually are air wavelengths rather than vacuum wavelengths. Also, as you go beyond photons, energy really is the only thing that reasonably characterises all messengers alike (well, even that still isn't quite settled for gravitational waves as long as we're not done with a quantum theory of gravitation).

Well – the wavelength milk is spilled. Still, the VO has been boldly expanding its reach beyond the optical and infrared windows (recently, with neutrinos and gravitational waves, not to mention EPN-TAP's in-situ measurements in the solar system, even beyond the electromagnetic spectrum). Which means we will have to accomodate the various customs regarding spectral units described above. Where there are “thick” user interfaces, these can care about that. For instance, my datalink XSLT and javascript lets people constrain spectral cutouts (along BAND) in a variety of units (Example).

But what if the UI is as shallow as it is in ADQL, where you deal with whatever is in the underlying database tables? This has come up again at last week's EuroVO Technology Forum in virtual Strasbourg in the context of making Obscore more attractive to radio astronomers. And thus I've sat down and taught DaCHS a new user defined function to address just that.

Up front: When you read this in 2022 or beyond and everything has panned out, the function might be called ivo_specconv already, and perhaps the arguments have changed slightly. I hope I'll remember to update this post accordingly. If not, please poke me to do so.

The function I'm proposing is, mainly, gavo_specconv(expr, target_unit). All it does is convert the SQL expression expr to the (spectral) target_unit if it knows how to do that (i.e., if the expression's unit and the target unit are spectral units properly written in VOUnit) and raise an error otherwise.

So, you can now post:
```
SELECT TOP 5 gavo_specconv(em_min, 'GHz') AS nu
FROM ivoa.obscore
WHERE gavo_specconv((em_min+em_max)/2, 'GHz')
    BETWEEN 1 AND 2
  AND obs_collection='VLBA LH sources'
```
to the TAP service at http://dc.g-vo.org/tap. You will get your result in GHz, and you write your constraint in GHz, too. Oh, and see below on the ugly constraint on obs_collection.

Similarly, an X-ray astronomer would say, perhaps:
```
SELECT TOP 5 access_url, gavo_specconv(em_min, 'keV') AS energy
FROM ivoa.obscore
WHERE gavo_specconv((em_min+em_max)/2, 'keV')
  BETWEEN 0.5 AND 2
  AND obs_collection='RASS'
```
This works because the ADQL translator can figure out the unit of its first argument. But, perhaps regrettably, ADQL has no notion of literals with units, and so there is no way to meaningfully say the equivalent of gavo_specconv(656, 'Hz') to get Hα in Hz, and you will receive a (hopefully helpful) error message if you try that.

However, this functionality is highly desirable not the least because the queries above are fairly inefficient. That's why I added the funny constraints on the collection: without them, the queries will take perhaps half a minute and thus require async operation on my box.

The (fundamental) reason for that is that postgres is not smart enough to work out it could be using an index on em_min and em_max if it sees something like nu between 3e8/em_min and 3e7/em_max by re-writing the constraint into 3e8/nu between em_min and em_max (and think really hard about whether this is equivalent in the presence of NULLs). To be sure, I will not teach that to my translation layer either. Not using indexes, however, is a recipe for slow queries when the obscore table you query has about 85 million rows (hi there in 2050: yes, that was a sizable table in our day).

To let users fix what's too hard for postgres (or, for that matter, the translation engine when it cannot figure out units), there is a second form of gavo_specconv that takes a third argument: gavo_specconv(expr, unit_of_expr, target_unit). With that, you can write queries like:
```
SELECT TOP 5 gavo_specconv(em_min, 'Angstrom') AS nu
FROM ivoa.obscore
WHERE gavo_specconv(5000, 'Angstrom', 'm')
  BETWEEN em_min AND em_max
```
and hope the planner will use indexes. Full disclosure: Right now, I don't have indexes on the spectral limits of all tables contributing to my obscore table, so this particular query only looks fast because it's easy to find five datasets covering 500 nm – but that's an oversight I'll fix soon.

Of course, to make this functionality useful in practice, it needs to be available on all obscore services (say) – only then can people run all-VO obscore searches without the optical bias. The next step (before Bambi-eyeing the TAP implementors) therefore would be to get it into the catalogue of ADQL user defined functions.

For this, one would need to specify a bit more carefully what units must minimally be supported. In DaCHS, I have built this on a full implementation of VOUnits, which means you can query using attoparsecs of wavelength and get your result in dekaerg (which is a microjoule: 1 daerg = 1 uJ in VOUnits – don't you just love this?):
```
SELECT gavo_specconv(
  (spectral_start+spectral_end)/2, 'daerg')
  AS energy
FROM rr.stc_spectral
WHERE gavo_specconv(0.0002, 'apc', 'J')
  BETWEEN spectral_start AND spectral_end
```
(stop computing: an attoparsec is about 3 cm). This, incidentally, queries the draft RegTAP extension for the VODataService 1.2 coverage in space, time, and spectrum, which is another reason I'm proposing this function: I'm not quite sure how well my rationale that using Joules of energy is equally inconvenient for all communities will be generally received. The real rationale – that Joule is the SI unit for energy – I don't dare bring forward in the first place.

Playing with wavelengths in AU (you can do that, too; note, though, that VOUnit forbids prefixes on AU, so don't even try mAU) is perhaps entertaining in a slightly twisted way, but admittedly poses a bit of a challenge in implementation when one does not have full VOUnits available. I'm currently thinking that m, nm, Angstrom, MHz, GHz, keV and MeV (ach! No Joule! But no erg, either!) plus whatever spectral units are in use in the local tables would about cover our use cases. But I'd be curious what other people think.

Since I found the implementation of this a bit more challenging than I had at first expected, let me say a few words on how the underlying code works; I guess you can stop reading here unless you are planning to implement something like this.

The fundamental trouble is that spectral conversions are non-linear. That means that what I do for ADQL's IN_UNIT – just compute a conversion factor and then multiply that to whatever expression is in its first argument – will not work. Instead, one has to write a new expression. And building these expressions becomes involved because there are thousands of possible combinations of input and output units.

What I ended up doing is adopting standard (i.e., SI) units for energy (J), wavelength (m), and frequency (Hz) as common bases, and then first convert the source and target units to the applicable standard unit. This entails trying to convert each input unit to each standard unit until a conversion actually works, which in DaCHS' Python looks like this:
```
def toStdUnit(fromUnit):
    for stdUnit in ["J", "Hz", "m"]:
        try:
             factor = base.computeConversionFactor(
                 fromUnit, stdUnit)
        except base.IncompatibleUnits:
            continue
        return stdUnit, factor

    raise common.UfuncError(
        f"specconv: {fromUnit} is not a spectral unit understood here")
```
The VOUnits code is hidden away in base.computeConversionFactor, which raises an IncompatibleUnits when a conversion is impossible; hence, in the end, as a by-product this function also determines what kind of spectral value (energy, frequency, or wavelength) I am dealing with.

That accomplished, all I need to do is look up the conversions between the basic units, which can be done in a single dictionary mapping pairs of standard units to the conversion expression templates. I have not tried to make these templates particularly pretty, but if you squint, you can still, I hope, figure out this is actually what the opening image shows:
```
SPEC_CONVERSION = {
    ("J", "m"): "h*c/(({expr})*{f})",
    ("J", "Hz"): "({expr})*{f}/h",
    ("J", "J"): "({expr})*{f}",
    ("Hz", "m"): "c/({expr})/{f}",
    ("Hz", "Hz"): "{f}*({expr})",
    ("Hz", "J"): "h*{f}*({expr})",
    ("m", "m"): "{f}*({expr})",
    ("m", "Hz"): "c/({expr})/{f}",
    ("m", "J"): "h*c/({expr})/{f}",}
```
expr is (conceptually) replaced by the first argument of the UDF, and f is the conversion factor between the input unit and the unit expr is in. Note that thankfully, no additive operators are involved and thus all this is numerically well-conditioned. Hence, I can afford not attempting to simplify any of the expressions involved.

The rest is essentially book-keeping, where I'm using the ADQL parser to turn the expression into a tree fragment and then fiddling in the tree fragment for expr into that. The result then replaces the UDF function call in the syntax tree. You can review all this in context in DaCHS' ufunctions.py, starting at the definition of toStdUnit.

Sure: this is no Turing award material. But perhaps these notes are useful when people want to put this kind of thing into their ADQL engines. Which I'd consider a Really Good Thing™.

Category: Demo
Tangible Astronomy and Movies with TOPCAT

2021-03-31 Markus Demleitner

This March, I've put up two new VO resources (that's jargon for “table or service or whatever”) that, I think, fit quite well what I like to call tangible astronomy: things you can readily relate to what you see when you step out at night. And, since I'm a professing astronomy nerd, that's always nicely gratifying.

The two resources are the Constellations as Polygons and the Gaia eDR3 catalogue of nearby stars (GCNS).

Constellations

On the constellations, you might rightfully say that's really far from science. But then they do help getting an idea where something is, and when and from where you might see something. I've hence wanted for a long time to re-publish the Davenhall Constellation Boundary Data as proper, ADQL-queriable polygons, and figuring out where the loneliest star in the sky (and Voyager 1) were finally made me do it.

Taurus in the GCNS density plot: with constellations!

So, since early March there's the cstl.geo table on the TAP service at https://dc.g-vo.org/tap with the constallation polygons in its p column. Which, for starters, means it's trivial to overplot constallation boundaries in your favourite VO clients now, as in the plot above. To make it, I've just done a boring SELECT * FROM cstl.geo, did the background (a plain HEALPix density plot of GCNS) and, clicked Layers → Add Area Control and selected the cstl.geo table.

If you want to identify constellations by clicking, while in the area control, choose “add central” from the Forms menu in the Form tab; that's what I did in the figure above to ensure that what we're looking at here is the Hyades and hence Taurus. Admittedly: these “centres“ are – as in the catalogue – just the means of the vertices rather than the centres of mass of the polygon (which are hard to compute). Oh, and: there is also the AreaLabel in the Forms menu, for when you need the identification more than the table highlighting (be sure to use a center anchor here).

Note that TOPCAT's polygon plot at this point is not really geared towards large polygons (which the constellations are) right now. At the time of writing, the documentation has: “Areas specified in this way are generally intended for displaying relatively small shapes such as instrument footprints. Larger areas may also be specified, but there may be issues with use.” That you'll see at the edges of the sky plots – but keeping that in mind I'd say this is a fun and potentially very useful feature.

What's a bit worse: You cannot turn the constellation polygons into MOCs yet, because the MOC library currently running within our database will not touch non-convex polygons. We're working on getting that fixed.
Nearby Stars

Similarly tangible in my book is the GCNS: nearby stars I always find romantic.

Let's look at the 100 nearest stars, and let's add spectral types from Henry Draper (cf. my post on Annie Cannon's catalogue) as well as the constellation name:
```
WITH nearest AS (
SELECT TOP 100
  a.source_id,
  a.ra, a.dec,
  phot_g_mean_mag,
  dist_50,
  spectral
FROM gcns.main AS a
LEFT OUTER JOIN hdgaia.main AS b
  ON (b.source_id_dr3=a.source_id)
ORDER BY dist_50 ASC)
SELECT nearest.*, name
FROM nearest
JOIN cstl.geo AS g
  ON (1=CONTAINS(
    POINT(nearest.ra, nearest.dec),
    p))
```
Note how I'm using CONTAINS with the polygon in the constellations table here; that's the usage I've had in mind for this table (and it's particularly handy with table uploads).

That I have a Common Table Expression (“WITH”) here is due to SQL planner confusion (I'll post something about that real soon now): With the WITH, the machine first selects the nearest 100 rows and then does the (relatively costly) spatial match, without it, the machine (somewhat surprisingly) did the geometric match first. This particular confusion looks fixable, but for now I'd ask you for forgiveness for the hack – and the technique is often useful anyway.

If you inspect the result, you will notice that Proxima Cen is right there, but α Cen is missing; without having properly investigated matters, I'd say it's just too bright for the current Gaia data reduction (and quite possibly even for future Gaia analysis).

Most of the objects on that list that have made it into the HD (i.e., have a spectral type here) are K dwarfs – which is an interesting conspiracy between the limits of the HD (the late red and old white dwarfs are too weak for it) and the limits of Gaia (the few earlier stars within 6 parsec – which includes such luminaries as Sirius at a bit more than 2.5 pc – are just too bright for where Gaia data reduction is now).
Animation

Another fairly tangible thing in the GCNS is the space velcity, given in km/s in the three dimensions U, V, and W. That is, of course, an invitation to look for stellar streams, as, within the relatively small portion of the Milky Way the GCNS looks at, stars on similar orbits will exhibit similar space motions.

Considering the velocity dispersion within a stellar stream will be a few km/s, let's have the database bin the data. Even though this data is small enough to conveniently handle locally, this kind of remote analysis is half of what TAP is really great at (the other half being the ability to just jump right into a new dataset). You can group by multiple things at the same time:
```
SELECT
  COUNT(*) AS n,
  ROUND(uvel_50/5)*5 AS ubin,
  ROUND(vvel_50/5)*5 AS vbin,
  ROUND(wvel_50/5)*5 AS wbin
FROM gcns.main
GROUP BY ubin, vbin, wbin
```
Note that this (truly) 3D histogram only represents a small minority of the GCNS objects – you need radial velocities for space motion, and these are precious even in the Gaia age.

What really surprised me is how clumpy this distribution is – are we sure we already know all stellar streams in the solar neighbourhood? Watch for yourself (if your browser can't play webm, complain to your vendor):

[Update (2021-04-01): Mark Taylor points out that the “flashes” you sometimes see when the grid is aligned with the viewing axes (and the general appearance) could be improved by just pulling all non-NULL UVW values out of the table and using a density plot (perhaps shading=density densemap=inferno densefunc=linear). That is quite certainly true, but it would of course defeat the purpose of having on-server aggregation. Which, again, isn't all that critical for this dataset, so doing the prettier plot actually is a valuable exercise for the reader]

How did I make this video? Well, I started with a Cube Plot in TOPCAT as usual, configuring weighted plotting with n as its weight and played around a bit with scaling out a few outliers. And then I saved the table (to zw.vot), hit “STILTS“ in the plot window and saved the text from there to a text file, zw.sh. I had to change the ``in`` clause in the script to make it look like this:
```
#!/bin/sh
stilts plot2cube \
 xpix=887 ypix=431 \
 xlabel='ubin / km/s' ylabel='vbin / km/s' \
 zlabel='wbin / km/s' \
 xmin=-184.5 xmax=49.5 ymin=-77.6 ymax=57.6 \
 zmin=-119.1 zmax=94.1 phi=-84.27 theta=90.35 \
  psi=-62.21 \
 auxmin=1 auxmax=53.6 \
 auxvisible=true auxlabel=n \
 legend=true \
 layer=Mark \
    in=zw.vot \
    x=ubin y=vbin z=wbin weight=n \
    shading=weighted size=2 color=blue
```
– and presto, sh zw.sh would produce the plot I just had in TOPCAT. This makes a difference because now I can animate this.

In his documentation, Mark already has a few hints on how to build animations; here are a few more ideas on how to organise this. For instance, if, as I want here, you want to animate more than one variable, stilts tloop may become a bit unwieldy. Here's how to give the camera angles in python:
```
import sys
from astropy import table
import numpy

angles = numpy.array(
  [float(a) for a in range(0, 360)])
table.Table([
    angles,
    40+30*numpy.cos((angles+57)*numpy.pi/180)],
  names=("psi", "theta")).write(
    sys.stdout, format="votable")
```
– the only thing to watch out for is that the names match the names of the arguments in stilts that you want to animate (and yes, the creation of angles will make numpy afficionados shudder – but I wasn't sure if I might want to have somewhat more complex logic there).

[Update (2021-04-01): Mark Taylor points out that all that Python could simply be replaced with a straightforward piece of stilts using the new loop table scheme in stilts, where you would simply put:
```
animate=:loop:0,360,0.5
acmd='addcol phi $1'
acmd='addcol theta 40+30*cosDeg($1+57)'
```
into the plot2cube command line – and you wouldn't even need the shell pipeline.]

What's left to do is basically the shell script that TOPCAT wrote for me above. In the script below I'm using a little convenience hack to let me quickly switch between screen output and file output: I'm defining a shell variable OUTPUT, and when I un-comment the second OUTPUT, stilts renders to the screen. The other changes versus what TOPCAT gave me are de-dented (and I've deleted the theta and psi parameters from the command line, as I'm now filling them from the little python script):
```
OUTPUT="omode=out out=pre-movie.png"
#OUTPUT=omode=swing

python3 camera.py |\
stilts plot2cube \
   xpix=500 ypix=500 \
   xlabel='ubin / km/s' ylabel='vbin / km/s' \
   zlabel='wbin / km/s' \
   xmin=-184.5 xmax=49.5 ymin=-77.6 ymax=57.6 \
   zmin=-119.1 zmax=94.1 \
   auxmin=1 auxmax=53.6 \
phi=8 \
animate=- \
afmt=votable \
$OUTPUT \
   layer=Mark \
      in=zw.vot \
      x=ubin y=vbin z=wbin weight=n \
      shading=weighted size=4 color=blue

# render to movie with something like
# ffmpeg -i "pre-movie-%03d.png" -framerate 15 -pix_fmt yuv420p /stream-movie.webm
# (the yuv420p incantation is so real-world
# web browsers properly will not go psychedelic
# with the colours)
```
The comment at the end says how to make a proper movie out of the PNGs this produces, using ffmpeg (packaged with every self-respecting distribution these days) and yielding a webm. Yes, going for mpeg x264 might be a lot faster for you as it's a lot more likely to have hardware support, but everything around mpeg is so patent-infested that for the sake of your first-born's soul you probably should steer clear of it.

Movies are fun in webm, too.
Category: Data
Semantics, Cross-Discipline Discovery, and Down-To-Earth Code

2021-02-18 Markus Demleitner

A tiny piece of the Unified Astronomy Thesaurus as viewed by Sembarebro – the IVOA logos sit on terms that have VO resoures on them.

Sometimes people ask me (in particular when I'm wearing my hat as the current chair of the IVOA Semantics working group) “well, what's this semantics thing good for?“ There are many answers, but here's one that nicely meshes with my pet subject data discovery: You want hierarchical, agreed-upon word lists to bridge discipline gaps.

This story starts with B2FIND, a cross-disciplinary metadata aggregator for science data run within the framework of the European Open Science Cloud (EOSC). GAVO (or, more precisely, Heidelberg University's Astronomy) is involved in the EOSC via the ESCAPE project, and so I have had the pleasure of interacting with B2FIND for a while now. In particular, they are harvesting the metadata records of the Virtual Observatory Registry from us.

This of course requires a bit of mapping, because the VO's metadata formats (VOResource, VODataService, and several extensions; see 2014A&C.....7..101D to learn more) are far too fine-grained for the wider scientific public. Not even our good friends from high-energy physics would appreciate being served links to, say, TAP endpoints (yet!). So, on our end we're mapping to the Datacite metadata kernel, which from VOResource is just a piece of XSL away (plus some perhaps debatable conventions).

But there's more to this mapping, such as vocabularies of subject keywords. You might argue that in the age of rapid full text searches, keywords are dead. I would beg to disagree. For example, with good, hierarchical keyword systems you can, among many other useful things, offer topical browsing of metadata repositories. While it might not quite qualify as “useful” yet, the SemBaReBro registry browser I've hacked together late last year would be an example for such facilities – and might become part of our WIRR Registry searching tool one day.

On the topic of subject keywords VOResource says that resources in the VO should be using the Unified Astronomy Thesaurus, specifically in its IVOA incarnation (not quite true yet, but true enough by blog standards). While few do, I've done a mapping of existing keywords in the VO to UAT concepts, which is what's behind SemBaReBro. So: most VO resources now have UAT concepts.

However, these include concepts like AM Canum Venaticorum Stars, which outside of rather specialised circles of astronomers few people will ever have heard about (which, don't get me wrong, I personally regret – they're funky star systems). Hence, B2FIND does not bother with those.

When we discussed the subject mapping for B2FIND, we thought using the UAT's top-level concepts might be a good start. However, at that point no VO resources at all actually used these, and, indeed, within astronomy that generally wouldn't make a lot of sense, because they are to unspecific to help much within the discipline. I postponed and then forgot about the problem – when the keywords of the resources weren't even from UAT, solving the granularity mismatch just wasn't humanly possible.

That was the state of affairs until last Tuesday, when I had a mumble session with B2FIND folks and the topic came up again. And now, thanks partly to the new desise format proposed in the current Vocabularies in the VO 2 draft, things fell nicely into place: Hey, I have UAT concepts, and mapping these to the top-level terms isn't hard either any more.

So, B2FIND gets the toplevel keywords they've been expecting all the time starting today. Yes: This isn't a panacea suddenly solving all the problems of cross-discipline data discovery, not the least because it's harder than one might think to imagine how such a thing would look like in practice. But given the complexities involved I was positively surprised how easy this particular part of the equation was.

From here on, there's a bit of tech babble I intend to re-use in the RFC of Vocabularies in the VO 2; don't feel bad if you skip it.

The first step was to make the mapping from UAT terms to the toplevel terms. The interesting part of the source I'm linking to here is:
```
def get_roots_for(term, uat_terms):
  roots, seen = set(), set()

  def follow(t):
    wider = uat_terms[t]["wider"]
    if not wider:
      if not t in ROOT_TERMS:
        raise Exception(
          f"{t} found as a top-level term")
      roots.add(t)
    else:
      seen.add(t)
      for wider in uat_terms[t]["wider"]:
        follow(wider)

  follow(term)
  return roots
```
There, uat_terms is essentially just a json-decode of what you get from the vocabulary URI if you ask for desise (see the draft spec linked to above for the technicalities). That's really it, and it even defends against cycles in the concept graph (which are legal by SKOS but shouldn't happen in the UAT) and detached terms (i.e., ones that are not rooted in the top-level terms). For what it does, I claim that's remarkably compact code.

Once I had that, I needed to get the UAT-mapped subject keywords for the records I'm serving to datacite and fiddle the corresponding roots back in. That's technically a bit more involved because I am producing the datacite records on the fly from the XML representation for VOResource records that I keep in the database, and there's a bit of namespace magic involved (full code). Plus, the UAT-mapped keywords are only kept in the database, not in the metadata records.

Still, the core operation here is relatively straightforward. Consider:
```
def addUATToplevels(dataciteTree):
  # dataciteTree is an (lxml) ElementTree for the
  # result of the XSL transformation.  That's all
  # I have, and thus I first have to fiddle out
  # the identifier we are talking about
  ivoid =  dataciteTree.xpath(
      "//d:alternateIdentifier["
      "@alternateIdentifierType='ivoid']",
      namespaces={"d": DATACITE_NS}
    )[0].text.lower()
  # The .lower() is necessary because ivoids
  # unfortunately are case-insensitive, and RegTAP
  # normalises them to lowercase to retain sanity.

  # Now pull the UAT-mapped subject keywords from
  # our RegTAP extension (getTableConn is
  # DaCHS-internal API, but there's no magic in
  # there, it's just connection pooling with
  # guarantees against connections  idle in
  # transaction).
  with base.getTableConn() as conn:
    subjects = set(r[0] for r in
      conn.query("SELECT uat_concept"
        " FROM rr.subject_uat"
        " WHERE ivoid=%(ivoid)s", locals()))

  # This is the mapping itself: we do
  # roots-subjects to avoid adding
  # root terms that are already in
  # the record itself.  UAT_TOPLEVELS is the result
  # of the root finding discussed above.
  for term in subjects:
    root = UAT_TOPLEVELS[term]
    newRoots |= (root-subjects)

  # And finally fiddle in any new root terms found
  # into the datacite tree
  if newRoots:
    subjects = dataciteTree.xpath(
      "//d:subjects",
      namespaces={"d": DATACITE_NS})[0]
    for root in newRoots:
      newSubject = etree.SubElement(subjects,
        f"{{{DATACITE_NS}}}subject")
      newSubject.text = root
```
Apart from the technicalities I'd again say that's pretty satisfying code.

And these two pieces of code are really all I had to do to map between the vocabularies of different granularities – which I claim will probably be the norm as metadata flows between disciplines.

It's great to see the pieces of a fairly comples puzzle fall into place like that.

Category: Operations

« Page 10 / 22 »

Constellations

Nearby Stars

Animation