DaCHS 2.12 Is Out

2025-07-18 Markus Demleitner

A bit more than one month after the last Interop, I have released the next version of GAVO's data publication package, DaCHS. This is the customary post on what is new in this release.

There is no major headline for DaCHS 2.12, but there is a fair number of nice conveniences in it. For instance, if you have a collection of time series to publish, the new time series service template might help you. You get it by calling dachs start timeseries; I will give you that it suffers from about the same malady as the existing ssap+datalink one: There is a datalink service built in from the start, which puts up a scary amount of up-front complexity you have to master before you get any sort of gratification.

There is little we can do about that; the creators of time series data sets just have not come up with a good convention for how to write them. I might be moved to admit that putting them into nice FITS binary tables might count as acceptable. In practice, none of the time series I got from my data providers came in a format remotely fit for distribution. Perhaps Ada's photometric time series convention (which is what you will deliver with the template) is not the final word on how to represent time series, but it is much better than anything else I have seen. Turning what you get from your upstreams into something you can confidently hand out to your users just requires Datalink at this point I'm afraid[1].

I will add tutorial chapters for how to deal with the datalink-infested templates one of these days; within them bulk commenting will play a fairly important role. For quite a while, I have recommended to define a lazy macro with a CDATA section in order to comment out a large portion of an RD. I have changed that recommendation now to open such comments with <macDef raw="True" name="todo"><![CDATA[ and close them with ]]></macDef>. The new (2.12) part is the raw="True". This only means that DaCHS will not try to expand macros within the macro definition. So far, it has done that, and that was a pain in for the datalink-infested templates, because there are macro calls in the templates, but some of them will not work in the RD context the macDef is in, which then lead to hard-to-understand RD parse errors.

By the way, in case you would like to write your template to a file other than q.rd (perhaps because there already is one in your resdir), there is now an -o option to dachs start.

Speaking of convenience, defining spectral coverage has become a lot less of a pain in 2.12. So far, whenever you had to manually define a resource's STC coverage (and that is not uncommon for the spectral axis, where dachs limits often will find no suitable columns or does not find large gaps in observations in multiple narrow bands), you had to turn the Ångströms or GHz into Joule by throwing in the right amounts of c, h, and math operators. Now, you just add the appropriate units in square brackets and let DaCHS work out the rest; DaCHS will also ensure that the lower limit actually is smaller than the upper limit. A resource covering a number of bands in various parts of the spectrum might thus say:

<coverage>
  <spectral>100[kHz] 21.5[cm]</spectral>
  <spectral>2[THz] 1[um]</spectral>
  <spectral>653[nm] 660[nm]</spectral>
  <spectral>912[Angstrom] 10[eV]</spectral>
  <spectral>20[GeV] 100[GeV]</spectral>
</coverage>

DaCHS will produce a perfectly viable coverage declaration for the Registry from that.

Still in the convenience department, I have found myself define a STREAM (in case you don't know what I'm talking about: read up on them in the tutorial) that creates pairs of columns for a value and its error once to often. Thus, there is now the //procs#witherror stream. Essentially, you can replace the <column in a column definition with <FEED source="//procs#witherror, and you get two columns: One with the name itself, the other with a name of err_name, and the columns ought to have suitable metadata. For instance:

<FEED source="//procs#witherror
  name="rv" type="double precision"
  unit="km/s" ucd="spect.dopplerVeloc"
  tablehead="RV_S"
  description="Radial velocity derived by the Serval pipeline"
  verbLevel="1"/>

You cannot yet have values children with witherror, but it is fairly uncommon for such columns to want them: you won't enumerate values or set null values (things with errors will be floating point values, which have “natural” null values at least in VOTable), and columns statistics these days are obtained automatically by dachs limits.

You can take this a turn further and put witherror into a LOOP. For instance, to define ugriz photometry with errors, you would write:

<LOOP>
  <csvItems>
  item, ucd
  u, U
  g, V
  r, R
  i, I
  z, I
  </csvItems>
  <events passivate="True">
    <FEED source="//procs#witherror name="mag_\item"
      unit="mag" ucd="phot.mag;em.opt.\ucd">
      tablehead="m_\item"
      description="Magnitude in \item band"/>
  </events>
</LOOP>

There is a difficult part in this: the passivate="True" in the events element. If you like puzzlers, you may want to figure out why that is needed based on what I document about active tags in the reference documentation. Metaprogramming and Macros become subtle not only in DaCHS.

Far too few DaCHS operators define examples for their TAP services. Trust me, your users will love them. To ensure that they still are good, you can now pass an -x flag to dachs val (nb not dachs test); that will execute all of the TAP examples defined in the RD against the local server and complain when one does not return at least one valid row. The normal usage would be to say dachs val -x //tap if you define your examples in the userconfig RD; but with hierarchical examples, any RD might contain examples modern TAP clients will pick up.

There is another option to have an example tested: you could put the query into a macro (remember macDef above?) and then use that macro both in the example and in a regTest element. That is because url attributes now expand macros. That may be useful for other and more mundane things, too; for instance, you could have DaCHS fill in the schema in queries.

Actual new features in 2.12 are probably not very relevant to average DaCHS operators, at least for now:

users can add indexes to their persistent uploads (featured here before)
registration of VOEvent streams according to the current VOEvent 2.1 PR (ask if interested; there is minimal documentation on this at this point).
an \if macro that sometimes may be useful to skip things that make no sense with empty strings: \if{\relpath}{http://example.edu/p/\relpath} will not produce URLs if relpath is empty.
if you have tables with timestamps, it may be worth running dachs limits on them again, as DaCHS will now obtain statistics for them (in MJD, if you have to know) and consequently provide, e.g., placeholders.
our spatial WCS implementation no longer assumes the units are degrees (but still that it is dealing with spherical coordinates).
when params are array-valued, any limits defined in values are now validated component-wise.

Finally, if you inspected a diff to the last release, you would see a large number of changes due to type annotation of gavo.base. I have promised to my funders to type-annotate the entire DaCHS code (except perhaps for exotic stuff I shouldn't have written in the first place, viz., gavo.stc) in order to make it easier for the community to maintain DaCHS.

From my current experience, I don't think I will keep this particular promise. After annotating several thousand lines of code my impression is that the annotation is a lot of effort even with automatic annotation helpers (the cases it can do are the ones that would be reasonably quick for a human, too). The code does in general improve in consequence (but not always), but not fundamentally, and it does not become dramatically more readable in most places (there are exceptions to that reservation, though).

All in all, the cost/benefit ratio just does not seem to be small enough. And: the community members that I want to encourage to contribute code would feel obliged to write type annotations, too, which feels like an extra hurdle I would like to spare them.

[1]

Ok: you could also do an offline conversion of the data collection before ingestion, but I tend to avoid this, partly because I am reluctant to touch upstream data, but in this case in particular because with the current approach it will be much easier to adopt improved serialisations as they become defined.

At the College Park Interop

2025-06-02 Markus Demleitner

Uneasy Logistics
TCG: Come again?
Opening session (2025-06-02, 14:30)
Charge to the Working Groups (2025-06-02, 15:30)
Data Management Challenges (2025-06-03, 10:30)
Registry (2025-06-03, 12:30)
Tuesday (2025-06-03) Afternoon
DCP (2025-06-04, 10:00)
Obscore and Extensions (2025-06-04, 15:00)
Apps II (2025-06-05, 12:00)
DM 3 (2025-06-05, 17:00)
Wrapping Up (2025-06-06)

A part of a modern-ish square building, partly clinker brick, partly concrete pillars with glass behin them, holding a portico saying “Edward St. John Learning and Teaching Center“.

This is where the northern spring Interop 2025 will take place over the next few days; the meeting is hosted by the University of Maryland.

A bit more than six months after the Malta Interop, the people working on the Virtual Observatory are congregating again to discuss what everyone has done about VO matters and what they are planning to do in the next few months.

Uneasy Logistics

This time, the event takes place in College Park, Maryland, in the metro area of Washington, DC. And that has been a bit of an issue with respect to “congregating”, because many of the regular Interop attendees were worried by news about extra troubles with US border checks. In consequence, we will only have about 40 on-site participants (rather than about 100, as is more usual for Interops); the missing people have promised to come in via some proprietary video conferencing system <cough>, though.

Right now, in the closed session of the Technical Coordination Group, (TCG) where the chairs of the various Working and Interest Groups of the IVOA meet, this feels fairly ok. But then more than half of the participants are on-site here. Also, the room we are in (within the Edward St. John Learning and Teaching Center pictured above) is perfectly equipped for this kind of thing, what with microphones in each desk, and screens everywhere.

I am sure the majority-virtual situation will not work at all for what makes conferences great: the chats between the sessions. Let's see how the usual sessions – that mix talks and discussion in various proportions – will work in deeply hybrid.

TCG: Come again?

The TCG, among other things, has to worry about rather high-level, cross-working-group, and hence often boring topics. For instance, we were just talking about how to improve the RFC process, the way we discuss whether and how a draft standard (“Proposed Recommendation”) should become a standard (“Recommendation”). This, so far, happens on the Twiki, which is nice because it's stable over long times (20 years and counting). But it also sucks because the discussions are hard to follow and the link between comments and resulting changes is loose at best. For an example that should illustrate the problem, see the last RFC I ran.

Since we're sold to github/Microsoft for our current document management anyway, I think I would rather have the RFC discussions on github, too, and in some way we will probably say as much in the next version of the Document Standards. But of course there are many free parameters in the details, which led to quite a bit more discussion than I had expected. I am not entirely sure whether we sometimes crossed the border to bikeshedding; my hope is we did not.

Here's another example of the sort of infrastruture talk we were having: There is now a strong move to express parts of our standards' content machine-readably in OpenAPI (example for TAP). Again, there are interesting details: If you read a standard, how will you find the associated OpenAPI files? Since these specs will rather certainly include parts of other standards: how will that work technically (by network requests or in a single repository in which all IVOA OpenAPI specs reside)? And more importantly, can a spec say “I want to include a specific minor version of another standard's artefacts“? Must it be minor version-sharp, and how would that fit with semantic versioning? Can it say “latest”?

This may appear very far removed from astronomy. But without having good answers as early as possible, we will quite likely repeat the mess we have had with our XML schemas (you would not believe how much curation went into this) and in particular their versioning. So, good thing there are the TCG sessions even if they sometimes are a bit boring.

Now that I think of it: In our XML schema, we now implicitly always say “latest for the major version”, and I think that has served us well. I should have mentioned that a prior art for this question.

Opening session (2025-06-02, 14:30)

The public part of the conference has started with Simon O'Toole's overview over what was going on in the VO in the past semester. Around page 36 of his slide set, updates from the Rubin Observatory say what I have been saying for a long time:

A piece of a slide showing “Binary2 For The Win” and “Large results make TABLEDATA prohibitive”.

If you don't understand what they are talking about, don't worry too much: It's a fairly technical detail of writing VOTables, where we did a fix of something rather severly broken in 2012.

The entertaining part about it, though, is that later in the conference, when I will talk about the challenges of transitioning between incompatible versions of protocols, BINARY2 will be one of my examples for how such transitions tend to be a lot less successful than they should be. Seeing takeup by players of the size of Rubin almost proves me wrong, I think.

Charge to the Working Groups (2025-06-02, 15:30)

This is the session in which the chairs of the Working and Interest Groups discuss what they expect to happen in the next few days. Here is the first oddity of what I've just called deeply hybrid: The room we are in has lots of screens along the wall that show the slides; but there is no slide display behind the local speaker:

A large room with a some relatively scattered people around tables looking at various screens. At the far end of the room, there are windows and a lectern.

If you design lecture halls: Don't do that. It really feels weird when you stand in front of a crowd and everyone is looking somewhere else.

Content-wise, let me stress that this detail from Grégory's DAL talk was good news to me:

A cutout from a presentation slide; a large SLAP over a struck-out LineTAP on blue ground, and some text explaining this in deep jargon.

This means that the scheme for distributing spectral line data that Margarida and I have been working on for quite a while now, LineTAP (last mentioned in the Malta post), is probably dead; the people who would mostly have to take it up, VAMDC, are (somewhat rightly) scared of having to do a server-side TAP implementation. Instead, they will now design a parameter-based interface.

Even though I have been promoting and implementing LineTAP for quite a while, that outcome is fine with me, because it seems that my central concern – don't have another data model for spectral lines – is satisfied in that that parameter-based interface (“SLAP2”) will build directly upon VAMDC's XSAMS model, actually adopting LineTAP's proposed table schema (or something very close) as the response table. So, SLAP2, evolved in this way, seems like an eminently sensible compromise to me.

Tess gave the Registry intro, and it promises a “Spring Cleaning Hackathon” for the VO registry. That'll be a first for Interop, but one that I have wished for quite a while, as evinced by my (somewhat notorious) Janitor post from 2023. I am fairly sure it will be fun.

Data Management Challenges (2025-06-03, 10:30)

Interops typically have plenary sessions with science topics, something like “the VO and radio astronomy”. This time, it's less sciency, it's about “Data Management” (where I refuse to define that term). If you look at the session programme, in it some major science projects will be telling you about their plans for how to deal with (mostly large) new data collections.

For instance Euclid, has to deal with several dozen petabytes, and they report 2.5 million async TAP queries in the three months from March, which seems incredibly much. I'd be really curious what people actually did. As usual: if you report metrics, make sure you give the information necessary to understand them (of course, that will usually mean that you don't need the metrics any more; but that's a feature, not a bug). In this case, it seems most of these queries are the result of web pages firing off such queries when they are loaded into Javascript-enabled web browsers (or crawlers).

More relevant to our standards landscape, however, is that ESA wants to make the data available within their, cough, Science Data Platform, i.e., computers they control and that are close to the data. To exploit that property, in data discovery you need to somehow make it such that code running on the platform can find out file system paths rather than HTTP URIs – or in addition to them? We have already discussed possible ways to address such requirements in Malta, without a clear path forward yet that I remember. Pierre, the ESA speaker, did not detail their plan.

In the talk from the Roman people, I liked the specification of their data reduction pipeline (p. 8 ff); I think I will use this as a reference for what sort of thing you would need to describe in a full provenance model for the output of a modern space telescope. On the other hand, this slide made me unhappy:

Admittedly, I don't really know what use case the pre-baked table files that they want to serve in this ADSF format are supposed to cover, but I am rather sure that efficiency-wise having Parquet files (which they intend to use elsewhere anyway) with VOTable metadata as per Parquet in the VO would not make much of a difference. But it would bring them much closer to proper VO metadata, which to me sounds like a big win.

The remaining two talks in the session covered fairly exotic instruments: SphereX, which scans the sky into a giant spectral cube, and COSI, a survey instrument for MeV gamma rays (like, for instance: ⁶⁰Fe, which is a strong signal in Supernovae) with the usual challenges for making something like an image out of what falls out of your detector, including the fact that the machines' point spread function is a cone:

A presentation slide with a bit of text and two plots below it. The main eye catcher is a red 3D cone in coordinates phi, chi, and psi.

How exciting.

Registry (2025-06-03, 12:30)

I'm on my home turf: The Registry Session, in which I will talk about how to deal with continuously updated resources. But before that, Renaud, the current chair of the Registry WG, pointed out something I did (and reported on here): Since yesterday, pyVO 1.7 is out and hence you can use the UAT constraint with semantics built-in:

A cutout of a presentation slide with a plot of a piece of the UAT and a bit of python code showing keyword expansion of the concept nebulae up and down.

Ha! The experience of agency! And I'm only dropping half a smiley here.

Later in the session, Gilles reported on the troubles that VizieR still has with the VOResource data model since many of their resources contain multiple tables with coordinates and hence multiple cone search services, and it is impossible in VODataService to say which service is related to which table. This is, indeed, a problem that will need some sort of solution. I, for one, still believe that the right solution would be to fix cone search rather than try and fiddle together some sort of kludge (and I don't see anything but kludges on that side) in the Registry.

He also submitted something that could be considered a bug report. Here are match counts for three different interfaces on top of (hopefully) roughly equivalent metadata collections:

Three browser screenshots next to each other showing matches of about the same search on three different pages, returning 315, 1174, and 419 results, respectively.

I think we'll have to have second and third looks at this.

Tuesday (2025-06-03) Afternoon

I was too busy to blog during yesterday's afternoon sessions, Semantics (which I chaired in my function as WG chair emeritus because the current chair and vice chair were only present remotely) and then, gasp, the session on major version transitions. The latter event was mainly a discussion session – that worked rather well in its deeply hybrid form, I am happy to report –, where everyone agreed that (a) as a community, we should be able to change our standards in ways that break existing practices lest we become sclerotic and that (b) it's a difficult thing and needs careful and intensive management.

In my opening pitch, I mentioned a few cases where we didn't get the breaking changes right. Let's try to be better next time. At the session, some people signalled they would be in on updating Simple Cone Search from the heap of outdated legacy that it now is into an elegant protocol nicely in line with modern VO standards (which certainly would be a breaking change). Now, if only I could bring myself to imagine the whole SCS2 business as something I would actually want to do.

If you are reading this and feel you would like to pull SCS2 along with me: Do write in.

Let me remark that I found it a stellar moment of this session when a former Google employee mentioned that at Google they did think long and hard about whether to kill Reader (which was supporting the open RSS standard, and thus was a positive thing at least by Google standards) and then decided they would not keep running it for three people in a cave.

Ummm, now that I think about it, I don't remember whether the ”three people in a cave” quip came from her, but somehow the phrase was in the room, and one participant actually got fairly cross because they are missing Google Reader to this day[1] and they resented being considered one of three people in a cave.

Similarly for the “breaking change“ of switching mobile phone standards (GSM to UMTS to LTE), there were immediately people in the room who are still unhappy because they had to discard perfectly good phones when the networks their modems knew were shut down. So, in a way my message of “if you can help it, don't do breaking changes, because someone will get pissed with you” was brought home very impressively. This one time, however, I'd much rather be wrong. Perhaps there are ways to have relatively painless major version migrations of more or less mature federated systems.

Raising some hopes in that direction, the migration from Plastic to SAMP in the early days of the VO was mentioned as something that has worked rather nicely. Ok: That was not exactly a federated client-server system, but it was not too far from that either. Perhaps one should have a closer look at that story.

DCP (2025-06-04, 10:00)

I'm now sitting the the session of the Data Curation and Preservation WG, and I am delighted that in Gilles' talk, something that was, in the end, rather simple in implemenation yields something as complex as provenance graphs such as this:

A part of a graphviz visualisation having nodes like gav_tap, the GAVO DC team, our obscore table, and so on.

which occurs towards the end of Gilles' slideset. The full graph integrates our part of a not entirely trivial table's provenance with some metadata coming from CDS. That I found remarkable in itself.

The delightful detail about it, however, is that I had never planned for the data origin implementation to enable anything like this. That on the client side you can do things the publishers have never meant you to do (and mind you, I personally am not convinced scientists would like to contemplate such graphs), that is why I think interoperable standards letting users do whatever they like on their end of the protocol is such a great thing.

Yes, that was a stinger against “platforms”, as much they have been all the craze a few years ago. On them, the publisher controls the client, too, and the more platformy something is, the more users will be limited by the ideas of the publishers.

Obscore and Extensions (2025-06-04, 15:00)

I was worried for a moment that this would be an Interop day without a talk by me. Fortunately, Renaud asked me to give his talk on the Registry aspects of Obscore extensions (which, to be fair, already had me on its author list before). This is in the context of something I am fairly happy about: extra tables next to instances of ivoa.obscore (where we can store all kinds of results of astronomical observations) that cover metadata that is peculiar to certain fields: messenger types like radio or high energy for instance. If you are running DaCHS, you can already have a draft of one of these (Radio) since DaCHS 2.10.

So, this time, there is a session on the extensions for high energy, radio, and possibly time, with a view of how to use and find them in practice. Given that the unfortunate (“my biggest mistake”) dataModel element for discovery of Obscore tables came up again in Grégory's talk, I am happy I had a chance to make my point again on why we need to discover these kinds of things differently than what I had envisioned in 2012. If you weren't there: It's basically what I said last year in TableReg (the April 2025 date on this reflects a very minor fix).

Apps II (2025-06-05, 12:00)

When I sat in the Apps 2 session I was still shaking my head about Grégory's slide from his talk on rewriting the grammar for our ADQL query language in a formalism called PEG. In itself, PEG and the grammar are great (I have contributed to it quite a bit myself). They give absolutely no reason for head-shaking. But then there are various libraries that read PEG grammars and build parsers from them. It turns out that each library has tiny little, largely inexplicable quirks in the way they expect the PEG to be written.

This made Grégory squeeze something like a source grammar through several pieces of sed horror to fit it to the various concrete PEG machineries. Here's how this looks like for the Canopy PEG library:

A presentation slide with some red arrows mapping grammar rules greyed out in the background, and wild sed rules with a bit of syntax highlighting in the foreground.

Call me overly sensitive, but it's things like these that sometimes makes me seriously consider becoming a vegetable gardener and don't ever touch computers again.

But then I'm too much of a language lawyer to not enjoy the sort nitpicking I just did in the Apps 2 session, and none of that would exist without computers. Basically, it was about this VOTable being broken:

<VOTABLE><RESOURCE><TABLE>
<FIELD name="objname" datatype="char" arraysize="*"/>
<DATA><TABLEDATA><TR>
<TD>Joachim Wambsganß</TD>
</TR></TABLEDATA></DATA></TABLE></RESOURCE></VOTABLE>

Looks fine to you? Well, have a look at my lecture notes to see what's wrong and what ways to improve the situation I see. Still, I feel an urge to confess I had quite a bit of rather twisted fun when I gave that talk. It must be that kind of sentiment that leads to the Babylonoid confusion that Grégory has regretted in his PEG talk.

DM 3 (2025-06-05, 17:00)

Another plenary discussion session: Data Models: modularity, levels, endorsement. I have to really try hard not to blurt out “told you so, told you so” every few minutes. But I could not resist sneaking in a link to a PR against astropy that still illustrates what I think we should to DMs like (even if it's now many years old): https://github.com/msdemlei/astropy. I think I'll leave this repo at commit dcc88dc forever. And that's about all I can say about that topic without losing my equanimity. Aw, I even had code showing how to deal with breaking changes in that astropy fork:

pos_ann = None
for desired_type in ["stc3:Coords", "stc2:Coords"]:
  for ann in ann.get_annotations(desired_type):
    pos_ann = ann

    if pos_ann is not None:
      break

if pos_ann is None:
  raise Exception("Don't understand any target annotation")

Meanwhile, the Spring Cleaning Hackaton of the Registry WG that I had looked forward to above happened two hours ago. It was very interesting to debug the workflow for assigning subject keywords for resources (the thing I was taking about in my lofty semantics post) for a certain data centre that shall remain unnamed here. We eventually found out the reason their subjects were substandard was that the person responsible for picking them was not aware of that responsibility.

If you ask me, this hackathon showed again that getting people together in a room is the preferred way to work out what these days you might call hybrid problems: Not entirely social and organisational, but not entirely technical either. What we did in that hour would have taken many mails and a lot more time to solve if we had even started doing it rather than just resigning to the (in this case) substandard keywords.

Wrapping Up (2025-06-06)

I am sitting in the traditional last session of the Interop, where the chairs of the various Working and Interest Groups look back on their sessions. I just have to comment one thing from Grégory and Joshua's summary for DAL, where they quote me as:

A cutout of a presentation slide with a fake post-it note quoting me as saying: dataModel in TAPRegExt was a terrible mistake.

Let me stress that the reason I was so blunt here is that it was I who put the dataModel element into TAPRegExt. It seemed a good idea at the time. For the story of how that later turned out to be an mistake, I would again like to draw your attention to TableReg.

Before this closing session, I had my last talk at this Interop. That happened in the DAL 2 session in the form of a report on my addition to the persistent uploads that I have recently discussed here. The following talk by Pat from CADC mentioned that they did the indexing part somewhat differently; let's see how we reach consensus here.

So, that's it for this Interop. The parting exec chair, Simon, had the last word, rightfully thanking the local organisers who really had a hard time given the political chaos around them, and also reminded people that we will next meet in Görlitz – which means that I will be the local organiser. I'm nervous already:

A presentation slide advertising the Southern Spring meeting 2025 hosted in Görlitz with a few fake photos from there (showing the future DZA) and a groundplan of the future institute. It stresses that “Görlitz is about 1 hour from Dresden”.

[1]	The question of why that person has not just migrated to some open alternative – after all, the option to do that is one of the strong advantages of using open standards like Atom or RSS– I cannot answer, and it's quite beside the point for what the session was trying to address, too.

Persistent TAP Uploads Update: A Management Interface

2025-05-21 Markus Demleitner

Setting Lifetimes
Creating Indexes
Special Index Types

a screenshot of a python notebook; a few lines of python yield the current date, a few more a date a year from now.

There is a new version of the jupyter notebook showing off the persistent TAP uploads in python coming with this post, too: Get it.

Six months ago, I reported on my proposal for persistent uploads into TAP services on this very blog: Basically, you could have and keep your own tables in databases of TAP servers supporting this, either by uploading them or by creating them with an ADQL query. Each such table has a URI; you PUT to it to create it, you GET from it to inspect its metadata, VOSI-style, and you DELETE to it to drop the table once you're done.

Back then, I enumerated a few open issues; two of these I have recently addressed: Lifetime management and index creation. Here is how:

Setting Lifetimes

In my scheme, services assign a lifetime to user-uploaded tables, mainly in order to nicely recover when users don't keep books on what they created. The service will eventually clean up their tables after them, in the case of the reference implementation in DaCHS after a week.

However, there are clearly cases when you would like to extend the lifetime of your table beyond that week. To let users do that, my new interface copies the pattern of UWS. There, jobs have a destruction child. You can post DALI-style timestamps[1] there, and both the POST and the GET return a DALI timestamp of the destruction time actually set; this may be different from what you asked for because services may set hard limits (in my case, a year).

For instance, to find out when the service will drop the table you will create when you follow last October's post you could run:

curl http://dc.g-vo.org/tap/user_tables/my_upload/destruction

To request that the table be preserved until the Epochalypse you would say:

curl -F DESTRUCTION=2038-01-19T03:14:07Z http://dc.g-vo.org/tap/user_tables/my_upload/destruction

Incidentally (and as you can see from the POST response), until January 2037, my service will reduce your request to “a year from now”.

I can't say I'm too wild about posting a parameter called “DESTRUCTION” to an endpoint that's called “destruction” (even if it weren't such a mean word). UWS did that because they wanted it make it easy to operate a compliant service from a web browser. Whether that still is a reasonable design goal (in particular because everyone seems to be wild on dumping 20 metric tons of Javascript on their users even things like UWS would make it easy to not do that) is certainly debatable. But I thought it's better to have a single questionable pattern throughout rather than have something a little odd in one place and something a little less odd in another place.

Creating Indexes

For many applications of database systems, having indexes is crucial. You really, really don't want to have to go through a table with 2 billion rows just to find a single object (that's a quarter of a day when you manage to pull through 100'000 rows a second; ok: today's computers are faster than that). While persistently uploaded tables won't (regularly) have two billion rows any time soon, indexes are already very valuable even for tables in the million-row range.

On the other hand, there are many sorts of indexes, and there are many ways to qualify indexes. To get an idea of what you might want to tell a database about an index, see Postgres' CREATE INDEX docs. And that's just for Postgres; other database systems still do it differently, and of course when you index on expressions, there is no limit to the complexity you can build into your indexes.

Building a cross-database API that would reflect all that is entirely out of the question. Hence, I went for the other extreme: You just specify which column(s) you would like to have indexed, and the service is supposed to choose a plausible index type for you.

Following the model of destruction (typography matters!), this is done by POST-ing one or more column names in INDEX parameters to the index child of the table url. For instance, if you have put a table my_upload that has a column Kmag (e.g., from last October's post), you would say:

curl -L -F INDEX=Kmag http://dc.g-vo.org/tap/user_tables/my_upload/index

The -L makes curl follow the redirect that this issues. Why would it redirect, you ask? The index request creates a UWS job behin the scenes, that is, something like a TAP async job. What you get redirected to is that job.

The background is that for large tables and complex indexes, you may easily get your (appartently idle) connection cut while the index is being created, and you would never learn if a problem had materialised or when the indexing is done. Against that, UWS lets us keep running, and you have a URI at which to inspect the progress of the indexing operation (well, frankly: nothing yet beyond “is it done?”).

Speaking UWS with curl is no fun, but then you don't need to: The job starts in QUEUED and will automatically execute when the machine next has time. In case you are curious, see the notebook linked above, where there is an example for manually following the job's progress. You could use generic UWS clients to watch it, too.

A weak point of the scheme (and one that's surprisingly hard to fix) is that the index is immediately shown in the table metadata the notebook linked to above shows this; I'll spare you the VODataService XML that curl-ing the table URL will spit at you, but in there you will see the Kmag index whether or not the indexer job has run.

It shares this deficit with another way to look at indexes. You see, since there is so much backend-specific stuff you may want to know about an index, I am also proposing that when you GET the index child, you get back the actual database statements, or at least something rather similar. This is expressly not supposed to be machine readable, if only because what you see is highly dependent on the underlying database.

Here is how this looks like on DaCHS over postgres after the index call on Kmag:

$ curl http://dc.g-vo.org/tap/user_tables/my_upload/index
Indexes on table tap_user.my_upload

CREATE INDEX my_upload_Kmag ON tap_user.my_upload (Kmag)

I would not want to claim that this particular human-readable either. But humans that try to understand why a computer does not behave as they expect will certainly appreciate something like this.

Special Index Types

If you look at the tmp.vot from last october's post, you will see that there is an a pair of equatorial coordinates in _RAJ2000 and _DEJ2000. It is nicely marked up with pos.eq UCDs, and the units are deg: This is an example of a column set that DaCHS has special index magic for. Try it:

curl -L -F INDEX=_RAJ2000 -F INDEX=_DEJ2000 \
  http://dc.g-vo.org/tap/user_tables/my_upload/index > /dev/null

Another GET against index will show you that this index is a bit different, stuttering something about q3c (or perhaps spoint at another time or on another service):

Indexes on table tap_user.my_upload

CREATE INDEX my_upload__RAJ2000__DEJ2000 ON tap_user.my_upload (q3c_ang2ipix("_RAJ2000","_DEJ2000"))
CLUSTER my_upload__RAJ2000__DEJ2000 ON tap_user.my_upload
CREATE INDEX my_upload_Kmag ON tap_user.my_upload (Kmag)

DaCHS will also recognise spatial points. Let's quickly create a table with a few points by running:

CREATE TABLE tap_user.somepoints AS
SELECT TOP 30 preview, ssa_location
FROM gdr3spec.ssameta

on the TAP server at http://dc.g-vo.org/tap, for instance in TOPCAT (as explained in post one, the “Table contained no rows” message you will see then is to be expected). Since TOPCAT does not know about persistent uploads yet, you have to create the index using curl:

curl -LF INDEX=ssa_location http://dc.g-vo.org/tap/user_tables/somepoints/index

GET-ting the index URL after that will yield:

Indexes on table tap_user.somepoints

CREATE INDEX ndpmaliptmpa_ssa_location ON tap_user.ndpmaliptmpa USING GIST (ssa_location)
CLUSTER ndpmaliptmpa_ssa_location ON tap_user.ndpmaliptmpa

The slightly shocking name of the table is an implementation detail that I might want to hide at some point; the important thing here is the USING GIST that indicates DaCHS has realised that for spatial queries to be able to use the index, a special method is necessary.

Incidentally, I was (and still am) not entirely sure what to do when someone asks for this:

curl -L -F INDEX=_Glon -F INDEX=_DEJ2000 \
  http://dc.g-vo.org/tap/user_tables/my_upload/index > /dev/null

That's a latitude and a longitude all right, but of course they don't belong together. Do I want to treat these as two random columns being indexed together, or do I decide that the user very probably wants to use a very odd coordinate system here?

Well, try it and see how I decided; after this post, you know what to do.

[1]	Many people call that “ISO format”, but I cannot resist pointing out that ISO, in addition to charging people who want to read their standards an arm and leg, admits a panic-inducing variety of date formats, and so “ISO format” not a particularly useful term.

At the Gaia Passivation Event

2025-03-27 Markus Demleitner
- 9:00
- 9:35
- 9:42
- 9:45
- 9:50
- 9:53
- 9:55
- 12:00
- 12:30
- 13:00
[All times in CET]

9:00

The instrument that featured most frequently (try this) in this blog is ESA's Gaia Spacecraft that, during the past eleven years, has obtained the positions and much more (my personal favourite: the XP spectra) of about two billion objects, mostly of stars, but also of quasars, asteroids and whatever else is reasonably point-like.

Today, this mission comes to an end. To celebrate it – the mission, not the end, I would say –, ESA has organised a little ceremony at its operations centre in Darmstadt, just next to Heidelberg. To my serious delight, I was invited to that farewell party, and I am now listening to an overview of the passivation given by David Milligan, who used to manage spacecraft operations. This is a suprisingly involved operation, mostly because spacecraft are built to recover from all kinds of mishaps automatically and thus will normally come back on when you switch them off:

But for now Gaia is still alive and kicking; the control screen shows four thrusters accelerating Gaia out of L2, the Lagrange point behind Earth, where it has been taking data for all these years (if you have doubts, you could check amateur images of Gaia taken while Gaia was particularly bright in the past few months; the service to collect them runs on my data centre).

They are working the thrusters quite a bit harder than they were designed for to get to a Δv of 120 m/s (your average race car doesn't make that, but of course it takes a lot less time to accelerate, too). It is not clear yet if they will burn to the end; but even if one fails early, David explains, it is already quite unlikely that Gaia will return.

9:35

Just now the thrusters on the spacecraft have been shut down (”nominally”, as they say here, so they've reached the 120 m/s). Gaia is now on its way into a heliocentric orbit that, as the operations manager said, will bring it back to the Earth-Moon-System with chance of less than 0.25% between now and 2125. That's what much of this is about: You don't want Gaia to crash into anything else that's populating L2 (or something else near Earth, for that matter), or start randomly sending signals that might confuse other missions.

9:42

Gaia is now going blind. They have switched off the science computers a few minutes ago, which we could follow on the telemetry screen, and now they are switching off the CCDs, one after the other. The RP/BP CCDs, the ones that obtained my beloved XP spectra, are already off. Now the astrometry CCDs go grey (on the screen) one after the other. This feels oddly sombre.

In a nerdy joke, they switched off the CCDs so the still active ones formed the word ”bye” for a short moment:

9:45

The geek will inherit the earth. Some nerd has programmed Gaia to send, while it is slowly winding down, an extra text message: “The cosmos is vast. So is our curiosity. Explore!”. Oh wow. Kitsch, sure, but still goosebumpy.

9:50

Another nerdy message: ”Signing off. 2.5B stars. countless mysteries unlocked.” Sigh.

9:53

Gaia is now mute. The operations manager gave a little countdown and then said „mark”. We got to see the spectrum of the signal on the ground station, and then watch it disappear. There was dead silence in the room.

9:55

Gaia was still listening until just now. Then they sent the shutdown command to the onboard computer, so it's deaf, too, or actually braindead. Now there is no way to revive the spacecraft short of flying there. ”This is a very emotional moment,” says someone, and while it sounds like an empty phrase, it is not right now. ”Gaia has changed astronomy forever”. No mistake. And: ”Don't be sad that it's over, be glad that it happened”.

12:00

Before they shut down Gaia, they stored messages from people involved with the mission in the onboard memory – and names of people somehow working on Gaia, too. And oh wow, I found my name in there, too:

It felt a bit odd to have my name stored aboard a spacecraft in an almost eternal heliocentric orbit.

But on reflection: This is solid state storage, in other words, some sort of EPROM. And that means that given the radiation up there, the charges that make up the bit pattern will not live long; colleagues estimated that, with a lot of luck, this might still be readable 20 years from now, but certainly not much longer. So, no, it's not like I now share Kurt Waldheim's privilege of having something of me stored the better part of eternity.

12:30

Andreas Rudolph, head of operations, now gives a talk on, well, ”Gaia Operations”.

I have already heard a few stories of the olden days while chatting to people around here. For instance, ESTEC staff is here and gave some insight views on the stray light trouble that caused quite a few sleepless nights when it was discovered during commissioning. Eventually it turned out it was because of fibres sticking out from the sunshield. Today I learned that had a long history because unfolding the sunshield actually was a hard problem during spacecraft design, and, as Andreas just reminded us, a nail-biting moment during commissioning. The things need to be rollable but stiff, and unroll reliably once in space.

People thought of almost everything. But once they showed the sunshield to an optical engineer while debugging the problem, after a few minutes he shone the flashlight of his telephone behind the screens and conclusively demonstrated the source of the stray light.

Space missions are incredibly hard. Even the smallest oversights can have tremendous consequences (although the mission extension after the original five years of mission time certainly helped offsetting the stray light problem).

Andreas discussed more challenges like that, in particular the still somewhat mysterious Basic Angle Variation, and finished predicting that in 2029, Gaia will next approach Earth, passing at a distance of about 10 million kilometers. I don't think it will be accessible to amateur telescopes, perhaps not even to professional ones. But let's see.
13:00

Gaia data processing is (and will be for another 10 years or so) performed by a large international collaboration called DPAC. DPAC is headed by Anthony Brown, and his is the last talk for today. He mentioned some of the exciting science results of the Gaia mission. Of course, that is a minute sample taken from the thousands and thousands of papers that would not exist without Gaia.
- The tidal bridge of stars between the LMC and the SMC.
- The discovery of 160'000 asteroids (with 5.5 years of data analysed), and their rough spectra, allowing us to group them into several classes.
- The high-precision reference catalogue which is now in use everywhere to astrometrically calibrate astronomical images; a small pre-release of this was already in use for the navigation of the extended mission of the Pluto probe New Horizons.
- Finding the young stars in the (wider) solar neighbourhood by their over-luminosity in the colour-magnitude diagram, which lets you accurately map star forming regions out to a few hundred parsecs.
- Unraveling the history of the Milky Way by reconstructing orbits of hundreds of millions of stars and identifying stellar streams (or rather, overdensities in the space of orbital elements) left over by mergers of other galaxies with the Milky Way and preserved over 10 billion years.
- Confirming that the oldest stars in the Mikly Way are indeed in the bulge using the XP spectra, and reconstructing how the disk formed afterwards.
- In the vertical motions of the disk stars, there is a clear signal of a recent perturbation (probably when the Sagittarius dwarf galaxy crashed through the Milky Way disk) and how there is now some sort of wave going through the disk and slowly petering out.
- Certain white dwarfs (I think those consisting of carbon and nitrogen) show underluminosities because they form bizarre crystals in their outer regions (or so; I didn't quite get that part).
- Thousands of star clusters newly discovered (and a few suspected star clusters debunked). One new discovery was actually hiding behind Sirius; it took space observations and very careful data reduction around bright sources to see it in the vicinity of this source overshining everything around it.
- Quite a few binary stars having neutron stars or black holes as companions – where we are still not sure how some of these systems can even form.
- Acceleration of the solar system: The sun orbits the centre of the Milky Way, once every about 220 Million years or so. So, it does not move linearly, but only very slightly so (“2 Angstrom/s²” acceleration, Anthony put it). Gaia's breathtaking precision let us measure that number for the first time.
- Oh, and in DR4, we will see probably 1000s of new exoplanets in a mass-period range not well sampled so far: Giant planets in wide orbits.
- And in DR5, there will even be limits on low-frequency gravitational waves.
Incidentally, in the question session after Anthony's talk, the grandmaster of both Hipparcos and Gaia, Erik Høg, reminded everyone of the contributions by Russian astronomers to Gaia, among other things having proposed the architecture of the scanning CCDs. I personally have to say that I am delighted to be reminded of how science overcomes the insanities of politics and nationalism.
Category: Meetings

A New Constraint Class in PyVO's Registry API: UAT

2025-02-14 Markus Demleitner

A scan of a book page: lots of astronomy-relevant topics ranging from "Cronometrie" to "Kosmologie, Relativitätstheorie". Overlaid a title page stating "Astronomischer Jahresbericht. Die Literatur des Jahres 1967".

This was how they did what I am talking about here almost 60 years ago: a page of the table of contents of the “Astronomischer Jahresbericht” for 1967, the last volume before it was turned into the English-language Astronomy and Astrophysics Abstracts, which were the main tool for literature work in astronomy until the ADS came along in the late 1990ies.

Thesauri and the UAT
Why Keywords?
The UAT constraint
Implementation

I have recently created a pull request against pyVO to furnish the library with a new constraint to search for data and services: Search by a concept drawn from the Unified Astronomy Thesaurus UAT. This is not entirely different from the classical search by subject keywords that was what everyone did before we had the ADS, which is what I am trying to illustrate above. But it has some twists that, I would argue, still make it valuable even in the age of full-text indexes.

To make my argument, let me first set the stage.

Thesauri and the UAT

(Disclaimer: I am currently a member of the UAT steering committee and therefore cannot claim neutrality. However, I would not claim neutrality otherwise, either: the UAT is not perfect, but it's already great)

Librarians (and I am one at heart) love thesauri. Or taxonomies. Or perhaps even ontologies. What may sound like things out of a Harry Potter novel are actually ways to organise a part of the world (a “domain”) into “concepts”. If you are suitably minded, you can think of a “concept“ as a subset of the domain; “suitably minded“ here means that you consider the world as a large set of things and a domain a subset of this world. The IVOA Vocabularies specification contains some additional philosophical background on this way of thinking in sect. 5.2.4.

On this other hand, if you are not suitably minded, a “concept” is not much different from a topic.

There are differences in how each of thesaurus, taxonomy, and ontology does that organising (and people don't always agree on the differences). Ontologies, for instance, let you link concepts in every way, as in “a (bicycle) (is steered) (using) a (handle bar) (made of) ((steel) or (aluminum))“; every parenthesised phrase would be a node (which is a better term in ontologies than “concept”) in a suitably general ontology, and connecting these nodes creates a fine-graned representation of knowledge about the world.

That is potentially extremely powerful, but also almost too hard for humans. Check out WordNet for how far one can take ontologies if very many very smart people spend very many years.

Thesauri, on the other hand, are not as powerful, but they are simpler and within reach for mere humans: there, concepts are simply organised into something like a tree, perhaps (and that is what many people would call a taxonomy) using is-a relationships: A human is a primate is a mammal is a vertebrate is an animal. The UAT actually is using somewhat vaguer notions called “narrower” and “wider”. This lets you state useful if somewhat loose relationships like “asteroid-rotation is narrower than asteroid-dynamics”. For experts: The UAT is using a formalism called SKOS; but don't worry if you can't seem to care.

The UAT is standing on the shoulders of giants: Before it, there has been the IAU thesaurus in 1993, and an astronomy thesaurus was also produced under the auspices of the IVOA. And then there were (and to some extent still are) the numerous keyword schemes designed by journal publishers that would also count as some sort of taxonomy or astronomy.

“Numerous” is not good when people have to assign keywords to their journal articles: If A&A use something drastically or only subtly different from ApJ, and MNRAS still something else, people submitting to multiple journals will quite likely lose their patience and diligence with the keywords. For reasons I will discuss in a second, that is a shame.

Therefore, at least the big American journals have now all switched to using UAT keywords, and I sincerely hope that their international counterparts will follow their example where that has not already happened.

Why Keywords?

Of course, you can argue that when you can do full-text searches, why would you even bother with controlled keyword lists? Against that, I would first argue that it is extremely useful to have a clear idea of what a thing is called: For example, is it delta Cephei stars, Cepheids, δ Cep stars or still something else? Full text search would need to be rather smart to be able to sort out terminological turmoil of this kind for you.

And then you would still not know if W Virginis stars (or should you say “Type II Cepheids”? You see how useful proper terminology is) are included in whatever your author called Cepheids (or whatever they called it). Defining concepts as precisely as possible thus is already great.

The keyword system becomes even more useful when the hiearchy we see in the Cepheid example becomes visible to computers. If a computer knows that there is some relationship between W Virgins stars and classical Cepheids, it can, for instance, expand or refine your queries (“give me data for all kinds of Cepheids”) as necessary. To give you an idea of how this looks in practice, here is how SemBaReBro displays the Cepheid area in the UAT:

Arrows between texts like "Type II Cepheid variable stars", "Cepheid variable stars", and "Young disk Cepheid variable stars"

In that image, only concepts associated with resources in the Registry have a spiffy IVOA logo; that so few VO resources claim to deal with Cepheids tells you that our data providers can probably improve their annotations quite a bit. But that is for another day; the hope is that as more people search using UAT concepts, the data providers will see a larger benefit in choosing them wisely[1].

By the way, if you are a regular around here, you will have seen images like that before; I have talked about Sembarebro in 2021 already, and that post contains more reasons for having and maintaining vocabularies.

Oh, and for the definitions of the concepts, you can (in general; in the UAT, there are still a few concepts without definitions) dereference the concept URI, which in the VO is always of the form <vocabulary uri>#<term identifier>, where the vocabulary URI starts with http://www.ivoa.net/rdf, after which there is the vocabulary name.

Thus, if you point your web browser to https://www.ivoa.net/rdf/uat#cepheid-variable-stars [2], you will learn that a Cepheid is:

A class of luminous, yellow supergiants that are pulsating variables and whose period of variation is a function of their luminosity. These stars expand and contract at extremely regular periods, in the range 1-50 days [...]

The UAT constraint

Remember? This was supposed to be a blog post about a new search constraint in pyVO. Well, after all the preliminaries I can finally reveal that once pyVO PR #649 is merged, you can search by UAT concepts:

>>> from pyvo import registry
>>> print(registry.search(registry.UAT("variable-stars")))
<DALResultsTable length=2010>
              ivoid               ...
                                  ...
              object              ...
--------------------------------- ...
         ivo://cds.vizier/b/corot ...
          ivo://cds.vizier/b/gcvs ...
           ivo://cds.vizier/b/vsx ...
          ivo://cds.vizier/i/280b ...
           ivo://cds.vizier/i/345 ...
           ivo://cds.vizier/i/350 ...
                              ... ...
            ivo://cds.vizier/v/97 ...
         ivo://cds.vizier/vii/293 ...
   ivo://org.gavo.dc/apass/q/cone ...
ivo://org.gavo.dc/bgds/l/meanphot ...
     ivo://org.gavo.dc/bgds/l/ssa ...
     ivo://org.gavo.dc/bgds/q/sia ...

In case you have never used pyVO's Registry API before, you may want to skim my post on that topic before continuing.

Since the default keyword search also queries RegTAP's res_subject table (which is what this constraint is based on), this is perhaps not too exciting. At least there is a built-in protection against typos:

>>> print(registry.search(registry.UAT("varialbe-stars")))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/msdemlei/gavo/src/pyvo/pyvo/registry/rtcons.py", line 713, in __init__
    raise dalq.DALQueryError(
pyvo.dal.exceptions.DALQueryError: varialbe-stars does not identify an IVOA uat concept (see http://www.ivoa.net/rdf/uat).

It becomes more exciting when you start exploiting the intrinsic hierarchy; the constraint constructor supports optional keyword arguments expand_up and expand_down, giving the number of levels of parent and child concepts to include. For instance, to discover resources talking about any sort of supernova, you would say:

>>> print(registry.search(registry.UAT("supernovae", expand_down=10)))
<DALResultsTable length=593>
                 ivoid                   ...
                                         ...
                 object                  ...
---------------------------------------- ...
                   ivo://cds.vizier/b/sn ...
                 ivo://cds.vizier/ii/159 ...
                 ivo://cds.vizier/ii/189 ...
                 ivo://cds.vizier/ii/205 ...
                ivo://cds.vizier/ii/214a ...
                 ivo://cds.vizier/ii/218 ...
                                     ... ...
           ivo://cds.vizier/j/pasp/122/1 ...
       ivo://cds.vizier/j/pasp/131/a4002 ...
           ivo://cds.vizier/j/pazh/30/37 ...
          ivo://cds.vizier/j/pazh/37/837 ...
ivo://edu.gavo.org/eurovo/aida_snconfirm ...
                ivo://mast.stsci/candels ...

There is no overwhelming magic in this, as you can see when you tell pyVO to show you the query it actually runs:

>>> print(registry.get_RegTAP_query(registry.UAT("supernovae", expand_down=10)))
SELECT
  [crazy stuff elided]
WHERE
(ivoid IN (SELECT DISTINCT ivoid FROM rr.res_subject WHERE res_subject in (
  'core-collapse-supernovae', 'hypernovae', 'supernovae',
  'type-ia-supernovae', 'type-ib-supernovae', 'type-ic-supernovae',
  'type-ii-supernovae')))
GROUP BY [whatever]

Incidentally, some services have an ADQL extension (a “user defined function“ or UDF) that lets you do these kinds of things on the server side; that is particularly nice when you do not have the power of Python at your fingertips, as for instance interactively in TOPCAT. This UDF is:

gavo_vocmatch(vocname STRING, term STRING, matchagainst STRING) -> INTEGER

(documentation at the GAVO data centre). There are technical differences, some of which I try to explain in amoment. But if you run something like:

SELECT ivoid FROM rr.res_subject
WHERE 1=gavo_vocmatch('uat', 'supernovae', res_subject)

on the TAP service at http://dc.g-vo.org/tap, you will get what you would get with registry.UAT("supernovae", expand_down=1). That UDF also works with other vocabularies. I particularly like the combination of product-type, obscore, and gavo_vocmatch.

If you wonder why gavo_vocmatch does not go on expanding towards narrower concepts as far as it can go: That is because what pyVO does is semantically somewhat questionable.

You see, SKOS' notions of what is wider and narrower are not transitive. This means that just because A is wider than B and B is wider than C it is not certain that A is wider than C. In the UAT, this sometimes leads to odd results when you follow a branch of concepts toward narrower concepts, mostly because narrower sometimes means part-of (“Meronymy”) and sometimes is-a (“Hyponymy“). Here is an example discovered by my colleague Adrian Lucy:

interstellar-medium wider nebulae wider emission-nebulae wider planetary-nebulae wider planetary-nebulae-nuclei

Certainly, nobody would argue that that the central stars of planetary nebulae somehow are a sort of or are part of the interstellar medium, although each individual relationship in that chain makes sense as such.

Since SKOS relationships are not transitive, gavo_vocmatch, being a general tool, has to stop at one level of expansion. By the way, it will not do that for the other flavours of IVOA vocabularies, which have other (transitive) notions of narrower-ness. With the UAT constraint, I have fewer scruples, in particular since the expansion depth is under user control.

Implementation

Talking about technicalities, let me use this opportunity to invite you to contribute your own Registry constraints to pyVO. They are not particularly hard to write if you know both ADQL and Python. You will find several examples – between trivial and service-sensing complex in pyvo.registry.rtcons. The code for UAT looks like this (documentation removed for clarity[3]):

class UAT(SubqueriedConstraint):
    _keyword = "uat"
    _subquery_table = "rr.res_subject"
    _condition = "res_subject in {query_terms}"
    _uat = None

    @classmethod
    def _expand(cls, term, level, direction):
        result = {term}
        new_concepts = cls._uat[term][direction]
        if level:
            for concept in new_concepts:
                result |= cls._expand(concept, level-1, direction)
        return result

    def __init__(self, uat_keyword, *, expand_up=0, expand_down=0):
        if self.__class__._uat is None:
            self.__class__._uat = vocabularies.get_vocabulary("uat")["terms"]

        if uat_keyword not in self._uat:
            raise dalq.DALQueryError(
                f"{uat_keyword} does not identify an IVOA uat"
                " concept (see http://www.ivoa.net/rdf/uat).")

        query_terms = {uat_keyword}
        if expand_up:
            query_terms |= self._expand(uat_keyword, expand_up, "wider")
        if expand_down:
            query_terms |= self._expand(uat_keyword, expand_down, "narrower")

        self._fillers = {"query_terms": query_terms}

Let me briefly describe what is going on here. First, we inherit from the base class SubqueriedConstraint. This is a class that takes care that your constraints are nicely encapsulated in a subquery, which generally is what you want in pyVO. Calmly adding natural joins as recommended by the RegTAP specification is a dangerous thing for pyVO because as soon as a resource matches your constraint more than once (think “columns with a given UCD”), the RegistryResult lists in pyVO will turn funny.

To make a concrete SubqueriedConstraint, you have to fill out:

the table it will operate on, which is in the _subquery_table class attribute;
an expression suitable for a WHERE clause in the _condition attribute, which is a template for str.format. This is often computed in the constructor, but here it is just a constant expression and thus works fine as a class attribute;
a mapping _fillers mapping the substitutions in the _condition string template to Python values. PyVO's RegTAP machinery will worry about making SQL literals out of these, so feel free to just dump Python values in there. See the make_SQL_literal for what kinds of types it understands and expand it as necessary.

There is an extra class attribute called _keyword. This is used by the pyvo.regtap machinery to let users say, for instance, registry.search(uat="foo.bar") instead of registry.search(registry.UAT("foo.bar")). This is a fairly popular shortcut when your constraints can be expressed as simple strings, but in the case of the UAT constraint you would be missing out on all the interesting functionality (viz., the query expansion that is only available through optional arguments to its constructor).

This particular class has some extra logic. For one, we cache a copy of the UAT terms on first use at the class level. That is not critical for performance because caching already happens at the level of get_vocabulary; but it is convenient when we want query expansion in a class method, which in turn to me feels right because the expansion does not depend on the instance. If you don't grok the __class__ magic, don't worry. It's a nerd thing.

More interesting is what happens in the _expand class method. This takes the term to expand, the number of levels to go, and whether to go up or down in the concept trees (which are of the computer science sort, i.e., with the root at the top) in the direction argument, which can be wider or narrower, following the names of properties in Desise, the format we get our vocabulary in. To learn more about Desise, see section 3.2 of Vocabularies in the VO 2.

At each level, the method now collects the wider or narrower terms, and if there are still levels to include, calls itself on each new term, just with level reduced by one. I consider this a particularly natural application of recursion. Finally. everything coming back is merged into a set, which then is the return value.

And that's really it. Come on: write your own RegTAP constraints, and also have fun with vocabularies. As you see here, it's really not that magic.

[1]	Also, just so you don't leave with the impression I don't believe in AI tech at all, something like SciX's KAILAS might also help improving Registry subject keywords.

[2]

Yes, in a little sleight of hand, I've switched the URI scheme to https here. That's not really right, because the term URIs are supposed to be opaque, but some browsers currently forget the fragment identifiers when the IVOA web server redirects them to https, and so https is safer for this demonstration. This is a good example of why the web would be a better place if http had been evolved to support transparent, client-controlled encryption (rather than inventing https).

[3]	I've always wanted to write this.

Page 1 / 22 »

Uneasy Logistics

TCG: Come again?

Opening session (2025-06-02, 14:30)

Charge to the Working Groups (2025-06-02, 15:30)

Data Management Challenges (2025-06-03, 10:30)

Registry (2025-06-03, 12:30)

Tuesday (2025-06-03) Afternoon

DCP (2025-06-04, 10:00)

Obscore and Extensions (2025-06-04, 15:00)

Apps II (2025-06-05, 12:00)

DM 3 (2025-06-05, 17:00)

Wrapping Up (2025-06-06)

Setting Lifetimes

Creating Indexes

Special Index Types

9:00

9:35

9:42

9:45

9:50

9:53

9:55

12:00

12:30

13:00

Thesauri and the UAT

Why Keywords?

The UAT constraint

Implementation