Semantics, Cross-Discipline Discovery, and Down-To-Earth Code

Boxes-and-arrows view of the UAT
A tiny piece of the Unified Astronomy Thesaurus as viewed by Sembarebro – the IVOA logos sit on terms that have VO resoures on them.

Sometimes people ask me (in particular when I’m wearing my hat as the current chair of the IVOA Semantics working group) “well, what’s this semantics thing good for?“ There are many answers, but here’s one that nicely meshes with my pet subject data discovery: You want hierarchical, agreed-upon word lists to bridge discipline gaps.

This story starts with B2FIND, a cross-disciplinary metadata aggregator for science data run within the framework of the European Open Science Cloud (EOSC). GAVO (or, more precisely, Heidelberg University’s Astronomy) is involved in the EOSC via the ESCAPE project, and so I have had the pleasure of interacting with B2FIND for a while now. In particular, they are harvesting the metadata records of the Virtual Observatory Registry from us.

This of course requires a bit of mapping, because the VO’s metadata formats (VOResource, VODataService, and several extensions; see 2014A&C…..7..101D to learn more) are far too fine-grained for the wider scientific public. Not even our good friends from high-energy physics would appreciate being served links to, say, TAP endpoints (yet!). So, on our end we’re mapping to the Datacite metadata kernel, which from VOResource is just a piece of XSL away (plus some perhaps debatable conventions).

But there’s more to this mapping, such as vocabularies of subject keywords. You might argue that in the age of rapid full text searches, keywords are dead. I would beg to disagree. For example, with good, hierarchical keyword systems you can, among many other useful things, offer topical browsing of metadata repositories. While it might not quite qualify as “useful” yet, the SemBaReBro registry browser I’ve hacked together late last year would be an example for such facilities – and might become part of our WIRR Registry searching tool one day.

On the topic of subject keywords VOResource says that resources in the VO should be using the Unified Astronomy Thesaurus, specifically in its IVOA incarnation (not quite true yet, but true enough by blog standards). While few do, I’ve done a mapping of existing keywords in the VO to UAT concepts, which is what’s behind SemBaReBro. So: most VO resources now have UAT concepts.

However, these include concepts like AM Canum Venaticorum Stars, which outside of rather specialised circles of astronomers few people will ever have heard about (which, don’t get me wrong, I personally regret – they’re funky star systems). Hence, B2FIND does not bother with those.

When we discussed the subject mapping for B2FIND, we thought using the UAT’s top-level concepts might be a good start. However, at that point no VO resources at all actually used these, and, indeed, within astronomy that generally wouldn’t make a lot of sense, because they are to unspecific to help much within the discipline. I postponed and then forgot about the problem – when the keywords of the resources weren’t even from UAT, solving the granularity mismatch just wasn’t humanly possible.

That was the state of affairs until last Tuesday, when I had a mumble session with B2FIND folks and the topic came up again. And now, thanks partly to the new desise format proposed in the current Vocabularies in the VO 2 draft, things fell nicely into place: Hey, I have UAT concepts, and mapping these to the top-level terms isn’t hard either any more.

So, B2FIND gets the toplevel keywords they’ve been expecting all the time starting today. Yes: This isn’t a panacea suddenly solving all the problems of cross-discipline data discovery, not the least because it’s harder than one might think to imagine how such a thing would look like in practice. But given the complexities involved I was positively surprised how easy this particular part of the equation was.

From here on, there’s a bit of tech babble I intend to re-use in the RFC of Vocabularies in the VO 2; don’t feel bad if you skip it.

The first step was to make the mapping from UAT terms to the toplevel terms. The interesting part of the source I’m linking to here is:

def get_roots_for(term, uat_terms):
  roots, seen = set(), set()

  def follow(t):
    wider = uat_terms[t]["wider"]
    if not wider:
      if not t in ROOT_TERMS:
        raise Exception(
          f"{t} found as a top-level term")
      roots.add(t)
    else:
      seen.add(t)
      for wider in uat_terms[t]["wider"]:
        follow(wider)
  
  follow(term)
  return roots

There, uat_terms is essentially just a json-decode of what you get from the vocabulary URI if you ask for desise (see the draft spec linked to above for the technicalities). That’s really it, and it even defends against cycles in the concept graph (which are legal by SKOS but shouldn’t happen in the UAT) and detached terms (i.e., ones that are not rooted in the top-level terms). For what it does, I claim that’s remarkably compact code.

Once I had that, I needed to get the UAT-mapped subject keywords for the records I’m serving to datacite and fiddle the corresponding roots back in. That’s technically a bit more involved because I am producing the datacite records on the fly from the XML representation for VOResource records that I keep in the database, and there’s a bit of namespace magic involved (full code). Plus, the UAT-mapped keywords are only kept in the database, not in the metadata records.

Still, the core operation here is relatively straightforward. Consider:

def addUATToplevels(dataciteTree):
  # dataciteTree is an (lxml) ElementTree for the 
  # result of the XSL transformation.  That's all 
  # I have, and thus I first have to fiddle out 
  # the identifier we are talking about
  ivoid =  dataciteTree.xpath(
      "//d:alternateIdentifier["
      "@alternateIdentifierType='ivoid']", 
      namespaces={"d": DATACITE_NS}
    )[0].text.lower()
  # The .lower() is necessary because ivoids 
  # unfortunately are case-insensitive, and RegTAP 
  # normalises them to lowercase to retain sanity.

  # Now pull the UAT-mapped subject keywords from 
  # our RegTAP extension (getTableConn is 
  # DaCHS-internal API, but there's no magic in 
  # there, it's just connection pooling with 
  # guarantees against connections  idle in 
  # transaction).
  with base.getTableConn() as conn:
    subjects = set(r[0] for r in 
      conn.query("SELECT uat_concept"
        " FROM rr.subject_uat"
        " WHERE ivoid=%(ivoid)s", locals()))

  # This is the mapping itself: we do 
  # roots-subjects to avoid adding  
  # root terms that are already in 
  # the record itself.  UAT_TOPLEVELS is the result
  # of the root finding discussed above.
  newRoots = set()
  for term in subjects:
    root = UAT_TOPLEVELS[term]
    newRoots |= (root-subjects)

  # And finally fiddle in any new root terms found 
  # into the datacite tree
  if newRoots:
    subjects = dataciteTree.xpath(
      "//d:subjects", 
      namespaces={"d": DATACITE_NS})[0]
    for root in newRoots:
      newSubject = etree.SubElement(subjects, 
        f"{{{DATACITE_NS}}}subject")
      newSubject.text = root

Apart from the technicalities I’d again say that’s pretty satisfying code.

And these two pieces of code are really all I had to do to map between the vocabularies of different granularities – which I claim will probably be the norm as metadata flows between disciplines.

It’s great to see the pieces of a fairly comples puzzle fall into place like that.

Sofa instead of Granada

[Screenshot from an online talk]
Gesticulating wildly to a computer is what happens in an online conference. To me, at least. Let’s hope nobody watched me through the window.

It was already in the wee hours of Friday last week (CET) when the second “virtual Interop” had its rather unceremonious closing ceremony. Its predecessor in May had about it an air of a state of emergency. For instance, all sessions were monothematic. That was nice on the one hand, because a relatively large part of the time was available for discussion – which, really, is what the Interops are about. But then Interops are also about noticing what everyone else in the Virtual Observatory is cooking up, for which the short-ish talks we usually have at Interops work really well.

In contrast to that first Corona Interop, this second one, replacing what would have taken place in Granada, Spain, had a much more conventional format, which again accomodated many talks. But of course, this made one feel the lack of possibilities to quickly hash out a problem during a coffee break or in a spontaneous splinter quite a bit more.

Be that as it may, I would like to give you some insights on what I’m currently up to at the IVOA level; I am grateful for any feedback you can give on any of these topics.

Given that I currently chair the Semantics Working group, there was a natural focus on topics around vocabularies, and I gave two talks in that department. The one in DAL (DAL is the working group that builds the actual access protocols such as TAP or SIAP) was mainly on Datalink-related aspects of my Vocabularies in the VO 2 draft (VocInVO2), which in particular was an opportunity to thank everyone involved in the Vocabulary Enhancement Proposals we have been running this last year (all of which were about Datalink and hence closely tied to DAL). One thing I was asking for was reviews on a github pull request that would make the bysemantics method of Datalink accesses semantics-aware; basically, as intended by the original Datalink authors, when asking for #calibration links, this will also return, say, #bias links. If you can spare a moment for this: Please do!

Another thing I tried to raise some interest for is the proposed vocabulary of product types; this, I think, should eventually define what people may put into the dataproduct_type column of Obscore results, and there are related uses in Datalink and, believe it or not, the registration of SSAP (spectral) services. A question Alberto raised while I was discussing that made me realise I forgot to mention another vocabularies-related development relevant for DAL: I’ve put the gavo_vocmatch ADQL user-defined function into DaCHS. It lets you match something against a term or its narrower terms, referencing an IVOA vocabulary. For instance, if we had different sorts of time series (which, of course, would be odd for obscore that has the o_ucd column for this kind of thing), you could, using ADQL, still get all time series by querying

SELECT TOP 5 * 
FROM ivoa.obscore
WHERE
  1=gavo_vocmatch(
    ’product-type’, 
    ’timeseries’, 
    dataproduct_type)

Here, the first argument is the vocabulary name (whatever is after the http://www.ivoa.net/rdf in the vocabulary URL), the second the “root” term, and the third the column to match against. Since postgres, for now, isn’t aware of IVOA vocabularies, the second argument must be a literal string rather than, say, an expression involving columns.

I gave a second semantics-related talk in the Registry session. That had its focus on the Unified Astronomy Thesaurus (UAT), from which people should pick the subject keywords in the VO Registry (actually, they should pick from its representation at http://www.ivoa.net/rdf/uat). I’ll probably blog about that a little more some other time. For now, let me recommend a little UAT-based game on my Semantics Based Registry Browser sembarebro: Choose two terms that are pretty far apart (like, perhaps, ionized-coma-gases and cosmic-background-radiation) and then try to join the two sub-graphs. Warning: This may waste your time. But it will acquaint you with the UAT, which may be a good thing.

In that second talk, I also mentioned a second draft vocabulary I’ve put up in the past six months, http://www.ivoa.net/rdf/messenger. This builds upon the terms for VODataService’s waveband element, which enumerated certain flavours of photons (like Radio, Optical, or X-ray). Now that we explore other messengers as well and have more and more solar system resources in the Registry, I’m arguing we ought to open up things by making “Photon” explicit in there and then adding Neutrinos and, later, other messengers. I’ve received a certain amount of pushback there on mixing the electromagnetic spectrum with particle types; on the other hand, the hierarchical nature of our vocabularies would, I think, let us smartly get away with that.

Speaking about solar system resources, I’m also listed as an author on Stéphane Erard’s talk on EPN-TAP and EPNCore v2.0, probably due to my involvement in finally bringing EPN-TAP into the IVOA document repository. I’ve already talked about that in a 2017 post on this blog – and again, if you’re interested in solar system data, this would be a good time to review the EPN-TAP working draft.

Talking about things regluar readers of this blog will have heard of: September’s Crazy Shapes post I’ve referenced in a talk on MOCs in pgsphere, together with a fervent appeal to data centers to become involved in pgsphere maintenance.

And then there was my colleague Margarida’s talk on LineTAP, a proposal to obsolete the little-used SLA protocol (which lets people search for spectral lines) with something combining the much more successful VAMDC with our beloved TAP. Me, I’m in this because I’d like to bring TOSS data closer to VAMDC – but also because having competing infrastructures for the same thing sucks.

And finally, I gave a talk I’ve called Data Model Posture Review in a session of the Data Models working group; I was somewhat worried that given its rather skeptical outlook it wouldn’t be really well-received. But in fact quite a few people shared my main conclusions – and perhaps it was another step towards resolving my decade-old spot of pain: that the VO still doesn’t offer tech to reliably bring two catalogues to the same epoch without human intervention.

With this number of talks I’ve been involved in, I’m essentially back to the level of a normal Interop. Which means I’ve been fairly knocked-out on Friday. And I can’t lie: I still regret I didn’t get to spend a few more warm days in Granada. Corona begone!