Posts with the Tag Registry:

  • Registry: A Janitor Speaks Out

    I sometimes claim the reason I like working on the VO Registry is that I am a librarian at heart. Perhaps there is some truth to that, in that ugly metadata does make me unhappy – but beyond that, it also makes the Virtual Observatory look or even work a good deal worse than it should.

    Given that, in this post I'm afraid I will sound more like a grumpy janitor than a wise librarian, but let me still attempt to contribute to better metadata by pointing out a few things to watch out for when writing a resource record. People consuming resource records (i.e., VO-using astronomers) are welcome here, too: when you encounter antipatterns mentioned here, a polite complaint to the service publisher is entirely a good thing.

    Note that I am using real metadata found in the registry – in case you recognise some of own records, do not feel reprimanded individually. Most of the problems I discuss here are really common at this point, and thus if I picked your metadata, that was mere bad luck. I actually picked some of my own occasionally (but duly fixed the problem then).

    Missing Coverage

    Since VODataService 1.2, you can say what part of the sky, spectrum, and time your resource covers. That is incredibly useful metadata in practice. Spatial coverage, for instance, is used in Aladin like this:

    Screenshot: Resource names in white, orange and green, and a part of the sky (h and χ Persei) next to them

    Green means: these services could have data for the patch of sky shown, orange means don't bother with these, and white means: No idea because the resource does not declare its coverage.

    Similarly, it would be great if researchers or clients could reliably say:

    SELECT * FROM rr.resource JOIN rr.stc_spectral WHERE
      1=ivo_interval_overlaps(spectral_start, spectral_end,
          ivo_specconv(658, 'nm', 'J'), ivo_specconv(654, 'nm', 'J'))
    

    to find resources having data covering the Hα line on the spectral axis. Currently, that's just 2064 resources, and given that Hα sits smack in the middle of the optical window that's an indication that far too few resources say where they are.

    So – add STC coverage to your data today. It's not hard with pymoc or pgsphere and chapter 3.2 of VODataService. DaCHS operators will probably get by just studying the corresponding section of the tutorial.

    Broken Author Names

    On the ADS, last time I had information on that, about 90% of the queries were by author. In the VO registry, by my unscientific estimate, less than 5% of queries constrain authors. Sure, people look for literature and data in different ways and for different purposes, but an important reason for the difference still is that we don't do a good job giving creator/name (which contains the equvialent of the author name).

    The ideal format is to have last name first, then a comma, and then abbreviated initials or full first names, as in von der Heide, J.. Many names in the VO are almost in this format do not have a comma; but the comma makes parsing these names a lot simpler, so please put it in. Of all the forms to write names in, that's most easily constrained without guessing how many first names are where. Remember, there are people out their with names like „Kirsten-Claude Selim de Vaucouleurs-van der Heide Lobos“ (or, for that matter, Uthamadhanapuram Venkatasubbaiyer Swaminatha Iyer), and a computer cannot efficiently decide where the last name starts in first name first order (and conversely, without the comma in last name first order, it has a hard time figuring out where the last name stops). Also, last name first almost always gives a more useful natural sort order.

    Realistically, people will have to live with J. von der Heide, too, so author searches in the VO will have to look like LIKE '%von der Heide%' for some years to come, but let's at least try to improve. And whatever you do, don't do any of (in approximate order of severity):

    • Dump in half an acknowledgement, e.g., under a cooperative agreement with the NSF on behalf of the Gemini partnership: the National Science Foundation (United States), or, about as bad: provided by S. Snowden from data by Dickey and Lockman – that's useless for author searches but invites lots of false positives
    • Dump more than one name into one creator/name element, e.g., Zhuang Z.,Kirby E.N.,Leethochawalit N.,de los Reyes M.A.C. or Voges, W.; Aschenbach, B.; Boller, Th.; (and ~200 more characters) – that's really hard to search and essentially impossible to use for, e.g., author datagraphies
    • Include affiliations (the VO can't properly deal with those yet), e.g., Zub M. (The PLANET Collaboration) or a combination of this and the previous: Zhu W. (The Spitzer team) Dominik M.
    • Forget citation debris, e.g., et al. MNRAS (in press), or, shockingly common: and Scheck M.; of course, entire citations (WALKER I. Astron. J. 106) are inappropriate, too – all of this will prevent the use of meaningful name constraints
    • Give a bibcode: 2014ApJ...787...78M – this likely belongs into content/source
    • Have empty author name elements (as, at this moment, 13 records)
    • Cheat with effectively empty author names: <NOT GIVEN>, or "We forgot to give credit, please complain"
    • Go all uppercase, e.g., ZINNECKER H. – standards-compliant ADQL string comparisons are case-sensitive, and case-folding would require special indexes. Perhaps case-insensitive author matches should be made easier in that van der Waals is probably the same person as Van der Waals, but for now that's not how it works right now. And I don't think that will change any time soon, because if I have learned one thing in my life it is that case insensitivity is almost always evil
    • Have just a first name: walter or W.I. or W-J
    • Combine author lists from different contributing papers: Wright et al.; Griffith, Wright, Burke, Ekers; Griffith, Wright – if you really need to do something like this, merge the two author lists – and then of course use one name per creator element

    In principle, these considerations would apply to contributors, contacts and perhaps publishers, too, but since I don't think people should use these in discovery queries, their format does not matter too much: If they're human-readable, that's enough.

    Fragile Contact Info

    Quite regularly I need to ask people to fix something in their publishing registries, and then it's really useful to have reliable contact information. That's also nice for VO users; pyVO, for instance, has the get_contact method on registry records, and in WIRR, you can pop up contact info on all records:

    a screenshot showing a match in a registry query.  A subwindow is popped up that shows a mail address and a telephone number of a “GAVO Data Center Team“.

    For that to work, personal addresses in the contact information are really dangerous – it is my experience that these break significantly more often than institutional addresses. So, please avoid things like (I'm making all of these up because there may still be folks around harvesting mail addresses to send spam):

    • a.b.miller-parachtnix@gmail.com (well: avoid using gmail.com unconditionally)
    • friederike.student@ari.uni-heidelberg.de

    Rather, create an alias that you can hand on and that perhaps is even a bit speaking. This could be:

    • vo-help@great-telescope.org
    • gavo@ari.uni-heidelberg.de
    • uni-hd-vo@posteo.de (in case your own institution absolutely loathes the idea of addresses not bound to persons)

    Non-machine-readable Subjects

    VOResource 1.1 said that subjects are to be taken “from the UAT” (that's the Unified Astronomy Thesaurus), but failed to say what exactly that means. Since last July, this is properly defined: Use fragment identifiers into http://www.ivoa.net/rdf/uat, that is, something like abell-clusters.

    Having all subject keywords in a predictable format, with useful metadata, and part of a proper hierarchy enables all kinds of cool stuff, and hence it would be great if we could stomp out the following sorts of mispractice in the VO:

    • Multiple things in one subject element: ATLAS DR1, SIAP, Images – have one term per subject element
    • Undefined NULL values: NOT PROVIDED, ??? – if you really cannot find a pertinent term, use astronomical-research (or one of the other top-level terms). If nowhere else, that at least helps when your record moves to interdisciplinary search engines
    • Random free text: optical lines equivalent width catalog – that's multiple terms rolled into one, and the machine will not know what it means
    • Project or instrument names: 6dF Data Release 3 Spectra, COROT N2 – there's the instrument metadata for some uses of that. For the rest, see above on having projects in creator/name.
    • Protocol names: TAP – that's what capabilities are for
    • Service titles: CADC image/cube HiPS service – that's what the title element is for
    • Non-UAT keyword schemes: Galaxy:general – let's not force VO components to learn about multiple keyword systems. If you are missing something from the UAT, tell them about it

    Unfulfilling Resource Descriptions

    Descriptions of VO resources serve a dual purpose: The should give researches a quick idea of what to expect and not expect of a resource, and they should mention all the important buzzwords for the benefit of full-text searches. Hence, if you only have two words as in:

    Survey (LoLSS).

    or have something like a title:

    Convolution of normalized synthetic stellar spectra.

    or use somewhat uncommon abbreviations and technical details that probably will not help much during data discovery:

    USET Group form

    (what group? Does „form“ really mean „web browser-facing“? If so, that's again better expressed through the capabilities), you should work a bit on your description.

    It is usually helpful to start the description with „this service is…“ or something similar. While it's marginally ok to mention terms and conditions like:

    When referencing results from this online catalog, please cite &lt;a href="https://iopscience.iop.org/article/10.384

    further down in the description (the proper place for this kind of thing is the rights element, though), don't discuss stuff like this before you have told people what is in the “online catalog” in the first place. Also: registry records are like e-mail in that you shouldn't use HTML anywhere in registry metadata. If you have to include URLs in text for human consumption, just put them in as text.

    Talking about markup: You cannot rely on any of that in descriptions. Even basic ASCII art (or, well, tables) will always come out bad:

    Only the data from the first catalog that was matched is reported here according to the following priority list. This means for example, if a star was matched with Hipparcos, that information was used while possible other catalog data are not listed here. -------------------------------------------------------- # stars flg catalog -------------------------------------------------------- 53500 0 no catalog match 55549 1 Hipparcos 254 2 Yale Parallax Catalog 1041 3 Finch and Zacharias 2016 (UPM NNNN-NNNN) 1431 4 MEarth parallaxes 402 5 SIMBAD Database (w/parallax) -------------------------------------------------------- 112177 total number stars in catalog -------------------------------------------------------- Not all parallaxes from the...

    (of course, that in this case the newlines and longer sequences of blanks have been normalised to single blanks already in the original resource record makes it particularly certain that the table will come out wrong).

    And where in titles abbreviations are usually a good thing, in particular when you can expect your target audience too look for the abbreviation rather spelled-out names in discovery queries, in descriptions you have space, and hence you normally should explain MCQA as „Monte Carlo Quality Assessment“ in something like the following:

    Herschel sources in Planck fields measured at 350 µm MCQA

    Remember: The people who read your descriptions may come from the future (as in: 25 years from now) or at least may be unfamilar with your project's jargon.

    Lame Relationships

    There are an incredible 136958 relationships in the current VO that have related-to as their relationship type. This is deplorable because the relevant vocabulary, https://www.ivoa.net/rdf/voresource/relationship_type, marks it as deprecated, and that's for a good reason: Just stating “some relationship“ between two resources is rarely useful. Decide what the relationship is and then pick a proper term (or, if that does not exist, prepare a VEP).

    Missing Tablesets

    Tablesets are a VODataService feature giving metadata on the return table (or, in the case of the flexible TAP services, the queried tables). They are really useful if you look for services returning some sort of physics – and if you are running TAP services, they will one day let me shut down the GloTS service that replicates a good deal of registry functionality for no good reason at all.

    So, if you have a catalog service and your registry record ends somewhat like this:

      </capability>
    </ri:Resource>
    

    it is almost certainly missing a tableset (which would normally go after the capabilities; you are probably also missing the sky coverage, though, because that would sit there, too).

    Writing basic tablesets is not hard. In fact, if you are running a TAP service, you have a working tableset on your service's tables endpoint. But even without VOSI tables, making a tableset from the VOTable you return is straightforward – with a few encouraging words, I could be talked to write a few lines of Python that do that.

    I will readily admit that writing good tablesets is more involved, but what is hard about it you should be doing anyway, because it also will improve the VOTables that you write, and hence the usability of your data all around. So, until the end of this post let me look at some common warts of the column metadata in today's VO.

    Deficient Column Descriptions

    Column descriptions like ?, ??, or even ??? are surprisingly common. Please don't do that. If you really have no idea what your upstream has put into a column, admit that, aplogise and try to make your upstream explain.

    And while RA somewhat works among astronomers, a word or two on the reference system (“IRCS”) and an informal provenance (“from PSF fits”) would certainly be much appreciated by your users and might even come handy in discovery.

    Or consider “Age” – this could immediately be improved by revealing just what has aged here and, again, some hint on how the age was estimated (e.g., “obtained from ivo://foo.bar/res” versus “by isochrone fitting”).

    But don't overdo it, either: Do not include entire footnotes in descriptions, because that will lead to many false positives in full text searches (not to mention slow down the Registry as a whole if this became common practice). DaCHS operators: you can have footnotes in your RD by using note meta items; cf. Typed Meta Elements in the DaCHS reference.

    Near the upper limit of what is appropriate in a column description is perhaps something like this:

    The 2.5 percentile of the Log total SFR PDF. This is derived by combining emission line measurements from within the fibre where possible and aperture corrections are done by fitting models ala Gallazzi et al (2005), Salim et al (2007) to the photometry outside the fibre. For those objects where the emission lines within the fibre do not provide an estimate of the SFR, model fits were made to the integrated photometry.

    – but at the same time it illustrates how you can provide a lot of information that helps casual users.

    The position angles I will turn to in a second give another nice example of why human-readable descriptions are so important: There is no reliable convention of the direction and the baseline of these, so stating something like „north over east“ in a description will avoid a lot of head-scratching.

    Column UCDs: Missing, Outdated, or Useless

    A very plausible discovery scenario involves UCDs: „give me resources with (some photometry | redshifts | kinematics | dynamics | positions on earth)“. Hence, make sure your columns' metadata has predictable and halfway correct UCDs.

    Sure, that's not always straightforward (note, by the way, that there is a reasonably simple process to suggest new UCDs), but there's no excuse for there being 117 columns called pa without any UCD, where pos.posAng will almost certainly fit all of them (though, who knows: 30 of these in addition don't even have a description).

    To make sure the UCDs you assign exist, run them through astropy at least once. Do not ignore complaints by astropy; it is actually preferable to have no UCD rather than “??” (which currently a whopping 30342 column sport, in addition to which we have 41 times “???“ and 70 times “????“[1]). Also, resist the temptation to freely invent things, such as the “mjd” UCD I'm seeing on 13 columns. In this particular case, by the way, I give you that saying “this column contains MJDs“ has been a pain in VOTables for a long time, but since version 1.4, TIMESYS lets you do that in a reasonable way.

    Oh, let me qualify the “freely invent“ in the last paragraph: It could be[2] that MJD has actually been part of the original UCDs you may still know from cone search (“POS_EQ_RA”); that people have not updated their metadata from these ancient days is also the reason I'm still seeing 13827 columns with an (invalid) UCD of “error“ in column metadata (and 84 with pos_eq_dec).

    Unrelatedly (though with an undisputable entertainment value): the longest UCD in the current VO is meta.code;phot.flux.density;arith.ratio;em.ir.15-30um;em.radio.750-1500mhz; unless I and astropy are missing something, it's even syntactically correct.

    Bad Units

    While I do not see many discovery scenarios that would make good use of units, do not forget to update your units to VOUnits when you touch up your tablesets. This will let software like astropy do the unit calculus for its users, which is a win overall. It cannot do that if you ignore VOUnits and write, say, ABmag/arcsec2 – the AB part you will have to communicate in the description for now, and exponentiation is ** in VOUnits.

    Recent versions of the stilts validators (votlint, taplint) will complain about bad units. And you can use stilts interactively to figure out whether you got it right:

    $ stilts calc 'vounitStatus("ABmag/arcsec2")'
      BAD_SYNTAX
    $ stilts calc 'vounitStatus("mag/arcsec**2")'
      OK
    

    [In a previous version of this post, I have given a piece of astropy to do unit checking; it turns out that astropy by default is rather forgiving, and you want stilts on your box anyway; why not use it for unit validation? If your stilts says something about “bad expression“ with the command lines above, it's an indication that you should update it.]

    And with this somewhat non-registry topic: Go forth and polish your resource records. Or, as a consumer of such metadata, ask the publishers of bad resource metadata to fix it.

    [1]Remarkably, there are no ????? or even longer sequences of question marks, and even more remarkably, nobody has put in a lonely question mark. If someone versed in cognitive psychology has a plausible interpretation for that fact: can you share it with me?
    [2]Since the original UCDs predate my VO involvement and, for all I know, never were properly standardised, I frankly can't say.
  • Query the Registry with WIRR

    Search windows of VODesktop and WIRR

    Pixels from venerable VODesktop and WIRR: it's supposed to be about the same thing, except WIRR uses and exposes the latest Registry standards (and then some tech that's not standard yet).

    When the VO was young, there was a programme called VODesktop that had a very nice interface for searching the Registry. Also, it would run queries against the services discovered, giving nice all-VO querying that few modern clients do quite as elegantly. Regrettably, when the astrogrid UK project was de-funded, VODesktop's development ceased in 2010.

    In 2012, it had become clear that nobody would step up to continue it, and I wanted to at least provide a replacement for the Registry interface part. In consequence, Florian Rothmaier and I wrote the Web Interface to the Relational Registry, or WIRR for short; this lets you build Registry queries in your Web Browser in an interface inspired by VODesktop (which, I'm told, in turn was inspired by early iTunes).

    WIRR's sweet spot is between the Registry interfaces in the usual clients (TOPCAT, Aladin: these try to hide the gory details of where their service lists come from and hence are limited in what interaction they allow) and using a TAP client to write and execute RegTAP queries (where there are no limitations beyond the protocol's, but it's tedious unless you happen to know the RegTAP standard by heart).

    In contrast to its model VODesktop, WIRR cannot run any queries against the services discovered using it. But you can transfer the services you have found to clients via SAMP (TOPCAT can handle the relevant MTypes, but I'm frankly not sure what else). Apart from that, an obvious use for WIRR are the queries one needs in VO curation. For instance, I keep linking to it when sending people canned registry queries, as in the section on claiming an authority in the DaCHS Tutorial.

    Given that both Javascript and the Registry have evolved a lot in the past decade, WIRR was in need of a major redecoration for some time now, and in early July, I found some time to do it. The central result is that the code is now halfway modern, strict Javascript; let's see how many web browsers still run that can't execute this.

    On the surface, much less has changed, but there are some news I'd consider noteworthy and that might help your data discovery-fu:

    • Since I've added some constraint types, the constraint type selector is now a hierarchical box, sporting what I think are or should be the most common constraint types (full text, service type and UAT term) on level 0 and then having “Blind Discovery“, “Finer Grained“, and “Special Effects“ as pop-ups; all this so we obey Miller's Rule of Seven.
    • Rather than explain the constraints on a second, separate page, there are now brief help texts coming with each constaint.
    • You can now match against UAT concepts, and there is a completing input box for them; in case you're wondering what this is about, see this post from last February. And yes, next time I'll play with WIRR I'll probably include SemBaReBro here.
    • When constraining by column UCD, you can now choose from UCDs found in the registry (the “Pick one“ button).
    • You can now constrain by spatial, temporal, and spectral coverage, though that's still a gamble because not many (or, actually, very few in the case of temporal and spectral) operators care to declare their services' coverage. When they don't, you won't see their resources with such blind discovery constraints. For some background on this, check Space and Time not lost on the Registry on this blog.
    • There is now a „SQL“ button with successful searches that lets you retrieve the SQL executed for the particular constraint. While that query does not immediately execute on RegTAP services (it's Postgres' SQL rather than ADQL), it ought to give you a head start when transplanting your Registry query into, say, a pyVO-based script.
    • You can now use your browser's back and forward buttons (or, in my case. key bindings) to navigate in your query history.

    What this still doesn't do: Work without Javascript. That's a bit of a disgrace, since after the last changes it would actually be reasonable to provide non-javascript fallbacks for some of the basic functionality (of course, no SAMP at all then…). I'll do it the first time someone asks. Promised.

    A document that now needs at least slight updates because things have moved about a bit is the data discovery use case Florian wrote back then. The updates absolutely necessary are not terribly involved, but I would like to use the opportunity to add a bit more spice to the tutorial. If you have ideas: I'm all ears.

    Oh, and before I close: you can still run VODesktop; kudos to the maintainers of the JVM for that. But it's nevertheless not really usable any more, which perhaps isn't too surprising for a client built on top of experimental online services ten years ago. For one, its TAP client speaks pre-release versions of both TAP and ADQL, so those won't work on modern TAP services (and the ancient ones have vanished). Worse, it needed to use a non-standard extension of RegTAP's predecessor (for those old enough to remember: it used XQuery), and none of the modern searchable registries understands that any more.

    Which is a pity, really. It's been a fine programme. It just was a few years early: By 2012, everything it needed has been defined in nice, stable standards that are still around and probably will be for another decade at least.

  • GAVO at the Northern Spring Interop 2021

    As usual in May, the people making the Virtual Observatory happen meet for their Interoperability Conference, better known as the Interop – where “meet” still has to be taken with a generous helping of salt (more on this near the end of this post). As has become customary on this blog, let me briefly discuss contributions with a significant involvement of GAVO.

    A major thing from my perspective actually happened in the run-up: The IVOA executive committee (“Exec“) approved Version 2.0 of Vocabularies in the VO, a standard saying how hierarchical word lists (“vocabularies“) can be managed, disseminated, and consumed within the VO. Developing the main ideas from sufficiently restricting RDF to coming up with desise (which makes complicated things possible with surprisingly little code), and trying things out on our growing number of vocabularies took up quite a bit of my standards time in the last 20 months or so – and I'm fairly happy with the outcome, which I celebrated with a brief talk on programming with IVOA semantics during Wednesday morning's semantics session.

    In that session I gave a second, more discussion-oriented, talk, probing how to formalise data product types – which is surprisingly involved, even with the relatively straightforward use case “figure out a programme to handle the data“: What's a spectrum? Well, something that maps a spectral coordinate to... hm. Is it still a spectrum if there's multiple sorts values (perhaps flux, magnitude, and polarisation)? If we allow, in effect, tuples, why not whole images, which would make spectral cubes spectra – but of course few client programmes that deal with spectra do anything useful with cubes, so clearly such a definition would kill our use case. And what about slit spectra, mapping a spatial coordinat to spectra?

    All this of course is reminiscent of the classical problems of semantics: An elephant is a big animal with a trunk. But when an elephant loses its trunk in an accident: does it stop being an elephant? So, much of the art here is finding the sweet spot of usability between strict and formal semantics (that will never fit the real world) and just tossing around loosely defined strings (that will simply not be machine-readable). After the session, I came up with the 2021-05-26 draft of product-type. If you read this a few years down the road, it might be interesting to compare with what product-type is today. I'm curious myself.

    Later on Wednesday CET, I did a shameless plug for my Datalink-transforming XSLT (apologies for a github link, but I'm fishing for PRs here; if you use DaCHS, you'll get the updated stuff with version 2.4, due soon). The core of this dates back to the dawn of datalink, but with a new graphical cutout code and in particular vocabulary-based tree-ification of the result rows, I figured it's time to remind the operators of datalink services it's still out there for them to take up. Perhaps more than from the slides, you can see what I am after here by just trying the Datalink examples I've collected for this talk and comparing document source, the appearance without Javascript (pure XSLT) and the appearance with Javascript (I'm a bit ashamed I'm relying so heavily on it, but much of this really can only be done client-side).

    Quite a bit after midnight my time (still Thursday UTC), Mark Taylor talked about Software Identification, something I've been working on with him recently. It's is one of the things that is short and trivial but that, when unregulated, just doesn't work; in this case it's servers and clients saying what they are when they speak HTTP. I stumbled into the problem while trying to locate severely outdated DaCHS installations – so, I a way I put effort into the Note Mark was talking about (and which I have just uploaded to the IVOA Document Repository) as a sort of penance.

    While I was already asleep when Mark gave his talk, I was back at the Interop Friday morning CEST, when Hendrik Heinl talked about the LOFAR TAP service (which, I'm proud to say, runs on top of DaCHS); this was mainly live operations in TOPCAT (which is why there's no exciting slides), but Hendrik used a pyVO script doing cutouts in an (optical) mosaic of the Fornax cluster built on top of – and that's the main point – Datalink and SODA. Working this out with Hendrik made me realise the documentation of Datalink in pyVO really needs… love. Or, better, work.

    Later on Friday, there was the Registry session, where I gave brief (and somewhat cramped) talks on advanced column metadata (which is intended to one day let you query the registry for things like “roughly complete to 18 mag” or “having objects out to redshift 4“) and how to put VODataService 1.2 coverage into RegTAP – I expect you'll read more on both topics on this blog as they mature to a level at which this can leave the Registry nerd circles.

    And now, about 10 pm on Friday, the meeting is slowly winding down; beyond all the talks (which were, regrettably for a free software spirit like me, on zoom), the real bonus was that there was a gather.town attached to the conference. Now, that's a closed, proprietary, non-self-hostable platform, too, and so I have all reason to grumble. But: for the first time since February 2020 it felt like a conference, with the most useful action happening outside of the lecture halls, from trying to reach consensus on VEP-006 to teaching DaCHS datalink service declaration to learning about working with visibilities coming from VLBI (where it's even more difficult than it is with the big antenna arrays). So… this one time I've made my peace with proprietary platforms.

    A propos of “say no to platforms“ (in this case, slack): Due to the recent troubles with freenode, in addition to the Interop last week saw the the GAVO IRC channel move to libera.chat (where it's still #gavo). So, for instant messaging us now that the Interop is (in effect) over: Come there.

  • Semantics, Cross-Discipline Discovery, and Down-To-Earth Code

    Boxes-and-arrows view of the UAT

    A tiny piece of the Unified Astronomy Thesaurus as viewed by Sembarebro – the IVOA logos sit on terms that have VO resoures on them.

    Sometimes people ask me (in particular when I'm wearing my hat as the current chair of the IVOA Semantics working group) “well, what's this semantics thing good for?“ There are many answers, but here's one that nicely meshes with my pet subject data discovery: You want hierarchical, agreed-upon word lists to bridge discipline gaps.

    This story starts with B2FIND, a cross-disciplinary metadata aggregator for science data run within the framework of the European Open Science Cloud (EOSC). GAVO (or, more precisely, Heidelberg University's Astronomy) is involved in the EOSC via the ESCAPE project, and so I have had the pleasure of interacting with B2FIND for a while now. In particular, they are harvesting the metadata records of the Virtual Observatory Registry from us.

    This of course requires a bit of mapping, because the VO's metadata formats (VOResource, VODataService, and several extensions; see 2014A&C.....7..101D to learn more) are far too fine-grained for the wider scientific public. Not even our good friends from high-energy physics would appreciate being served links to, say, TAP endpoints (yet!). So, on our end we're mapping to the Datacite metadata kernel, which from VOResource is just a piece of XSL away (plus some perhaps debatable conventions).

    But there's more to this mapping, such as vocabularies of subject keywords. You might argue that in the age of rapid full text searches, keywords are dead. I would beg to disagree. For example, with good, hierarchical keyword systems you can, among many other useful things, offer topical browsing of metadata repositories. While it might not quite qualify as “useful” yet, the SemBaReBro registry browser I've hacked together late last year would be an example for such facilities – and might become part of our WIRR Registry searching tool one day.

    On the topic of subject keywords VOResource says that resources in the VO should be using the Unified Astronomy Thesaurus, specifically in its IVOA incarnation (not quite true yet, but true enough by blog standards). While few do, I've done a mapping of existing keywords in the VO to UAT concepts, which is what's behind SemBaReBro. So: most VO resources now have UAT concepts.

    However, these include concepts like AM Canum Venaticorum Stars, which outside of rather specialised circles of astronomers few people will ever have heard about (which, don't get me wrong, I personally regret – they're funky star systems). Hence, B2FIND does not bother with those.

    When we discussed the subject mapping for B2FIND, we thought using the UAT's top-level concepts might be a good start. However, at that point no VO resources at all actually used these, and, indeed, within astronomy that generally wouldn't make a lot of sense, because they are to unspecific to help much within the discipline. I postponed and then forgot about the problem – when the keywords of the resources weren't even from UAT, solving the granularity mismatch just wasn't humanly possible.

    That was the state of affairs until last Tuesday, when I had a mumble session with B2FIND folks and the topic came up again. And now, thanks partly to the new desise format proposed in the current Vocabularies in the VO 2 draft, things fell nicely into place: Hey, I have UAT concepts, and mapping these to the top-level terms isn't hard either any more.

    So, B2FIND gets the toplevel keywords they've been expecting all the time starting today. Yes: This isn't a panacea suddenly solving all the problems of cross-discipline data discovery, not the least because it's harder than one might think to imagine how such a thing would look like in practice. But given the complexities involved I was positively surprised how easy this particular part of the equation was.

    From here on, there's a bit of tech babble I intend to re-use in the RFC of Vocabularies in the VO 2; don't feel bad if you skip it.

    The first step was to make the mapping from UAT terms to the toplevel terms. The interesting part of the source I'm linking to here is:

    def get_roots_for(term, uat_terms):
      roots, seen = set(), set()
    
      def follow(t):
        wider = uat_terms[t]["wider"]
        if not wider:
          if not t in ROOT_TERMS:
            raise Exception(
              f"{t} found as a top-level term")
          roots.add(t)
        else:
          seen.add(t)
          for wider in uat_terms[t]["wider"]:
            follow(wider)
    
      follow(term)
      return roots
    

    There, uat_terms is essentially just a json-decode of what you get from the vocabulary URI if you ask for desise (see the draft spec linked to above for the technicalities). That's really it, and it even defends against cycles in the concept graph (which are legal by SKOS but shouldn't happen in the UAT) and detached terms (i.e., ones that are not rooted in the top-level terms). For what it does, I claim that's remarkably compact code.

    Once I had that, I needed to get the UAT-mapped subject keywords for the records I'm serving to datacite and fiddle the corresponding roots back in. That's technically a bit more involved because I am producing the datacite records on the fly from the XML representation for VOResource records that I keep in the database, and there's a bit of namespace magic involved (full code). Plus, the UAT-mapped keywords are only kept in the database, not in the metadata records.

    Still, the core operation here is relatively straightforward. Consider:

    def addUATToplevels(dataciteTree):
      # dataciteTree is an (lxml) ElementTree for the
      # result of the XSL transformation.  That's all
      # I have, and thus I first have to fiddle out
      # the identifier we are talking about
      ivoid =  dataciteTree.xpath(
          "//d:alternateIdentifier["
          "@alternateIdentifierType='ivoid']",
          namespaces={"d": DATACITE_NS}
        )[0].text.lower()
      # The .lower() is necessary because ivoids
      # unfortunately are case-insensitive, and RegTAP
      # normalises them to lowercase to retain sanity.
    
      # Now pull the UAT-mapped subject keywords from
      # our RegTAP extension (getTableConn is
      # DaCHS-internal API, but there's no magic in
      # there, it's just connection pooling with
      # guarantees against connections  idle in
      # transaction).
      with base.getTableConn() as conn:
        subjects = set(r[0] for r in
          conn.query("SELECT uat_concept"
            " FROM rr.subject_uat"
            " WHERE ivoid=%(ivoid)s", locals()))
    
      # This is the mapping itself: we do
      # roots-subjects to avoid adding
      # root terms that are already in
      # the record itself.  UAT_TOPLEVELS is the result
      # of the root finding discussed above.
      for term in subjects:
        root = UAT_TOPLEVELS[term]
        newRoots |= (root-subjects)
    
      # And finally fiddle in any new root terms found
      # into the datacite tree
      if newRoots:
        subjects = dataciteTree.xpath(
          "//d:subjects",
          namespaces={"d": DATACITE_NS})[0]
        for root in newRoots:
          newSubject = etree.SubElement(subjects,
            f"{{{DATACITE_NS}}}subject")
          newSubject.text = root
    

    Apart from the technicalities I'd again say that's pretty satisfying code.

    And these two pieces of code are really all I had to do to map between the vocabularies of different granularities – which I claim will probably be the norm as metadata flows between disciplines.

    It's great to see the pieces of a fairly comples puzzle fall into place like that.

  • Crazy Shapes in TAP

    OpenNGC shapes

    A complex shape from OpenNGC: MOCs need not be convex, or simply connected, or anything.

    So far when you did spherical geometry in ADQL, you had points, circles, and polygons as data types, and you could test for intersection and containment as operations. This feature set is a bit unsatisfying because there are no (algebraic) groups in this picture: When you join or intersect two circles, the result only is a circle if one contains the other. With non-intersecting polygons, you will again not have a (simply connected) spherical polygon in the end.

    Enter MOCs (which I've mentioned a few times before on this blog): these are essentially arbitrary shapes on the sky, in practice represented through lists of pixels, cleverly done so they can be sufficiently precise and rather compact at the same time. While MOCs are powerful and surprisingly simple in practice, ADQL doesn't know about them so far, which limits quite a bit what you can do with them. Well, DaCHS would serve them since about 1.3 if you managed to push them into the database, but there were no operations you could do on them.

    Thanks to work done by credativ (who were really nice to work with), funded with some money we had left from our previous e-inf-astro project (BMBF FKZ 05A17VH2) on the pgsphere database extension, this has now changed. At least on the GAVO data center, MOCs are now essentially first-class citizens that you can create, join, and intersect within ADQL, and you can retrieve the results. All operators of DaCHS services are just a few updates away from being able to offer the same.

    So, what can you do? To follow what's below, get a sufficiently new TOPCAT (4.7 will do) and open its TAP client on http://dc.g-vo.org/tap (a.k.a. GAVO DC TAP).

    Basic MOC Operations in TAP

    First, let's make sure you can plot MOCs; run

    SELECT name, deepest_shape
    FROM openngc.shapes
    

    Then do Graphics/Sky Plot, and in the window that pops up then, Layers/Add Area Control. Then select your new table in the Position tab, and finally choose deepest_shape as area (yeah, this could become a bit more automatic and probably will over time). You will then see the footprints of a few NGC objects (OpenNGC's author Mattia Verga hasn't done all yet; he certainly welcomes help on OpenNGC's version control repo), and you can move around in the plot, yielding perhaps something like Fig. 1.

    Now let's color these shapes by object class. If you look, openngc.data has an obj_type column – let's group on it:

    SELECT
      obj_type,
      shape,
      AREA(shape) AS ar
    FROM (
      SELECT obj_type, SUM(deepest_shape) AS shape
      FROM openngc.shapes
      NATURAL JOIN openngc.data
      GROUP BY obj_type) AS q
    

    (the extra subquery is a workaround necessary because the area function wants a geometry or a column reference, and ADQL doesn't allow aggregate functions – like sum – as either of these).

    In the result you will see that so far, contours for about 40 square degrees of star clusters with nebulae have been put in, but only 0.003 square degrees of stellar associations. And you can now plot by the areas covered by the various sorts of objects; in Fig. 2, I've used Subsets/Classify by Column in TOPCAT's Row Subsets to have colours indicate the different object types – a great workaround when one deals with categorial variables in TOPCAT.

    MOCs and JOINs

    Another table that already has MOCs in them is rr.stc_spatial, which has the coverage of VO resources (and is the deeper reason I've been pushing improved MOC support in pgsphere – background); this isn't available for all resources yet , but at least there are about 16000 in already. For instance, here's how to get the coverage of resources talking about planetary nebulae:

    SELECT ivoid, res_title, coverage
    FROM rr.subject_uat
      NATURAL JOIN rr.stc_spatial
      NATURAL JOIN rr.resource
    WHERE uat_concept='planetary-nebulae'
      AND AREA(coverage)<20
    

    (the rr.subject_uat table is a local extension to RegTAP that will be the subject of some future blog post; you could also use rr.res_subject, but because people still use wildly different keyword schemes – if any –, that wouldn't be as much fun). When plotted, that's the left side of Fig. 3. If you do that yourself, you will notice that the resolution here is about one degree, which is a special property of the sort of MOCs I am proposing for the Registry: They are of order 6. Resolution in MOC goes up with order, doubling with every step. Thus MOCs of order 7 have a resolution of about half a degree, MOCs of order 5 a resolution of about two degrees.

    One possible next step is fetch the intersection of each of these coverages with, say, the DFBS (cf. the post on Byurakan spectra). That would look like this:

    SELECT
      ivoid,
      res_title,
      gavo_mocintersect(coverage, dfbscoverage) as ovrlp
    FROM (
      SELECT ivoid, res_title, coverage
      FROM rr.subject_uat
      NATURAL JOIN rr.stc_spatial
      NATURAL JOIN rr.resource
      WHERE uat_concept='planetary-nebulae'
      AND AREA(coverage)<20) AS others
    CROSS JOIN (
      SELECT coverage AS dfbscoverage
      FROM rr.stc_spatial
      WHERE ivoid='ivo://org.gavo.dc/dfbsspec/q/spectra') AS dfbs
    

    (the DFBS' identifier I got with a quick query on WIRR). This uses the gavo_mocintersect user defined function (UDF), which takes two MOCs and returns a MOC of their common pixels. Which is another important part why MOCs are so cool: together with union and intersection, they form groups. It should not come as a surprise that there is also a gavo_mocunion UDF. The sum aggregate function we've used in our grouping above is (conceptually) built on that.

    Planetary Nebula footprint and plate matches

    Fig. 3: Left: The common footprint of VO resources declaring a subject of planetary-nebula (and declaring a footprint). Right bottom: Heidelberg plates intersecting this, and, in blue, level-6 intersections. Above this, an enlarged detail from this plot.

    You can also convert polygons and circles to MOCs using the (still DaCHS-only) MOC constructor. For instance, you could compute the coverage of all resources dealing with planetary nebulae, filtering against obviously over-eager ones by limiting the total area, and then match that against the coverages of images in, say, the Königstuhl plate achives HDAP. Watch this:

    SELECT
      im.*,
      gavo_mocintersect(MOC(6, im.coverage), pn_coverage) as ovrlp
    FROM (
      SELECT SUM(coverage) AS pn_coverage
      FROM rr.subject_uat
      NATURAL JOIN rr.stc_spatial
      WHERE uat_concept='planetary-nebulae'
      AND AREA(coverage)<20) AS c
    JOIN lsw.plates AS im
    ON 1=INTERSECTS(pn_coverage, MOC(6, coverage))
    

    – so, the MOC(order, geo) function should give you a MOC for other geometries. There are limits to this right now because of limitations of the underlying MOC library; in particular, non-convex polygons are not supported right now, and there are precision issue. We hope this will be rectified soon-ish when we base pgsphere's MOC operations on the CDS HEALPix library. Anyway, the result of this is plotted on the right of Fig. 3.

    Open Ends

    In case you have MOCs from the outside, you can also construct MOCs from literals, which happen to be the ASCII MOCs from the standard. This could look like this:

    SELECT TOP 1
      MOC('4/30-33 38 52 7/324-934') AS ar
    FROM tap_schema.tables
    

    For now, you cannot combine MOCs in CONTAINS and INTERSECTS expressions directly; this is mainly because in such an operation, the machine as to decide on the order of the MOC the other geometries are converted to (and computing the predicates between geometry and MOC directly is really painful). This means that if you have a local table with MOCs in a column cmoc that you want to compare against a polygon-valued column coverage in a remote table like this:

    SELECT db.* FROM
      lsw.plates AS db
      JOIN tap_upload.t6
    ON 1=CONTAINS(coverage, cmoc) -- fails!
    

    you will receive a rather scary message of the type “operator does not exist: spoly <@ smoc”. To fix it (until we've worked out how to reasonably let the computer do that), explicitly convert the polygon:

    SELECT db.* FROM
      lsw.plates AS db
      JOIN tap_upload.t6
    ON 1=CONTAINS(MOC(7, coverage), cmoc)
    

    (be stingy when choosing the order here – MOCs that already exist are fast, but making them at high order is expensive).

    Having said all that: what I've written here is bleeding-edge, and it is not standardised yet. I'd wager, though, that we will see MOCs in ADQL relatively soon, and that what we will see will not be too far from this experiment. Well: Some rough edges, I'd hope, will still be smoothed out.

    Getting This on Your Own DaCHS Installation

    If you are running a DaCHS installation, you can contribute to takeup (and if not, you can stop reading here). To do that, you need to upgrade to DaCHS's latest beta (anything newer than 2.1.4 will do) to have the ADQL extension, and, even more importantly, you need to install the postgresql-postgres package from our release repository (that's version 1.1.4 or newer; in a few weeks, getting it from Debian testing would work as well).

    You will probably not get that automatically, because if you followed our normal installation instructions, you will have a package called postgresql-11-pgsphere installed (apologies for this chaos; as ususal, every single step made sense). The upshot is that with our release repo added, sudo apt install postgresql-pgsphere should give you the new code.

    That's not quite enough, though, because you also need to acquaint the database with the new functions. This can only be done with database administrator privileges, which DaCHS by design does not possess. What DaCHS can do is figure out the commands to do that when it is called as dachs upgrade -e. Have a look at the output, and if you are satisfied it is about what to expect, just pipe it into psql as a superuser; in the default installation, dachsroot would be sufficiently privileged. That is:

    dachs upgrade -e | psql gavo   # as dachsroot
    

    If running:

    select top 1 gavo_mocunion(moc('1/3'), moc('2/9'))
    from tap_schema.tables
    

    through your TAP endpoint returns '1/3 2/9', then all is fine. For entertainment, you might also make sure that gavo_mocintersect(moc('1/3'), moc('2/13')) is 2/13 as expected, and that if you intersect with 2/3 you get back an empty string.

    So – let's bring MOCs to ADQL!

Page 1 / 3 »