Artikel mit Tag Data discovery:

  • DaCHS 2.4 is out: Blind discovery, pretty datalink, and more

    DaCHS screenshots and logo

    DaCHS 2.4: automatic ranges (with registry support!), pretty datalink (with vocabulary support!). And then the usual bunch of improvements (hopefully!).

    I have released DaCHS 2.4 today, and as usual for stable releases, I would like to have something like a commented changelog here so DaCHS deployers perhaps look forward to upgrading – which would be good, because there are far too many outdated DaCHSes out there.

    Among the more notable changes in version 2.4 are:

    Blind discovery overhaul. If you've been following my requests to include coverage metadata three years ago, you have probably felt that the way DaCHS started to hack your RDs to include the metadata it had obtained from the data was a bit odd. Well, it was. DaCHS no longer does that when running dachs limits. While you can still do manual overrides, all the statistics gathered by DaCHS is now kept in the database and injected into the DaCHS' internal idea of your RDs at loading time.

    I have not only changed this because the old way really sucked; it was also necessary because I wanted to have per-column metadata routinely, and since in advanced DaCHS there often are no XML literals for columns (because of active tags), there wouldn't be a place to keep information like what a column is minimally, maximally, in median, or as a “2σ range“ within the RD itself. A longer treatment of where this is going is given in the IVOA note Blind Discovery 2: Advanced Column Statistics that Grégory and I have recently uploaded.

    For you, it's easy: Just run dachs limits q once you're happy with your data, or perhaps once a month for living data, and leave the rest to DaCHS. A fringe benefit: in browser froms, there are now value ranges of the various numeric constraints as placeholders (that's the screenshot on the left in the title picture).

    There is a slight downside: As part of this overhaul, DaCHS is now computing the coverage of SIAP and SSAP services based on the footprints of the products as MOCs. While that gives much more precise service footprints, it only works with bleeding-edge pgsphere as delivered in Debian bullseye – or from our Debian repository. If you want to build this from source, you need to get credativ's pgsphere fork for now.

    Generate column elements: If you have tables with many columns, even just lexically entering the <column> elements becomes straining. That is particularly annoying if there already is a halfway machine-readable representation of that data.

    To alleviate that, very early in the development of DaCHS, I had the gavo mkrd subcommand that you could feed FITS images or VOTables to get template RDs. For a number of reasons, that never worked well enough to make me like or advertise it, and I eventually ended up writing dachs start instead, which is something I like and advertise for general usage.

    However, what that doesn't do is come up with the column declarations. To make good on this, there is now a dachs gencol command that will, from a FITS binary table, a VOTable, or a VizieR-style byte-by-byte description, generate columns with as much metadata as it can fathom. Paste that into the output of dachs start, and, depending on your input format, you should have a quick start on a fairly full-featured data collection (also note there's dachs adm suggestucds for another command that may help quickly generate rich metadata).

    This currently doesn't work for products (i.e., tables of spectra, images, and the like); at least for FITS arrays, I suppose turning their non-obvious header cards into columns might save some work. Let's see: your feedback is welcome.

    Refurbished Datalink XSLT: Since the dawn of datalink, DaCHS has delivered Datalink documents with XSLT stylesheets in order to have nicely formatted pages rather than wild XML when web browsers chance on datalink documents. I have overhauled the Javascript part of this (which, I have to admit, is what makes it pretty). For one, the spatial cutout now works again, and it's modeless (no clicking “edit“ any more before you can drag cutout vertices). I'm also using the datalink/core vocabulary to furnish link groups with proper titles and descriptions, and to have them sorted in in a proper result tree. I've talked about it at the interop, and I've prepared a showcase of various datalink documents in the Heidelberg data centre.

    Update to DaCHS 2.4 and you'll get the same thing for your datalinks.

    Non-product datalinks: When writing a datalink service, you have to first come up with a descriptor generator. DaCHS will provide a simple one for you (or perhaps a bit more complex ones for FITS images or spectra) – but all of these assume that whatever the datalink ID parameter references is in DaCHS' product table. It turned out that in many interesting cases – for instance, attaching time series to object catalogues – that is not the case, and then you had to write rather obscure code to keep DaCHS from poking around in the product table.

    No longer: There is now the //datalink#fromtable descriptor generator. Just fill in which column contains the identifier and the name of the table containing that column and you're (basically) done. Your descriptor will then have a metadata attribute containing the relevant row – along with everything else DaCHS expects from a datalink descriptor.

    gavo_specconv: That's a longer story covered previously on this blog.

    Index declaration in views: Saying on which columns a database index exists allows users to write smart queries, and DaCHS uses such information internally when rewriting geometrical expressions from ADQL to whatever is in use in the actual database. Hence, making sure these indexes are properly declared is important. But at the same time it's difficult for views, because postgres doesn't let you have indexes on views (for good reasons). Still, queries against views will (usually) use indexes of their underlying tables, and hence those should be declared in the corresponding metadata.

    This is tedious in general. DaCHS now helps you with the //procs#declare-indexes-from stream. Essentially, it will compare the columns in the view with the ones from the source tables and then guess which view columns correspond to indexed columns from the source tables; using that, it adds indexed flags to some view columns.

    If all this is too weird for you: Thanks to declare-indexes-from, the index declaration now automatically happens in the modern way to build SSAP services, the //ssap#view mixin. Hence, chances are you won't even see this particular STREAM but just notice its beneficial consequences.

    Sunsetting resources: I've been fiddling off and on with a smart way to pull resources I no longer want to maintain while still leaving a tombstone. I had to re-visit this problem recently because I dropped the Gaia DR1 table from my Heidelberg data centre. So, how do I explain to people why the thing that's been there no longer is?

    In general, this is a rather untractable problem; for instance, it's very hard to do something sensible with the TAP_SCHEMA entries or the VOSI tables endpoints for the tables that went away. Pure web pages, on the other hand, can be adorned with helpful info. To enable that, there is now the superseded meta item, which you define in the RD that once held the resources. For Gaia DR1, here's what I used:

    <meta name="superseded" format="rst">
      We do not publish Gaia DR1 data here any more.
      If you actually need DR1 data, refer to the
      full Gaia mirrors, for instance `the one at
      ARI`_.  Otherwise, please use more recent data
      releases, for instance `eDR3`_.
    
      .. _the one at ARI: http://gaia.ari.uni-heidelberg.de
      .. _eDR3: /browse/gaia/q3
    </meta>
    

    Root page template: I slightly streamlined the default root page template, in particular dropping the "i" and "Q" icons for going to the metadata and querying the service. If you have overridden the root template, you may want to see if you want to merge the changes.

    As usual, there are many more small repairs and additions, but most of these are either very minor or rather technical. One last thing, though: DaCHS now works with Python 3.8 (3.7 will continue to be supported for a few years at least, earlier 3.x never was), which is going to be the python3 in Debian bullseye. Bullseye itself will only have DaCHS 2.3 (with the Python 3.8 fixes backported), though. Once bullseye has become stable, we will look into putting DaCHS 2.4 into the backports.

  • Semantics, Cross-Discipline Discovery, and Down-To-Earth Code

    Boxes-and-arrows view of the UAT

    A tiny piece of the Unified Astronomy Thesaurus as viewed by Sembarebro – the IVOA logos sit on terms that have VO resoures on them.

    Sometimes people ask me (in particular when I'm wearing my hat as the current chair of the IVOA Semantics working group) “well, what's this semantics thing good for?“ There are many answers, but here's one that nicely meshes with my pet subject data discovery: You want hierarchical, agreed-upon word lists to bridge discipline gaps.

    This story starts with B2FIND, a cross-disciplinary metadata aggregator for science data run within the framework of the European Open Science Cloud (EOSC). GAVO (or, more precisely, Heidelberg University's Astronomy) is involved in the EOSC via the ESCAPE project, and so I have had the pleasure of interacting with B2FIND for a while now. In particular, they are harvesting the metadata records of the Virtual Observatory Registry from us.

    This of course requires a bit of mapping, because the VO's metadata formats (VOResource, VODataService, and several extensions; see 2014A&C.....7..101D to learn more) are far too fine-grained for the wider scientific public. Not even our good friends from high-energy physics would appreciate being served links to, say, TAP endpoints (yet!). So, on our end we're mapping to the Datacite metadata kernel, which from VOResource is just a piece of XSL away (plus some perhaps debatable conventions).

    But there's more to this mapping, such as vocabularies of subject keywords. You might argue that in the age of rapid full text searches, keywords are dead. I would beg to disagree. For example, with good, hierarchical keyword systems you can, among many other useful things, offer topical browsing of metadata repositories. While it might not quite qualify as “useful” yet, the SemBaReBro registry browser I've hacked together late last year would be an example for such facilities – and might become part of our WIRR Registry searching tool one day.

    On the topic of subject keywords VOResource says that resources in the VO should be using the Unified Astronomy Thesaurus, specifically in its IVOA incarnation (not quite true yet, but true enough by blog standards). While few do, I've done a mapping of existing keywords in the VO to UAT concepts, which is what's behind SemBaReBro. So: most VO resources now have UAT concepts.

    However, these include concepts like AM Canum Venaticorum Stars, which outside of rather specialised circles of astronomers few people will ever have heard about (which, don't get me wrong, I personally regret – they're funky star systems). Hence, B2FIND does not bother with those.

    When we discussed the subject mapping for B2FIND, we thought using the UAT's top-level concepts might be a good start. However, at that point no VO resources at all actually used these, and, indeed, within astronomy that generally wouldn't make a lot of sense, because they are to unspecific to help much within the discipline. I postponed and then forgot about the problem – when the keywords of the resources weren't even from UAT, solving the granularity mismatch just wasn't humanly possible.

    That was the state of affairs until last Tuesday, when I had a mumble session with B2FIND folks and the topic came up again. And now, thanks partly to the new desise format proposed in the current Vocabularies in the VO 2 draft, things fell nicely into place: Hey, I have UAT concepts, and mapping these to the top-level terms isn't hard either any more.

    So, B2FIND gets the toplevel keywords they've been expecting all the time starting today. Yes: This isn't a panacea suddenly solving all the problems of cross-discipline data discovery, not the least because it's harder than one might think to imagine how such a thing would look like in practice. But given the complexities involved I was positively surprised how easy this particular part of the equation was.

    From here on, there's a bit of tech babble I intend to re-use in the RFC of Vocabularies in the VO 2; don't feel bad if you skip it.

    The first step was to make the mapping from UAT terms to the toplevel terms. The interesting part of the source I'm linking to here is:

    def get_roots_for(term, uat_terms):
      roots, seen = set(), set()
    
      def follow(t):
        wider = uat_terms[t]["wider"]
        if not wider:
          if not t in ROOT_TERMS:
            raise Exception(
              f"{t} found as a top-level term")
          roots.add(t)
        else:
          seen.add(t)
          for wider in uat_terms[t]["wider"]:
            follow(wider)
    
      follow(term)
      return roots
    

    There, uat_terms is essentially just a json-decode of what you get from the vocabulary URI if you ask for desise (see the draft spec linked to above for the technicalities). That's really it, and it even defends against cycles in the concept graph (which are legal by SKOS but shouldn't happen in the UAT) and detached terms (i.e., ones that are not rooted in the top-level terms). For what it does, I claim that's remarkably compact code.

    Once I had that, I needed to get the UAT-mapped subject keywords for the records I'm serving to datacite and fiddle the corresponding roots back in. That's technically a bit more involved because I am producing the datacite records on the fly from the XML representation for VOResource records that I keep in the database, and there's a bit of namespace magic involved (full code). Plus, the UAT-mapped keywords are only kept in the database, not in the metadata records.

    Still, the core operation here is relatively straightforward. Consider:

    def addUATToplevels(dataciteTree):
      # dataciteTree is an (lxml) ElementTree for the
      # result of the XSL transformation.  That's all
      # I have, and thus I first have to fiddle out
      # the identifier we are talking about
      ivoid =  dataciteTree.xpath(
          "//d:alternateIdentifier["
          "@alternateIdentifierType='ivoid']",
          namespaces={"d": DATACITE_NS}
        )[0].text.lower()
      # The .lower() is necessary because ivoids
      # unfortunately are case-insensitive, and RegTAP
      # normalises them to lowercase to retain sanity.
    
      # Now pull the UAT-mapped subject keywords from
      # our RegTAP extension (getTableConn is
      # DaCHS-internal API, but there's no magic in
      # there, it's just connection pooling with
      # guarantees against connections  idle in
      # transaction).
      with base.getTableConn() as conn:
        subjects = set(r[0] for r in
          conn.query("SELECT uat_concept"
            " FROM rr.subject_uat"
            " WHERE ivoid=%(ivoid)s", locals()))
    
      # This is the mapping itself: we do
      # roots-subjects to avoid adding
      # root terms that are already in
      # the record itself.  UAT_TOPLEVELS is the result
      # of the root finding discussed above.
      for term in subjects:
        root = UAT_TOPLEVELS[term]
        newRoots |= (root-subjects)
    
      # And finally fiddle in any new root terms found
      # into the datacite tree
      if newRoots:
        subjects = dataciteTree.xpath(
          "//d:subjects",
          namespaces={"d": DATACITE_NS})[0]
        for root in newRoots:
          newSubject = etree.SubElement(subjects,
            f"{{{DATACITE_NS}}}subject")
          newSubject.text = root
    

    Apart from the technicalities I'd again say that's pretty satisfying code.

    And these two pieces of code are really all I had to do to map between the vocabularies of different granularities – which I claim will probably be the norm as metadata flows between disciplines.

    It's great to see the pieces of a fairly comples puzzle fall into place like that.

  • Small Telescopes, Large Surveys

    Image: Blink comparator and survey camera

    Plate technology at Bamberg observatory: a blink comparator with one plate mounted, and a survey camera that was once used at Boyden Station, an astronomer outpost in 60ies South Africa.

    I'm currently at the workshop “Large surveys with small telescopes: past, present, and future” (or Astroplate III for short) in Bamberg, where people are discussing using and re-using the rich heritage of historical observations (hence the “plate” part) as well growing that heritage in the age of large CCDs, fast computers and large disks.

    Using and re-using is of course what the Virtual Observatory is about, and we've been keeping fairly large plate collections in our data center for quite a while (among them the Archives of Landessternwarte Königstuhl or the Palomar-Leiden Trojan surveys, and there is the WFPDB TAP-accessibly). Therefore, people from GAVO Heidelberg have been to all past astroplate conferences.

    For this one, I brought a brand-new tutorial on plate scans in the VO, which, I hope, also works as a general introduction to image discovery in the VO using SIAP, Datalink, and Obscore. If you're doing image stuff now and then, please have a quick look at the thing – I am particularly grateful for hints on what to improve or perhaps particularly obvious use cases for the material discussed.

    Such VO proselytising aside, the conference is discussing the wide variety of creative, low-cost data collectors out there as well as computer-aided re-analysis extracting new knowledge from decades-old data. If I had to choose a single come-to-think-of-it moment, it would be Norbert Zacharias' observation that if you have a well-behaved object and you'd like to know where it was in 1900, it's now more accurate to extrapolate Gaia astrometry to the epoch of observation than to measure it on the plate itself. Which is saying a lot about the amazing feat of engineering that Gaia is.

    This is not, however, an argument for dumping the old data. Usually, it is exactly what is not so well-behaved (like those) that's interesting – both in terms of astrometry and in terms of photometry (for which there's a lot more unruly behaviour in the first place). To figure out how objects don't behave well, and, for objects disguising as well-behaved only on time scales of the (say) Gaia mission, which these are, the key is “old” data. The freshness of which we're discussing this week.

  • A Grey Eminence of a Standard

    [Screenshot: graphs and numbers]

    Examples for extra metadata: extended column descriptions on the web pages accompanying the ARI-Gaia TAP service.

    Last friday, I've uploaded a first working draft of VODataService 1.2 to the IVOA documents repository. That's the first major step in updating a standard, and it's an invitation to everyone to have a look and comment.

    Foof, you might say, what do I care? I've not even heard of that standard.

    Well, but you've probably used it. VODataService is (among several other things) the standard that governs how a TAP service tells clients (TOPCAT, say) what tables it has and what's inside of them. So, if you see in TOPCAT that there is a column named ang_error with a unit of deg, a UCD of stat.error;pos and the meaning “1 σ confidence radius of the position”, that most likely came in a document standardised by VODataService.

    The question of what (TAP) services can tell clients about their table set is one major open point: Do we want additional metadata there? This article's image, for inspiration, shows a screenshot of extended metadata Grégory delivers to browsers on his ARI-Gaia service; among this are minima, maxima, means, standard deviations, quartiles, and fill factors (i.e., how many of the columns are NULL). He even shows histograms of the values' distributions and HEALPix maps showing how (the means of) the values vary on the sky. Another example of extended metadata could be footnotes as you will find them on many of my resources' reference URLs (example; footnotes are, unsurprisingly, near the foot of that page).

    We could define interoperable means to communicate information like this. The question is: does the added value justify the complication in implementation? This is where it would be great if you weighed in, in particular if you are a “mere” TAP user: Are there any such pieces of metadata you've always wanted to see in your TAP interfaces? Oh, and metadata of course can also be added to tables rather than columns. The current draft already lets services communicate the number of rows in each table – is there more “simple”, table-specific metadata of this sort?

    VODataService furthermore deals with several other topics; for instance, the STC in the registry business I've blogged about in February is going to be standardised here (update on this: spectral coverage is no longer in wavelength but in energy). Other changes are rather more technical in nature, like several new resource types that will improve the discovery of tables and other such resources, or a careful adjustment of some features to keep them in line with TAP evolution.

    But don't let the technicalities scare you away – just have a peek, and if you have thoughts on any of the VODataService topics: I'm just a mail away.

  • Say hello to RegTAP

    image: WIRR in the browser

    GAVO's WIRR registry interface in action to find resources with radio parallaxes.

    RegTAP is one of those standards that a scientist will normally not see – it works in the background and makes, for instance, TOPCAT display the Cone Search services matching some key words. And it's behind the services like WIRR, our Web Interface to the Relational Registry (“Relational Registry” being the official name for RegTAP) that lets you do some interesting data discovery beyond what current clients support. In the screenshot above, for instance (try it yourself), I'm looking for cone search services having parallaxes presumably from radio observations. You could now transmit the services you've found to, say, TOPCAT or your own pyvo-based program to start querying them.

    The key point this query is the use of UCDs – these let services declare fairly unambiguously what kind of physics (if you take that word with a grain of salt) they are talking about. In the example, pos.parallax means, well, a parallax, and the percent character is a wildcard (coming not from UCDs, but from ADQL). That wildcard is a good idea here because without it we might miss things like pos.parallax;obs and pos.parallax;stat.fit that people might have used to distinguish “raw” and ”processed” estimates.

    UCDs are great for data discovery. Really.

    Sometimes, however, clicking around in menus just isn't good enough. That's when you want the full power of RegTAP and write your very own queries. The good news: If you know ADQL (and you should!), you're halfway there already.

    Here's one example of direct RegTAP use I came up with the other day. The use case was discovering data collections that give the effective temperatures of components of binary star systems.

    If you check the UCD list, that “physics” translates into data that has columns with UCDs of phys.temperature and meta.code.multip at the same time. To translate that into a RegTAP query, have a look at the tables that make up a RegTAP service: its ”schema”. Section 8 of the standard lists all the tables there are, and there's an ADASS poster that has an image of the schema with the more common columns illustrated. Oh, and if you're new to RegTAP, you're probably better off briefly studying the examples first to get a feeling for how RegTAP is supposed to work.

    You will find that a pair of ivoid – the VO's global resource identifier – and a per-resource table index uniquely identify a table within the entire registry. So, an ADQL query to pick out all tables containing temperatures and component identifiers would look like this:

    SELECT DISTINCT ivoid, table_index
    FROM
    rr.table_column AS t1
    JOIN rr.table_column AS t2
    USING (ivoid, table_index)
    WHERE t1.ucd='phys.temperature'
    AND t2.ucd='meta.code.multip'
    

    – the DISTINCT makes it so even tables that have lots of temperatures or codes only turn up once in our result set, and the somewhat odd self-join of the rr.table_column table with itself lets us say “make sure the two columns are actually in the same table”. Note that you could catch multi-table resources that define the components in one table and the temperatures in another by just joining on ivoid rather than ivoid and table_index.

    You can run this query on any RegTAP endpoint: GAVO operates a small network of mirrors behind http://reg.g-vo.org/tap, there's the ESAC one at http://registry.euro-vo.org/regtap/tap, and STScI runs one at http://vao.stsci.edu/RegTAP/TapService.aspx. Just use your usual TAP client.

    But granted, the result isn't terribly user-friendly: just identifiers and number. We'd at least like to see the names and descriptions of the tables so we know if the data is somehow relevant.

    RegTAP is designed so you can locate the columns you would like to retrieve or constrain and then just NATURAL JOIN everything together. The table_description and table_name columns are in rr.res_table, so all it takes to see them is to take the query above and join its result like this:

    SELECT table_name, table_description
    FROM rr.res_table
    NATURAL JOIN (
      SELECT DISTINCT ivoid, table_index
      FROM
      rr.table_column AS t1
      JOIN rr.table_column AS t2
      USING (ivoid, table_index)
      WHERE t1.ucd='phys.temperature'
      AND t2.ucd='meta.code.multip') as q
    

    If you try this, you'll see that we'd like to get the descriptions of the resources embedding the tables, too in order to get an idea what we can expect from a given data collection. And if we later want to find services exposing the tables (WIRR is nice for that – try the ivoid constraint –, but for this example all resources currently come from VizieR, so you can directly use VizieR's TAP service to interact with the tables), you want the ivoids. Easy: Just join rr.resource and pick columns from there:

    SELECT table_name, table_description, res_description, ivoid
    FROM rr.res_table
    NATURAL JOIN rr.resource
    NATURAL JOIN (
      SELECT DISTINCT ivoid, table_index
      FROM
      rr.table_column AS t1
      JOIN rr.table_column AS t2
      USING (ivoid, table_index)
      WHERE t1.ucd='phys.temperature'
      AND t2.ucd='meta.code.multip') as q
    

    If you've made it this far and know a bit of ADQL, you probably have all it really takes to solve really challenging data discovery problems – as far as Registry metadata reaches, that is, which currently does not include space-time coverage. But stay tuned, more on this soon.

    In case you're looking for a more systematic introduction into the world of the Registry and RegTAP, there are two... ouch. Can I really link to Elsevier papers? Well, here goes: 2014A&C.....7..101D (a.k.a. arXiv:1502.01186 on the Registry as such and 2015A%26C....11...91D (a.k.a. arXiv:1407.3083) mainly on RegTAP.

Page 1 / 1