GAVO Blog: Virtual Observatory Matters: Datalink

Posts with the Tag Datalink:

DaCHS 2.4 is out: Blind discovery, pretty datalink, and more

2021-06-09 Markus Demleitner

DaCHS 2.4: automatic ranges (with registry support!), pretty datalink (with vocabulary support!). And then the usual bunch of improvements (hopefully!).

I have released DaCHS 2.4 today, and as usual for stable releases, I would like to have something like a commented changelog here so DaCHS deployers perhaps look forward to upgrading – which would be good, because there are far too many outdated DaCHSes out there.

Among the more notable changes in version 2.4 are:

Blind discovery overhaul. If you've been following my requests to include coverage metadata three years ago, you have probably felt that the way DaCHS started to hack your RDs to include the metadata it had obtained from the data was a bit odd. Well, it was. DaCHS no longer does that when running dachs limits. While you can still do manual overrides, all the statistics gathered by DaCHS is now kept in the database and injected into the DaCHS' internal idea of your RDs at loading time.

I have not only changed this because the old way really sucked; it was also necessary because I wanted to have per-column metadata routinely, and since in advanced DaCHS there often are no XML literals for columns (because of active tags), there wouldn't be a place to keep information like what a column is minimally, maximally, in median, or as a “2σ range“ within the RD itself. A longer treatment of where this is going is given in the IVOA note Blind Discovery 2: Advanced Column Statistics that Grégory and I have recently uploaded.

For you, it's easy: Just run dachs limits q once you're happy with your data, or perhaps once a month for living data, and leave the rest to DaCHS. A fringe benefit: in browser froms, there are now value ranges of the various numeric constraints as placeholders (that's the screenshot on the left in the title picture).

There is a slight downside: As part of this overhaul, DaCHS is now computing the coverage of SIAP and SSAP services based on the footprints of the products as MOCs. While that gives much more precise service footprints, it only works with bleeding-edge pgsphere as delivered in Debian bullseye – or from our Debian repository. If you want to build this from source, you need to get credativ's pgsphere fork for now.

Generate column elements: If you have tables with many columns, even just lexically entering the <column> elements becomes straining. That is particularly annoying if there already is a halfway machine-readable representation of that data.

To alleviate that, very early in the development of DaCHS, I had the gavo mkrd subcommand that you could feed FITS images or VOTables to get template RDs. For a number of reasons, that never worked well enough to make me like or advertise it, and I eventually ended up writing dachs start instead, which is something I like and advertise for general usage.

However, what that doesn't do is come up with the column declarations. To make good on this, there is now a dachs gencol command that will, from a FITS binary table, a VOTable, or a VizieR-style byte-by-byte description, generate columns with as much metadata as it can fathom. Paste that into the output of dachs start, and, depending on your input format, you should have a quick start on a fairly full-featured data collection (also note there's dachs adm suggestucds for another command that may help quickly generate rich metadata).

This currently doesn't work for products (i.e., tables of spectra, images, and the like); at least for FITS arrays, I suppose turning their non-obvious header cards into columns might save some work. Let's see: your feedback is welcome.

Refurbished Datalink XSLT: Since the dawn of datalink, DaCHS has delivered Datalink documents with XSLT stylesheets in order to have nicely formatted pages rather than wild XML when web browsers chance on datalink documents. I have overhauled the Javascript part of this (which, I have to admit, is what makes it pretty). For one, the spatial cutout now works again, and it's modeless (no clicking “edit“ any more before you can drag cutout vertices). I'm also using the datalink/core vocabulary to furnish link groups with proper titles and descriptions, and to have them sorted in in a proper result tree. I've talked about it at the interop, and I've prepared a showcase of various datalink documents in the Heidelberg data centre.

Update to DaCHS 2.4 and you'll get the same thing for your datalinks.

Non-product datalinks: When writing a datalink service, you have to first come up with a descriptor generator. DaCHS will provide a simple one for you (or perhaps a bit more complex ones for FITS images or spectra) – but all of these assume that whatever the datalink ID parameter references is in DaCHS' product table. It turned out that in many interesting cases – for instance, attaching time series to object catalogues – that is not the case, and then you had to write rather obscure code to keep DaCHS from poking around in the product table.

No longer: There is now the //datalink#fromtable descriptor generator. Just fill in which column contains the identifier and the name of the table containing that column and you're (basically) done. Your descriptor will then have a metadata attribute containing the relevant row – along with everything else DaCHS expects from a datalink descriptor.

gavo_specconv: That's a longer story covered previously on this blog.

Index declaration in views: Saying on which columns a database index exists allows users to write smart queries, and DaCHS uses such information internally when rewriting geometrical expressions from ADQL to whatever is in use in the actual database. Hence, making sure these indexes are properly declared is important. But at the same time it's difficult for views, because postgres doesn't let you have indexes on views (for good reasons). Still, queries against views will (usually) use indexes of their underlying tables, and hence those should be declared in the corresponding metadata.

This is tedious in general. DaCHS now helps you with the //procs#declare-indexes-from stream. Essentially, it will compare the columns in the view with the ones from the source tables and then guess which view columns correspond to indexed columns from the source tables; using that, it adds indexed flags to some view columns.

If all this is too weird for you: Thanks to declare-indexes-from, the index declaration now automatically happens in the modern way to build SSAP services, the //ssap#view mixin. Hence, chances are you won't even see this particular STREAM but just notice its beneficial consequences.

Sunsetting resources: I've been fiddling off and on with a smart way to pull resources I no longer want to maintain while still leaving a tombstone. I had to re-visit this problem recently because I dropped the Gaia DR1 table from my Heidelberg data centre. So, how do I explain to people why the thing that's been there no longer is?

In general, this is a rather untractable problem; for instance, it's very hard to do something sensible with the TAP_SCHEMA entries or the VOSI tables endpoints for the tables that went away. Pure web pages, on the other hand, can be adorned with helpful info. To enable that, there is now the superseded meta item, which you define in the RD that once held the resources. For Gaia DR1, here's what I used:
```
<meta name="superseded" format="rst">
  We do not publish Gaia DR1 data here any more.
  If you actually need DR1 data, refer to the
  full Gaia mirrors, for instance `the one at
  ARI`_.  Otherwise, please use more recent data
  releases, for instance `eDR3`_.

  .. _the one at ARI: http://gaia.ari.uni-heidelberg.de
  .. _eDR3: /browse/gaia/q3
</meta>
```
Root page template: I slightly streamlined the default root page template, in particular dropping the "i" and "Q" icons for going to the metadata and querying the service. If you have overridden the root template, you may want to see if you want to merge the changes.

As usual, there are many more small repairs and additions, but most of these are either very minor or rather technical. One last thing, though: DaCHS now works with Python 3.8 (3.7 will continue to be supported for a few years at least, earlier 3.x never was), which is going to be the python3 in Debian bullseye. Bullseye itself will only have DaCHS 2.3 (with the Python 3.8 fixes backported), though. Once bullseye has become stable, we will look into putting DaCHS 2.4 into the backports.

Category: Software
GAVO at the Northern Spring Interop 2021

2021-05-28 Markus Demleitner

As usual in May, the people making the Virtual Observatory happen meet for their Interoperability Conference, better known as the Interop – where “meet” still has to be taken with a generous helping of salt (more on this near the end of this post). As has become customary on this blog, let me briefly discuss contributions with a significant involvement of GAVO.

A major thing from my perspective actually happened in the run-up: The IVOA executive committee (“Exec“) approved Version 2.0 of Vocabularies in the VO, a standard saying how hierarchical word lists (“vocabularies“) can be managed, disseminated, and consumed within the VO. Developing the main ideas from sufficiently restricting RDF to coming up with desise (which makes complicated things possible with surprisingly little code), and trying things out on our growing number of vocabularies took up quite a bit of my standards time in the last 20 months or so – and I'm fairly happy with the outcome, which I celebrated with a brief talk on programming with IVOA semantics during Wednesday morning's semantics session.

In that session I gave a second, more discussion-oriented, talk, probing how to formalise data product types – which is surprisingly involved, even with the relatively straightforward use case “figure out a programme to handle the data“: What's a spectrum? Well, something that maps a spectral coordinate to... hm. Is it still a spectrum if there's multiple sorts values (perhaps flux, magnitude, and polarisation)? If we allow, in effect, tuples, why not whole images, which would make spectral cubes spectra – but of course few client programmes that deal with spectra do anything useful with cubes, so clearly such a definition would kill our use case. And what about slit spectra, mapping a spatial coordinat to spectra?

All this of course is reminiscent of the classical problems of semantics: An elephant is a big animal with a trunk. But when an elephant loses its trunk in an accident: does it stop being an elephant? So, much of the art here is finding the sweet spot of usability between strict and formal semantics (that will never fit the real world) and just tossing around loosely defined strings (that will simply not be machine-readable). After the session, I came up with the 2021-05-26 draft of product-type. If you read this a few years down the road, it might be interesting to compare with what product-type is today. I'm curious myself.

Later on Wednesday CET, I did a shameless plug for my Datalink-transforming XSLT (apologies for a github link, but I'm fishing for PRs here; if you use DaCHS, you'll get the updated stuff with version 2.4, due soon). The core of this dates back to the dawn of datalink, but with a new graphical cutout code and in particular vocabulary-based tree-ification of the result rows, I figured it's time to remind the operators of datalink services it's still out there for them to take up. Perhaps more than from the slides, you can see what I am after here by just trying the Datalink examples I've collected for this talk and comparing document source, the appearance without Javascript (pure XSLT) and the appearance with Javascript (I'm a bit ashamed I'm relying so heavily on it, but much of this really can only be done client-side).

Quite a bit after midnight my time (still Thursday UTC), Mark Taylor talked about Software Identification, something I've been working on with him recently. It's is one of the things that is short and trivial but that, when unregulated, just doesn't work; in this case it's servers and clients saying what they are when they speak HTTP. I stumbled into the problem while trying to locate severely outdated DaCHS installations – so, I a way I put effort into the Note Mark was talking about (and which I have just uploaded to the IVOA Document Repository) as a sort of penance.

While I was already asleep when Mark gave his talk, I was back at the Interop Friday morning CEST, when Hendrik Heinl talked about the LOFAR TAP service (which, I'm proud to say, runs on top of DaCHS); this was mainly live operations in TOPCAT (which is why there's no exciting slides), but Hendrik used a pyVO script doing cutouts in an (optical) mosaic of the Fornax cluster built on top of – and that's the main point – Datalink and SODA. Working this out with Hendrik made me realise the documentation of Datalink in pyVO really needs… love. Or, better, work.

Later on Friday, there was the Registry session, where I gave brief (and somewhat cramped) talks on advanced column metadata (which is intended to one day let you query the registry for things like “roughly complete to 18 mag” or “having objects out to redshift 4“) and how to put VODataService 1.2 coverage into RegTAP – I expect you'll read more on both topics on this blog as they mature to a level at which this can leave the Registry nerd circles.

And now, about 10 pm on Friday, the meeting is slowly winding down; beyond all the talks (which were, regrettably for a free software spirit like me, on zoom), the real bonus was that there was a gather.town attached to the conference. Now, that's a closed, proprietary, non-self-hostable platform, too, and so I have all reason to grumble. But: for the first time since February 2020 it felt like a conference, with the most useful action happening outside of the lecture halls, from trying to reach consensus on VEP-006 to teaching DaCHS datalink service declaration to learning about working with visibilities coming from VLBI (where it's even more difficult than it is with the big antenna arrays). So… this one time I've made my peace with proprietary platforms.

A propos of “say no to platforms“ (in this case, slack): Due to the recent troubles with freenode, in addition to the Interop last week saw the the GAVO IRC channel move to libera.chat (where it's still #gavo). So, for instant messaging us now that the Interop is (in effect) over: Come there.

Category: Meetings
LAMOST5 meets Datalink

2019-12-11 Markus Demleitner

One of the busiest spectral survey instruments operated right now is the Large Sky Area Multi-Object Fiber Spectrograph Telescope (LAMOST). And its data in the VO, more or less: DR2 and DR3 have been brought into the VO by our Czech colleagues, but since they currently lack resources to update their services to the latest releases, they have kindly given me their DaCHS resource descriptor, and so I had a head start for publishing DR5 in Heidelberg.

With some minor updates, here it is now: Over nine million medium-resolution spectra covering large parts of the northen sky – the spatial coverage is like this:

There's lots of fun to be had with this; of course, there's an SSA service, so when you point Aladin or Splat at some part of the covered sky and look for spectra, chances are you'll see LAMOST spectra, and when working on some of our tutorials (this one, for example), it happened that LAMOST actually had what I was looking for when writing them.

But I'd like to use the opportunity to mention two other modes of accessing the data.
Tablesample and TOPCAT's Plot Table activation action

Say you'd like to look at spectra of M stars and would like to have some sample from across the sky, fire up TOPCAT, point its TAP client the GAVO DC TAP service (http://dc.g-vo.org/tap) and run something like:
```
select
  ssa_pubDID, accref, raj2000, dej2000, ssa_targsubclass
from lamost5.data tablesample(1)
where
  ssa_targsubclass like 'M%'
```
This is using the TABLESAMPLE modifier in the from clause, which isn't standard ADQL yet. As mentioned in the DaCHS 1.4 announcement, DaCHS has a prototype implementation of what's been discussed on the IVOA's DAL mailing list: pick a part of a table rather than the full one. It takes a percentage as an argument, and tells the server to choose about this percentage of the table's records using a reasonable and fast heuristic. Note that this won't give you perfect statistical sampling, but if it's not “good enough” for some purpose, I'd like to learn about that purpose.

Drawing a proper statistical sample, on the other hand, would take minutes on the GAVO database server – with tablesample, I had the roughly 6000 spectra the above query returns essentially instantaneously, and from eyeballing a sky plot of them, I'd say their distribution is close enough to that of the full DR5. So: tablesample is your friend.

For a quick look at the spectra themselves, in TOPCAT click Views/Activation Actions, check “Plot Table” and make sure TOPCAT proposes the accref column as “Table Location” (if you don't see these items, update your TOPCAT – it's worth it). Now click on a row or perhaps a dot on a plot and behold an M spectrum.
Cutouts via Datalink

LAMOST releases spectra in FITS format pretty much like the ones you may know from SDSS. The trick above works because we instead hand out proper, IVOA Spectral Data Model-compliant spectra through SSA and TAP. However, if you need to go back to the original files, you can, using Datalink. If you're unsure what this Datalink thing is: call me vain, but I still like my 2015 ADASS poster explaining that. In TOPCAT, you'd be using the “Invoke Service” activation action to get to the datalinks.

If you have actual work to do, offloading repetetive work to the computer is what you want, and fortunately, pyVO knows about datalink, too. I give you this is hard to discover so far, and the interface is... a tiny bit clunky. Until some kind soul cleans up the pyVO datalink act, a poster Stefan and I showed at the 2017 ADASS might give you an idea which buttons to press. Or read on and see how things work for LAMOST5.

The shortest way to datalinks is a TAP query that at least retrieves the ssa_pubdid column (that's a must; Datalink can't work without it) and, on the result, run the iter_datalinks method. This returns an object in which you can find the associated data items (in this case, a preview and the original FITS with the #progenitor semantics), plus the cutout service.

Hence, a minimal example for pulling the legacy FITS links out of the first three items in lamost5.data would look like this:
```
import pyvo

svc = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
for dl in svc.run_sync("select top 3 ssa_pubdid"
        " from lamost5.data").iter_datalinks():
    print(next(dl.bysemantics("#progenitor")
        )["access_url"].decode("ascii"))
```
This is a bit different from listing 2 in the poster linked above because it's python3, so getting the first element from iterator an iterator looks a bit different, and (curse astropy.votable for returning VOTable chars as bytes rather than strings!) you'll want to turn the URL into a proper string manually.

Another, actually more interesting, thing you can do with Datalink is cut out regions of interest. The LAMOST spectra are fairly long (though of course still small by image standards), so if you're only interested in a single line, you can save a bit of storage and bandwidth over blindly pulling the whole thing.

For instance, if you wanted to pull the vicinity of the H and K Fraunhofer lines from the matches in the loop in the snippet above, you could say:
```
from astropy import units as u
proc = next(dl.iter_procs())
cutout = proc.processed(band=(392*u.nm,398*u.nm))
```
And this is what I've done for the decorative left border above: it's the H and K line profiles for 0.1% of the stars LAMOST has classified as G8. Building the image didn't take more than a few seconds (where I'd like the cutouts to be faster by a factor of 10; I guess that's about an afternoon of work for me, so if it'd save you more than that afternoon, poke me to do it).

What's coming back is tables. By the time python has digested these, they're numpy record arrays. Thus, you can immediately bring in your beloved scipy (or whatever). For instance, if for some reason you're convinced that the H and K lines should be fit by identical Gaussians in the boring case and would like find objects for which that's patently untrue and that hence could be un-boring, here's how you could do that:
```
def spectral_model(wl, c1, c2, depth, width):
    return (1
        -depth*numpy.exp(-numpy.square(wl-c1)
            /numpy.square(width))
        -depth*numpy.exp(-numpy.square(wl-c2)
            /numpy.square(width)))

for pubdid, prof in get_profiles(
        "G8", (392*u.nm,398*u.nm), 0.01, 4):
    prof["flux"] /= max(prof["flux"])
    popt, pcov = curve_fit(
        spectral_model, prof["spectral"], prof["flux"],
        sigma=prof["flux_error"],
        p0=[3968, 3934, 1, 1])
    if pcov[3][3]>1:
        break
```
– where get_profiles is essentially doing the TAP plus datalink routine above, except I'm swallowing spectra with too much noise and I have the function transform the spectral coordinate into the objects' rest frames. If you're curious how I'm doing this just based on the IVOA Spectral Data Model, check the source linked at the bottom of this post.

I've just run this, and the first spectrum that the machinery flagged as suspicious was this:

– which doesn't look like I've made a discovery just yet. But that doesn't mean there's not a lot to find within LAMOST5's lines...

To get you up to speed quickly: here's the actual python3 code I ran for the “analysis” and the plot.
Category: Data
DaCHS 1.2 is out

2018-07-17 Markus Demleitner

Today, I have released DaCHS 1.2 – somewhat belatedly perhaps, because I managed to break my collarbone, but here it is. If you've been following this blog, you already know about the headline news: the dachs start command, ADQL 2.1, and early support for STC in the registry.

If you're not yet on DaCHS 1.1, please have a quick look at the corresponding release article. While the upgrade itself should work fine in one go even from older versions, the release notes of course apply cumulatively, and you may still have to do the dist-upgrade to 1.1.

As usual, the generic upgrading instructions are available in the operator's guide (in short: do a dachs val ALL; apt update; apt upgrade). Since I've still encountered DaCHS installations with wrong sources.lists last April: Note again that our repository names have changed in August 2016 – we now have release and beta rather than Debian release names. So, make sure you have something like:
```
deb http://vo.ari.uni-heidelberg.de/debian release main
```
in your /etc/apt/sources.list, not something containing “stable” or the like.

That said, here's the commented changes for 1.2:
- New dachs start command to produce structured templates for certain service types. See Horror Vacui Begone on this blog for the full story.
- Support for ADQL 2.1 (actually, its current proposed recommendation), including almost all of the optional parts (see Speak out on ADQL 2.1 on this blog). While not strictly necessary, it's a good idea to run dachs imp //adql after the upgrade; this will give you some nice new UDFs, in particular gavo_histogram.
- New coverage element (with updaters) to build and declare the space-time-spectral coverage of a resource. It would be great if you could add coverage elements to your resources where it makes sense and re-publish them. This blog post tells you how to do it (you'll have to scroll down a bit).
- There is now odbcGrammar to feed an import from another database. Essentially, you put an ODBC connection string into a file, point your sources element there, and you'll get one rawdict per tuple in a foreign database table. This might be a nice way to publish moderate-size non-postgres tables via DaCHS.
- You can now declare associated datalink services for tables using the _associatedDatalinkSvc meta item. In particular, if you had a datalink property on SSAP services, you should migrate at some point. One advantage: Users will get the datalinks even when querying the tables through TAP. See “Integrating Datalink Services” in the reference documentation for the full story.
- We now force matplotlib to read its configuration from /var/gavo/etc/matplotlibrc; to get a default, just run dachs init again. This is mainly to avoid uncontrolled imports of matplotlibrcs when DaCHS is run under a uid that does other things now and then.
- DaCHS now supports VOSI 1.1; in particular, DaCHS now understands the detail hints and has per-table endpoints, so clients like TOPCAT could avoid reading the full table metadata in one go. Realistically, at least TOPCAT doesn't yet, so this is perhaps less cool than it may sound.
- The indices generated by the ssa mixins are now a bit more sensible considering typical query modes. You probably want to run dachs imp -I on the RDs for your ssap data collections when convenient. If you have larger spectral collections, chances are many queries will be a lot faster.
- ssapCore no longer wantonly adds preview columns. If you have previews with spectra, you probably want to add <property name="previews">auto</property> to your ssapCores. If you don't, the preview column will not be added to SSA responses (right now, few clients evaluate it, but that will hopefully change in the future).
- You can now add a statisticsTarget property to columns; you will want this on largish tables with non-uniformly distributed values to aid the query planner; something like <property key=" statisticsTarget">10000</property> within the corresponding column element can go a long way to improve query planning (you need to run gavo imp -m on the RD after the change).
- DaCHS's log now by default does not contain IP addresses, user agents, and referrers any more, which should mostly keep you from processing personal data and thus from having to muck around with the EU GDPR. To get back the previous behaviour, set [web]logFormat in /etc/gavo.rc to combined.
- I fixed some utypes for obscore 1.1. These utypes are useless, so there's nothing you have to do. But then stilts taplint complains about them, and so you may want to run dachs imp -m //obscore.
- As usual, there are many minor bug fixes and improvements (e.g., memmapping FITSes for cutout again, delimited table references in ADQL, new-style tutorial resource records, correct obscore standardId, much saner nD-arrays in VOTables).
Well – enjoy the release, and if something goes wrong with it, be sure to let us know, preferably on the DaCHS-suppport mailing list.

Category: Software
DaCHS, SODA, and Datalink

2016-12-12 Markus Demleitner

DaCHS, the Data Center Helper Suite, is a comprehensive suite for publishing astronomical data to the Virtual Observatory, supporting most major protocols out there. On Dec 12, GAVO released a new version, 0.9.8. The most notable change is that now SODA is supported as specified in the last IVOA Proposed Recommendation.

This is fairly big news, as SODA is the VO's answer to providing cutout services and the like, which obviously is important part with datasets in the Multi-Gigabyte range and the VO's wider programme of trying to enable users to only download what they need. But even for spectra, which aren't typically terribly large, we have been using SODA; for instance, when you just want to see the development of a single line over time, say,, it's nice to not have to bother with the the full spectrum. The spectral client SPLAT has been offering such functionality for a couple of years now -- watch out for the scissors icon in discovery results. These indicate SODA support on the respective services.

Another client that will support SODA and its basis Datalink is Aladin – we've seen a promising demo of that during the last Interop in Trieste. Until the clients are there, DaCHS contains a (largely re-usable) stylesheet that generates simple UIs for Datalink documents and SODA services. Some examples:
- Cutout over a califa cube (including links to previous versions)
- Cutout over a plate scan (including a links to a photo of the cover page and a grey wedge)
- Various retrieval options for a spectrum (which also links split-order versions of the echelle spectrum)
Note again that all of these are not actually web pages, they're machine-readable metadata collections; if you don't believe it, pull the URLs with curl. To learn more about the combo of Datalink and SODA, check out this ADASS 2015 poster (preferably before even looking at the not terribly readable standards texts).

If you're running DaCHS yourself and can't wait to run Datalink and SODA -- here's how to do that.

Category: Demo

Page 1 / 1

Posts with the Tag Datalink:

Tablesample and TOPCAT's Plot Table activation action

Cutouts via Datalink