DaCHS 2.11: Persistent TAP Uploads
The traditional autumn release of GAVO's server package DaCHS is somewhat late this year, but not so late that could not still claim it comes after the interop. So, here it is: DaCHS 2.11 and the traditional what's new post.
But first, while I may have DaCHS operators' attention: If you have always wondered why things in DaCHS are as they are, you will probably enjoy the article Declarative Data Publication with DaCHS, which one day will be in the proceedings of ADASS XXXIV (and before that probably on arXiv). You can read it in a pre-preprint version already now at https://docs.g-vo.org/I301.pdf, and feedback is most welcome.
Persistent TAP Uploads
The potentially most important new feature of DaCHS 2.11 (in my opinion) will not be news to regular readers of this blog: Persistent TAP Uploads.
At this point, no client supports this, and presumably when clients do support it, it will look somewhat different, but if you like the bleeding edge and have users that don't mind an occasional curl or requests call, you would be more than welcome to help try the persistent uploads. As an operator, it should be sufficient to type:
dachs imp //tap_user
To make this more useful, you probably want to hand out proper credentials (make them with dachs adm adduser) to people who want to play with this, and point the interested users to the demo jupyter notebook.
I am of course grateful for any feedback, in particular on how people find ways to use these features to give operators a headache. For instance, I really would like to avoid writing a quota system. But I strongly suspect will have to…
On-loaded Execute-s
DaCHS has a built-in cron-type mechanism, the execute Element. So far, you could tell it to run jobs every x seconds or at certain times of the day. That is fine for what this was made for: updates of “living” data. For instance, the RegTAP RD (which is what's behind the Registry service you are probably using if you are reading this) has something like this:
<execute title="harvest RofR" every="40000"> <job><code> execDef.spawnPython("bin/harvestRofR.py") </code></job> </execute>
This will pull in new publishing registries from the Registry of Registries, though that is tangential; the main thing is that some code will run every 40 kiloseconds (or about 12 hours).
Against using plain cron, the advantage is that DaCHS knows context (for instance, the RD's resdir is not necessary in the example call), that you can sync with DaCHS' own facilities, and most of all that everything is in once place and can be moved together. By the way, it is surprisingly simple to run a RegTAP service of your own if you already run DaCHS. Feel free to inquire if you are interested.
In DaCHS 2.11, I extended this facility to include “events” in the life of an RD. The use case seems rather remote from living data: Sometimes you have code you want to share between, say, a datalink service and some ingestion code. This is too resource-bound for keeping it in the local namespace, and that would again violate RD locality on top.
So, the functions somehow need to sit on the RD, and something needs to stick them there. To do that, I recommended a rather hacky technique with a LOOP with codeItems in the respective howDoI section. But that was clearly rather odious – and fragile on top because the RD you manipulated was just being parsed (but scroll down in the howDoI and you will still see it).
Now, you can instead tell DaCHS to run your code when the RD has finished loading and everything should be in place. In a recent example I used this to have common functions to fetch photometric points. In an abridged version:
<execute on="loaded" title="define functions"><job> <setup imports="h5py, numpy"/> <code> def get_photpoints(field, quadrant, quadrant_id): """returns the photometry points for the specified time series from the HDF5 as a numpy array. [...] """ dest_path = "data/ROME-FIELD-{:02d}_quad{:d}_photometry.hdf5".format( field, quadrant) srchdf = h5py.File(rd.getAbsPath(dest_path)) _, arr = next(iter(srchdf.items())) photpoints = arr[quadrant_id-1] photpoints = numpy.array(photpoints) photpoints[photpoints==0] = numpy.nan photpoints[photpoints==-9999.99] = numpy.nan return photpoints def get_photpoints_for_rome_id(rome_id): """as get_photpoints, but taking an integer rome_id. """ field = rome_id//10000000 quadrant = (rome_id//1000000)%10 quadrant_id = (rome_id%1000000) base.ui.notifyInfo(f"{field} {quadrant} {quadrant_id}") return get_photpoints(field, quadrant, quadrant_id) rd.get_photpoints = get_photpoints rd.get_photpoints_for_rome_id = get_photpoints_for_rome_id </code></job></execute>
(full version; if this is asking you to log in, tell your browser not to wantonly switch to https). What is done here in detail again is not terribly relevant: it's the usual messing around with identifiers and paths and more or less broken null values that is a data publisher's everyday lot. The important thing is that with the last two statements, you will see these functions whereever you see the RD, which in RD-near Python code is just about everywhere.
dachs start taptable
Since 2018, DaCHS has supported kickstarting the authoring of RDs, which is, I claim, the fun part of a data publisher's tasks, through a set of templates mildly customised by the dachs start command. Nobody should start a data publication with an empty editor window any more. Just pass the sort of data you would like to publish and start answering sensible questions. Well, “sort of data” within reason:
$ dachs start list epntap -- Solar system data via EPN-TAP 2.0 siap -- Image collections via SIAP2 and TAP scs -- Catalogs via SCS and TAP ssap+datalink -- Spectra via SSAP and TAP, going through datalink taptable -- Any sort of data via a plain TAP table
There is a new entry in this list in 2.11: taptable. In both my own work and watching other DaCHS operators, I have noticed that my advice “if you want to TAP-publish any old material, just take the SCS template and remove everything that has scs in it” was not a good one. It is not as simple as that. I hope taptable fits better.
A plan for 2.12 would be to make the ssap+datalink template less of a nightmare. So far, you basically have to fill out the whole thing before you can start experimenting, and that is not right. Being able to work incrementally is a big morale booster.
VOTable 1.5
VOTable 1.5 (at this point still a proposed recommendation) is a rather minor, cleanup-type update to the VO's main table format. Still, DaCHS has to say it is what it is if we want to be able to declare refposition in COOSYS (which we do). Operators should not notice much of this, but it is good to be aware of the change in case there are overeager VOTable parsers out there or in case you have played with DaCHS MIVOT generator; in 2.10, you could ask it to do its spiel by requesting the format application/x-votable+xml;version=1.5. In 2.11, it's application/x-votable+xml;version=1.6. If you have no idea what I was just saying, relax. If this becomes important, you will meet it somewhere else.
Minor Changes
That's almost it for the more noteworthy news; as usual, there are a plethora of minor improvements, bug fixes and the like. Let me briefly mention a few of these:
- The ADQL form interface's registry record now includes the site name. In case you are in this list, please say dachs pub //adql after upgrading.
- More visible legal info, temporal, and spatial coverage in table and service infos; one more reason to regularly run dachs limits!
- VOUnit's % is now known to DaCHS (it should have been since about 2.9)
- More vocabulary validation for VOResource generation; so, dachs pub might now complain to you when it previously did not. It is now right and was wrong before.
- If you annotate a column as meta.bib.bibcode, it will be rendered as ADS links
- The RD info links to resrecs (non-DaCHS resources, essentially), too.
Upgrade As Convenient
As usual, if you have the GAVO repository enabled, the upgrade will happen as part of your normal Debian apt upgrade. Still, if you have not done so recently, have a quick look at upgrading in the tutorial. If, on the other hand, you use the Debian-distributed DaCHS package and you do not need any of the new features, you can let things sit and enjoy the new features after your next dist-upgrade.