Since the Groningen Interop is over, it’s time for a DaCHS release, and so, roughly half a year after the release of DaCHS 1.3, today I’ve pushed DaCHS 1.4 into our Debian repository.
As usual, you should upgrade as soon as you find time to do so, because upgrades become more difficult if they span large version gaps; and one of these days you will need some new feature or run into one of the odd bugs. Upgrading is a good opportunity to also get your DaCHS ready for buster by adding the repos mentioned there.
The list of new features is rather short this time around. Here are some noteworthy ones:
There’s now an XML grammar that can be used when you have to parse smallish snippets of XML as, for instance, in VOEvent.
You can now use TABLESAMPLE(1) after a table specification in DaCHS’ ADQL to tell the database engine to just use 1% of a table for a query. While this isn’t a precise way to sample tables, it’s great when developing queries.
Also among new features I’d like to see in ADQL and have therefore put into DaCHS is GENERATE_SERIES(a,b), which is what is known as table-generating function in SQL . If you know SDSS CasJobs, you’ll have seen lots of those already. GENERATE_SERIES, however, is really plain: it just spits out a table with a column with integers between a and b. For an example of why one might what to have that, check out the poster I’m linking to in my ADASS report.
If you have an updating data descriptor (usually, because you keep feeding data into a data collection), DaCHS will no longer automatically re-make its dependencies (like, say, views). That’s because that’s not necessary in general, and it’s a pain if every update on an obscore-published table tears down and rebuilds the obscore view. For the rare cases when you do need to rebuild dependencies, there’s now a remakeOnDataChange attribute on data.
To conclude on a low note: This is probably going to be the last release of DaCHS for python 2. Even though we will have to shed a dependency or two that simply will not be ported to python 3, and even though I’m rather unhappy with a few properties of the python 3 port of twisted, there’s probably no way to escape this, given that Debian is purging out python 2 packages quickly already.
So, when we meet again for the next release, you’ll probably be looking at DaCHS 2.0, and where you have custom code in your RDs, it’s rather likely that you’ll see a minor amount of breakage. I promise I’ll do everything I can to make the migration easy for deployers, but I can’t do higher magic, so: If there’s ever been a time to add regression tests to your RDs, it’s now.
DaCHS is developed on Debian, and Debian is the recommended deployment platform. Hence, a new major release of Debian (where major means for them: We may break stuff) is always a big thing for me. And so it was with the release that came in July, codenamed “buster”. Both on the “big thing” and on the “break” counts. This posting gives DaCHS deployers some background for their buster upgrades. Astronomers not running Debian themselves won’t risk missing anything if they skip this post.
So, after I upgraded the first thing I noticed is that DaCHS would no longer even start because astropy (which it needs, in particular, because that’s where pyfits sits these days) was gone. Simple explanation: Upstream astropy doesn’t support python2 any more, and so Debian buster only has python3-astropy.
Moving DaCHS to python3, unfortunately, isn’t that easy; a major dependency, nevow (essentially, a web framework), isn’t ported yet, and porting it is a major thing. Believe me, I’ve tried. The nasty thing, in particular, is that twisted, which lies below nevow still, hands up lots of byte strings. And in python3, b"a"!="a". You wouldn’t believe how many interesting bugs that simple truth introduces when you got a library that handed out “just strings” in python2 and now byte strings in python3. Yikes.
Update (2019-08-28): After quite a bit of experimentation, I finally gave up on providing a python2 version of astropy through release, because for a complicated set of reasons (including numpy declaring a conflict with existing astropys in buster) it is impossible to provide a package that works in buster and doesn’t break stretch. So, for buster only you’ll have to have a second (or, if running beta, third) gavo line in your sources.list (or equivalent):
deb http://vo.ari.uni-heidelberg.de/debian buster-foreports main
The instructions at our APT repository have been updated, so you won’t have to bookmark this particular page.
But that wasn’t the end of it. Buster comes with Postgres 11, which I look forward to in particular because it supports parallel query execution. That could help us quite a bit, given out large catalogs that quite often we want to run sequential scans on. But of course this means upgrading postgres. And attempting to do that on my development machine immediately hit a wall. What’s nice is that the q3c and pgsphere extensions that we’ve had to push out ourselves so far are now part of Debian main. What’s rather fatal is that our pgsphere extensions dealing with HEALPixes and MOCs aren’t part of the buster pgsphere package (the reasons for that are tedious and arcane and have to do with OpenSSL and the GPL).
Also, the pgsphere package coming with buster is called postgres-pgsphere, which is rather unfortunate as it’s missing the version indication. So: If you find it on your system, remove it right away. It will conflict with the one true pgsphere package (postgresql-11-pgsphere). That one you’ll get from us, and it has the HEALPix stuff built in. TL;DR: run apt install postgresql-q3c postgresql-11-pgsphere before following the postgres update recipe linked above.
There’s a bit more to upgrading the database this time. Because of fairly low-level cleanup in Postgres itself. you’re risking index corruption on string indices. Realistically, for almost anything you’ll have, it’s unlikely that you’re affected (it’s essentially about non-ASCII in strings), but then it’s better to be safe than sorry, and hence you should say
reindex database gavo
first thing after you’ve upgraded to Postgres 11 (which you should really do once the box is on buster). Only if you have very large tables it might be worth it to restrict the index regeneration to indices that could actually need it; see the postgres link above for how to do that.
One last thing on Postgres upgrades: I’ve not quite tried to work out why, but probably depending on your /etc/hosts DaCHS on buster is much more likely to connect to your database using IPv6 than it was before. Many older Postgres configurations won’t let you in then. If that happens to you, just edit /etc/postgresql/11/main/pg_hba.conf and add a line
host all all ::1/32 md5
(or something less permissive if you prefer).
The next buster-related shock was when TOPCAT’s TAP uploads stopped working while my regression tests didn’t find anything wrong. After a bit of cursing I eventually figured out that that’s not actually buster’s fault but twisted’s, which in a commit from May 2018 broke chunked uploads (essentially, that’s when you’re not saying up front how large your upload will be). I’ve filed a bug report on twisted, but we can’t really wait until any sort of fix will be ready and have a broken TOPCAT-DaCHS relationship until then, so for now we’re also shipping a fixed twisted package. If you’re running DaCHS without our repository enabled, you will have to patch your the twisted code itself. The bug report tells what to do (no warranties, though, because I’m not entriely sure why they changed it in the first place; it’s a very small change, though).
[Update (2019-08-14) scratch the part with the fixed twisted packages. They’re too much trouble on stretch systems. You can keep using them on buster boxes if you want, though. The most recent stable release monkeypatches the problem out of presumably broken twisteds, and so will the next beta.]
I hope you’re not totally discouraged now, because upgrade you should (though perhaps not right before going on vacation) – distribution upgrades are unavoidable if you want to run services for decades, and that’s definitely a goal within the VO. See the Debian release note for Debian’s take on dist upgrades, which arguably is a bit more alarmist than it would need to; a lean, server-only system typically is really simple to upgrade.
Given the relatively large number of Debian packages we override in buster, I’ll be particularly grateful if you complain early about breakage you observe (ideally use the dachs-support mailing list, but see Support for alternatives), and as usual you are encouraged to try the upgrade first on a development system if you have one. Which you should.
Almost a year has passed since release 1.2 of DaCHS – I’ve let the normal autumn release slip last year because there weren’t so many release-worthy new features in DaCHS at the traditional release time (i.e., after the College Park interop), and also because running betas when you do need a new feature is a fairly stable thing by now.
But here it finally is: Release 1.3 (tarball for the die-hard self-builders; everyone else just switches back the release branch as necessary and then runs an update/upgrade cycle).
Columns can now be hidden from TAP/ADQL (and other interfaces) by setting hidden="True".
There is now a setting [web]maxSyncUploadSize=500000 (meaning: about 500 kByte) as the default upload limit on sync queries. In compensation, clients uploading too much now receive a more useful error message (except it doesn’t reach TOPCAT users most of the time because it does chunked uploads). To get back the behaviour of 1.2 (which is probably ok if you can live with the occasional resource hog), add maxSyncUploadSize=20000000 to your /etc/gavo.rc.
Adding support for https (certificate reading, certificate updating with letsencrypt, registering alternate endpoints, no WebSAMP with https). See HTTPS in DaCHS on this blog for details.
New source_table and preview columns in obscore. If you’re using the various obscore mixins, this should be automatic. If you have defined views manually, you will have to amend these (and have a broken obscore until a dachs upgrade ran without error).
No longer producing arraysize="1" in VOTables for scalars (except char, for compatibility with a legacy TOPCAT workaround; see VOTable 1.3 Erratum 3 for background information).
Support for draft TIMESYS in VOTable (with STC 2 annotation; ask about details if you’re interested. This is for draft VOTable 1.4 and probably only relevant to you if you’re publishing time series).
You can now add targetType and targetTitle properties to URL-valued columns to help Aladin figure out what to do with URLs (see Datalinks as product URLs in the reference documentation).
New gavo_transform, gavo_ipix, and gavo_urlescale ufuncs for ADQL, fixed gavo_urlescape to have acceptable performance.
//soda#sdm_genDesc now matches accref rather than pubDID by default. If you use Datalink with SSA and have a custom pubDID schema (or no index on accref), add a useAccref="False" to your descriptorGenerator statement.
There is now a --foreground option for dachs serve start. This is mainly to play nice with systemd, and indeed, the Debian package now comes with a systemd unit file. I’m not terribly familiar with systemd, so please have an eye on DaCHS controlled by systemd and let me know if you see something that’s not as it should be.
Fixes for various bugs (most notable: ” in ADQL, WCS in SIAP cutout products) and many minor improvements. Check out the source tree (still via subversion) and read the changelog if you want to know the whole truth.
On systems running from the Debian package, the update should be automatic with the next system upgrade. However, you’ll be saving yourself quite a bit of headache if you check the health of your installation before the upgrade; see Upgrading DaCHS in the operator’s guide on how to upgrade professionally.
When I started working on the VO in 2007, my collagues in Garching already had a software that implemented major parts of the simple spectral access protocol (SSAP) that was being developed back then. It would publish spectra in the FITS format by just blindly dumping all header cards into a database table and then defining a view over that “raw” metadata table to make the whole thing match SSAP’s expectations for how the output table should look like. Sometimes you could just map through a header to an SSA column, sometimes you would just convert a unit, sometimes you would have to write a fairly complex SQL expressions combining multiple fields.
Back then, I didn’t like it – why have two things (a table and a view) that can break when one (just a table in SSA’s format) would do, too? Also, SSAP has about 50 metadata fields, but lets you put constant values into VOTable PARAMs, which seemed a very reasonable way to attain more compact responses. So, when DaCHS grew SSAP support, I defined a mixin (essentially, a configurable interface definition) that let operators define SSA tables and their constant parameters in a fairly simple fashion and directly produced a table you could base your SSAP service on.
That made assumptions about which pieces of metadata are constant and which are not; for instance, the original mixin (“hcd” for “homogeneous collection”) assumed all spectra in a data collection came from the same instrument and had the same resolution and (what was I thinking?) SNR. Unsurprisingly, that broke fairly soon. So, I added a second mixin (“mixc”) for when different instruments or codes produced the data.
But even that was headache, at the latest when I started making time series services using SSAP. And I had to fix a few bugs in the mixins themselves in the meantime, which mostly required re-imports of the data in that design. Such re-imports are non-trivial when you have millions of spectra, and they need to happen at software upgrade time or the services would break with the upgrade. Ouch.
It was about mid-2018 when it dawned on me that sometimes it’s better to have two things that can break even if one would do, after all. Specifically, if fixing the one thing is expensive, it’s an excellent idea to put a facade on top of it that’s cheap to change and can already be used to repair most deficiencies. Why re-build the house if a paint job does the trick?
So, I’ve now come up with a view-based solution in DaCHS, too: the //ssap#view mixin. This is a bit less radical than the Garching software of 2007, as it doesn’t dump raw headers but instead lets you do the primary transformations in the RD. But it no longer constrains what pieces of metadata should be constant and which may vary between spectra, and it uses the same names for the same pieces of metadata throughout (which also is a step forward over the old SSAP mixins).
With this, DaCHS operators should no longer use the hcd and mixc mixins for new services. The new technique is already reflected the respective tutorial chapter, and the SSAP template (you’re using dachs start, aren’t you?) now uses it, too.
If you have a spectra publishing project in your pipeline, this would be the perfect time to upgrade to the DaCHS 1.2.4 beta, which has the new mixin. It would be great if we could iron out remaining wrinkles before the next release makes changes a load on my conscience.
As to migrating existing SSAP services: Well, it would be great if I could drop the old mixins in a couple of years, as they cause quite a bit of uglyness in DaCHS’s built-in //ssap RD. But the migration regrettably isn’t straightforward, so you may want to wait a bit before embarking on that journey (I’ll be happy to help, though).
(Warning: No astronomy-relevant content at all this time).
But on the other hand, there are some scenarios in which HTTPS makes sense, and I can remotely fantasise that some of those may even be relevant to the VO. And people have been asking for HTTPS in DaCHS a number of times, at times even because their administrations urged them to switch. So, here it is, hopefully. Turning it on is reasonably easy when you use Letsencrypt (which in particular entails having ports 80 and 443); the section on Letencrypt in the operator’s guide tells what to do. In particular don’t forget the cron job, because without it, things would break after three months (when the initial certificate expires).
Things get difficult after that. For one, if your box is known under several names (our data center, for instance, can be reached as any of dc.g-vo.org, vo.uni-hd.de, and dc.zah.uni-heidelberg.de; this of course also includes things like www.example.org and example.org), you’ll now have to tell DaCHS about it in the new [web]alternateHostnames configuration item; for instance, we have
And then the Registry has to know you have https. There’s actually no convention for that in the VO yet. But since I’d really like to have at least fallback interfaces with plain HTTP, we’ll have to come up with something. For now, my plan is to have the alternative protocol (i.e., HTTPS for sites that have an HTTP-serverURL and vice versa) using the brand-new VOResource 1.1 mirrorURLs (in RegTAP 1.1, they are in the mirror_url column rr.interface). To make DaCHS declare the alternate URLs, set [web]registerAlternative to True.
Another change I’ve introduced for HTTPS is that the default HTML template for the form renderer (i.e., the one people use who come with a browser) now suppresses the SAMP button if the request came in through HTTPS; that’s because WebSAMP doesn’t work with HTTPS and probably never will – at least I can’t see a way to make it happen without totally wrecking what security guarantees HTTPS gives.
All this doesn’t yet cater for the case when you use a reverse proxy to terminate HTTPS. If you are in that situation, please talk to me so we can figure out a sane way for you explain to DaCHS what to tell the Registry.
Oh, and if you’re a client developer: Our data center is now reachable through HTTPS (at https://dc.g-vo.org), and we already have pushed the records with mirrorURLs declaring HTTPS support to the RegTAP service at dc.g-vo.org (the others will have to wait a bit longer, as we haven’t re-published our registry records yet (it’s all experimental, after all).
If you’re not yet on DaCHS 1.1, please have a quick look at the corresponding release article. While the upgrade itself should work fine in one go even from older versions, the release notes of course apply cumulatively, and you may still have to do the dist-upgrade to 1.1.
As usual, the generic upgrading instructions are available in the operator’s guide (in short: do a dachs val ALL; apt update; apt upgrade). Since I’ve still encountered DaCHS installations with wrong sources.lists last April: Note again that our repository names have changed in August 2016 – we now have release and beta rather than Debian release names. So, make sure you have something like
deb http://vo.ari.uni-heidelberg.de/debian release main
in your /etc/apt/sources.list, not something containing “stable” or the like.
That said, here’s the commented changes for 1.2:
New dachs start command to produce structured templates for certain service types. See Horror Vacui Begone on this blog for the full story.
Support for ADQL 2.1 (actually, its current proposed recommendation), including almost all of the optional parts (see Speak out on ADQL 2.1 on this blog). While not strictly necessary, it’s a good idea to run dachs imp //adql after the upgrade; this will give you some nice new UDFs, in particular gavo_histogram.
New coverage element (with updaters) to build and declare the space-time-spectral coverage of a resource. It would be great if you could add coverage elements to your resources where it makes sense and re-publish them. This blog post tells you how to do it (you’ll have to scroll down a bit).
There is now odbcGrammar to feed an import from another database. Essentially, you put an ODBC connection string into a file, point your sources element there, and you’ll get one rawdict per tuple in a foreign database table. This might be a nice way to publish moderate-size non-postgres tables via DaCHS.
You can now declare associated datalink services for tables using the _associatedDatalinkSvc meta item. In particular, if you had a datalink property on SSAP services, you should migrate at some point. One advantage: Users will get the datalinks even when querying the tables through TAP. See “Integrating Datalink Services” in the reference documentation for the full story.
We now force matplotlib to read its configuration from /var/gavo/etc/matplotlibrc; to get a default, just run dachs init again. This is mainly to avoid uncontrolled imports of matplotlibrcs when DaCHS is run under a uid that does other things now and then.
DaCHS now supports VOSI 1.1; in particular, DaCHS now understands the detail hints and has per-table endpoints, so clients like TOPCAT could avoid reading the full table metadata in one go. Realistically, at least TOPCAT doesn’t yet, so this is perhaps less cool than it may sound.
The indices generated by the ssa mixins are now a bit more sensible considering typical query modes. You probably want to run dachs imp -I on the RDs for your ssap data collections when convenient. If you have larger spectral collections, chances are many queries will be a lot faster.
ssapCore no longer wantonly adds preview columns. If you have previews with spectra, you probably want to add <property name="previews">auto</property> to your ssapCores. If you don’t, the preview column will not be added to SSA responses (right now, few clients evaluate it, but that will hopefully change in the future).
You can now add a statisticsTarget property to columns; you will want this on largish tables with non-uniformly distributed values to aid the query planner; something like <property key=" statisticsTarget">10000</property> within the corresponding column element can go a long way to improve query planning (you need to run gavo imp -m on the RD after the change).
DaCHS’s log now by default does not contain IP addresses, user agents, and referrers any more, which should mostly keep you from processing personal data and thus from having to muck around with the EU GDPR. To get back the previous behaviour, set [web]logFormat in /etc/gavo.rc to combined.
I fixed some utypes for obscore 1.1. These utypes are useless, so there’s nothing you have to do. But then stilts taplint complains about them, and so you may want to run dachs imp -m //obscore.
As usual, there are many minor bug fixes and improvements (e.g., memmapping FITSes for cutout again, delimited table references in ADQL, new-style tutorial resource records, correct obscore standardId, much saner nD-arrays in VOTables).
One of the major usability issues our publishing suite DaCHS has for operators (i.e., people who want publish data) is the “horror vacui”: How do I start a Resource Descriptor (RD – the file DaCHS interprets to build services)?
I used to recommend to start by having a look at the RDs of our existing services and pick whatever matches best your publication project. But finding a matching service and figuring out what is generic, what’s a special property of the concrete data collection, and what’s a hack that should not be reproduced isn’t straightforward at all, not to mention the fact that some of those RDs have been in maintenance mode for almost 10 years and hence may show deprecated practices.
I had planned to generalise Mikhail’s approach to several types of resources supported by DaCHS, ideally inferring the questions to ask from the built-in documentation of mixins and applys. But during the last year, whenever I felt it would be a good time to tackle that generalisation, I quickly gave up again. It was mostly rather trivial stuff such as how to tell apart repeatable metadata (waveband, say) and non-repeatable metadata (instrument, say). But it was bad enough that I quickly found something else to do each time I got started.
Eventually, I gave up on a menu interface altogether – making it flexible and generatable at the same time seemed a fairly complex problem. But that doesn’t mean I forgot about overcoming the horror vacui thing. So, when forms aren’t flexible enough for data entry, where do you turn? Right! A text editor.
Enter dachs start. That’s a new DaCHS subcommand that gets you started with your RD. For one, you can list the templates available:
$ dachs start list
siap -- Image collections via SIAP1 and TAP
ssap+datalink -- Spectra via SSAP and TAP, going through datalink
epntap -- Solar system data via EPN-TAP 2.0
scs -- Catalogs via SCS and TAP
More templates are planned; siap+datalink, for instance, would cover some frequent use cases. Feel free to mail in requests.
Once you find a suitable template, create your future resource directory, enter it and run dachs start again, this time passing the name of the template you want:
$ mkdir ex_data
$ cd ex_data
$ dachs start scs
$ head -16 q.rd | tail -9
<meta name="title">%title -- not more than a line%</meta>
%this should be a paragraph or two (take care to mention salient terms)%
<!-- Take keywords from
dachs start uses the directory name as the new schema name and then writes a file q.rd (which is the canonical name for the “main” RD in a resource). Within this file, you’ll see things to fill out between pairs of percent signs with short explanantions. Where longer explanations are necessary, embedded comments should help.
To give you an idea of the intended use: As a vim user, I’ve put
au BufRead,BufNewFile *.rd imap /%[^%]*%a
au BufRead,BufNewFile *.rd imap cf%
into my ~/.vimrc. That way, while editing the template into an actual RD, hitting F8 takes me to the next thing to be edited; I can then read the instructions, and when I have made up my mind, I can either delete the template element or hit F9 and replace the explanation text with whatever belongs there.
The command is available starting with the 1.1.3 beta (available now by switching to the beta repo) and will be part of the 1.2 release, planned for early June after the Victoria interop.
If you have a publication project: just try it out and give feedback. Note that the templates haven’t actually been tested yet, and the comments were written by a DaCHS and VO nerd, so they might not always be great either. Thus, when you get stuck: complain early, complain often!
If you’ve always wanted to be part of a standardisation process within the IVOA (and who would not?), the time has rarely been as good as now. Because: We’re updating ADQL! Yes! The ADQL you are writing your queries in will receive a few more language elements, and we’re carefully trying to heal a few things that turned out to be warts. And while some of the changes are as dull and boring as you may expect standards work to be, on some of them you may wish to have a saying.
Also, you can try things out – the GAVO data center TAP endpoint at http://dc.g-vo.org/tap already has most of the proposed features, and the new DaCHS beta 1.1.2 (out since last Friday) does, too. So, if you’re running DaCHS yourself, you can start playing after switching to the beta repository.
You’re now supposed to write the standard crossmatch as DISTANCE(ra1, dec1, ra2, dec2)<dist. This replaces the old dance with 1=CONTAINS(POINT(), CIRCLE()) that you’ve probably learned to hate. Finally: Crossmatching without having to resort to TOPCAT’s example menu…
ADQL geometries used to require a first argument that would give the reference frame, as in POINT('ICRS', ra, dec). The hope was that services could then automagically make a statement like CONTAINS(point_in_icrs, circle_in_galactic) work as presumably intended. Few services ever did (DaCHS still tries reasonably hard), and when they did, there were all kinds of opaque oddities. One of the most common sources of confusion is the question what a service is supposed to do with POINT('GALACTIC', ra, dec), assuming it knows that ra and dec are in, say, B1950 FK4. Also, is there any expectation that services attempt to do anything beyond a simple rotation (FK4, for instance, rotates noticably against the ICRS, so proper motions would need to get fixed, too)? In all, the frame as a first argument was ill thought-out, and it’s been deprecated. Simply don’t put in the string-typed first argument any more. POINT(long, lat) does it. True: This, more than ever, calls for an ADQL astrometry library so you can easily convert, at least, between Galactic and ICRS (probably a few more would be useful, too). More on this in some future post.
Services should have CAST now. Sometimes you want to turn a number into a string or a string into a timestamp. In such cases, you can write CAST('1991-02-01', TIMESTAMP) now. The details are not quite, excuse me, cast in stone yet, so if you have a use case for this kind of thing, speak up now. The current draft also calls for a TIMESTAMP(tx) function – but since that’s really not different from CAST(tx, TIMESTAMP), I’m trying to dissuade people from adding it.
Services should have an IN_UNIT function now. That’s a nifty thing in particular when you’re re-using queries on different services. Just write, say, IN_UNIT(pmra, 'deg/yr') and never worry again if it’s arcsec/yr, mas/yr, rad/cy, or whatever. The second argument, by the way, is written according to the Units in the Virtual Observatory standard. It’s an optional feature according to the current standard, so perhaps it’s too early to party, but I’ve found this extremely useful, and so I hope we’ll see widespread adoption.
Services should now have set operations. These are UNION, EXCEPT, and INTERSECT and are useful when you have two queries that result in the same table schema (because they won’t work otherwise). Say you have two complex ways to filter rows from the table source, but you want to process both sorts of results further on – you can say then say something like
SELECT <whatever complex> FROM
(SELECT a,b,c FROM source
WHERE <crazy stuff>
GROUP BY a, b, c) as left
(SELECT a,b,c FROM source
WHERE <other crazy stuff>
GROUP BY a, b, c) as right
WHERE <more complex stuff over a, b, and c>
– and similarly, EXCEPT lets you “punch a hole” in a result table. Another interesting use case would be to query many tables on a service like VizieR in one go; that still works if you make sure the tables defined by the sub-queries have the same columns. Given that a lot of cross-table operations actually boil down to JOINs and WHERE clauses, the set operations are used less that one would expect. But if you need them, there’s no real alternative (short of downloading far too much and performing the operation locally, which of course defeats the purpose of TAP).
Common table expressions (“WITH”). DaCHS doesn’t do these yet, and it will only pick them up if someone else implements them first. In the way ADQL 2.1 has them (“nonrecursive”), CTEs are little more than syntactic sugar, and I’m not quite sure if the additional implementation complexity is worth it. If you’re curious, check CTEs in the postgres manual. If that makes you drool for WITH in ADQL, let me know. It’ll not be too hard to sway me to put them in.
Bitwise Operations. That’s when integers are treated as bit patterns. If this sounds like nerd stuff to you, well, it happens quite a bit in actual catalogs. See, for instance, Note 3 for the PPMXL. You’d need the flags column described there if you wanted to exclude PPMXL objects that replaced multiple USNO-B1.0 objects (bit 3), you will right now have to write something like MOD(flags,16)>7. That’s a bit of magic that everyone will have to think about for a while. With bitwise operations, you’ll just write BITWISE_AND(flags,8)=8, which will look familiar to everyone who has used the pattern before (in particular, it’s clear we’re talking about bit 3). There still is discussion whether bitwise operations are common enough to warrant special syntax – the draft currently says the above should be written as flags&8=8 – or whether the functions DaCHS has at the moment (they’re called BITWISE_AND, BITWISE_OR, BITWISE_XOR, and BITWISE_NOT) are good enough.
Offset. If you’ve ever done anything with ADQL, you’ll know that SELECT TOP 10 * FROM hipparcos.main ORDER BY parallax DESC will give you the 10 objects with the larges parallaxes. But what if you want the next but 10 closest stars? Well, OFFSET to the rescue:
SELECT TOP 10 *
ORDER BY parallax DESC
There is another, more sinister, application for OFFSET, which happens to be the actual reason I’ve put it into DaCHS’ ADQL ages ago: Written as OFFSET 0 several databases use it to denote a barries for the query planner. This is explained to some degree in the class DaCHS TAP example Crossmatch for a Guide Star – which still mentions the first hack I had built into DaCHS to let query authors rein in overzealous query planners.
LOWER and ILIKE. ADQL has been extremely weak on the side of text processing, so weak indeed that it wasn’t nearly enough to cover the use cases for the registry when it moved to RegTAP. ADQL 2.1 adds two basic features – LOWER, a function that lets people query in a case-insensitive fashion, and ILIKE, an operator that is like LIKE, but again ignores case. While both features are obviously great as soon as people dump any kind of text (think object names) into their databases, I’m not terribly happy with ILIKE, as it does the same as RegTAP’s ivoa_nocasematch user defined function, and it’s always bad when a two standards forsee two different mechanisms for the same thing.
Geometry-typed arguments. CIRCLE and POLYGON now accept POINTs in alternative constructor functions. That is, you can now say CIRCLE(POINT(ra, dec), radius) in addition to the traditional CIRCLE(ra, dec, radius). In itself, that’s probably not terribly exciting, but when you have actual POINTs in your database, it’s much more compact to write, say,
(which would return rows for those spectra for which the declared aperture does not contain the declared target). Before, you’d had to write some fairly ugly expression involving COORD1 and whatnot in order to achieve the same effect.
Boolean expressions. That’s another one that’s still a bit up in the air. First, the rough goal is to allow boolean values in ADQL-accessible tables, which so far have been a hack at best. In the future, you should be able to say WHERE is_broken=True. However, people coming from other languages will find that odd, and indeed, in python I’d cringe on if is_broken==True:. What I’d expect is if is_broken:. Do we want this in ADQL? Currently, it’s in the grammar (more or less like this), but this kind of thing makes it still harder to produce useful syntax error messages. Is it worth it, either way? I’m not sure.
That about concludes my quick review of the new features of ADQL 2.1. If you’d like to know more, the current draft is on the IVOA document repository, and if you can deal with version control (you should!), you can follow the bleeding edge in the ADQL document in Volute. Discussion happens on the DAL mailing list.
Update (2018-04-13): Well, as to the CTEs, I couldn’t resist after all, and they’re in with DaCHS 1.1.3. And I have to say a love them — they weren’t hard to put in, and once they’re there they make so many queries a good deal more readable than before. I’ve even put it a server-defined example for CTEs on the Heidelberg TAP service showcasing a particularly compelling use case.
A histogram of times for which the Palomar-Leiden service has images: That’s temporal service coverage right there.If you are an astronomer and you’ve ever tried looking for data in the Virtual Observatory Registry, chances are you have wondered “Why can’t I enter my position here?” Or perhaps “So, I’m looking for images in [NIII] – where would I go?”
Both of these are examples for the use of Space-Time Coordinates (STC) in data discovery – yes, spectral coordinates count as STC, too, and I could make an argument for it. But this post is about something else: None of this has worked in the Registry up to now.
It’s time to mend this blatant omission. To take the next steps, after a bit of discussion on some of the IVOA’s mailing lists, I have posted an IVOA note proposing exactly those last Thursday. It is, perhaps with a bit of over-confidence, called A Roadmap for Space-Time Discovery in the VO Registry. And I’d much appreciate feedback, in particular if you are a VO user and have ideas on what you’d like to do with such a facility.
In this post, I’d like to give a very quick run-down on what is in it for (1) VO users, (2) service operators in general, and (3) service operators who happen to run DaCHS.
First, users. We already are pretty good on spatial coverage (for about 13000 of almost 20000 resources), so it might be worth experimenting with that. For now, the corresponding table is only available on the RegTAP mirror at http://dc.g-vo.org/tap. There, you can try queries like
select ivoid from
natural join rr.stc_spatial
and ucd like 'phot.flux;em.radio%'
to find – in this case – services that have radio fluxes in the area of the Hubble Deep Field. If these lines scare you or you don’t know what to do with the stupid ivoids, check the previous post on this blog – it explains a bit more about RegTAP and why you might care.
Similarly cool things will, hopefully, some day be possible in spectrum and time. For instance, if you were interested in SII fluxes in the crab nebula in the early sixties, you could, some day, write
SELECT ivoid FROM
NATURAL JOIN rr.stc_spectral
NATURAL JOIN rr.stc_spatial
As you can see, the spectral coordiate will, following (admittedly broken) VO convention, be given in meters of vacuum wavelength, and time in MJD. In particular the thing with the wavelength isn’t quite settled yet – personally, I’d much rather have energy there. For one, it’s independent of the embedding medium, but much more excitingly, it even remains somewhat sensible when you go to non-electromagnetic messengers.
A pattern I’m trying to establish is the use of the user-defined function ivo_interval_overlaps, also defined in the Note. This is intended to allow robust query patterns in the presence of two intrinsically interval-valued things: The service’s coverage and the part of the spectrum you’re interested in, say. With the proposed pattern, either of these can degenerate to a single point and things still work. Things only break when both the service and you figure that “Aw, Hα is just 656.3 nm” and one of you omits a digit or adds one.
But that’s academic at this point, because really few resources define their coverage in time and and spectrum. Try it yourself:
SELECT COUNT(*) FROM (
SELECT DISTINCT ivoid FROM rr.stc_temporal) AS q
(the subquery with the DISTINCT is necessary because a single resource can have multiple rows for time and spectrum when there’s multiple distinct intervals – think observation campaigns). If this gives you more than a few dozen rows when you read this, I strongly suspect it’s no longer 2018.
To improve this situation, the service operators need to provide the information on the coverage in their resource records. Indeed, the registry schemas already have the notion of a coverage, and the Note, in its core, simply proposes to add three elements to the coverage element of VODataService 1.1. Two of these new elements – the coverage in time and space – are simple floating-point intervals and can be repeated in order to allow non-contiguous coverage. The third element, the spatial coverage, uses a nifty data structure called a MOC, which expands to “HEALPix Multi-Order Coverage map” and is the main reason why I claim we can now pull off STC in the Registry: MOCs let databases and other programs easily and quickly manipulate areas on the sphere. Without MOCs, that’s a pain.
So, if you have registry records somewhere, please add the elements as soon as you can – if you don’t know how to make a MOC: CDS’ Aladin is there to help. In the end, your coverage elements should look somewhat like this:
The waveband elements are remainders from VODataService 1.1. They are still in use (prominently, for one, in SPLAT), and it’s certainly still a good idea to keep giving them for the forseeable future. You can also see how you would represent multiple observing campaigns and different spectral ranges.
Finally, if you’re running DaCHS and you’re using it to generate registry records (and there’s almost no excuse for not doing so), you can simply write a coverage element into your RD starting with DaCHS 1.2 (or, if you run betas, 1.1.1, which is already available). You’ll find lots of examples at the usual place. As a relatively interesting example, the resource descriptor of plts. It has this:
This particular service archives plate scans from the Palomar-Leiden Trojan surveys; these were looking for Trojan asteroids (of Jupiter) using the Palomar 122 cm Schmidt and were conducted in several shortish campaigns between 1960 and 1977 (incidentally, if you’re looking for things near the Ecliptic, this stuff might still hold valuable insights for you). Because the fill factor for the whole time period is rather small, I manually extracted the time coverage; for that, I ran select dateobs from plts.data via TAP and made the histogram plot above. Zooming in a bit, I read off the limits in TOPCAT’s coordinate display.
The other coverages, however, were put in automatically by DaCHS. That’s what the updater element does: for each axis, you can say where DaCHS should look, and it will then fill in the appropriate data from what it guesses gives the relevant coordiantes – that’s straightforward for standard tables like the ones behind SSAP and SIAP services (or obscore tables, for that matter), perhaps a bit more involved otherwise. To say “just do it for all axis”, give the updater a single sourceTable attribute.
Finally, in this case I’m overriding mocOrder, the order down to which DaCHS tries to resolve spatial features. I’m doing this here because in determining the coverage of image services DaCHS right now only considers the centers of the images, and that’s severely underestimating the coverage here, where the data products are the beautiful large Schmidt plates. Hence, I’m lowering the resolution from the default 6 (about one degree linearly) to still give some approximation to the actual data coverage. We’ll fix the underlying deficit as soon as pgsphere, the postgres extension which is actually dealing with all the MOCs, has support for turning circles and polygons into MOCs.
When you have defined an updater, just run dachs limits q.rd, and DaCHS will carefully (preserving your indentation) re-write the RD to contain what DaCHS has worked out from your table (but careful: it will overwrite what was previously there; so, make sure you only ask DaCHS to only deal with axes you’re not dealing with manually).
If you feel like writing code discovering holes in the intervals, ideally already in the database: that would be great, because the tighter the intervals defined, the fewer false positives people will have in data discovery.
The take-away for DaCHS operators is:
Add STC coverage to your resources as soon as you’ve updated to DaCHS 1.2
If you don’t have to have the tightest coverage declaration conceivable, all you have to do to have that is add
to your RD (where my_table is the id of your service’s “main” table) and then run dachs limits q.rd
For special effects and further information, see Coverage Metadata in the DaCHS reference documentation
If you have a nice postgres function that splits a simple coverage interval up so the filling factor of a set of new intervals increases (or know a nice, database-compatible algorithm to do so) – please let me know.
As usual, the general upgrading instructions are available in the operator’s guide (in short: do a dachs val ALL before the Debian upgrade). This time, I’d recommend to use the opportunity to upgrade your underlying server to stretch if you haven’t done so already. If you do that, please have a look at hints on postgres upgrades. Stretch comes with postgres 9.6 (jessie: 9.4). Postgres upgrades are generally safe, but please take a dump before migrating anyway.
So, with this out of the way, here’s a short list of the major changes from DaCHS 1.0 to DaCHS 1.1:
DaCHS now officially requires python 2.7. If this really is a problem for you, please shout – if wouldn’t be hard to maintain 2.6 compatibility, but by now we feel there’s no reason to bother any more.
Now supporting TAP 1.1; in particular, TOP n doesn’t trump MAXREC any more, and it doesn’t affect OVERFLOW indication, which may break things that used TOP to override DaCHS’ default TAP match limit of 2000. Also, TAP_SCHEMA is updated (this happens as a side effect of dachs upgrade).
Now serialising spoint, scircle, and friends to DALI 1.1 xtypes (timestamp, point, polygon, circle). Fields explicitly marked with adql:POINT or adql:REGION will still be serialised to STC-S. Do this only if you have no choice (DaCHS has this for obscore and epntap s_region right now).
The output column selection is sanitised. This may make for slight changes in service responses, in particular in VOTable formats. See Output Tables in the reference documentation for details if you think this might hit you.
DaCHS no longer comes with an outdated version pyparsing and instead uses what’s installed on the system. The Debian package further re-uses additional system resources if available (rjsmin, jquery).
DaCHS now tries a bit harder to come up with sensible names for SODA result files.
map/@source is no longer limited to identifier-like strings; any key that’s in your source is fair game.
Relative imports from custom code (“import foo” in a custom core, for instance, getting res/foo.py) no longer work. See Importing Modules in the reference documentation for details.
This release fixes a severe bug in the creation of obscore metadata from SSAP tables. If you use //obscore#publishSSAPHCD or //obscore#publishSSAPMIXC mixins, update the obscore definitions by running dachs imp -m <rdid>, followed by dachs imp //obscore (the latter is only necessary once at the end).
You can now define a footer.html template that’s added at the foot of the main page content – with a bit of CSS magic, this lets you overwrite almost anything on DaCHS HTML pages.
As always, please complain early if something breaks for you; our regression tests can only cover so much. In particular, our support list is there for you.
Update (2017-12-06): In particular on jessie, you may see that all DaCHS packages are being held back. To resolve this situation, manually say apt-get install python-gavoutils python-gavostc.