DaCHS 2.3 on the way to Debian main

DaCHS, Debian, and 2.3
DaCHS 2.3 will be the first DaCHS officially in Debian.

DaCHS releases usually come around the Interops in (roughly) May and November. Not this one, though, for one pleasant, one unpleasant, and several other reasons.

The unpleasant reason first: The 2.2 release has a fairly severe memory leak in it (resulting, in roundabout ways, from python 3 preserving tracebacks of nested exceptions), which of course really became virulent on my server right over the holidays. If you run a site with just a few gigs of RAM that might be hit by second-rate async clients, this will bite you and you ought to upgrade now (well, you ought to upgrade anyway).

The pleasant reason is that DaCHS has made it into Debian main and thus, unless something disastrous happens, it will be part of the Debian version 11 (“bullseye”). This means that people who do not need to be on the bleeding edge, will not need to monkey around with our repository (and its signing key) any more starting some time in 2021 (or just about now, if they’re running testing). I can’t tell you how gratifying that feels to me. And well, I wanted relatively recent code corresponding to a something on our release branch in bullseye.

One of the other reasons is that stilts’ author Mark Taylor is trying to stomp out TAP services failing his taplint’s validation, and many DaCHS 2.2 services (those that don’t define TAP examples, which of course is a shame anyway) fail with only the (really minor) error E-EXDH-1 (see below).

DaCHS 2.3 has some other noteworthy changes; as usual in minor version steps, my expectation is that none of this will break existing services. Still, you may want to glance over the following list, as there are some behavioural changes nevertheless. In approximate order of the wizardry involved:

  • I’ve long had a bad consciousness because DaCHS has stored cleartext passwords so far. That’s probably not a problem for DaCHS itself (as it does not protect great riches), but people tend to re-use passwords, and I’d have hated to leak passwords that might work elsewhere. Well, no longer: the dc.users table now contains hashed passwords, and the upgrade will hash them. This, in particular, means that you cannot recover them once you have updated (which, of course, is as it should be).
  • The javascript delivered with DaCHS was no longer quite up to date with Debian’s jquery. I have updated it in several ways, and I have restored the functionality of the WebSAMP button in the default response. If you have custom HTML templates containing javascript, you may need to update them to newer jquery, too, specifically,
    • change .unload( to .on("unload", (this happens in the SAMP code in defaultresponse.html, for instance).
    • also in the SAMP code in overridden defaultresponses, change the icon URL to completeURL("/logo_tiny.png") (or whatever) to avoid trouble with https installations.
    • if you compare jquery element names: these are now returned in lower case.

    And yes, WebSAMP now mostly works with HTTPS (which is unrelated to this update, except that DaCHS until 2.2 suppresses the WebSAMP button when it thinks it is delivering through HTTPS).

  • DaCHS now honours upgrade-insecure-requests headers that common web browsers issue and will then redirect them to https when appropriate. So, please don’t forcibly do these redirects any more from reverse proxies – they break, among other things, TAP, and they’re generally just a bad idea.
  • DaCHS now instructs the database to return all bits of floating point numbers. This may break your regression tests, but it’s the right thing to do (blog post on this).
  • Another thing that may break regression tests: TAP results now have column names in the case given in the RD (where previously they were lowercased unless quoted). Let me cite rule 1 of SQL table design: Don’t use mixed-case column names.
  • Wildcards in the directory parts of sources patterns are now expanded, which means that you can write things like <sources pattern="data/202?/*.fits"/>, which previously wouldn’t have done what you might reasonably expect; however, this might in rare cases match additional sources when you re-import data.
  • The examples endpoint now returns a 404 if no examples are defined on a service; this fixes the stilts taplint E-EXA-EXDH-1 error I mentioned above.
  • DaCHS will now refuse to use x-unregistred as an authority when publishing resources or creating publisher DIDs. This is to protect to people who do a lot of imports before settling on their authority; sometimes DaCHS’ fallback null authority got into their databases, which then caused quite a bit of cleanup effort.
  • Because of licensing problems, the Debian package no longer contains the CC logos for the time being. If you want them back, drop appropriate files cc0.png, ccby.png, and ccybysa.png into /var/gavo/web/nv_static/img
  • You can now list modules you want in a procedure application in its setup/@imports attribute. I’ve done this after I had to add code to a proc’s setup just to run an import once too often.
  • simbadinterface’s Sesame now uses the dc.metastore table to cache results rather than files as before. Previous saveNew, id, and debug parameters are no longer supported (the base.caches.getSesame interface is unchanged, so it’s unlikely you’d notice this).
  • table.query() or querier.query() are now seriously deprecated (you may have used them in code embedded in RDs). See Database Queries in the reference documentation for what the recommended query patterns are (and have been for a while). Just one word of warning: table.query would macro-expand its argument, which the connection method obviously cannot. If you depend on that, call table.expand(query) manually first.

With this: Merry upgrading and a happy new year!

The Case of the Disappearing Bits

[number line with location markers]
Every green line in this image stands for a value exactly representable in a floating point value of finite size. As you see, it’s a white area out there [source]

While I was preparing the publication of Coryn Bailer-Jones’ distance estimations based on Gaia eDR3 (to be released about tomorrow), Coryn noticed I was swallowing digits from his numbers. My usual reaction of “aw, these are meaningless anyway because your errors are at least an order of magnitude higher” didn’t work this time, because Gaia is such an incredible machine that some of the values really have six significant decimal digits. For an astronomical distance! If I had a time machine, I’d go back to F.W. Bessel right away to make him pale in envy.

I’m storing these distances as PostgreSQL REALs, so these six digits are perilously close the seven decimal digits that the 23 bits of mantissa of single precision IEEE 754 floats are usually translated to. Suddenly, being cavalier with the last few bits of the mantissa isn’t just a venial sin. It will lose science.

So, I went hunting for the bits, going from (in this case C’s sscanf) through my serialisation into Postgres binary copy material (DaCHS operators: this is using a booster) to pulling the material out of the database again. And there I hit it: the bits disappeared between copying them in and retrieving them from the database.

Wow. Turns out: It’s a feature. And one I should have been aware of in that Postgres’ docs have a prominent warning box where it explains its floating point types: Without setting extra-float-digits it will cut off bits. And it’s done this ever since the dawn of DaCHS (in postgres terms, version 8.2 or so).

Sure enough (edited for brevity):

gavo=$ select r_med_geo from gedr3dist.main 
gavo-$ where source_id=563018673253120;

gavo=$ set extra_float_digits=3;
gavo=$ select r_med_geo from gedr3dist.main 
gavo-$ where source_id=563018673253120;

Starting with its database schema 26 (which is the second part of the output of dachs --version), DaCHS will configure its database roles always have extra_float_digits 3; operators beware: this may break your regression tests after the next upgrade.

If you want to configure your non-DaCHS role, too, all it takes is

  alter role (you) set extra_float_digits=3,

You could also make the entire database or even the entire cluster behave like that; but then losing these bits isn’t always a bad idea: It really makes the floats prettier while most of the time not losing significant data. It’s just when you want to preserve the floats as you get them – and with science data, that’s mostly a good idea – that we just can’t really afford that prettyness.

DaCHS 2.2 is out

[Image: DaCHS "entails" 2.2]
DaCHS 2.2 adds support for what simple semantics we currently do in the VO. Which is a welcome excuse to abuse one of the funny symbols semanticians love so much.

Today, I’ve released DaCHS 2.2, the second stable version of DaCHS running on Python 3. Indeed, we have ironed out a few sore spots that have put that “stable” into question, especially if you didn’t run things on Debian Buster. Mind you, playing it safe and just going for Debian is still recommended: Compared to the Python 2 world, where things largely didn’t break for a decade, the Python 3 universe is still shaking out, and so the versions of dependencies do matter. It’s actually fairly gruesome how badly pyparsing 2.4 will break DaCHS. But that’s for another day.

Despite this piece of fearmongering, it’d be great if you could upgrade your installations if you are running DaCHS, and it’s pretty safe if you’re on Debian buster anyway (and if you’re running Debian in the first place, you should be running buster by now).

Here are the more notable changes in this release:

  • DaCHS can now (relatively easily) write time series in the form of what Ada Nebot’s Time Series Annotation note proposes. See the tutorial chapter on building time series for how to do that in practice. Seriously: If you have time series to publish, by all means try this out. The specification can still be fixed, and so this is the perfect time to find problems with the plan.
  • The 2.2 release contains support for the MOC ADQL functions mentioned in the last post on this blog. Of course, to make them work, you will still have to acquaint your database with the new functionality.
  • DaCHS has learned to use IVOA vocabularies as per the current draft for Vocabularies in the VO 2. The most visible effect for you probably is that DaCHS now warns if your subject keywords are not taken from the Unified Astronomy Thesaurus (UAT) – which they almost certainly are not, because the actual format of these keywords is a bit funky. On the other hand, if you employ the “plain” root page template (see the root template in our templating guide if you are not sure what I am talking about here), you will get nice, human-friendly labels for the computer-friendly terms you ought to put into subjects. In case you don’t bother: Given I’m currently serving as chair of the semantics working group of the IVOA, the whole topic will certainly come up again soon, and at that point I will probably also talk about another semantics-related newcomer to DaCHS, the gavo_vocmatch ADQL UDF.
  • There is a new command dachs datapack for interacting with frictionless data packages. The idea is that you can say dachs datapack create myres/q myres.pack and obtain an archive of all that is necessary to re-create myres on another DaCHS installation, where you would say dachs datapack load myres.pack. Frankly, this isn’t much different from just tarring up the resource directory at this point, except that any cruft that may have accumulated in the directory is skipped and there is a bit of structured metadata. But then interoperability always starts slowly. Note, by the way, that this certainly does not teach DaCHS to do anything sensible with third-party data packages; while I’ve not thought hard about this, as it seems a remote use case, I am pretty sure that even the “tabular data packages” that refine the rough general metadata quite a bit simply have nowhere near enough metadata to create a useful VO resource or TAP table.
  • As part of my never-ending struggle against bitrot (in case you’ve always wondered what “curation” means: that, essentially), I’m running dachs val -vc ALL in my own data center once every month. This used to traverse the file system to locate all RDs defined on a box and then make sure they are still ok and their definitions match the database schema. That behaviour has now changed a bit: It will only check published RDs now. I cannot lie: the main reason for the change is because on my production machine the file system traversal has taken longer and longer as data accumulated. But then beyond that there is much less to worry when unpublished gets a little bit mouldy. To get back the old behaviour of validating all RDs that are reachable by the server, use ALL_RECURSE instead of ALL.
  • DaCHS has traditionally assumed that you are running multiple services on one site, which is why its root page is rendered over a service that exposes metadata on local resources. If that doesn’t quite work for how you use DaCHS – perhaps because you want to have your own custom renderers and data functions on your root page, perhaps because you only have one browser-based service and that should be the root page right away –, you can now override what is shown when people access the root URI of your DaCHS installation by setting the [web]root config item to the path of the resource you want as root (e.g., myres/q/s/fixed when the root page should be made by the fixed renderer on the service s within the RD myres/q).
  • Scripting in DaCHS is a powerful way to execute python or SQL code when certain things happen. That seems an odd thing to want until you need it; then you need it badly. Since DaCHS 2.2, scripts executed before or after the creation of a table, before its deletion, or after its meta data has been updated, can sit on tables (where they have always belonged). Before, they could only be on makes (where they can still sit, but of course they are then only executed if the table is operated through that particular make) and RDs (from where they could be copied). That latter location is now forbidden in order to free up RD scripts for later sanitation. Use STREAM and FEED instead if you really used something like that (and I’d bet you don’t).
  • Minor behavioural changes: (a) Due to a bug, you could write things like <schema foo="bar">my_schema<schema>, i.e., have attributes on attributes written in element form. That is now flagged as an error. Since that attribute was fed to the embedding element, you might need to add it there. (b) If you have custom flot plots in one of your templates (and you don’t if you don’t know what I’m talking about), you now have to set style to Points or Lines where you had usingIndex 0 or 1 before. (c) The sidebar template no longer has links to a privacy policy (that few bothered to fill out). See extra sidebar items in the tutorial on how to get them back or add something else.

The most important change comes last: The default logo DaCHS shows unless you override it is no longer the GAVO logo. That’s, really, been inappropriate from the start. It’s now the DaCHS logo, the thing that’s in this posts’s article image. Which isn’t quite as tasteful as the GAVO one, true. But I trust we’ll all get used to it.

Crazy Shapes in TAP

OpenNGC shapes
A complex shape from OpenNGC: MOCs need not be convex, or simply connected, or anything.

So far when you did spherical geometry in ADQL, you had points, circles, and polygons as data types, and you could test for intersection and containment as operations. This feature set is a bit unsatisfying because there are no (algebraic) groups in this picture: When you join or intersect two circles, the result only is a circle if one contains the other. With non-intersecting polygons, you will again not have a (simply connected) spherical polygon in the end.

Enter MOCs (which I’ve mentioned a few times before on this blog): these are essentially arbitrary shapes on the sky, in practice represented through lists of pixels, cleverly done so they can be sufficiently precise and rather compact at the same time. While MOCs are powerful and surprisingly simple in practice, ADQL doesn’t know about them so far, which limits quite a bit what you can do with them. Well, DaCHS would serve them since about 1.3 if you managed to push them into the database, but there were no operations you could do on them.

Thanks to work done by credativ (who were really nice to work with), funded with some money we had left from our previous e-inf-astro project (BMBF FKZ 05A17VH2) on the pgsphere database extension, this has now changed. At least on the GAVO data center, MOCs are now essentially first-class citizens that you can create, join, and intersect within ADQL, and you can retrieve the results. All operators of DaCHS services are just a few updates away from being able to offer the same.

So, what can you do? To follow what’s below, get a sufficiently new TOPCAT (4.7 will do) and open its TAP client on http://dc.g-vo.org/tap (a.k.a. GAVO DC TAP).

Basic MOC Operations in TAP

First, let’s make sure you can plot MOCs; run

SELECT name, deepest_shape
FROM openngc.shapes

Then do Graphics/Sky Plot, and in the window that pops up then, Layers/Add Area Control. Then select your new table in the Position tab, and finally choose deepest_shape as area (yeah, this could become a bit more automatic and probably will over time). You will then see the footprints of a few NGC objects (OpenNGC’s author Mattia Verga hasn’t done all yet; he certainly welcomes help on OpenNGC’s version control repo), and you can move around in the plot, yielding perhaps something like Fig. 1.

Now let’s color these shapes by object class. If you look, openngc.data has an obj_type column – let’s group on it:

  AREA(shape) AS ar
  SELECT obj_type, SUM(deepest_shape) AS shape
  FROM openngc.shapes
  NATURAL JOIN openngc.data
  GROUP BY obj_type) AS q

(the extra subquery is a workaround necessary because the area function wants a geometry or a column reference, and ADQL doesn’t allow aggregate functions – like sum – as either of these).

Coloured shapes
Fig. 2: OpenNGC shapes grouped and coloured by type.

In the result you will see that so far, contours for about 40 square degrees of star clusters with nebulae have been put in, but only 0.003 square degrees of stellar associations. And you can now plot by the areas covered by the various sorts of objects; in Fig. 2, I’ve used Subsets/Classify by Column in TOPCAT’s Row Subsets to have colours indicate the different object types – a great workaround when one deals with categorial variables in TOPCAT.

MOCs and JOINs

Another table that already has MOCs in them is rr.stc_spatial, which has the coverage of VO resources (and is the deeper reason I’ve been pushing improved MOC support in pgsphere – background); this isn’t available for all resources yet , but at least there are about 16000 in already. For instance, here’s how to get the coverage of resources talking about planetary nebulae:

SELECT ivoid, res_title, coverage
FROM rr.subject_uat
  NATURAL JOIN rr.stc_spatial
  NATURAL JOIN rr.resource
WHERE uat_concept='planetary-nebulae'
  AND AREA(coverage)<20

(the rr.subject_uat table is a local extension to RegTAP that will be the subject of some future blog post; you could also use rr.res_subject, but because people still use wildly different keyword schemes – if any –, that wouldn’t be as much fun). When plotted, that’s the left side of Fig. 3. If you do that yourself, you will notice that the resolution here is about one degree, which is a special property of the sort of MOCs I am proposing for the Registry: They are of order 6. Resolution in MOC goes up with order, doubling with every step. Thus MOCs of order 7 have a resolution of about half a degree, MOCs of order 5 a resolution of about two degrees.

One possible next step is fetch the intersection of each of these coverages with, say, the DFBS (cf. the post on Byurakan spectra). That would look like this:

  gavo_mocintersect(coverage, dfbscoverage) as ovrlp
  SELECT ivoid, res_title, coverage
  FROM rr.subject_uat
  NATURAL JOIN rr.stc_spatial
  NATURAL JOIN rr.resource
  WHERE uat_concept='planetary-nebulae'
  AND AREA(coverage)<20) AS others
  SELECT coverage AS dfbscoverage 
  FROM rr.stc_spatial
  WHERE ivoid='ivo://org.gavo.dc/dfbsspec/q/spectra') AS dfbs

(the DFBS’ identifier I got with a quick query on WIRR). This uses the gavo_mocintersect user defined function (UDF), which takes two MOCs and returns a MOC of their common pixels. Which is another important part why MOCs are so cool: together with union and intersection, they form groups. It should not come as a surprise that there is also a gavo_mocunion UDF. The sum aggregate function we’ve used in our grouping above is (conceptually) built on that.

Planetary Nebula footprint and plate matches
Fig. 3: Left: The common footprint of VO resources declaring a subject of planetary-nebula (and declaring a footprint). Right bottom: Heidelberg plates intersecting this, and, in blue, level-6 intersections. Above this, an enlarged detail from this plot.

You can also convert polygons and circles to MOCs using the (still DaCHS-only) MOC constructor. For instance, you could compute the coverage of all resources dealing with planetary nebulae, filtering against obviously over-eager ones by limiting the total area, and then match that against the coverages of images in, say, the Königstuhl plate achives HDAP. Watch this:

  gavo_mocintersect(MOC(6, im.coverage), pn_coverage) as ovrlp
  SELECT SUM(coverage) AS pn_coverage
  FROM rr.subject_uat
  NATURAL JOIN rr.stc_spatial
  WHERE uat_concept='planetary-nebulae'
  AND AREA(coverage)<20) AS c
JOIN lsw.plates AS im
ON 1=INTERSECTS(pn_coverage, MOC(6, coverage))

– so, the MOC(order, geo) function should give you a MOC for other geometries. There are limits to this right now because of limitations of the underlying MOC library; in particular, non-convex polygons are not supported right now, and there are precision issue. We hope this will be rectified soon-ish when we base pgsphere’s MOC operations on the CDS HEALPix library. Anyway, the result of this is plotted on the right of Fig. 3.

Open Ends

In case you have MOCs from the outside, you can also construct MOCs from literals, which happen to be the ASCII MOCs from the standard. This could look like this:

  MOC('4/30-33 38 52 7/324-934') AS ar 
FROM tap_schema.tables

For now, you cannot combine MOCs in CONTAINS and INTERSECTS expressions directly; this is mainly because in such an operation, the machine as to decide on the order of the MOC the other geometries are converted to (and computing the predicates between geometry and MOC directly is really painful). This means that if you have a local table with MOCs in a column cmoc that you want to compare against a polygon-valued column coverage in a remote table like this:

  lsw.plates AS db
  JOIN tap_upload.t6
ON 1=CONTAINS(coverage, cmoc) -- fails!

you will receive a rather scary message of the type “operator does not exist: spoly <@ smoc”. To fix it (until we’ve worked out how to reasonably let the computer do that), explicitly convert the polygon:

  lsw.plates AS db
  JOIN tap_upload.t6
ON 1=CONTAINS(MOC(7, coverage), cmoc)

(be stingy when choosing the order here – MOCs that already exist are fast, but making them at high order is expensive).

Having said all that: what I’ve written here is bleeding-edge, and it is not standardised yet. I’d wager, though, that we will see MOCs in ADQL relatively soon, and that what we will see will not be too far from this experiment. Well: Some rough edges, I’d hope, will still be smoothed out.

Getting This on Your Own DaCHS Installation

If you are running a DaCHS installation, you can contribute to takeup (and if not, you can stop reading here). To do that, you need to upgrade to DaCHS’s latest beta (anything newer than 2.1.4 will do) to have the ADQL extension, and, even more importantly, you need to install the postgresql-postgres package from our release repository (that’s version 1.1.4 or newer; in a few weeks, getting it from Debian testing would work as well).

You will probably not get that automatically, because if you followed our normal installation instructions, you will have a package called postgresql-11-pgsphere installed (apologies for this chaos; as ususal, every single step made sense). The upshot is that with our release repo added, sudo apt install postgresql-pgsphere should give you the new code.

That’s not quite enough, though, because you also need to acquaint the database with the new functions. This can only be done with database administrator privileges, which DaCHS by design does not possess. What DaCHS can do is figure out the commands to do that when it is called as dachs upgrade -e. Have a look at the output, and if you are satisfied it is about what to expect, just pipe it into psql as a superuser; in the default installation, dachsroot would be sufficiently privileged. That is:

dachs upgrade -e | psql gavo   # as dachsroot

If running

select top 1 gavo_mocunion(moc('1/3'), moc('2/9')) 
from tap_schema.tables

through your TAP endpoint returns ‘1/3 2/9’, then all is fine. For entertainment, you might also make sure that gavo_mocintersect(moc('1/3'), moc('2/13')) is 2/13 as expected, and that if you intersect with 2/3 you get back an empty string.

So – let’s bring MOCs to ADQL!

Tutorial Renewal

[Image: The DaCHS Tutorial among other seminal works]

DaCHS’ documentation (readthedocs mirror) has two fat pieces and a lot of smaller read-as-you-go pieces. One of the behmoths, the reference documentation, at roughly 350 PDF pages, has large parts generated from source code, and there is no expectation that anyone would ever read it linearly. Hence, I wasn’t terribly worried about unreadable^Wpassages of questionable entertainment value in there.

That’s a bit different with the tutorial (also available as 150 page PDF; epub on request): I think serious DaCHS deployers ought to read the DaCHS Basics and the chapters on configuring DaCHS and the interaction with the VO Registry, and they should skim the remaining material so they are at least aware of what’s there.

Ok. I give you that is a bit utopian. But given that pious wish I felt rather bad that the tutorial has become somewhat incoherent in the years since I had started the piece in April 2009 (perhaps graciously, the early history is not visible at the documentation’s current github home). Hence, when applying for funds under our current e-inf-astro project, I had promised to give the tutorial a solid makeover as, hold your breath, Milestone B1-5, due in the 10th quarter. In human terms: last December.

When it turned out the Python 3 migration was every bit as bad as I had feared, it became clear that other matters had to take priority and that we might miss this part of that “milestone” (sorry, I can’t resist these quotes). And given e-inf-astro only had two quarters to go after that, I prepared for having to confess I couldn’t make good on my promise of fixing the tutorial.

But then along came Corona, and reworking prose seemed the ideal pastime for the home office. So, on April 4, I forked off a new-tutorial branch and started a rather large overhaul that, among others, resulted in the operators’ guide with its precarious position between tutorial and reference being largely absorbed into the tutorial. In all, off and on over the last few months I accumulated (according to git diff --shortstat 6372 inserted and 3453 deleted lines in the tutorial’s source. Since that source currently is 7762 lines, I’d say that’s the complete makeover I had promised. Which is good as e-inf-astro will be over next Wednesday (but don’t worry, our work is still funded).

So – whether you are a DaCHS expert, think about running it, or if you’re just curious what it takes to build VO services, let me copy from index.html: Tutorial on importing data (tutorial.html, tutorial.pdf, tutorial.rstx). The ideal company for your vacation!

And if you find typos, boring pieces, overly radical advocacy or anything else you don’t like: there’s a bug tracker for you (not to mention PRs are welcome).

DaCHS 2.1: Say hello to Python 3

Image: DaCHS and python logos

Today, I have released DaCHS 2.1, the first stable DaCHS running on Python 3. I have tried hard to make the major version move painless and easy, and indeed “pure DaCHS” RDs should just continue to work. But wherever there’s Python in your RDs or near them, things may break, since Python 3 is different from Python 2 in some rather fundamental ways.

Hence, the Debian package even has a new name: gavodachs2-server. Unless you install that, things will keep running as they do. I will keep fixing serious DaCHS 1 bugs for a while, so there’s no immediate urgency to migrate. But unless you migrate, you will not see any new features, so one of these days you will have to migrate anyway. Why not do it today?

Migrating to DaCHS 2

In principle, just say apt install gavodachs2-server and hope for the best. If you have a development machine and regression tests defined, this is actually what we recommend, and we’d be very grateful to learn of any problems you may encounter.

If you’d rather be a little more careful, Carlos Henrique Brandt has kindly updated his Docker files in order to let you spot problems before you mess up your production server. See Test Migration for a quick intro on how to do that. If you spot any problems that are not related to the Python 3 pitfalls mentioned in the howto linked below or nevow exodus, please tell me or (preferably) the dachs-support mailing list.

A longer, more or less permanent piece elaborating possible migration pains, is in our how-to documentation: How do I go from DaCHS1 to DaCHS2?

What’s new in DaCHS2?

I’ve used the opportunity of the major version change to remove a few (mis-) features that I’m rather sure nobody uses; and there are a few new features, too. Here’s a rundown of the more notable changes:

  • DaCHS now produces VOTable 1.4 by default. This is particularly notable when you provide TIMESYS metadata (on which I’ll report some other time).
  • When doing spatial indices, prefer the new //scs#pgs-pos-index to //scs#q3cindex. While q3c is still faster and more compact than pgsphere when just indexing points, on the longer run I’d like to shed the extra dependency (note, however, that the pgsphere index limits the cone search to a maximum radius of 90 degrees at this point).
  • Talking about Cone Search: For custom parameters, DaCHS has so far used SSA-like syntax, so you could say, for instance, vmag=12/13 (for “give me rows where vmag is between 12 and 13”). Since I don’t think this was widely used, I’ve taken the liberty to migrate to DALI-compliant syntax, where intervals are written as they would be in VOTable PARAM values: vmag=12 13.
  • In certain situations, DaCHS tries to enable parallel queries (previously on this blog).
  • Some new ADQL user defined functions: gavo_random_normal, gavo_mocintersect, and gavo_mocunion. See the TAP capabilities for details, and note that the moc functions will fail until we put out a new pgsphere package that has support for the MOC-MOC operations.
  • dachs info (highly recommended after an import) now takes a --sample-percent option that helps when doing statistics on large tables.
  • For SSA services serving something other than spectra (in all likelihood, timeseries), you can now set a productType meta as per the upcoming SimpleDALRegExt 1.2.
  • If you have large, obscore-published SIAP tables, re-index them (dachs imp -I q) so queries over s_ra and s_dec get index support, too.
  • Since we now maintain RD state in the database, you can remove the files /var/gavo/state/updated* after upgrading.
  • When writing datalink metaMakers returning links, you can (and should, for new RDs) define the semantics in an attribute to the element rather in the LinkDef constructor.
  • Starting with this version, it’s a good idea to run dachs limits after an import. This, right now, will mainly set an estimate for the number of rows in a table, but that’s already relevant because the ADQL translator uses it to help the postgres query planner. It will later also update various kinds of column metadata that, or so I hope, will become relevant in VODataService 1.3.
  • forceUnique on table elements is now a no-op (and should be removed); just define a dupePolicy as before.
  • If you write bad obscore mappings, it could so far be hard to figure out the reason of the failure and, between lots of confusing error messages, to fix it. Instead, you can now run “dachs imp //obscore recover“ in such a situation. It will re-create the obscore table and throw out all stanzas that fail; after that, you can fix the obscore declarations that were thrown out one by one.
  • If you run DaCHS behind a reverse proxy that terminates https, you can now set [web]adaptProtocol in /etc/gavo.rc to False. This will make that setup work for form-based services, too.
  • If you have custom OAI set name (i.e., anything but local and ivo_managed in the sets attribute of publish elements), you now have to declare them in [ivoa]validOAISets.
  • Removed things: the docform renderer (use form instead), the soap renderer (well, it’s not actually removed, it’s just that the code it depends on doesn’t exist on python3 any more), sortKey on services (use the defaultSortKey property), //scs#q3cpositions (port the table to have ra and dec and one of the SCS index mixins), the (m)img.jpeg renderers (if you were devious enough to use these, let me know), and quite a few even more exotic things.

Some Breaking Changes

Python 3 was released in 2008, not long after DaCHS’ inception, but since quite a few of the libraries it uses to do its job haven’t been available for Python 3, we have been reluctant to make the jump over the past then years (and actually, the stability of the python2 platform was a very welcome thing).

Indeed, the most critical of our dependencies, twisted, only became properly usable with python3 in, roughly, 2017. Indeed, large parts of DaCHS weren’t even using twisted directly, but rather a nice add-on to it called nevow. Significant parts of nevow bled through to DaCHS operators; for instance, the render functions or the entire HTML templating.

Nevow, unfortunately, fell out of fashion, and so nobody stepped forward to port it. And when I started porting it myself I realised that I’m mainly using the relatively harmless parts of nevow, and hence after a while I figured that I could replace the entire dependency by something like a 1000 lines in DaCHS, which, given significant aches when porting the whole of nevow, seemed like a good deal.

The net effect is that if you built code on top of nevow – most likely in the form of a custom renderer – that will break now, and porting will probably be rather involved (having ported ~5 custom renderers, I think I can tell). If this concerns you, have a look at the README in gavo.formal (and then complain because it’s mainly notes to myself at this point). I feel a bit bad about having to break things that are not totally unreasonable in this drastic way and thus offer any help I can give to port legacy DaCHS code.

Outside of these custom renderers, there should just be a single visible change: If you have used n:data="some_key" in nevow templates to pull data from dictionaries, that won’t work any longer. Use n:data="key some_key" n:render="str" instead. And it turns out that this very construct was used in the default root template, which you may have derived from. So – see if you have /var/gavo/web/templates/root.html and if so, whether there is <ul n:data="chunk" in there. If you have that, change it to <ul n:data="key chunk".

Update (2020-11-19): Two only loosely related problems have surfaced during updates. In particular if you are updating on rather old installations, you may want to look at the points on Invalid script type preIndex and function spoint_in already exists in our list of common problems.

Parallel Queries

Image: Plot of run times
An experiment with parallel querying of PPMX, going from single-threaded execution to using seven workers.

Let me start this post with a TL;DR for

Large analysis queries (like those that contain a GROUP BY clause) profit a lot from parallel execution, and you needn’t do a thing for that.
DaCHS operators
When you have large tables, Postgres 11 together with the next DaCHS release may speed up your responses quite dramatically in some cases.

So, here’s the story –

I’ve finally overcome my stretch trauma and upgraded the Heidelberg data center’s database server to Debian buster. With that, I got Postgres 11, and I finally bothered to look into what it takes to enable parallel execution of database queries.

Turns out: My Postgres started to do parallel execution right away, but just in case, I went for the following lines in postgresql.conf:

max_parallel_workers_per_gather = 4
max_worker_processes = 10
max_parallel_workers = 10

Don’t quote me on this – I frankly admit I haven’t really developed a feeling for the consequences of max_parallel_workers_per_gather and instead just did some experiments while the box was loaded otherwise, determining where raising that number has a diminishing return (see below for more on this).

The max_worker_processes thing, on the other hand, is an educated guess: on my data center, there’s essentially never more than one person at a time who’s running “interesting”, long-running queries (i.e., async), and that person should get the majority of the execution units (the box has 8 physical CPUs that look like 16 cores due to hyperthreading) because all other operations are just peanuts in comparison. I’ll gladly accept advice to the effect that that guess isn’t that educated after all.

Of course, that wasn’t nearly enough. You see, since TAP queries can return rather large result sets – on the GAVO data center, the match limit is 16 million rows, which for a moderate row size of 2 kB already translates to 32 GB of memory use if pulled in at once, half the physical memory of that box –, DaCHS uses cursors (if you’re a psycopg2 person: named cursors) to stream results and write them out to disk as they come in.

Sadly, postgres won’t do parallel plans if it thinks people will discard a large part of the result anyway, and it thinks that if you’re coming through a cursor. So, in SVN revision 7370 of DaCHS (and I’m not sure if I’ll release that in this form), I’m introducing a horrible hack that, right now, just checks if there’s a literal “group” in the query and doesn’t use a cursor if so. The logic is, roughly: With GROUP, the result set probably isn’t all that large, so streaming isn’t that important. At the same time, this type of query is probably going to profit from parallel execution much more than your boring sequential scan.

This gives rather impressive speed gains. Consider this example (of course, it’s selected to be extreme):

import contextlib
import pyvo
import time

def timeit(activity):
  start_time = time.time()
  end_time = time.time()
  print("Time spent on {}: {} s".format(activity, end_time-start_time))

svc = pyvo.tap.TAPService("http://dc.g-vo.org/tap")
with timeit("Cold (?) run"):
  svc.run_sync("select round(Rmag) as bin, count(*) as n"
    " from ppmx.data group by bin")
with timeit("Warm run"):
  svc.run_sync("select round(Rmag) as bin, count(*) as n"
    " from ppmx.data group by bin")

(if you run it yourself and you get warnings about VOTable versions from astropy, ignore them; I’m right and astropy is wrong).

Before enabling parallel execution, this was 14.5 seconds on a warm run, after, it was 2.5 seconds. That’s an almost than a 6-fold speedup. Nice!

Indeed, that holds beyond toy examples. The showcase Gaia density plot,

        count(*) AS obs,
        source_id/140737488355328 AS hpx
FROM gaia.dr2light

(the long odd number is 235416-6, which turns source_ids into level 6-HEALPixes as per Gaia footnote id; please note that Postgres right now isn’t smart enough to parallelise ivo_healpix), which traditionally ran for about an hour is now done in less than 10 minutes.

In case you’d like to try things out on your postgres, here’s what I’ve done to establish the max_parallel_workers_per_gather value above.

  1. Find a table with a few 1e7 rows. Think of a query that will return a small result set in order to not confuse the measurements by excessive client I/O. In my case, that’s a magnitude histogram, and the query would be
    select round(Rmag) as bin, count(*) 
    as n from ppmx.data 
    group by bin;

    Run this query once so the data is in the disk cache (the query is “warm”).

  2. Establish a non-parallel baseline. That’s easy to do:
    set max_parallel_workers_per_gather=0;
  3. Then run
    explain analyze select round(Rmag) as bin, count(*) as n from ppmx.data group by bin;

    You should see a simple query plan with the runtime for the non-parallel execution – in my case, a bit more than 12 seconds.

  4. Then raise the number of max_parallel_workers_per_gatherer successively. Make sure the query plan has lines of the form “Workers Planned” or so. You should see that the execution time falls with the number of workers you give it, up to the value of max_worker_processes – or until postgres decides your table is too small to warrant further parallelisation, which for my settings happened at 7.

Note, though, that in realistic, more complex queries, there will probably be multiple operations that will profit from parallelisation in a single query. So, if in this trivial example you can go to 15 gatherers and still see an improvement, this could actually make things slower for complex queries. But as I said above: I have no instinct yet for how things will actually work out. If you have experiences to share: I’m sure I’m not the only person on dachs-users who’t be interested.

DaCHS 1.4 is out

Dachs logo with "version 1.4" superposed

Since the Groningen Interop is over, it’s time for a DaCHS release, and so, roughly half a year after the release of DaCHS 1.3, today I’ve pushed DaCHS 1.4 into our Debian repository.

As usual, you should upgrade as soon as you find time to do so, because upgrades become more difficult if they span large version gaps; and one of these days you will need some new feature or run into one of the odd bugs. Upgrading is a good opportunity to also get your DaCHS ready for buster by adding the repos mentioned there.

The list of new features is rather short this time around. Here are some noteworthy ones:

  • There’s now an XML grammar that can be used when you have to parse smallish snippets of XML as, for instance, in VOEvent.
  • You can now use TABLESAMPLE(1) after a table specification in DaCHS’ ADQL to tell the database engine to just use 1% of a table for a query. While this isn’t a precise way to sample tables, it’s great when developing queries.
  • Also among new features I’d like to see in ADQL and have therefore put into DaCHS is GENERATE_SERIES(a,b), which is what is known as table-generating function in SQL . If you know SDSS CasJobs, you’ll have seen lots of those already. GENERATE_SERIES, however, is really plain: it just spits out a table with a column with integers between a and b. For an example of why one might what to have that, check out the poster I’m linking to in my ADASS report.
  • If you have an updating data descriptor (usually, because you keep feeding data into a data collection), DaCHS will no longer automatically re-make its dependencies (like, say, views). That’s because that’s not necessary in general, and it’s a pain if every update on an obscore-published table tears down and rebuilds the obscore view. For the rare cases when you do need to rebuild dependencies, there’s now a remakeOnDataChange attribute on data.
  • At the interop, I’ve mentioned a few use cases for knowing which server software you’re talking to, and I’ve said that people should set their server headers to informative values. DaCHS does that now.

To conclude on a low note: This is probably going to be the last release of DaCHS for python 2. Even though we will have to shed a dependency or two that simply will not be ported to python 3, and even though I’m rather unhappy with a few properties of the python 3 port of twisted, there’s probably no way to escape this, given that Debian is purging out python 2 packages quickly already.

So, when we meet again for the next release, you’ll probably be looking at DaCHS 2.0, and where you have custom code in your RDs, it’s rather likely that you’ll see a minor amount of breakage. I promise I’ll do everything I can to make the migration easy for deployers, but I can’t do higher magic, so: If there’s ever been a time to add regression tests to your RDs, it’s now.

DaCHS 1.3 is out


Almost a year has passed since release 1.2 of DaCHS – I’ve let the normal autumn release slip last year because there weren’t so many release-worthy new features in DaCHS at the traditional release time (i.e., after the College Park interop), and also because running betas when you do need a new feature is a fairly stable thing by now.

But here it finally is: Release 1.3 (tarball for the die-hard self-builders; everyone else just switches back the release branch as necessary and then runs an update/upgrade cycle).

Here’s the commented changelog:

  • New //ssap#view mixin that should be used for future SSAP services, and that existing SSAP services should migrate to at some point. See A new view on SSAP in DaCHS on this blog for details.
  • Columns can now be hidden from TAP/ADQL (and other interfaces) by setting hidden="True".
  • There is now a setting [web]maxSyncUploadSize=500000 (meaning: about 500 kByte) as the default upload limit on sync queries. In compensation, clients uploading too much now receive a more useful error message (except it doesn’t reach TOPCAT users most of the time because it does chunked uploads). To get back the behaviour of 1.2 (which is probably ok if you can live with the occasional resource hog), add maxSyncUploadSize=20000000 to your /etc/gavo.rc.
  • Adding support for https (certificate reading, certificate updating with letsencrypt, registering alternate endpoints, no WebSAMP with https). See HTTPS in DaCHS on this blog for details.
  • New source_table and preview columns in obscore. If you’re using the various obscore mixins, this should be automatic. If you have defined views manually, you will have to amend these (and have a broken obscore until a dachs upgrade ran without error).
  • No longer producing arraysize="1" in VOTables for scalars (except char, for compatibility with a legacy TOPCAT workaround; see VOTable 1.3 Erratum 3 for background information).
  • Support for draft TIMESYS in VOTable (with STC 2 annotation; ask about details if you’re interested. This is for draft VOTable 1.4 and probably only relevant to you if you’re publishing time series).
  • You can now add targetType and targetTitle properties to URL-valued columns to help Aladin figure out what to do with URLs (see Datalinks as product URLs in the reference documentation).
  • New gavo_transform, gavo_ipix, and gavo_urlescape ufuncs for ADQL, fixed gavo_urlescape to have acceptable performance.
  • New generating CatalogResource records with auxiliary capabilities in accordance with Oct 2018 VODataService WD.
  • //soda#sdm_genDesc now matches accref rather than pubDID by default. If you use Datalink with SSA and have a custom pubDID schema (or no index on accref), add a useAccref="False" to your descriptorGenerator statement.
  • There is now a --foreground option for dachs serve start. This is mainly to play nice with systemd, and indeed, the Debian package now comes with a systemd unit file. I’m not terribly familiar with systemd, so please have an eye on DaCHS controlled by systemd and let me know if you see something that’s not as it should be.
  • Fixes for various bugs (most notably: escaped quotes in ADQL, WCS in SIAP cutout products) and many minor improvements. Check out the source tree (still via subversion) and read the changelog if you want to know the whole truth.

On systems running from the Debian package, the update should be automatic with the next system upgrade. However, you’ll be saving yourself quite a bit of headache if you check the health of your installation before the upgrade; see Upgrading DaCHS in the operator’s guide on how to upgrade professionally.

A New View on SSAP in DaCHS

When I started working on the VO in 2007, my collagues in Garching already had a software that implemented major parts of the simple spectral access protocol (SSAP) that was being developed back then. It would publish spectra in the FITS format by just blindly dumping all header cards into a database table and then defining a view over that “raw” metadata table to make the whole thing match SSAP’s expectations for how the output table should look like. Sometimes you could just map through a header to an SSA column, sometimes you would just convert a unit, sometimes you would have to write a fairly complex SQL expressions combining multiple fields.

Back then, I didn’t like it – why have two things (a table and a view) that can break when one (just a table in SSA’s format) would do, too? Also, SSAP has about 50 metadata fields, but lets you put constant values into VOTable PARAMs, which seemed a very reasonable way to attain more compact responses. So, when DaCHS grew SSAP support, I defined a mixin (essentially, a configurable interface definition) that let operators define SSA tables and their constant parameters in a fairly simple fashion and directly produced a table you could base your SSAP service on.

That made assumptions about which pieces of metadata are constant and which are not; for instance, the original mixin (“hcd” for “homogeneous collection”) assumed all spectra in a data collection came from the same instrument and had the same resolution and (what was I thinking?) SNR. Unsurprisingly, that broke fairly soon. So, I added a second mixin (“mixc”) for when different instruments or codes produced the data.

But even that was headache, at the latest when I started making time series services using SSAP. And I had to fix a few bugs in the mixins themselves in the meantime, which mostly required re-imports of the data in that design. Such re-imports are non-trivial when you have millions of spectra, and they need to happen at software upgrade time or the services would break with the upgrade. Ouch.

It was about mid-2018 when it dawned on me that sometimes it’s better to have two things that can break even if one would do, after all. Specifically, if fixing the one thing is expensive, it’s an excellent idea to put a facade on top of it that’s cheap to change and can already be used to repair most deficiencies. Why re-build the house if a paint job does the trick?

As to having more compact query responses when you stuff metadata that’s constant in all the rows into VOTable PARAMS – well, in the age of web pages pulling in a megabyte of javascript and two megabytes of images to display five lines of text, I’ve become a bit cavalier in that department. Sure, the average row may have grown by a factor of three, but we’re still talking only a few megabyte even with large responses. To me, these extra bytes seem a fair price to pay for the increased flexibility and overall more straightforward architecture.

So, I’ve now come up with a view-based solution in DaCHS, too: the //ssap#view mixin. This is a bit less radical than the Garching software of 2007, as it doesn’t dump raw headers but instead lets you do the primary transformations in the RD. But it no longer constrains what pieces of metadata should be constant and which may vary between spectra, and it uses the same names for the same pieces of metadata throughout (which also is a step forward over the old SSAP mixins).

With this, DaCHS operators should no longer use the hcd and mixc mixins for new services. The new technique is already reflected the respective tutorial chapter, and the SSAP template (you’re using dachs start, aren’t you?) now uses it, too.

If you have a spectra publishing project in your pipeline, this would be the perfect time to upgrade to the DaCHS 1.2.4 beta, which has the new mixin. It would be great if we could iron out remaining wrinkles before the next release makes changes a load on my conscience.

As to migrating existing SSAP services: Well, it would be great if I could drop the old mixins in a couple of years, as they cause quite a bit of uglyness in DaCHS’s built-in //ssap RD. But the migration regrettably isn’t straightforward, so you may want to wait a bit before embarking on that journey (I’ll be happy to help, though).