DaCHS 2.12 Is Out

The DaCHS logo, a badger's head and the text "VO Data Publishing"

A bit more than one month after the last Interop, I have released the next version of GAVO's data publication package, DaCHS. This is the customary post on what is new in this release.

There is no major headline for DaCHS 2.12, but there is a fair number of nice conveniences in it. For instance, if you have a collection of time series to publish, the new time series service template might help you. You get it by calling dachs start timeseries; I will give you that it suffers from about the same malady as the existing ssap+datalink one: There is a datalink service built in from the start, which puts up a scary amount of up-front complexity you have to master before you get any sort of gratification.

There is little we can do about that; the creators of time series data sets just have not come up with a good convention for how to write them; I might be moved to admit that putting them into nice FITS binary tables might count as acceptable. In practice, none of the time series I got from my data providers came in a format remotely fit for distribution. Perhaps Ada's photometric time series convention (which is what you will deliver with the template) is not the final word on how to represent time series, but it is much better than anything else I have seen. Turning what you get from your upstreams to what you want to hand out to your users just requires Datalink I'm afraid.

I will add tutorial chapters for how to deal with the datalink-infested templates one of these days; within them bulk commenting will play a fairly important role. For quite a while, I have recommended to define a lazy macro with a CDATA section in order to comment out a large portion of an RD. I have changed that recommendation now to open such comments with <macDef raw="True" name="todo"><![CDATA[ and close them with ]]></macDef>. The new (2.12) part is the raw="True". This only means that DaCHS will not try to expand macros within the macro definition. So far, it has done that, and that was a pain in for the datalink-infested templates, because there are macro calls in the templates, but some of them will not work in the RD context the macDef is in.

By the way, in case you would like to write your template to a file other than q.rd (perhaps because there already is one in your resdir), there is now an -o option to dachs start.

Speaking of convenience, defining spectral coverage has become a lot less of a pain in 2.12. So far, whenever you had to manually define a resource's STC coverage (and that is not uncommon for the spectral axis, where dachs limits often will find no suitable columns or does not find large gaps in observations in multiple narrow bands), you had to turn the Ångströms or GHz into Joule by throwing in the right amounts of c, h, and math operators. Now, you just add the appropriate units in square brackets and let DaCHS work out the rest; DaCHS will also ensure that the lower limit actually is smaller than the upper limit. A resource covering a number of bands in various parts of the spectrum might thus say:

<coverage>
  <spectral>100[kHz] 21.5[cm]</spectral>
  <spectral>2[THz] 1[um]</spectral>
  <spectral>653[nm] 660[nm]</spectral>
  <spectral>912[Angstrom] 10[eV]</spectral>
  <spectral>20[GeV] 100[GeV]</spectral>
</coverage>

DaCHS will produce a perfectly viable coverage declaration for the Registry from that.

Still in the convenience department, I have found myself define a STREAM (in case you don't know what I'm talking about: read up on them in the tutorial) that creates pairs of columns for a value and its error once to often. Thus, there is now the //procs#witherror stream. Essentially, you can replace the <column in a column definition with <FEED source="//procs#witherror, and you get two columns: One with the name itself, the other with a name of err_name, and the columns ought to have suitable metadata. For instance:

<FEED source="//procs#witherror
  name="rv" type="double precision"
  unit="km/s" ucd="spect.dopplerVeloc"
  tablehead="RV_S"
  description="Radial velocity derived by the Serval pipeline"
  verbLevel="1"/>

You cannot yet have values children with witherror, but it is fairly uncommon for such columns to want them: you won't enumerate values or set null values (things with errors will be floating point values), and columns statistics these days are obtained automatically by dachs limits.

You can take this a turn further and put witherror into a LOOP. For instance, to define ugriz photometry with errors, you would write:

<LOOP>
  <csvItems>
  item, ucd
  u, U
  g, V
  r, R
  i, I
  z, I
  </csvItems>
  <events passivate="True">
    <FEED source="//procs#witherror name="mag_\item"
      unit="mag" ucd="phot.mag;em.opt.\ucd">
      tablehead="m_\item"
      description="Magnitude in \item band"/>
  </events>
</LOOP>

There is a difficult part in this: the passivate="True" in the events element. If you like puzzlers, you may want to figure out why that is needed based on what I document about active tags in the reference documentation. Metaprogramming and Macros become subtle not only in DaCHS.

Far too few DaCHS operators define examples for their TAP services. Trust me, your users will love them. To ensure that they still are good, you can now pass an -x flag to dachs val (nb not dachs test); that will execute all of the TAP examples defined in the RD against the local server and complain when one does not return at least one valid row. The normal usage would be to say dachs val -x //tap if you define your examples in the userconfig RD; but with hierarchical examples, any RD might contain examples modern TAP clients will pick up.

There is another option to have an example tested: you could put the query into a macro (remember macDef above?) and then use that macro both in the example and in a regTest element. That is because url attributes now expand macros. That may be useful for other and more mundane things, too; for instance, you could have DaCHS fill in the schema in queries.

Actual new features in 2.12 are probably not very relevant to average DaCHS operators, at least for now:

  • users can add indexes to their persistent uploads (featured here before)
  • registration of VOEvent streams according to the current VOEvent 2.1 PR (ask if interested; there is minimal documentation on this at this point).
  • an \if macro that sometimes may be useful to skip things that make no sense with empty strings: \if{\relpath}{http://example.edu/p/\relpath} will not produce URLs if relpath is empty.
  • if you have tables with timestamps, it may be worth running dachs limits on them again, as DaCHS will now obtain statistics for them (in MJD, if you have to know) and consequently provide, e.g., placeholders.
  • our spatial WCS implementation no longer assumes the units are degrees (but still that it is dealing with spherical coordinates).
  • when params are array-valued, any limits defined in values are now validated component-wise.

Finally, if you inspected a diff to the last release, you would see a large number of changes due to type annotation of gavo.base. I have promised to my funders to type-annotate the entire DaCHS code (except perhaps for exotic stuff I shouldn't have written in the first place, viz., gavo.stc) in order to make it easier for the community to maintain DaCHS.

From my current experience, I don't think I will keep this particular promise. After annotating several thousand lines of code my impression is that the annotation is a lot of effort even with automatic annotation helpers (the cases it can do are the ones that would be reasonably quick for a human, too). The code does in general improve in consequence (but not always), but not fundamentally, and it does not become dramatically more readable in most places (there are exceptions to that reservation, though).

All in all, the cost/benefit ratio just does not seem to be small enough. And: the community members that I wish will contribute code would feel obliged to write type annotations, too, which feels like an extra hurdle I would like to spare them.