Sofa instead of Granada

[Screenshot from an online talk]
Gesticulating wildly to a computer is what happens in an online conference. To me, at least. Let’s hope nobody watched me through the window.

It was already in the wee hours of Friday last week (CET) when the second “virtual Interop” had its rather unceremonious closing ceremony. Its predecessor in May had about it an air of a state of emergency. For instance, all sessions were monothematic. That was nice on the one hand, because a relatively large part of the time was available for discussion – which, really, is what the Interops are about. But then Interops are also about noticing what everyone else in the Virtual Observatory is cooking up, for which the short-ish talks we usually have at Interops work really well.

In contrast to that first Corona Interop, this second one, replacing what would have taken place in Granada, Spain, had a much more conventional format, which again accomodated many talks. But of course, this made one feel the lack of possibilities to quickly hash out a problem during a coffee break or in a spontaneous splinter quite a bit more.

Be that as it may, I would like to give you some insights on what I’m currently up to at the IVOA level; I am grateful for any feedback you can give on any of these topics.

Given that I currently chair the Semantics Working group, there was a natural focus on topics around vocabularies, and I gave two talks in that department. The one in DAL (DAL is the working group that builds the actual access protocols such as TAP or SIAP) was mainly on Datalink-related aspects of my Vocabularies in the VO 2 draft (VocInVO2), which in particular was an opportunity to thank everyone involved in the Vocabulary Enhancement Proposals we have been running this last year (all of which were about Datalink and hence closely tied to DAL). One thing I was asking for was reviews on a github pull request that would make the bysemantics method of Datalink accesses semantics-aware; basically, as intended by the original Datalink authors, when asking for #calibration links, this will also return, say, #bias links. If you can spare a moment for this: Please do!

Another thing I tried to raise some interest for is the proposed vocabulary of product types; this, I think, should eventually define what people may put into the dataproduct_type column of Obscore results, and there are related uses in Datalink and, believe it or not, the registration of SSAP (spectral) services. A question Alberto raised while I was discussing that made me realise I forgot to mention another vocabularies-related development relevant for DAL: I’ve put the gavo_vocmatch ADQL user-defined function into DaCHS. It lets you match something against a term or its narrower terms, referencing an IVOA vocabulary. For instance, if we had different sorts of time series (which, of course, would be odd for obscore that has the o_ucd column for this kind of thing), you could, using ADQL, still get all time series by querying

FROM ivoa.obscore

Here, the first argument is the vocabulary name (whatever is after the in the vocabulary URL), the second the “root” term, and the third the column to match against. Since postgres, for now, isn’t aware of IVOA vocabularies, the second argument must be a literal string rather than, say, an expression involving columns.

I gave a second semantics-related talk in the Registry session. That had its focus on the Unified Astronomy Thesaurus (UAT), from which people should pick the subject keywords in the VO Registry (actually, they should pick from its representation at I’ll probably blog about that a little more some other time. For now, let me recommend a little UAT-based game on my Semantics Based Registry Browser sembarebro: Choose two terms that are pretty far apart (like, perhaps, ionized-coma-gases and cosmic-background-radiation) and then try to join the two sub-graphs. Warning: This may waste your time. But it will acquaint you with the UAT, which may be a good thing.

In that second talk, I also mentioned a second draft vocabulary I’ve put up in the past six months, This builds upon the terms for VODataService’s waveband element, which enumerated certain flavours of photons (like Radio, Optical, or X-ray). Now that we explore other messengers as well and have more and more solar system resources in the Registry, I’m arguing we ought to open up things by making “Photon” explicit in there and then adding Neutrinos and, later, other messengers. I’ve received a certain amount of pushback there on mixing the electromagnetic spectrum with particle types; on the other hand, the hierarchical nature of our vocabularies would, I think, let us smartly get away with that.

Speaking about solar system resources, I’m also listed as an author on Stéphane Erard’s talk on EPN-TAP and EPNCore v2.0, probably due to my involvement in finally bringing EPN-TAP into the IVOA document repository. I’ve already talked about that in a 2017 post on this blog – and again, if you’re interested in solar system data, this would be a good time to review the EPN-TAP working draft.

Talking about things regluar readers of this blog will have heard of: September’s Crazy Shapes post I’ve referenced in a talk on MOCs in pgsphere, together with a fervent appeal to data centers to become involved in pgsphere maintenance.

And then there was my colleague Margarida’s talk on LineTAP, a proposal to obsolete the little-used SLA protocol (which lets people search for spectral lines) with something combining the much more successful VAMDC with our beloved TAP. Me, I’m in this because I’d like to bring TOSS data closer to VAMDC – but also because having competing infrastructures for the same thing sucks.

And finally, I gave a talk I’ve called Data Model Posture Review in a session of the Data Models working group; I was somewhat worried that given its rather skeptical outlook it wouldn’t be really well-received. But in fact quite a few people shared my main conclusions – and perhaps it was another step towards resolving my decade-old spot of pain: that the VO still doesn’t offer tech to reliably bring two catalogues to the same epoch without human intervention.

With this number of talks I’ve been involved in, I’m essentially back to the level of a normal Interop. Which means I’ve been fairly knocked-out on Friday. And I can’t lie: I still regret I didn’t get to spend a few more warm days in Granada. Corona begone!

DaCHS 2.2 is out

[Image: DaCHS "entails" 2.2]
DaCHS 2.2 adds support for what simple semantics we currently do in the VO. Which is a welcome excuse to abuse one of the funny symbols semanticians love so much.

Today, I’ve released DaCHS 2.2, the second stable version of DaCHS running on Python 3. Indeed, we have ironed out a few sore spots that have put that “stable” into question, especially if you didn’t run things on Debian Buster. Mind you, playing it safe and just going for Debian is still recommended: Compared to the Python 2 world, where things largely didn’t break for a decade, the Python 3 universe is still shaking out, and so the versions of dependencies do matter. It’s actually fairly gruesome how badly pyparsing 2.4 will break DaCHS. But that’s for another day.

Despite this piece of fearmongering, it’d be great if you could upgrade your installations if you are running DaCHS, and it’s pretty safe if you’re on Debian buster anyway (and if you’re running Debian in the first place, you should be running buster by now).

Here are the more notable changes in this release:

  • DaCHS can now (relatively easily) write time series in the form of what Ada Nebot’s Time Series Annotation note proposes. See the tutorial chapter on building time series for how to do that in practice. Seriously: If you have time series to publish, by all means try this out. The specification can still be fixed, and so this is the perfect time to find problems with the plan.
  • The 2.2 release contains support for the MOC ADQL functions mentioned in the last post on this blog. Of course, to make them work, you will still have to acquaint your database with the new functionality.
  • DaCHS has learned to use IVOA vocabularies as per the current draft for Vocabularies in the VO 2. The most visible effect for you probably is that DaCHS now warns if your subject keywords are not taken from the Unified Astronomy Thesaurus (UAT) – which they almost certainly are not, because the actual format of these keywords is a bit funky. On the other hand, if you employ the “plain” root page template (see the root template in our templating guide if you are not sure what I am talking about here), you will get nice, human-friendly labels for the computer-friendly terms you ought to put into subjects. In case you don’t bother: Given I’m currently serving as chair of the semantics working group of the IVOA, the whole topic will certainly come up again soon, and at that point I will probably also talk about another semantics-related newcomer to DaCHS, the gavo_vocmatch ADQL UDF.
  • There is a new command dachs datapack for interacting with frictionless data packages. The idea is that you can say dachs datapack create myres/q myres.pack and obtain an archive of all that is necessary to re-create myres on another DaCHS installation, where you would say dachs datapack load myres.pack. Frankly, this isn’t much different from just tarring up the resource directory at this point, except that any cruft that may have accumulated in the directory is skipped and there is a bit of structured metadata. But then interoperability always starts slowly. Note, by the way, that this certainly does not teach DaCHS to do anything sensible with third-party data packages; while I’ve not thought hard about this, as it seems a remote use case, I am pretty sure that even the “tabular data packages” that refine the rough general metadata quite a bit simply have nowhere near enough metadata to create a useful VO resource or TAP table.
  • As part of my never-ending struggle against bitrot (in case you’ve always wondered what “curation” means: that, essentially), I’m running dachs val -vc ALL in my own data center once every month. This used to traverse the file system to locate all RDs defined on a box and then make sure they are still ok and their definitions match the database schema. That behaviour has now changed a bit: It will only check published RDs now. I cannot lie: the main reason for the change is because on my production machine the file system traversal has taken longer and longer as data accumulated. But then beyond that there is much less to worry when unpublished gets a little bit mouldy. To get back the old behaviour of validating all RDs that are reachable by the server, use ALL_RECURSE instead of ALL.
  • DaCHS has traditionally assumed that you are running multiple services on one site, which is why its root page is rendered over a service that exposes metadata on local resources. If that doesn’t quite work for how you use DaCHS – perhaps because you want to have your own custom renderers and data functions on your root page, perhaps because you only have one browser-based service and that should be the root page right away –, you can now override what is shown when people access the root URI of your DaCHS installation by setting the [web]root config item to the path of the resource you want as root (e.g., myres/q/s/fixed when the root page should be made by the fixed renderer on the service s within the RD myres/q).
  • Scripting in DaCHS is a powerful way to execute python or SQL code when certain things happen. That seems an odd thing to want until you need it; then you need it badly. Since DaCHS 2.2, scripts executed before or after the creation of a table, before its deletion, or after its meta data has been updated, can sit on tables (where they have always belonged). Before, they could only be on makes (where they can still sit, but of course they are then only executed if the table is operated through that particular make) and RDs (from where they could be copied). That latter location is now forbidden in order to free up RD scripts for later sanitation. Use STREAM and FEED instead if you really used something like that (and I’d bet you don’t).
  • Minor behavioural changes: (a) Due to a bug, you could write things like <schema foo="bar">my_schema<schema>, i.e., have attributes on attributes written in element form. That is now flagged as an error. Since that attribute was fed to the embedding element, you might need to add it there. (b) If you have custom flot plots in one of your templates (and you don’t if you don’t know what I’m talking about), you now have to set style to Points or Lines where you had usingIndex 0 or 1 before. (c) The sidebar template no longer has links to a privacy policy (that few bothered to fill out). See extra sidebar items in the tutorial on how to get them back or add something else.

The most important change comes last: The default logo DaCHS shows unless you override it is no longer the GAVO logo. That’s, really, been inappropriate from the start. It’s now the DaCHS logo, the thing that’s in this posts’s article image. Which isn’t quite as tasteful as the GAVO one, true. But I trust we’ll all get used to it.

Crazy Shapes in TAP

OpenNGC shapes
A complex shape from OpenNGC: MOCs need not be convex, or simply connected, or anything.

So far when you did spherical geometry in ADQL, you had points, circles, and polygons as data types, and you could test for intersection and containment as operations. This feature set is a bit unsatisfying because there are no (algebraic) groups in this picture: When you join or intersect two circles, the result only is a circle if one contains the other. With non-intersecting polygons, you will again not have a (simply connected) spherical polygon in the end.

Enter MOCs (which I’ve mentioned a few times before on this blog): these are essentially arbitrary shapes on the sky, in practice represented through lists of pixels, cleverly done so they can be sufficiently precise and rather compact at the same time. While MOCs are powerful and surprisingly simple in practice, ADQL doesn’t know about them so far, which limits quite a bit what you can do with them. Well, DaCHS would serve them since about 1.3 if you managed to push them into the database, but there were no operations you could do on them.

Thanks to work done by credativ (who were really nice to work with), funded with some money we had left from our previous e-inf-astro project (BMBF FKZ 05A17VH2) on the pgsphere database extension, this has now changed. At least on the GAVO data center, MOCs are now essentially first-class citizens that you can create, join, and intersect within ADQL, and you can retrieve the results. All operators of DaCHS services are just a few updates away from being able to offer the same.

So, what can you do? To follow what’s below, get a sufficiently new TOPCAT (4.7 will do) and open its TAP client on (a.k.a. GAVO DC TAP).

Basic MOC Operations in TAP

First, let’s make sure you can plot MOCs; run

SELECT name, deepest_shape
FROM openngc.shapes

Then do Graphics/Sky Plot, and in the window that pops up then, Layers/Add Area Control. Then select your new table in the Position tab, and finally choose deepest_shape as area (yeah, this could become a bit more automatic and probably will over time). You will then see the footprints of a few NGC objects (OpenNGC’s author Mattia Verga hasn’t done all yet; he certainly welcomes help on OpenNGC’s version control repo), and you can move around in the plot, yielding perhaps something like Fig. 1.

Now let’s color these shapes by object class. If you look, has an obj_type column – let’s group on it:

  AREA(shape) AS ar
  SELECT obj_type, SUM(deepest_shape) AS shape
  FROM openngc.shapes
  GROUP BY obj_type) AS q

(the extra subquery is a workaround necessary because the area function wants a geometry or a column reference, and ADQL doesn’t allow aggregate functions – like sum – as either of these).

Coloured shapes
Fig. 2: OpenNGC shapes grouped and coloured by type.

In the result you will see that so far, contours for about 40 square degrees of star clusters with nebulae have been put in, but only 0.003 square degrees of stellar associations. And you can now plot by the areas covered by the various sorts of objects; in Fig. 2, I’ve used Subsets/Classify by Column in TOPCAT’s Row Subsets to have colours indicate the different object types – a great workaround when one deals with categorial variables in TOPCAT.

MOCs and JOINs

Another table that already has MOCs in them is rr.stc_spatial, which has the coverage of VO resources (and is the deeper reason I’ve been pushing improved MOC support in pgsphere – background); this isn’t available for all resources yet , but at least there are about 16000 in already. For instance, here’s how to get the coverage of resources talking about planetary nebulae:

SELECT ivoid, res_title, coverage
FROM rr.subject_uat
  NATURAL JOIN rr.stc_spatial
  NATURAL JOIN rr.resource
WHERE uat_concept='planetary-nebulae'
  AND AREA(coverage)<20

(the rr.subject_uat table is a local extension to RegTAP that will be the subject of some future blog post; you could also use rr.res_subject, but because people still use wildly different keyword schemes – if any –, that wouldn’t be as much fun). When plotted, that’s the left side of Fig. 3. If you do that yourself, you will notice that the resolution here is about one degree, which is a special property of the sort of MOCs I am proposing for the Registry: They are of order 6. Resolution in MOC goes up with order, doubling with every step. Thus MOCs of order 7 have a resolution of about half a degree, MOCs of order 5 a resolution of about two degrees.

One possible next step is fetch the intersection of each of these coverages with, say, the DFBS (cf. the post on Byurakan spectra). That would look like this:

  gavo_mocintersect(coverage, dfbscoverage) as ovrlp
  SELECT ivoid, res_title, coverage
  FROM rr.subject_uat
  NATURAL JOIN rr.stc_spatial
  NATURAL JOIN rr.resource
  WHERE uat_concept='planetary-nebulae'
  AND AREA(coverage)<20) AS others
  SELECT coverage AS dfbscoverage 
  FROM rr.stc_spatial
  WHERE ivoid='ivo://org.gavo.dc/dfbsspec/q/spectra') AS dfbs

(the DFBS’ identifier I got with a quick query on WIRR). This uses the gavo_mocintersect user defined function (UDF), which takes two MOCs and returns a MOC of their common pixels. Which is another important part why MOCs are so cool: together with union and intersection, they form groups. It should not come as a surprise that there is also a gavo_mocunion UDF. The sum aggregate function we’ve used in our grouping above is (conceptually) built on that.

Planetary Nebula footprint and plate matches
Fig. 3: Left: The common footprint of VO resources declaring a subject of planetary-nebula (and declaring a footprint). Right bottom: Heidelberg plates intersecting this, and, in blue, level-6 intersections. Above this, an enlarged detail from this plot.

You can also convert polygons and circles to MOCs using the (still DaCHS-only) MOC constructor. For instance, you could compute the coverage of all resources dealing with planetary nebulae, filtering against obviously over-eager ones by limiting the total area, and then match that against the coverages of images in, say, the Königstuhl plate achives HDAP. Watch this:

  gavo_mocintersect(MOC(6, im.coverage), pn_coverage) as ovrlp
  SELECT SUM(coverage) AS pn_coverage
  FROM rr.subject_uat
  NATURAL JOIN rr.stc_spatial
  WHERE uat_concept='planetary-nebulae'
  AND AREA(coverage)<20) AS c
JOIN lsw.plates AS im
ON 1=INTERSECTS(pn_coverage, MOC(6, coverage))

– so, the MOC(order, geo) function should give you a MOC for other geometries. There are limits to this right now because of limitations of the underlying MOC library; in particular, non-convex polygons are not supported right now, and there are precision issue. We hope this will be rectified soon-ish when we base pgsphere’s MOC operations on the CDS HEALPix library. Anyway, the result of this is plotted on the right of Fig. 3.

Open Ends

In case you have MOCs from the outside, you can also construct MOCs from literals, which happen to be the ASCII MOCs from the standard. This could look like this:

  MOC('4/30-33 38 52 7/324-934') AS ar 
FROM tap_schema.tables

For now, you cannot combine MOCs in CONTAINS and INTERSECTS expressions directly; this is mainly because in such an operation, the machine as to decide on the order of the MOC the other geometries are converted to (and computing the predicates between geometry and MOC directly is really painful). This means that if you have a local table with MOCs in a column cmoc that you want to compare against a polygon-valued column coverage in a remote table like this:

  lsw.plates AS db
  JOIN tap_upload.t6
ON 1=CONTAINS(coverage, cmoc) -- fails!

you will receive a rather scary message of the type “operator does not exist: spoly <@ smoc”. To fix it (until we’ve worked out how to reasonably let the computer do that), explicitly convert the polygon:

  lsw.plates AS db
  JOIN tap_upload.t6
ON 1=CONTAINS(MOC(7, coverage), cmoc)

(be stingy when choosing the order here – MOCs that already exist are fast, but making them at high order is expensive).

Having said all that: what I’ve written here is bleeding-edge, and it is not standardised yet. I’d wager, though, that we will see MOCs in ADQL relatively soon, and that what we will see will not be too far from this experiment. Well: Some rough edges, I’d hope, will still be smoothed out.

Getting This on Your Own DaCHS Installation

If you are running a DaCHS installation, you can contribute to takeup (and if not, you can stop reading here). To do that, you need to upgrade to DaCHS’s latest beta (anything newer than 2.1.4 will do) to have the ADQL extension, and, even more importantly, you need to install the postgresql-postgres package from our release repository (that’s version 1.1.4 or newer; in a few weeks, getting it from Debian testing would work as well).

You will probably not get that automatically, because if you followed our normal installation instructions, you will have a package called postgresql-11-pgsphere installed (apologies for this chaos; as ususal, every single step made sense). The upshot is that with our release repo added, sudo apt install postgresql-pgsphere should give you the new code.

That’s not quite enough, though, because you also need to acquaint the database with the new functions. This can only be done with database administrator privileges, which DaCHS by design does not possess. What DaCHS can do is figure out the commands to do that when it is called as dachs upgrade -e. Have a look at the output, and if you are satisfied it is about what to expect, just pipe it into psql as a superuser; in the default installation, dachsroot would be sufficiently privileged. That is:

dachs upgrade -e | psql gavo   # as dachsroot

If running

select top 1 gavo_mocunion(moc('1/3'), moc('2/9')) 
from tap_schema.tables

through your TAP endpoint returns ‘1/3 2/9’, then all is fine. For entertainment, you might also make sure that gavo_mocintersect(moc('1/3'), moc('2/13')) is 2/13 as expected, and that if you intersect with 2/3 you get back an empty string.

So – let’s bring MOCs to ADQL!

Histograms and Hidden Open Clusters

[image: reddish pattern]
Colour-coded histograms for distances of stars in the direction of some NGC open clusters — one cluster per line, so you’re looking a a couple of Gigabytes of data here. If you want this a bit more precise: Read the article and generate your own image.

I have spent a bit of time last week polishing up what will (hopefully) be the definitive source of common ADQL User Defined Functions (UDFs) for IVOA review. What’s a UDF, you ask? Well, it is an extension to ADQL where service operators can invent new functionality. If you have been following this blog for a while, you will probably remember the ivo_healpix_index function from our dereddening exercise (and some earlier postings): That was an UDF, too.

This polishing work reminded me of a UDF I’ve wanted to blog about for a quite a while, available in DaCHS (and thus on our Heidelberg Data Center) since mid-2018: gavo_histogram. This, I claim, is a powerful tool for analyses over large amounts of data with rather moderate local means.

For instance, consider this classic paper on the nature of NGC 2451: What if you were to look for more cases like this, i.e., (indulging in a bit of poetic liberty) open clusters hidden “behind” other open clusters?

Somewhat more technically this would mean figuring out whether there are “interesting” patterns in the distance and proper motion histograms towards known open clusters. Now, retrieving the dozens of millions of stars that, say, Gaia, has in the direction of open clusters to just build histograms – making each row count for a lot less than one bit – simply is wasteful. This kind of counting and summing is much better done server-side.

On the other hand, SQL’s usual histogram maker, GROUP BY, is a bit unwieldy here, because you have lots of clusters, and you will not see anything if you munge all the histograms together. You could, of course, create a bin index from the distance and then group by this bin and the object name, somewhat like ...ROUND(r_est/20) as bin GROUP by name, bin – but that takes quite a bit of mangling before it can conveniently be used, in particular when you take independent distributions over multiple variables (“naive Bayesian”; but then it’s the way to go if you want to capture dependencies between the variables).

So, gavo_histogram to the rescue. Here’s what the server-provided documentation has to say (if you use TOPCAT, you will find this in the ”Service” tab in the TAP windows’ ”Use Service” tab):

gavo_histogram(val REAL, lower REAL, upper REAL, nbins INTEGER) -> INTEGER[]

The aggregate function returns a histogram of val with nbins+2 elements. Assuming 0-based arrays, result[0] contains the number of underflows (i.e., val<lower), result[nbins+1] the number of overflows. Elements 1..nbins are the counts in nbins bins of width (upper-lower)/nbins. Clients will have to convert back to physical units using some external communication, there currently is no (meta-) data as to what lower and upper was in the TAP response.

This may sound a bit complicated, but the gist really is: type gavo_histogram(r_est, 0, 2000, 20) as hist, and you will get back an array with 20 bins, roughly 0..100, 100..200, and so on, and two extra bins for under- and overflows.

Let’s try this for our open cluster example. The obvious starting point is selecting the candidate clusters; we are only interested in famous clusters, so we take them from the NGC (if that’s too boring for you: with TAP uploads you could take the clusters from Simbad, too), which conveniently sits in my data center as

select name, raj2000, dej2000, maj_ax_deg
where obj_type='OCl'

Then, we need to add the stars in their rough directions. That’s a classic crossmatch, and of course these days we use Gaia as the star catalogue:

  select name, source_id
  join gaia.dr2light
  on (
      circle(raj2000, dej2000, maj_ax_deg)))
  where obj_type='OCl')

This is now a table of cluster names and Gaia source ids of the candidate stars. To add distances, you could fiddle around with Gaia parallaxes, but because there is a 1/x involved deriving distances, the error model is complicated, and it is much easier and safer to adopt Bailer-Jones et al’s pre-computed distances and join them in through source_id.

And that distance estimation, r_est, is exactly what we want to take our histograms over – which means we have to group by name and use gavo_histogram as an aggregate function:

with ocl as (
  select name, raj2000, dej2000, maj_ax_deg, source_id
  join gaia.dr2light
  on (
      circle(raj2000, dej2000, maj_ax_deg)))
  where obj_type='OCl')

  gavo_histogram(r_est, 0, 4000, 200) as hist
  join ocl
  using (source_id)
where r_est!='NaN'
group by name

That’s it! This query will give you (admittedly somewhat raw, since we’re ignoring the confidence intervals) histograms of the distances of stars in the direction of all NGC open clusters. Of course, it will run a while, as many millions of stars are processed, but TAP async mode easily takes care of that.

Oh, one odd thing is left to discuss (ignore this paragraph if you don’t know what I’m talking about): r_est!='NaN'. That’s not quite ADQL but happens to do the isnan of normal programming languages at least when the backend is Postgres: It is true if computations failed and there is an actual NaN in the column. This is uncommon in SQL databases, and normal NULLs wouldn’t hurt gavo_histogram. In our distance table, some NaNs slipped through, and they would poison our histograms. So, ADQL wizards probably should know that this is what you do for isnan, and that the usual isnan test val!=val doesn’t work in SQL (or at least not with Postgres).

So, fire up your TOPCAT and run this on the TAP server

You will get a table with 618 (or so) histograms. At this point, TOPCAT can’t do a lot with them. So, let’s emigrate to pyVO and save this table in a file ocl.vot

My visualisation proposition would be: Let’s substract a “background” from the histograms (I’m using splines to model that background) and then plot them row by row; multi-peaked rows in the resulting image would be suspicious.

This is exactly what the programme below does, and the image for this article is a cutout of what the code produces. Set GALLERY = True to see how the histograms and background fits look like (hit ‘q’ to get to the next one).

In the resulting image, any two yellow dots in one line are at least suspicious; I’ve spotted a few, but they are so consipicuous that others must have noticed. Or have they? If you’d like to check a few of them out, feel free to let me know – I think I have a few ideas how to pull some VO tricks to see if these things are real – and if they’ve been spotted before.

So, here’s the yellow spot programme:

from astropy.table import Table
import matplotlib.pyplot as plt
import numpy
from scipy.interpolate import UnivariateSpline


def substract_background(arr):
    x = range(len(arr))
    mean = sum(arr)/len(arr)
    arr = arr/mean
    background = UnivariateSpline(x, arr, s=100)
    cleaned = arr-background(x)

    if GALLERY:
        plt.plot(x, arr)
        plt.plot(x, background(x))

    return cleaned

def main():
    tab ="ocl.vot")
    hist = numpy.array([substract_background(r["hist"][1:-1])
      for r in tab])
    plt.matshow(hist, cmap='gist_heat')

if __name__=="__main__":

Tutorial Renewal

[Image: The DaCHS Tutorial among other seminal works]

DaCHS’ documentation (readthedocs mirror) has two fat pieces and a lot of smaller read-as-you-go pieces. One of the behmoths, the reference documentation, at roughly 350 PDF pages, has large parts generated from source code, and there is no expectation that anyone would ever read it linearly. Hence, I wasn’t terribly worried about unreadable^Wpassages of questionable entertainment value in there.

That’s a bit different with the tutorial (also available as 150 page PDF; epub on request): I think serious DaCHS deployers ought to read the DaCHS Basics and the chapters on configuring DaCHS and the interaction with the VO Registry, and they should skim the remaining material so they are at least aware of what’s there.

Ok. I give you that is a bit utopian. But given that pious wish I felt rather bad that the tutorial has become somewhat incoherent in the years since I had started the piece in April 2009 (perhaps graciously, the early history is not visible at the documentation’s current github home). Hence, when applying for funds under our current e-inf-astro project, I had promised to give the tutorial a solid makeover as, hold your breath, Milestone B1-5, due in the 10th quarter. In human terms: last December.

When it turned out the Python 3 migration was every bit as bad as I had feared, it became clear that other matters had to take priority and that we might miss this part of that “milestone” (sorry, I can’t resist these quotes). And given e-inf-astro only had two quarters to go after that, I prepared for having to confess I couldn’t make good on my promise of fixing the tutorial.

But then along came Corona, and reworking prose seemed the ideal pastime for the home office. So, on April 4, I forked off a new-tutorial branch and started a rather large overhaul that, among others, resulted in the operators’ guide with its precarious position between tutorial and reference being largely absorbed into the tutorial. In all, off and on over the last few months I accumulated (according to git diff --shortstat 6372 inserted and 3453 deleted lines in the tutorial’s source. Since that source currently is 7762 lines, I’d say that’s the complete makeover I had promised. Which is good as e-inf-astro will be over next Wednesday (but don’t worry, our work is still funded).

So – whether you are a DaCHS expert, think about running it, or if you’re just curious what it takes to build VO services, let me copy from index.html: Tutorial on importing data (tutorial.html, tutorial.pdf, tutorial.rstx). The ideal company for your vacation!

And if you find typos, boring pieces, overly radical advocacy or anything else you don’t like: there’s a bug tracker for you (not to mention PRs are welcome).

DaCHS 2.1: Say hello to Python 3

Image: DaCHS and python logos

Today, I have released DaCHS 2.1, the first stable DaCHS running on Python 3. I have tried hard to make the major version move painless and easy, and indeed “pure DaCHS” RDs should just continue to work. But wherever there’s Python in your RDs or near them, things may break, since Python 3 is different from Python 2 in some rather fundamental ways.

Hence, the Debian package even has a new name: gavodachs2-server. Unless you install that, things will keep running as they do. I will keep fixing serious DaCHS 1 bugs for a while, so there’s no immediate urgency to migrate. But unless you migrate, you will not see any new features, so one of these days you will have to migrate anyway. Why not do it today?

Migrating to DaCHS 2

In principle, just say apt install gavodachs2-server and hope for the best. If you have a development machine and regression tests defined, this is actually what we recommend, and we’d be very grateful to learn of any problems you may encounter.

If you’d rather be a little more careful, Carlos Henrique Brandt has kindly updated his Docker files in order to let you spot problems before you mess up your production server. See Test Migration for a quick intro on how to do that. If you spot any problems that are not related to the Python 3 pitfalls mentioned in the howto linked below or nevow exodus, please tell me or (preferably) the dachs-support mailing list.

A longer, more or less permanent piece elaborating possible migration pains, is in our how-to documentation: How do I go from DaCHS1 to DaCHS2?

What’s new in DaCHS2?

I’ve used the opportunity of the major version change to remove a few (mis-) features that I’m rather sure nobody uses; and there are a few new features, too. Here’s a rundown of the more notable changes:

  • DaCHS now produces VOTable 1.4 by default. This is particularly notable when you provide TIMESYS metadata (on which I’ll report some other time).
  • When doing spatial indices, prefer the new //scs#pgs-pos-index to //scs#q3cindex. While q3c is still faster and more compact than pgsphere when just indexing points, on the longer run I’d like to shed the extra dependency (note, however, that the pgsphere index limits the cone search to a maximum radius of 90 degrees at this point).
  • Talking about Cone Search: For custom parameters, DaCHS has so far used SSA-like syntax, so you could say, for instance, vmag=12/13 (for “give me rows where vmag is between 12 and 13”). Since I don’t think this was widely used, I’ve taken the liberty to migrate to DALI-compliant syntax, where intervals are written as they would be in VOTable PARAM values: vmag=12 13.
  • In certain situations, DaCHS tries to enable parallel queries (previously on this blog).
  • Some new ADQL user defined functions: gavo_random_normal, gavo_mocintersect, and gavo_mocunion. See the TAP capabilities for details, and note that the moc functions will fail until we put out a new pgsphere package that has support for the MOC-MOC operations.
  • dachs info (highly recommended after an import) now takes a --sample-percent option that helps when doing statistics on large tables.
  • For SSA services serving something other than spectra (in all likelihood, timeseries), you can now set a productType meta as per the upcoming SimpleDALRegExt 1.2.
  • If you have large, obscore-published SIAP tables, re-index them (dachs imp -I q) so queries over s_ra and s_dec get index support, too.
  • Since we now maintain RD state in the database, you can remove the files /var/gavo/state/updated* after upgrading.
  • When writing datalink metaMakers returning links, you can (and should, for new RDs) define the semantics in an attribute to the element rather in the LinkDef constructor.
  • Starting with this version, it’s a good idea to run dachs limits after an import. This, right now, will mainly set an estimate for the number of rows in a table, but that’s already relevant because the ADQL translator uses it to help the postgres query planner. It will later also update various kinds of column metadata that, or so I hope, will become relevant in VODataService 1.3.
  • forceUnique on table elements is now a no-op (and should be removed); just define a dupePolicy as before.
  • If you write bad obscore mappings, it could so far be hard to figure out the reason of the failure and, between lots of confusing error messages, to fix it. Instead, you can now run “dachs imp //obscore recover“ in such a situation. It will re-create the obscore table and throw out all stanzas that fail; after that, you can fix the obscore declarations that were thrown out one by one.
  • If you run DaCHS behind a reverse proxy that terminates https, you can now set [web]adaptProtocol in /etc/gavo.rc to False. This will make that setup work for form-based services, too.
  • If you have custom OAI set name (i.e., anything but local and ivo_managed in the sets attribute of publish elements), you now have to declare them in [ivoa]validOAISets.
  • Removed things: the docform renderer (use form instead), the soap renderer (well, it’s not actually removed, it’s just that the code it depends on doesn’t exist on python3 any more), sortKey on services (use the defaultSortKey property), //scs#q3cpositions (port the table to have ra and dec and one of the SCS index mixins), the (m)img.jpeg renderers (if you were devious enough to use these, let me know), and quite a few even more exotic things.

Some Breaking Changes

Python 3 was released in 2008, not long after DaCHS’ inception, but since quite a few of the libraries it uses to do its job haven’t been available for Python 3, we have been reluctant to make the jump over the past then years (and actually, the stability of the python2 platform was a very welcome thing).

Indeed, the most critical of our dependencies, twisted, only became properly usable with python3 in, roughly, 2017. Indeed, large parts of DaCHS weren’t even using twisted directly, but rather a nice add-on to it called nevow. Significant parts of nevow bled through to DaCHS operators; for instance, the render functions or the entire HTML templating.

Nevow, unfortunately, fell out of fashion, and so nobody stepped forward to port it. And when I started porting it myself I realised that I’m mainly using the relatively harmless parts of nevow, and hence after a while I figured that I could replace the entire dependency by something like a 1000 lines in DaCHS, which, given significant aches when porting the whole of nevow, seemed like a good deal.

The net effect is that if you built code on top of nevow – most likely in the form of a custom renderer – that will break now, and porting will probably be rather involved (having ported ~5 custom renderers, I think I can tell). If this concerns you, have a look at the README in gavo.formal (and then complain because it’s mainly notes to myself at this point). I feel a bit bad about having to break things that are not totally unreasonable in this drastic way and thus offer any help I can give to port legacy DaCHS code.

Outside of these custom renderers, there should just be a single visible change: If you have used n:data="some_key" in nevow templates to pull data from dictionaries, that won’t work any longer. Use n:data="key some_key" n:render="str" instead. And it turns out that this very construct was used in the default root template, which you may have derived from. So – see if you have /var/gavo/web/templates/root.html and if so, whether there is <ul n:data="chunk" in there. If you have that, change it to <ul n:data="key chunk".

Update (2020-11-19): Two only loosely related problems have surfaced during updates. In particular if you are updating on rather old installations, you may want to look at the points on Invalid script type preIndex and function spoint_in already exists in our list of common problems.

Building consensus

[image: Markus, handwringing]
Sometimes, building consensus takes a little bending: Me, at the Shanghai Interop of 2017. In-joke: there’s “STC” on the slide.
In the Virtual Observatory, procedures are built on consensus: No (relevant) decisions are passed based some sort of majority vote. While I personally think that’s a very good thing in general – you really don’t want to clobber minorities, and I couldn’t even give a minimal size of such a minority below which it might be ok to ignore them –, there is a profound operational reason for that: We cannot force data centers or software writers to comply with our standards, so they had better agree with them in the first place.

However, building consensus (to avoid Chomsky’s somewhat odious notion of manufacturing consent) is hard. In my current work, this insight manifests itself most strongly when I wear my hat as chair of the IVOA Semantics Working Group, where we need to sort items from a certain part of the world into separate boxes and label those, that is, we’re building vocabularies. “Part of the world” can be formalised, and there are big phrases like “universe of discourse” to denote such formalisations, but to give you an idea, it’s things like reference frames, topics astronomy in general talks about (think journal keywords), relationships between data collections and services, or the roles of files related to or making up a dataset. If you visit the VO’s vocabulary repository, you will see what parts we are trying to systematise, and if you skim the current draft for the next release of Vocabularies in the VO, in section two you can find a few reasons why we are bothering to do that.

As you may expect if you have ever tried classifications like this, what boxes (”concepts” in the argot of the semantics folks) there should be and how to label them are questions with plenty of room for dissent. A case study for this is the discussion on VEP-001 and its successors that has been going on since late last year; it also illustrates that we are not talking about bikeshedding here. The discussion clarified much and, in particular, led to substantial improvements not only to the concept in question but also far beyond that. If you are interested, have a look at a few mail threads (here, here, here, or here; more discussion happened live at meetings).

An ideal outcome of such a process is, of course, a solution that is obvious in retrospect, so everyone just agrees. Sometimes, that doesn’t happen, and one of these times is VEP-001 and the VEP-003 it evolved into. A spontanous splinter between sessions of this week’s Virtual Interop yielded two rather sensible names for the concept we had identified in the previous debates: #sibling on the one hand, and #co-derived on the other (in case you’re RDF-minded: the full vocabulary URIs are obtained by prefixing this with the vocabulary URI, Choosing between the two is a bit of a matter of taste, but also of perhaps changing implementations, and so I don’t see a clear preference. And the people in the conference didn’t reach an agreement before people on the North American west coast really had to have some well-deserved sleep.

In such a situation – extensive discussion yields some very few, apparently rather equivalent solution –, I suspect it is the time to resort to some sort of polling after all. So, in the session I’ve asked the people involved to give their pain level on a scale of 1 to 10. Given there are quite a few consensus scales out there already (I’m too lazy to look for references now, but I’ll retrofit them here if you send some in), I felt this was a bit hasty after I had closed the z**m^H^H^H^H telecon client. But then, thinking about it, I started to like that scale, and so during a little bike ride I came up with what’s below. And since I started liking it, I thought I could put it into words, and into a form I can reference when similar situations come up in the future. And so, here it is:

Markus’ Pain Level Scale

  1. Oh wow. I’m enthusiastic about it, and I’d get really cross if we didn’t do it.
  2. It’s great. I don’t think we’ll find a better solution. People better have really strong reasons to reject it.
  3. Fine. Just go ahead.
  4. Quite reasonable. I have some doubts, but I either don’t have a good alternative, or the alternatives certainly won’t improve matters.
  5. Reasonable. I can live with it, possibly accepting a very moderate amount of pain (like: change an implemenation that I think is fine as it is).
  6. Sigh. I don’t like it much. If you think it’s useful, do it, but don’t blame me if it later turns out it stinks.
  7. Ouch. I wish we didn’t have to go there. For instance: This is going to uglify a few things I care about.
  8. Yikes. I think it’s a bad idea. Honestly, let’s not do it. It’s going to make quite a few things a lot uglier, though I give you it might still just barely work.
  9. OMG. What are you thinking? I won’t go near it, and I pity everyone who will have to. And it’s quite likely going to blow up some things I care about.
  10. Blech. To me, this clearly is a grave mistake that will impact a lot of things very adversely. If I can do anything within reason to stop it, I’ll do it. Consider this a veto, and shame on you if you override it.

You can qualify this with:

I’ve thought long and hard about this, and I think I understand the matter in depth. You’ll hence need arguments of the profundity of the Earth’s outer core to sway me.
I’ve thought about this, and as far as I understand the matter I’m sure about it. More information, solid arguments, or a sudden inspiration while showering might still sway me.
This is a gut feeling. It could very well be phantom pain. Feel free to try a differential diagnosis.

If you like the scale, too, feel free to reference it as

GAVO vs. Corona

[Group Photo of the 2018 Victoria Interop]
You won’t see something like this (the May 2018 Interop group photo) in Spring 2020: The Sidney Interop, planned for early May, is going to take place using remote tools. Some of which I’d rather do without.

The Corona pandemic, regrettably, has also brought with it a dramatic move to closed, proprietary communication and collaboration platforms: I’m being bombarded by requests to join Zoom meetings, edit Google docs, chat on Slack, “stream” something on any of Youtube, Facebook, Instagram, or Sauron (I’ve made one of these up).

Mind you, that’s within the Virtual Observatory. Call me pig-headed, but I feel that’s a disgrace when we’re out to establish Free and open standards (for good reasons). To pick a particularly sad case, Slack right now is my pet peeve because they first had an interface to IRC (which has been doing what they do since the late 80ies, though perhaps not as prettily in a web browser) and then cut it when they had sufficient lock-in. Of course, remembering how Google first had XMPP (that’s the interoperable standard for instant messaging) in Google talk and then cut that, too… ah well, going proprietary unfortunately is just good business sense once you have sufficient lock-in.

Be that as it may, I was finally fed up with all this proprietary tech and set up something suitable for conferecing building on open, self-hostable components. It’s on, and you’re welcome to use it for your telecons (assuming that when you’re reading this blog, you have at least some relationship to astronomy and open standards).

What’s in there?

Unfortunately, there doesn’t seem to be an established, Free conferencing system based on SIP/RTP, which I consider the standard for voice communication on the internet (if you’ve never heard of it: it’s what your landline phone uses in all likelihood). That came as a bit of a surprise to me, but the next best thing is a Free and multiply implemented solution, and there’s the great mumble system that (at least for me) works so much better than all the browser-based horrors, not to mention it’s quite a bit more bandwidth-effective. So: Get a client and connect to Join one of the two meeting rooms, done.

Mumble doesn’t have video, which, considering I’ve seen enough of peoples’ living rooms (not to mention Zoom’s silly bluebox backgrounds) to last a lifetime, counts as an advantage in my book. However, being able to share a view on a document (or slide set) and point around in it is a valid use case. Bonus points if the solution to that does not involve looking at other people’s mail, IM notifications, or screen backgrounds.

Now, a quick web search did not turn up anything acceptable to me, and since I’ve always wanted to play with websockets, I’ve created poatmyp: With it, you upload a PDF, distribute the link to your meeting partners, and all participants will see the slides and a shared pointer. And they can move around in the document together.

What’s left is shared editing. I’ve looked at a few implementations of this, but, frankly, there’s too much npm and the related curlbashware in this field to make any of it enjoyable; also, it seems nobody has bothered to provide a Debian package of one of the systems. On the other hand, there are a few trustworthy operators of etherpads out there, so for now we are pointing to them on telco.g-vo.

Setting up a mumble server and poatmyp isn’t much work if you know how to configure an nginx and have a suitable box on the web. So: perhaps you’ll use this opportunity to re-gain a bit of self-reliance? You see, there’s little point to have your local copy of the Gaia catalogue, and doing that right is hard. Thanks to people writing Free software, running a simple telecon infrastructure, on the other hand, isn’t hard any more.

The Bochum Galactic Disk Survey

[Image: Patches of higher perceived variability on the Sky]
Fig 1: How our haphazard variability ratio varies over the sky (galactic coordinates). And yes, it’s clear that this isn’t dominated by physical variability.

About a year ago, I reported on a workshop on “Large Surveys with Small Telescopes” in Bamberg; at around the same time, I’ve published an example for those, the Bochum Galactic Disk Survey BGDS, which used a twin 15 cm robotic telescope in some no longer forsaken place in the Andes mountains to monitor the brighter stars in the southern Milky Way. While some tables from an early phase of the survey have been on VizieR for a while, we now publish the source images (also in SIAP and Obscore), the mean photometry (via SCS and TAP) and, perhaps potentially most fun of all, the the lightcurves (via SSAP and TAP) – a whopping 35 million of the latter.

This means that in tools like Aladin, you can now find such light curves (and images in two bands from a lot of epochs) when you are in the survey’s coverage, and you can run TAP queries on GAVO’s server against the full photometry table and the time series.

Regular readers of this blog will not be surprised to see me use this as an excuse to show off a bit of ADQL trickery.

If you have a look at the bgds.phot_all table in your favourite TAP client, you’ll see that it has a column amp, giving the difference between the highest and lowest magnitude. The trouble is that amp for almost all objects just reflects the measurement error rather than any intrinsic variability. To get an idea what’s “normal” (based on the fact that essentially all stars have essentially constant luminosity on the range and resolution scales considered here), run a query like

SELECT ROUND(amp/err_mag*10)/10 AS bin, COUNT(*) AS n
FROM bgds.phot_all
WHERE nobs>10

As this scans the entire 75 million rows of the table, you will probably have to use async mode to run this.

[image: distribution of amplitude/mag error
Figure 2: The distribution of amplitude over magnitude error for all BGDS objects with nobs>10 (blue) and the subset with a mean magnitude brighter than 15 (blue).

When it comes back, you will have, for objects where any sort of statistics make sense at all (hence nobs>10), a histogram (of sorts) of the amplitude in units of upstream’s magnitude error estimation. If you log-log-plot this, you’ll see something like Figure 2. The curve at least tells you that the magnitude error estimate is not very far off – the peak at about 3 “sigma” is not unreasonable since about half of the objects have nobs of the order of a hundred and thus would likely contain outliers that far out assuming roughly Gaussian errors.

And if you’re doing a rough cutoff at amp/magerr>10, you will get perhaps not necessarily true variables, but, at least potentially interesting objects.

Let’s use this insight to see if we spot any pattern in the distribution of these interesting objects. We’ll use the HEALPix technique I’ve discussed three years ago in this blog, but with a little twist from ADQL 2.1: The Common Table Expressions or CTEs I have already mentioned in my blog post on ADQL 2.1 and then advertised in the piece on the Henry Draper catalogue. The brief idea, again, is that you can write queries and give them a name that you can use elsewhere in the query as if it were an actual table. It’s not much different from normal subqueries, but you can re-use CTEs in multiple places in the query (hence the “common”), and it’s usually more readable.

Here, we first create a version of the photometry table that contains HEALPixes and our variability measure, use that to compute two unsophisticated per-HEALPix statistics and eventually join these two to our observable, the ratio of suspected variables to all stars observed (the multiplication with 1.0 is a cheap way to make a float out of a value, which is necessary here because a/b does integer division in ADQL if a and b are both integers):

WITH photpoints AS (
    amp/err_mag AS redamp,
    ivo_healpix_index(5, ra, dec) AS hpx
  FROM bgds.phot_all
    AND band_name='SDSS i'
    AND mean_mag<16),
all_objs AS (
  SELECT count(*) AS ct,
    FROM photpoints GROUP BY hpx),
strong_var AS (
    FROM photpoints
    WHERE redamp>4 AND amp>1 GROUP BY hpx)
  strong_var.ct/(1.0*all_objs.ct) AS obs,
  all_objs.ct AS n,
FROM strong_var JOIN all_objs USING (hpx)
WHERE all_objs.ct>20

If you plot this using TOPCAT’s HEALPix thingy and ask it to use Galactic coordinates, you’ll end up with something like Figure 1.

There clearly is some structure, but given that the variables ratio reaches up to 0.2, this is still reflecting instrumental or pipeline effects and thus earthly rather than Astrophysics. And that’s going beyond what I’d like to talk about on a VO blog, although I’l take any bet that you will see significant structure in the spatial distribution of the variability ratio at about any magnitude cutoff, since there are a lot of different population mixtures in the survey’s footprint.

Be that as it may, let’s have a quick look at the time series. As with the short spectra from Byurakan use case, we’ve stored the actual time series as arrays in the database (the mjd and mags columns in bgds.ssa_time_series. Unfortunately, since they are a lot less array-like than homogeneous spectra, it’s also a lot harder to do interesting things with them without downloading them (I’m grateful for ideas for ADQL functions that will let you do in-DB analysis for such things). Still, you can at least easily download them in bulk and then process them in, say, python to your heart’s content. The Byurakan use case should give you a head start there.

For a quick demo, I couldn’t resist checking out objects that Simbad classifies as possible long-period variables (you see, as I write this, the public bohei over Betelgeuse’s brief waning is just dying down), and so I queried Simbad for:

SELECT ra, dec, main_id
FROM basic
     POINT('', ra, dec),
     POLYGON('', 127, -30, 112, -30, 272, -30, 258, -30))

(as of this writing, Simbad still needs the ADQL 2.0-compliant first arguments to POINT and POLYGON), where the POLYGON is intended to give the survey’s footprint. I obtained that by reading off the coordinates of the corners in my Figure 1 while it was still in TOPCAT. Oh, and I had to shrink it a bit because Simbad (well, the underlying Postgres server, and, more precisely, its pg_sphere extension) doesn’t want polygons with edges longer than π. This will soon become less pedestrian: MOCs in relational databases are coming; more on this soon.

[TOPCAT action shot with a light curve display]
Fig 3: V566 Pup’s BGDS lightcuve in a TOPCAT configured to auto-plot the light curves associated with a row from the bgds.ssa_time_series table on the GAVO DC TAP service.

If you now do the usual spiel with an upload crossmatch to the bgds.ssa_time_series table and check “Plot Table” in Views/Activation Action, you can quickly page through the light curves (TOPCAT will keep the plot style as you go from dataset to dataset, so it’s worth configuring the lines and the error bars). Which could bring you to something like Fig. 3; and that would suggest that V* V566 Pup isn’t really long-period unless the errors are grossly off.

Parallel Queries

Image: Plot of run times
An experiment with parallel querying of PPMX, going from single-threaded execution to using seven workers.

Let me start this post with a TL;DR for

Large analysis queries (like those that contain a GROUP BY clause) profit a lot from parallel execution, and you needn’t do a thing for that.
DaCHS operators
When you have large tables, Postgres 11 together with the next DaCHS release may speed up your responses quite dramatically in some cases.

So, here’s the story –

I’ve finally overcome my stretch trauma and upgraded the Heidelberg data center’s database server to Debian buster. With that, I got Postgres 11, and I finally bothered to look into what it takes to enable parallel execution of database queries.

Turns out: My Postgres started to do parallel execution right away, but just in case, I went for the following lines in postgresql.conf:

max_parallel_workers_per_gather = 4
max_worker_processes = 10
max_parallel_workers = 10

Don’t quote me on this – I frankly admit I haven’t really developed a feeling for the consequences of max_parallel_workers_per_gather and instead just did some experiments while the box was loaded otherwise, determining where raising that number has a diminishing return (see below for more on this).

The max_worker_processes thing, on the other hand, is an educated guess: on my data center, there’s essentially never more than one person at a time who’s running “interesting”, long-running queries (i.e., async), and that person should get the majority of the execution units (the box has 8 physical CPUs that look like 16 cores due to hyperthreading) because all other operations are just peanuts in comparison. I’ll gladly accept advice to the effect that that guess isn’t that educated after all.

Of course, that wasn’t nearly enough. You see, since TAP queries can return rather large result sets – on the GAVO data center, the match limit is 16 million rows, which for a moderate row size of 2 kB already translates to 32 GB of memory use if pulled in at once, half the physical memory of that box –, DaCHS uses cursors (if you’re a psycopg2 person: named cursors) to stream results and write them out to disk as they come in.

Sadly, postgres won’t do parallel plans if it thinks people will discard a large part of the result anyway, and it thinks that if you’re coming through a cursor. So, in SVN revision 7370 of DaCHS (and I’m not sure if I’ll release that in this form), I’m introducing a horrible hack that, right now, just checks if there’s a literal “group” in the query and doesn’t use a cursor if so. The logic is, roughly: With GROUP, the result set probably isn’t all that large, so streaming isn’t that important. At the same time, this type of query is probably going to profit from parallel execution much more than your boring sequential scan.

This gives rather impressive speed gains. Consider this example (of course, it’s selected to be extreme):

import contextlib
import pyvo
import time

def timeit(activity):
  start_time = time.time()
  end_time = time.time()
  print("Time spent on {}: {} s".format(activity, end_time-start_time))

svc = pyvo.tap.TAPService("")
with timeit("Cold (?) run"):
  svc.run_sync("select round(Rmag) as bin, count(*) as n"
    " from group by bin")
with timeit("Warm run"):
  svc.run_sync("select round(Rmag) as bin, count(*) as n"
    " from group by bin")

(if you run it yourself and you get warnings about VOTable versions from astropy, ignore them; I’m right and astropy is wrong).

Before enabling parallel execution, this was 14.5 seconds on a warm run, after, it was 2.5 seconds. That’s an almost than a 6-fold speedup. Nice!

Indeed, that holds beyond toy examples. The showcase Gaia density plot,

        count(*) AS obs,
        source_id/140737488355328 AS hpx
FROM gaia.dr2light

(the long odd number is 235416-6, which turns source_ids into level 6-HEALPixes as per Gaia footnote id; please note that Postgres right now isn’t smart enough to parallelise ivo_healpix), which traditionally ran for about an hour is now done in less than 10 minutes.

In case you’d like to try things out on your postgres, here’s what I’ve done to establish the max_parallel_workers_per_gather value above.

  1. Find a table with a few 1e7 rows. Think of a query that will return a small result set in order to not confuse . In my case, that’s a magnitude histogram, and the query would be
    select round(Rmag) as bin, count(*) 
    as n from 
    group by bin;

    Run this query once so the data is in the disk cache (the query is “warm”).

  2. Establish a non-parallel baseline. That’s easy to do:
    set max_parallel_workers_per_gather=0;
  3. Then run
    explain analyze select round(Rmag) as bin, count(*) as n from group by bin;

    You should see a simple query plan with the runtime for the non-parallel execution – in my case, a bit more than 12 seconds.

  4. Then raise the number of max_parallel_workers_per_gatherer successively. Make sure the query plan has lines of the form “Workers Planned” or so. You should see that the execution time falls with the number of workers you give it, up to the value of max_worker_processes – or until postgres decides your table is too small to warrant further parallelisation, which for my settings happened at 7.

Note, though, that in realistic, more complex queries, there will probably be multiple operations that will profit from parallelisation in a single query. So, if in this trivial example you can go to 15 gatherers and still see an improvement, this could actually make things slower for complex queries. But as I said above: I have no instinct yet for how things will actually work out. If you have experiences to share: I’m sure I’m not the only person on dachs-users who’t be interested.