DaCHS 2.1: Say hello to Python 3

Image: DaCHS and python logos

Today, I have released DaCHS 2.1, the first stable DaCHS running on Python 3. I have tried hard to make the major version move painless and easy, and indeed “pure DaCHS” RDs should just continue to work. But wherever there’s Python in your RDs or near them, things may break, since Python 3 is different from Python 2 in some rather fundamental ways.

Hence, the Debian package even has a new name: gavodachs2-server. Unless you install that, things will keep running as they do. I will keep fixing serious DaCHS 1 bugs for a while, so there’s no immediate urgency to migrate. But unless you migrate, you will not see any new features, so one of these days you will have to migrate anyway. Why not do it today?

Migrating to DaCHS 2

In principle, just say apt install gavodachs2-server and hope for the best. If you have a development machine and regression tests defined, this is actually what we recommend, and we’d be very grateful to learn of any problems you may encounter.

If you’d rather be a little more careful, Carlos Henrique Brandt has kindly updated his Docker files in order to let you spot problems before you mess up your production server. See Test Migration for a quick intro on how to do that. If you spot any problems that are not related to the Python 3 pitfalls mentioned in the howto linked below or nevow exodus, please tell me or (preferably) the dachs-support mailing list.

A longer, more or less permanent piece elaborating possible migration pains, is in our how-to documentation: How do I go from DaCHS1 to DaCHS2?

What’s new in DaCHS2?

I’ve used the opportunity of the major version change to remove a few (mis-) features that I’m rather sure nobody uses; and there are a few new features, too. Here’s a rundown of the more notable changes:

  • DaCHS now produces VOTable 1.4 by default. This is particularly notable when you provide TIMESYS metadata (on which I’ll report some other time).
  • When doing spatial indices, prefer the new //scs#pgs-pos-index to //scs#q3cindex. While q3c is still faster and more compact than pgsphere when just indexing points, on the longer run I’d like to shed the extra dependency (note, however, that the pgsphere index limits the cone search to a maximum radius of 90 degrees at this point).
  • Talking about Cone Search: For custom parameters, DaCHS has so far used SSA-like syntax, so you could say, for instance, vmag=12/13 (for “give me rows where vmag is between 12 and 13”). Since I don’t think this was widely used, I’ve taken the liberty to migrate to DALI-compliant syntax, where intervals are written as they would be in VOTable PARAM values: vmag=12 13.
  • In certain situations, DaCHS tries to enable parallel queries (previously on this blog).
  • Some new ADQL user defined functions: gavo_random_normal, gavo_mocintersect, and gavo_mocunion. See the TAP capabilities for details, and note that the moc functions will fail until we put out a new pgsphere package that has support for the MOC-MOC operations.
  • dachs info (highly recommended after an import) now takes a --sample-percent option that helps when doing statistics on large tables.
  • For SSA services serving something other than spectra (in all likelihood, timeseries), you can now set a productType meta as per the upcoming SimpleDALRegExt 1.2.
  • If you have large, obscore-published SIAP tables, re-index them (dachs imp -I q) so queries over s_ra and s_dec get index support, too.
  • Since we now maintain RD state in the database, you can remove the files /var/gavo/state/updated* after upgrading.
  • When writing datalink metaMakers returning links, you can (and should, for new RDs) define the semantics in an attribute to the element rather in the LinkDef constructor.
  • Starting with this version, it’s a good idea to run dachs limits after an import. This, right now, will mainly set an estimate for the number of rows in a table, but that’s already relevant because the ADQL translator uses it to help the postgres query planner. It will later also update various kinds of column metadata that, or so I hope, will become relevant in VODataService 1.3.
  • forceUnique on table elements is now a no-op (and should be removed); just define a dupePolicy as before.
  • If you write bad obscore mappings, it could so far be hard to figure out the reason of the failure and, between lots of confusing error messages, to fix it. Instead, you can now run “dachs imp //obscore recover“ in such a situation. It will re-create the obscore table and throw out all stanzas that fail; after that, you can fix the obscore declarations that were thrown out one by one.
  • If you run DaCHS behind a reverse proxy that terminates https, you can now set [web]adaptProtocol in /etc/gavo.rc to False. This will make that setup work for form-based services, too.
  • If you have custom OAI set name (i.e., anything but local and ivo_managed in the sets attribute of publish elements), you now have to declare them in [ivoa]validOAISets.
  • Removed things: the docform renderer (use form instead), the soap renderer (well, it’s not actually removed, it’s just that the code it depends on doesn’t exist on python3 any more), sortKey on services (use the defaultSortKey property), //scs#q3cpositions (port the table to have ra and dec and one of the SCS index mixins), the (m)img.jpeg renderers (if you were devious enough to use these, let me know), and quite a few even more exotic things.

Some Breaking Changes

Python 3 was released in 2008, not long after DaCHS’ inception, but since quite a few of the libraries it uses to do its job haven’t been available for Python 3, we have been reluctant to make the jump over the past then years (and actually, the stability of the python2 platform was a very welcome thing).

Indeed, the most critical of our dependencies, twisted, only became properly usable with python3 in, roughly, 2017. Indeed, large parts of DaCHS weren’t even using twisted directly, but rather a nice add-on to it called nevow. Significant parts of nevow bled through to DaCHS operators; for instance, the render functions or the entire HTML templating.

Nevow, unfortunately, fell out of fashion, and so nobody stepped forward to port it. And when I started porting it myself I realised that I’m mainly using the relatively harmless parts of nevow, and hence after a while I figured that I could replace the entire dependency by something like a 1000 lines in DaCHS, which, given significant aches when porting the whole of nevow, seemed like a good deal.

The net effect is that if you built code on top of nevow – most likely in the form of a custom renderer – that will break now, and porting will probably be rather involved (having ported ~5 custom renderers, I think I can tell). If this concerns you, have a look at the README in gavo.formal (and then complain because it’s mainly notes to myself at this point). I feel a bit bad about having to break things that are not totally unreasonable in this drastic way and thus offer any help I can give to port legacy DaCHS code.

Outside of these custom renderers, there should just be a single visible change: If you have used n:render="some_key" in nevow templates to pull data from dictionaries, that won’t work any longer. Use n:data="key some_key" n:render="str" instead.

Building consensus

[image: Markus, handwringing]
Sometimes, building consensus takes a little bending: Me, at the Shanghai Interop of 2017. In-joke: there’s “STC” on the slide.
In the Virtual Observatory, procedures are built on consensus: No (relevant) decisions are passed based some sort of majority vote. While I personally think that’s a very good thing in general – you really don’t want to clobber minorities, and I couldn’t even give a minimal size of such a minority below which it might be ok to ignore them –, there is a profound operational reason for that: We cannot force data centers or software writers to comply with our standards, so they had better agree with them in the first place.

However, building consensus (to avoid Chomsky’s somewhat odious notion of manufacturing consent) is hard. In my current work, this insight manifests itself most strongly when I wear my hat as chair of the IVOA Semantics Working Group, where we need to sort items from a certain part of the world into separate boxes and label those, that is, we’re building vocabularies. “Part of the world” can be formalised, and there are big phrases like “universe of discourse” to denote such formalisations, but to give you an idea, it’s things like reference frames, topics astronomy in general talks about (think journal keywords), relationships between data collections and services, or the roles of files related to or making up a dataset. If you visit the VO’s vocabulary repository, you will see what parts we are trying to systematise, and if you skim the current draft for the next release of Vocabularies in the VO, in section two you can find a few reasons why we are bothering to do that.

As you may expect if you have ever tried classifications like this, what boxes (”concepts” in the argot of the semantics folks) there should be and how to label them are questions with plenty of room for dissent. A case study for this is the discussion on VEP-001 and its successors that has been going on since late last year; it also illustrates that we are not talking about bikeshedding here. The discussion clarified much and, in particular, led to substantial improvements not only to the concept in question but also far beyond that. If you are interested, have a look at a few mail threads (here, here, here, or here; more discussion happened live at meetings).

An ideal outcome of such a process is, of course, a solution that is obvious in retrospect, so everyone just agrees. Sometimes, that doesn’t happen, and one of these times is VEP-001 and the VEP-003 it evolved into. A spontanous splinter between sessions of this week’s Virtual Interop yielded two rather sensible names for the concept we had identified in the previous debates: #sibling on the one hand, and #co-derived on the other (in case you’re RDF-minded: the full vocabulary URIs are obtained by prefixing this with the vocabulary URI, http://www.ivoa.net/rdf/datalink/core). Choosing between the two is a bit of a matter of taste, but also of perhaps changing implementations, and so I don’t see a clear preference. And the people in the conference didn’t reach an agreement before people on the North American west coast really had to have some well-deserved sleep.

In such a situation – extensive discussion yields some very few, apparently rather equivalent solution –, I suspect it is the time to resort to some sort of polling after all. So, in the session I’ve asked the people involved to give their pain level on a scale of 1 to 10. Given there are quite a few consensus scales out there already (I’m too lazy to look for references now, but I’ll retrofit them here if you send some in), I felt this was a bit hasty after I had closed the z**m^H^H^H^H telecon client. But then, thinking about it, I started to like that scale, and so during a little bike ride I came up with what’s below. And since I started liking it, I thought I could put it into words, and into a form I can reference when similar situations come up in the future. And so, here it is:

Markus’ Pain Level Scale

  1. Oh wow. I’m enthusiastic about it, and I’d get really cross if we didn’t do it.
  2. It’s great. I don’t think we’ll find a better solution. People better have really strong reasons to reject it.
  3. Fine. Just go ahead.
  4. Quite reasonable. I have some doubts, but I either don’t have a good alternative, or the alternatives certainly won’t improve matters.
  5. Reasonable. I can live with it, possibly accepting a very moderate amount of pain (like: change an implemenation that I think is fine as it is).
  6. Sigh. I don’t like it much. If you think it’s useful, do it, but don’t blame me if it later turns out it stinks.
  7. Ouch. I wish we didn’t have to go there. For instance: This is going to uglify a few things I care about.
  8. Yikes. I think it’s a bad idea. Honestly, let’s not do it. It’s going to make quite a few things a lot uglier, though I give you it might still just barely work.
  9. OMG. What are you thinking? I won’t go near it, and I pity everyone who will have to. And it’s quite likely going to blow up some things I care about.
  10. Blech. To me, this clearly is a grave mistake that will impact a lot of things very adversely. If I can do anything within reason to stop it, I’ll do it. Consider this a veto, and shame on you if you override it.

You can qualify this with:

+
I’ve thought long and hard about this, and I think I understand the matter in depth. You’ll hence need arguments of the profundity of the Earth’s outer core to sway me.
(unqualified)
I’ve thought about this, and as far as I understand the matter I’m sure about it. More information, solid arguments, or a sudden inspiration while showering might still sway me.
This is a gut feeling. It could very well be phantom pain. Feel free to try a differential diagnosis.

If you like the scale, too, feel free to reference it as https://blog.g-vo.org/building-consensus/#scale.

GAVO vs. Corona

[Group Photo of the 2018 Victoria Interop]
You won’t see something like this (the May 2018 Interop group photo) in Spring 2020: The Sidney Interop, planned for early May, is going to take place using remote tools. Some of which I’d rather do without.

The Corona pandemic, regrettably, has also brought with it a dramatic move to closed, proprietary communication and collaboration platforms: I’m being bombarded by requests to join Zoom meetings, edit Google docs, chat on Slack, “stream” something on any of Youtube, Facebook, Instagram, or Sauron (I’ve made one of these up).

Mind you, that’s within the Virtual Observatory. Call me pig-headed, but I feel that’s a disgrace when we’re out to establish Free and open standards (for good reasons). To pick a particularly sad case, Slack right now is my pet peeve because they first had an interface to IRC (which has been doing what they do since the late 80ies, though perhaps not as prettily in a web browser) and then cut it when they had sufficient lock-in. Of course, remembering how Google first had XMPP (that’s the interoperable standard for instant messaging) in Google talk and then cut that, too… ah well, going proprietary unfortunately is just good business sense once you have sufficient lock-in.

Be that as it may, I was finally fed up with all this proprietary tech and set up something suitable for conferecing building on open, self-hostable components. It’s on https://telco.g-vo.org, and you’re welcome to use it for your telecons (assuming that when you’re reading this blog, you have at least some relationship to astronomy and open standards).

What’s in there?

Unfortunately, there doesn’t seem to be an established, Free conferencing system based on SIP/RTP, which I consider the standard for voice communication on the internet (if you’ve never heard of it: it’s what your landline phone uses in all likelihood). That came as a bit of a surprise to me, but the next best thing is a Free and multiply implemented solution, and there’s the great mumble system that (at least for me) works so much better than all the browser-based horrors, not to mention it’s quite a bit more bandwidth-effective. So: Get a client and connect to telco.g-vo.org. Join one of the two meeting rooms, done.

Mumble doesn’t have video, which, considering I’ve seen enough of peoples’ living rooms (not to mention Zoom’s silly bluebox backgrounds) to last a lifetime, counts as an advantage in my book. However, being able to share a view on a document (or slide set) and point around in it is a valid use case. Bonus points if the solution to that does not involve looking at other people’s mail, IM notifications, or screen backgrounds.

Now, a quick web search did not turn up anything acceptable to me, and since I’ve always wanted to play with websockets, I’ve created poatmyp: With it, you upload a PDF, distribute the link to your meeting partners, and all participants will see the slides and a shared pointer. And they can move around in the document together.

What’s left is shared editing. I’ve looked at a few implementations of this, but, frankly, there’s too much npm and the related curlbashware in this field to make any of it enjoyable; also, it seems nobody has bothered to provide a Debian package of one of the systems. On the other hand, there are a few trustworthy operators of etherpads out there, so for now we are pointing to them on telco.g-vo.

Setting up a mumble server and poatmyp isn’t much work if you know how to configure an nginx and have a suitable box on the web. So: perhaps you’ll use this opportunity to re-gain a bit of self-reliance? You see, there’s little point to have your local copy of the Gaia catalogue, and doing that right is hard. Thanks to people writing Free software, running a simple telecon infrastructure, on the other hand, isn’t hard any more.

The Bochum Galactic Disk Survey

[Image: Patches of higher perceived variability on the Sky]
Fig 1: How our haphazard variability ratio varies over the sky (galactic coordinates). And yes, it’s clear that this isn’t dominated by physical variability.

About a year ago, I reported on a workshop on “Large Surveys with Small Telescopes” in Bamberg; at around the same time, I’ve published an example for those, the Bochum Galactic Disk Survey BGDS, which used a twin 15 cm robotic telescope in some no longer forsaken place in the Andes mountains to monitor the brighter stars in the southern Milky Way. While some tables from an early phase of the survey have been on VizieR for a while, we now publish the source images (also in SIAP and Obscore), the mean photometry (via SCS and TAP) and, perhaps potentially most fun of all, the the lightcurves (via SSAP and TAP) – a whopping 35 million of the latter.

This means that in tools like Aladin, you can now find such light curves (and images in two bands from a lot of epochs) when you are in the survey’s coverage, and you can run TAP queries on GAVO’s http://dc.g-vo.org/tap server against the full photometry table and the time series.

Regular readers of this blog will not be surprised to see me use this as an excuse to show off a bit of ADQL trickery.

If you have a look at the bgds.phot_all table in your favourite TAP client, you’ll see that it has a column amp, giving the difference between the highest and lowest magnitude. The trouble is that amp for almost all objects just reflects the measurement error rather than any intrinsic variability. To get an idea what’s “normal” (based on the fact that essentially all stars have essentially constant luminosity on the range and resolution scales considered here), run a query like

SELECT ROUND(amp/err_mag*10)/10 AS bin, COUNT(*) AS n
FROM bgds.phot_all
WHERE nobs>10
GROUP BY bin

As this scans the entire 75 million rows of the table, you will probably have to use async mode to run this.

[image: distribution of amplitude/mag error
Figure 2: The distribution of amplitude over magnitude error for all BGDS objects with nobs>10 (blue) and the subset with a mean magnitude brighter than 15 (blue).

When it comes back, you will have, for objects where any sort of statistics make sense at all (hence nobs>10), a histogram (of sorts) of the amplitude in units of upstream’s magnitude error estimation. If you log-log-plot this, you’ll see something like Figure 2. The curve at least tells you that the magnitude error estimate is not very far off – the peak at about 3 “sigma” is not unreasonable since about half of the objects have nobs of the order of a hundred and thus would likely contain outliers that far out assuming roughly Gaussian errors.

And if you’re doing a rough cutoff at amp/magerr>10, you will get perhaps not necessarily true variables, but, at least potentially interesting objects.

Let’s use this insight to see if we spot any pattern in the distribution of these interesting objects. We’ll use the HEALPix technique I’ve discussed three years ago in this blog, but with a little twist from ADQL 2.1: The Common Table Expressions or CTEs I have already mentioned in my blog post on ADQL 2.1 and then advertised in the piece on the Henry Draper catalogue. The brief idea, again, is that you can write queries and give them a name that you can use elsewhere in the query as if it were an actual table. It’s not much different from normal subqueries, but you can re-use CTEs in multiple places in the query (hence the “common”), and it’s usually more readable.

Here, we first create a version of the photometry table that contains HEALPixes and our variability measure, use that to compute two unsophisticated per-HEALPix statistics and eventually join these two to our observable, the ratio of suspected variables to all stars observed (the multiplication with 1.0 is a cheap way to make a float out of a value, which is necessary here because a/b does integer division in ADQL if a and b are both integers):

WITH photpoints AS (
  SELECT 
    amp/err_mag AS redamp,
    amp,
    ivo_healpix_index(5, ra, dec) AS hpx
  FROM bgds.phot_all
  WHERE 
    nobs>10
    AND band_name='SDSS i'
    AND mean_mag<16),
all_objs AS (
  SELECT count(*) AS ct,
    hpx
    FROM photpoints GROUP BY hpx),
strong_var AS (
  SELECT COUNT(*) AS ct,
    hpx
    FROM photpoints
    WHERE redamp>4 AND amp>1 GROUP BY hpx)
SELECT
  strong_var.ct/(1.0*all_objs.ct) AS obs,
  all_objs.ct AS n,
  hpx
FROM strong_var JOIN all_objs USING (hpx)
WHERE all_objs.ct>20

If you plot this using TOPCAT’s HEALPix thingy and ask it to use Galactic coordinates, you’ll end up with something like Figure 1.

There clearly is some structure, but given that the variables ratio reaches up to 0.2, this is still reflecting instrumental or pipeline effects and thus earthly rather than Astrophysics. And that’s going beyond what I’d like to talk about on a VO blog, although I’l take any bet that you will see significant structure in the spatial distribution of the variability ratio at about any magnitude cutoff, since there are a lot of different population mixtures in the survey’s footprint.

Be that as it may, let’s have a quick look at the time series. As with the short spectra from Byurakan use case, we’ve stored the actual time series as arrays in the database (the mjd and mags columns in bgds.ssa_time_series. Unfortunately, since they are a lot less array-like than homogeneous spectra, it’s also a lot harder to do interesting things with them without downloading them (I’m grateful for ideas for ADQL functions that will let you do in-DB analysis for such things). Still, you can at least easily download them in bulk and then process them in, say, python to your heart’s content. The Byurakan use case should give you a head start there.

For a quick demo, I couldn’t resist checking out objects that Simbad classifies as possible long-period variables (you see, as I write this, the public bohei over Betelgeuse’s brief waning is just dying down), and so I queried Simbad for:

SELECT ra, dec, main_id
FROM basic
WHERE
  otype='LP?'
  AND 1=CONTAINS(
     POINT('', ra, dec),
     POLYGON('', 127, -30, 112, -30, 272, -30, 258, -30))

(as of this writing, Simbad still needs the ADQL 2.0-compliant first arguments to POINT and POLYGON), where the POLYGON is intended to give the survey’s footprint. I obtained that by reading off the coordinates of the corners in my Figure 1 while it was still in TOPCAT. Oh, and I had to shrink it a bit because Simbad (well, the underlying Postgres server, and, more precisely, its pg_sphere extension) doesn’t want polygons with edges longer than π. This will soon become less pedestrian: MOCs in relational databases are coming; more on this soon.

[TOPCAT action shot with a light curve display]
Fig 3: V566 Pup’s BGDS lightcuve in a TOPCAT configured to auto-plot the light curves associated with a row from the bgds.ssa_time_series table on the GAVO DC TAP service.

If you now do the usual spiel with an upload crossmatch to the bgds.ssa_time_series table and check “Plot Table” in Views/Activation Action, you can quickly page through the light curves (TOPCAT will keep the plot style as you go from dataset to dataset, so it’s worth configuring the lines and the error bars). Which could bring you to something like Fig. 3; and that would suggest that V* V566 Pup isn’t really long-period unless the errors are grossly off.

Parallel Queries

Image: Plot of run times
An experiment with parallel querying of PPMX, going from single-threaded execution to using seven workers.

Let me start this post with a TL;DR for

scientists
Large analysis queries (like those that contain a GROUP BY clause) profit a lot from parallel execution, and you needn’t do a thing for that.
DaCHS operators
When you have large tables, Postgres 11 together with the next DaCHS release may speed up your responses quite dramatically in some cases.

So, here’s the story –

I’ve finally overcome my stretch trauma and upgraded the Heidelberg data center’s database server to Debian buster. With that, I got Postgres 11, and I finally bothered to look into what it takes to enable parallel execution of database queries.

Turns out: My Postgres started to do parallel execution right away, but just in case, I went for the following lines in postgresql.conf:

max_parallel_workers_per_gather = 4
max_worker_processes = 10
max_parallel_workers = 10

Don’t quote me on this – I frankly admit I haven’t really developed a feeling for the consequences of max_parallel_workers_per_gather and instead just did some experiments while the box was loaded otherwise, determining where raising that number has a diminishing return (see below for more on this).

The max_worker_processes thing, on the other hand, is an educated guess: on my data center, there’s essentially never more than one person at a time who’s running “interesting”, long-running queries (i.e., async), and that person should get the majority of the execution units (the box has 8 physical CPUs that look like 16 cores due to hyperthreading) because all other operations are just peanuts in comparison. I’ll gladly accept advice to the effect that that guess isn’t that educated after all.

Of course, that wasn’t nearly enough. You see, since TAP queries can return rather large result sets – on the GAVO data center, the match limit is 16 million rows, which for a moderate row size of 2 kB already translates to 32 GB of memory use if pulled in at once, half the physical memory of that box –, DaCHS uses cursors (if you’re a psycopg2 person: named cursors) to stream results and write them out to disk as they come in.

Sadly, postgres won’t do parallel plans if it thinks people will discard a large part of the result anyway, and it thinks that if you’re coming through a cursor. So, in SVN revision 7370 of DaCHS (and I’m not sure if I’ll release that in this form), I’m introducing a horrible hack that, right now, just checks if there’s a literal “group” in the query and doesn’t use a cursor if so. The logic is, roughly: With GROUP, the result set probably isn’t all that large, so streaming isn’t that important. At the same time, this type of query is probably going to profit from parallel execution much more than your boring sequential scan.

This gives rather impressive speed gains. Consider this example (of course, it’s selected to be extreme):

import contextlib
import pyvo
import time

@contextlib.contextmanager
def timeit(activity):
  start_time = time.time()
  yield
  end_time = time.time()
  print("Time spent on {}: {} s".format(activity, end_time-start_time))


svc = pyvo.tap.TAPService("http://dc.g-vo.org/tap")
with timeit("Cold (?) run"):
  svc.run_sync("select round(Rmag) as bin, count(*) as n"
    " from ppmx.data group by bin")
with timeit("Warm run"):
  svc.run_sync("select round(Rmag) as bin, count(*) as n"
    " from ppmx.data group by bin")

(if you run it yourself and you get warnings about VOTable versions from astropy, ignore them; I’m right and astropy is wrong).

Before enabling parallel execution, this was 14.5 seconds on a warm run, after, it was 2.5 seconds. That’s an almost than a 6-fold speedup. Nice!

Indeed, that holds beyond toy examples. The showcase Gaia density plot,

SELECT
        count(*) AS obs,
        source_id/140737488355328 AS hpx
FROM gaia.dr2light
GROUP BY hpx

(the long odd number is 235416-6, which turns source_ids into level 6-HEALPixes as per Gaia footnote id; please note that Postgres right now isn’t smart enough to parallelise ivo_healpix), which traditionally ran for about an hour is now done in less than 10 minutes.

In case you’d like to try things out on your postgres, here’s what I’ve done to establish the max_parallel_workers_per_gather value above.

  1. Find a table with a few 1e7 rows. Think of a query that will return a small result set in order to not confuse . In my case, that’s a magnitude histogram, and the query would be
    select round(Rmag) as bin, count(*) 
    as n from ppmx.data 
    group by bin;
    

    Run this query once so the data is in the disk cache (the query is “warm”).

  2. Establish a non-parallel baseline. That’s easy to do:
    set max_parallel_workers_per_gather=0;
    
  3. Then run
    explain analyze select round(Rmag) as bin, count(*) as n from ppmx.data group by bin;
    

    You should see a simple query plan with the runtime for the non-parallel execution – in my case, a bit more than 12 seconds.

  4. Then raise the number of max_parallel_workers_per_gatherer successively. Make sure the query plan has lines of the form “Workers Planned” or so. You should see that the execution time falls with the number of workers you give it, up to the value of max_worker_processes – or until postgres decides your table is too small to warrant further parallelisation, which for my settings happened at 7.

Note, though, that in realistic, more complex queries, there will probably be multiple operations that will profit from parallelisation in a single query. So, if in this trivial example you can go to 15 gatherers and still see an improvement, this could actually make things slower for complex queries. But as I said above: I have no instinct yet for how things will actually work out. If you have experiences to share: I’m sure I’m not the only person on dachs-users who’t be interested.

LAMOST5 meets Datalink

One of the busiest spectral survey instruments operated right now is the Large Sky Area Multi-Object Fiber Spectrograph Telescope (LAMOST). And its data in the VO, more or less: DR2 and DR3 have been brought into the VO by our Czech colleagues, but since they currently lack resources to update their services to the latest releases, they have kindly given me their DaCHS resource descriptor, and so I had a head start for publishing DR5 in Heidelberg.

With some minor updates, here it is now: Over nine million medium-resolution spectra covering large parts of the northen sky – the spatial coverage is like this:

[Coverage Healpix map]

There’s lots of fun to be had with this; of course, there’s an SSA service, so when you point Aladin or Splat at some part of the covered sky and look for spectra, chances are you’ll see LAMOST spectra, and when working on some of our tutorials (this one, for example), it happened that LAMOST actually had what I was looking for when writing them.

But I’d like to use the opportunity to mention two other modes of accessing the data.

Tablesample and TOPCAT’s Plot Table activation action

Say you’d like to look at spectra of M stars and would like to have some sample from across the sky, fire up TOPCAT, point its TAP client the GAVO DC TAP service (http://dc.g-vo.org/tap) and run something like

select 
  ssa_pubDID, accref, raj2000, dej2000, ssa_targsubclass
from lamost5.data tablesample(1)
where 
  ssa_targsubclass like 'M%' 

Image: stacked spectra

This is using the TABLESAMPLE modifier in the from clause, which isn’t standard ADQL yet. As mentioned in the DaCHS 1.4 announcement, DaCHS has a prototype implementation of what’s been discussed on the IVOA’s DAL mailing list: pick a part of a table rather than the full one. It takes a percentage as an argument, and tells the server to choose about this percentage of the table’s records using a reasonable and fast heuristic. Note that this won’t give you perfect statistical sampling, but if it’s not “good enough” for some purpose, I’d like to learn about that purpose.

Drawing a proper statistical sample, on the other hand, would take minutes on the GAVO database server – with tablesample, I had the roughly 6000 spectra the above query returns essentially instantaneously, and from eyeballing a sky plot of them, I’d say their distribution is close enough to that of the full DR5. So: tablesample is your friend.

For a quick look at the spectra themselves, in TOPCAT click Views/Activation Actions, check “Plot Table” and make sure TOPCAT proposes the accref column as “Table Location” (if you don’t see these items, update your TOPCAT – it’s worth it). Now click on a row or perhaps a dot on a plot and behold an M spectrum.

Cutouts via Datalink

LAMOST releases spectra in FITS format pretty much like the ones you may know from SDSS. The trick above works because we instead hand out proper, IVOA Spectral Data Model-compliant spectra through SSA and TAP. However, if you need to go back to the original files, you can, using Datalink. If you’re unsure what this Datalink thing is: call me vain, but I still like my 2015 ADASS poster explaining that. In TOPCAT, you’d be using the “Invoke Service” activation action to get to the datalinks.

If you have actual work to do, offloading repetetive work to the computer is what you want, and fortunately, pyVO knows about datalink, too. I give you this is hard to discover so far, and the interface is… a tiny bit clunky. Until some kind soul cleans up the pyVO datalink act, a poster Stefan and I showed at the 2017 ADASS might give you an idea which buttons to press. Or read on and see how things work for LAMOST5.

The shortest way to datalinks is a TAP query that at least retrieves the ssa_pubdid column (that’s a must; Datalink can’t work without it) and, on the result, run the iter_datalinks method. This returns an object in which you can find the associated data items (in this case, a preview and the original FITS with the #progenitor semantics), plus the cutout service.

Hence, a minimal example for pulling the legacy FITS links out of the first three items in lamost5.data would look like this:

import pyvo

svc = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
for dl in svc.run_sync("select top 3 ssa_pubdid"
        " from lamost5.data").iter_datalinks():
    print(next(dl.bysemantics("#progenitor")
        )["access_url"].decode("ascii"))

This is a bit different from listing 2 in the poster linked above because it’s python3, so getting the first element from iterator an iterator looks a bit different, and (curse astropy.votable for returning VOTable chars as bytes rather than strings!) you’ll want to turn the URL into a proper string manually.

Another, actually more interesting, thing you can do with Datalink is cut out regions of interest. The LAMOST spectra are fairly long (though of course still small by image standards), so if you’re only interested in a single line, you can save a bit of storage and bandwidth over blindly pulling the whole thing.

For instance, if you wanted to pull the vicinity of the H and K Fraunhofer lines from the matches in the loop in the snippet above, you could say:

from astropy import units as u
proc = next(dl.iter_procs())
cutout = proc.processed(band=(392*u.nm,398*u.nm))

And this is what I’ve done for the decorative left border above: it’s the H and K line profiles for 0.1% of the stars LAMOST has classified as G8. Building the image didn’t take more than a few seconds (where I’d like the cutouts to be faster by a factor of 10; I guess that’s about an afternoon of work for me, so if it’d save you more than that afternoon, poke me to do it).

What’s coming back is tables. By the time python has digested these, they’re numpy record arrays. Thus, you can immediately bring in your beloved scipy (or whatever). For instance, if for some reason you’re convinced that the H and K lines should be fit by identical Gaussians in the boring case and would like find objects for which that’s patently untrue and that hence could be un-boring, here’s how you could do that:

def spectral_model(wl, c1, c2, depth, width):
    return (1
        -depth*numpy.exp(-numpy.square(wl-c1)
            /numpy.square(width))
        -depth*numpy.exp(-numpy.square(wl-c2)
            /numpy.square(width)))

for pubdid, prof in get_profiles(
        "G8", (392*u.nm,398*u.nm), 0.01, 4):
    prof["flux"] /= max(prof["flux"])
    popt, pcov = curve_fit(
        spectral_model, prof["spectral"], prof["flux"],
        sigma=prof["flux_error"],
        p0=[3968, 3934, 1, 1])
    if pcov[3][3]>1:
        break

– where get_profiles is essentially doing the TAP plus datalink routine above, except I’m swallowing spectra with too much noise and I have the function transform the spectral coordinate into the objects’ rest frames. If you’re curious how I’m doing this just based on the IVOA Spectral Data Model, check the source linked at the bottom of this post.

I’ve just run this, and the first spectrum that the machinery flagged as suspicious was this:

Image: A fairly boring late G spectrum

– which doesn’t look like I’ve made a discovery just yet. But that doesn’t mean there’s not a lot to find within LAMOST5’s lines…

To get you up to speed quickly: here’s the actual python3 code I ran for the “analysis” and the plot.

DaCHS 1.4 is out

Dachs logo with "version 1.4" superposed

Since the Groningen Interop is over, it’s time for a DaCHS release, and so, roughly half a year after the release of DaCHS 1.3, today I’ve pushed DaCHS 1.4 into our Debian repository.

As usual, you should upgrade as soon as you find time to do so, because upgrades become more difficult if they span large version gaps; and one of these days you will need some new feature or run into one of the odd bugs. Upgrading is a good opportunity to also get your DaCHS ready for buster by adding the repos mentioned there.

The list of new features is rather short this time around. Here are some noteworthy ones:

  • There’s now an XML grammar that can be used when you have to parse smallish snippets of XML as, for instance, in VOEvent.
  • You can now use TABLESAMPLE(1) after a table specification in DaCHS’ ADQL to tell the database engine to just use 1% of a table for a query. While this isn’t a precise way to sample tables, it’s great when developing queries.
  • Also among new features I’d like to see in ADQL and have therefore put into DaCHS is GENERATE_SERIES(a,b), which is what is known as table-generating function in SQL . If you know SDSS CasJobs, you’ll have seen lots of those already. GENERATE_SERIES, however, is really plain: it just spits out a table with a column with integers between a and b. For an example of why one might what to have that, check out the poster I’m linking to in my ADASS report.
  • If you have an updating data descriptor (usually, because you keep feeding data into a data collection), DaCHS will no longer automatically re-make its dependencies (like, say, views). That’s because that’s not necessary in general, and it’s a pain if every update on an obscore-published table tears down and rebuilds the obscore view. For the rare cases when you do need to rebuild dependencies, there’s now a remakeOnDataChange attribute on data.
  • At the interop, I’ve mentioned a few use cases for knowing which server software you’re talking to, and I’ve said that people should set their server headers to informative values. DaCHS does that now.

To conclude on a low note: This is probably going to be the last release of DaCHS for python 2. Even though we will have to shed a dependency or two that simply will not be ported to python 3, and even though I’m rather unhappy with a few properties of the python 3 port of twisted, there’s probably no way to escape this, given that Debian is purging out python 2 packages quickly already.

So, when we meet again for the next release, you’ll probably be looking at DaCHS 2.0, and where you have custom code in your RDs, it’s rather likely that you’ll see a minor amount of breakage. I promise I’ll do everything I can to make the migration easy for deployers, but I can’t do higher magic, so: If there’s ever been a time to add regression tests to your RDs, it’s now.

ADASS and Interop

[ADASS group photo]
ADASS XXIX is a big conference with lots of attendants. I’ve taken the liberty of scaling the photo so you really won’t recognise me (though I am on the photo). Note that, regrettably, the interop will be a lot smaller.

The people that create the Virtual Observatory standards, organised in the IVOA, meet twice a year: Once in spring for a five-day meeting (this year it happened in Paris), and once in autumn for a three-day meeting back-to-back to ADASS, the venerable (this year it’s the 29th installment) meeting of people dealing with astronomy and computers.

We’re now on day three of ADASS, and for me, so far this has been more or an endless hackathon, with discussing and hacking on things like mirrors for DFBS, ADQL 2.1, the evolution of IVOA vocabularies (more on this soon somewhere around here), a vocabulary of object types, getting LAMOST 5 published properly in the VO, the measurements data model, convincing more registries to push out space-time coverage for their resources (I’m showing a poster on that), and a lot more.

So, getting to actually listen to talks during ADASS almost is something of a luxury, and a mind-widening at that – I’ve just listend to a talk about effectively doubling the precision of VLBI geodesy (in this case, measuring the location of radio telescopes to a few millimeters) by a piece of clever software, and before that I could learn a bit about how complex it is to figure out how much interference something emitting radio waves will cause in some other place on earth (like, well, a radio telescope). In case you’re curious: A bit more than a year from now, short papers on the topics will appear in the proceedings of ADASS XXIX, which in turn you’ll find in the ADASS proceedings collections (or on arXiv before that).

Given the experience of the last few days, I doubt I’ll do anything like the live blog from Paris linked above. I still can’t resist mentioning that at ADASS, I’m having a poster that’s little more than an ad blitz for STC in the registry.

Update (2019-10-13): Well, one week later I’m sitting in the closing session of the Interop, and I’ve even already given my summary of Semantics activities during the interop. Other topics I’ve talked about at this interop include interoperable authentication (I’m really interested in this because I’d like to enable persistent TAP uploads, where your uploaded tables are still there for you when you come back), a minor update to SimpleDALRegExt (which is overall rather technical and you probably don’t want to look at), on the takeup of new Registry tech (which might come over as somewhat sad, but considering that you have to pull along many people to have changes in “the” Registry, it’s not so bad at all), and on, as Mark Taylor called it, operational identification of server software (which I consider entertaining in its somewhat erratic narrative).

And now, after 7 days of essential nonstop discussion and brainstorming, I’m longing to slump into a chair on the train back to Heidelberg and just enjoy the landscape rolling by.

GAVO at AG-Tagung Stuttgart

[towel with astro photo]
Our puzzler prize this year: a Photo of the seahorse in the LMC, taken during Hubble’s 100000th orbit around the earth, on a fluffy towel.

It’s time again for the meeting of the Astronomische Gesellschaft (as 2017 in Göttingen; last year we had the IAU general assembly instead). We’re there with a booth (right next to the exhibition on 100 years of IAU) and a splinter meeting, at which I’ll have a sales pitch for cross-server uploads.

And, of course, there’s a puzzler again: you could win a beautiful towel if you solve a little VO-related problem. This year’s puzzler is about where in the sky you’ll see “nebulae” (in the classic sense defined by NGC) batched together most closely. If you’ve been following this blog for a while, it shouldn’t be too hard, but to participate you’d have to find someone in Stuttgart to hand in your solution.

If you are in Stuttgart: As usual, we’ll be giving hints during the coffee breaks on Tuesday and Wednesday. So, be sure to visit our booth.

DaCHS is Bustered

DaCHS is developed on Debian, and Debian is the recommended deployment platform. Hence, a new major release of Debian (where major means for them: We may break stuff) is always a big thing for me. And so it was with the release that came in July, codenamed “buster”. Both on the “big thing” and on the “break” counts. This posting gives DaCHS deployers some background for their buster upgrades. Astronomers not running Debian themselves won’t risk missing anything if they skip this post.

So, after I upgraded the first thing I noticed is that DaCHS would no longer even start because astropy (which it needs, in particular, because that’s where pyfits sits these days) was gone. Simple explanation: Upstream astropy doesn’t support python2 any more, and so Debian buster only has python3-astropy.

Moving DaCHS to python3, unfortunately, isn’t that easy; a major dependency, nevow (essentially, a web framework), isn’t ported yet, and porting it is a major thing. Believe me, I’ve tried. The nasty thing, in particular, is that twisted, which lies below nevow still, hands up lots of byte strings. And in python3, b"a"!="a". You wouldn’t believe how many interesting bugs that simple truth introduces when you got a library that handed out “just strings” in python2 and now byte strings in python3. Yikes.

Update (2019-08-28): After quite a bit of experimentation, I finally gave up on providing a python2 version of astropy through release, because for a complicated set of reasons (including numpy declaring a conflict with existing astropys in buster) it is impossible to provide a package that works in buster and doesn’t break stretch. So, for buster only you’ll have to have a second (or, if running beta, third) gavo line in your sources.list (or equivalent):

deb http://vo.ari.uni-heidelberg.de/debian buster-foreports main

The instructions at our APT repository have been updated, so you won’t have to bookmark this particular page.

But that wasn’t the end of it. Buster comes with Postgres 11, which I look forward to in particular because it supports parallel query execution. That could help us quite a bit, given out large catalogs that quite often we want to run sequential scans on. But of course this means upgrading postgres. And attempting to do that on my development machine immediately hit a wall. What’s nice is that the q3c and pgsphere extensions that we’ve had to push out ourselves so far are now part of Debian main. What’s rather fatal is that our pgsphere extensions dealing with HEALPixes and MOCs aren’t part of the buster pgsphere package (the reasons for that are tedious and arcane and have to do with OpenSSL and the GPL).

Also, the pgsphere package coming with buster is called postgres-pgsphere, which is rather unfortunate as it’s missing the version indication. So: If you find it on your system, remove it right away. It will conflict with the one true pgsphere package (postgresql-11-pgsphere). That one you’ll get from us, and it has the HEALPix stuff built in. TL;DR: run apt install postgresql-q3c postgresql-11-pgsphere before following the postgres update recipe linked above.

There’s a bit more to upgrading the database this time. Because of fairly low-level cleanup in Postgres itself. you’re risking index corruption on string indices. Realistically, for almost anything you’ll have, it’s unlikely that you’re affected (it’s essentially about non-ASCII in strings), but then it’s better to be safe than sorry, and hence you should say

reindex database gavo

first thing after you’ve upgraded to Postgres 11 (which you should really do once the box is on buster). Only if you have very large tables it might be worth it to restrict the index regeneration to indices that could actually need it; see the postgres link above for how to do that.

One last thing on Postgres upgrades: I’ve not quite tried to work out why, but probably depending on your /etc/hosts DaCHS on buster is much more likely to connect to your database using IPv6 than it was before. Many older Postgres configurations won’t let you in then. If that happens to you, just edit /etc/postgresql/11/main/pg_hba.conf and add a line

host    all         all         ::1/32          md5

(or something less permissive if you prefer).

The next buster-related shock was when TOPCAT’s TAP uploads stopped working while my regression tests didn’t find anything wrong. After a bit of cursing I eventually figured out that that’s not actually buster’s fault but twisted’s, which in a commit from May 2018 broke chunked uploads (essentially, that’s when you’re not saying up front how large your upload will be). I’ve filed a bug report on twisted, but we can’t really wait until any sort of fix will be ready and have a broken TOPCAT-DaCHS relationship until then, so for now we’re also shipping a fixed twisted package. If you’re running DaCHS without our repository enabled, you will have to patch your the twisted code itself. The bug report tells what to do (no warranties, though, because I’m not entriely sure why they changed it in the first place; it’s a very small change, though).

[Update (2019-08-14) scratch the part with the fixed twisted packages. They’re too much trouble on stretch systems. You can keep using them on buster boxes if you want, though. The most recent stable release monkeypatches the problem out of presumably broken twisteds, and so will the next beta.]

I hope you’re not totally discouraged now, because upgrade you should (though perhaps not right before going on vacation) – distribution upgrades are unavoidable if you want to run services for decades, and that’s definitely a goal within the VO. See the Debian release note for Debian’s take on dist upgrades, which arguably is a bit more alarmist than it would need to; a lean, server-only system typically is really simple to upgrade.

Given the relatively large number of Debian packages we override in buster, I’ll be particularly grateful if you complain early about breakage you observe (ideally use the dachs-support mailing list, but see Support for alternatives), and as usual you are encouraged to try the upgrade first on a development system if you have one. Which you should.