• Tutorial Renewal

    The DaCHS Tutorial among other seminal works

    DaCHS' documentation (readthedocs mirror) has two fat pieces and a lot of smaller read-as-you-go pieces. One of the behmoths, the reference documentation, at roughly 350 PDF pages, has large parts generated from source code, and there is no expectation that anyone would ever read it linearly. Hence, I wasn't terribly worried about unreadable^Wpassages of questionable entertainment value in there.

    That's a bit different with the tutorial (also available as 150 page PDF; epub on request): I think serious DaCHS deployers ought to read the DaCHS Basics and the chapters on configuring DaCHS and the interaction with the VO Registry, and they should skim the remaining material so they are at least aware of what's there.

    Ok. I give you that is a bit utopian. But given that pious wish I felt rather bad that the tutorial has become somewhat incoherent in the years since I had started the piece in April 2009 (perhaps graciously, the early history is not visible at the documentation's current github home). Hence, when applying for funds under our current e-inf-astro project, I had promised to give the tutorial a solid makeover as, hold your breath, Milestone B1-5, due in the 10th quarter. In human terms: last December.

    When it turned out the Python 3 migration was every bit as bad as I had feared, it became clear that other matters had to take priority and that we might miss this part of that “milestone” (sorry, I can't resist these quotes). And given e-inf-astro only had two quarters to go after that, I prepared for having to confess I couldn't make good on my promise of fixing the tutorial.

    But then along came Corona, and reworking prose seemed the ideal pastime for the home office. So, on April 4, I forked off a new-tutorial branch and started a rather large overhaul that, among others, resulted in the operators' guide with its precarious position between tutorial and reference being largely absorbed into the tutorial. In all, off and on over the last few months I accumulated (according to git diff --shortstat 6372 inserted and 3453 deleted lines in the tutorial's source. Since that source currently is 7762 lines, I'd say that's the complete makeover I had promised. Which is good as e-inf-astro will be over next Wednesday (but don't worry, our work is still funded).

    So – whether you are a DaCHS expert, think about running it, or if you're just curious what it takes to build VO services, let me copy from index.html: Tutorial on importing data (tutorial.html,tutorial.pdf,tutorial.rstx). The ideal company for your vacation!

    And if you find typos, boring pieces, overly radical advocacy or anything else you don't like: there's a bug tracker for you (not to mention PRs are welcome).

  • DaCHS 2.1: Say hello to Python 3

    DaCHS and python logos

    Today, I have released DaCHS 2.1, the first stable DaCHS running on Python 3. I have tried hard to make the major version move painless and easy, and indeed “pure DaCHS” RDs should just continue to work. But wherever there's Python in your RDs or near them, things may break, since Python 3 is different from Python 2 in some rather fundamental ways.

    Hence, the Debian package even has a new name: gavodachs2-server. Unless you install that, things will keep running as they do. I will keep fixing serious DaCHS 1 bugs for a while, so there's no immediate urgency to migrate. But unless you migrate, you will not see any new features, so one of these days you will have to migrate anyway. Why not do it today?

    Migrating to DaCHS 2

    In principle, just say apt install gavodachs2-server and hope for the best. If you have a development machine and regression tests defined, this is actually what we recommend, and we'd be very grateful to learn of any problems you may encounter.

    If you'd rather be a little more careful, Carlos Henrique Brandt has kindly updated his Docker files in order to let you spot problems before you mess up your production server. See Test Migration for a quick intro on how to do that. If you spot any problems that are not related to the Python 3 pitfalls mentioned in the howto linked below or nevow exodus, please tell me or (preferably) the dachs-support mailing list.

    A longer, more or less permanent piece elaborating possible migration pains, is in our how-to documentation: How do I go from DaCHS1 to DaCHS2?

    What's new in DaCHS2?

    I've used the opportunity of the major version change to remove a few (mis-) features that I'm rather sure nobody uses; and there are a few new features, too. Here's a rundown of the more notable changes:

    • DaCHS now produces VOTable 1.4 by default. This is particularly notable when you provide TIMESYS metadata (on which I'll report some other time).
    • When doing spatial indices, prefer the new //scs#pgs-pos-index to //scs#q3cindex. While q3c is still faster and more compact than pgsphere when just indexing points, on the longer run I'd like to shed the extra dependency (note, however, that the pgsphere index limits the cone search to a maximum radius of 90 degrees at this point).
    • Talking about Cone Search: For custom parameters, DaCHS has so far used SSA-like syntax, so you could say, for instance, vmag=12/13 (for “give me rows where vmag is between 12 and 13”). Since I don't think this was widely used, I've taken the liberty to migrate to DALI-compliant syntax, where intervals are written as they would be in VOTable PARAM values: vmag=12 13.
    • In certain situations, DaCHS tries to enable parallel queries (previously on this blog).
    • Some new ADQL user defined functions: gavo_random_normal, gavo_mocintersect, and gavo_mocunion. See the TAP capabilities for details, and note that the moc functions will fail until we put out a new pgsphere package that has support for the MOC-MOC operations.
    • dachs info (highly recommended after an import) now takes a --sample-percent option that helps when doing statistics on large tables.
    • For SSA services serving something other than spectra (in all likelihood, timeseries), you can now set a productType meta as per the upcoming SimpleDALRegExt 1.2.
    • If you have large, obscore-published SIAP tables, re-index them (dachs imp -I q) so queries over s_ra and s_dec get index support, too.
    • Since we now maintain RD state in the database, you can remove the files /var/gavo/state/updated* after upgrading.
    • When writing datalink metaMakers returning links, you can (and should, for new RDs) define the semantics in an attribute to the element rather in the LinkDef constructor.
    • Starting with this version, it's a good idea to run dachs limits after an import. This, right now, will mainly set an estimate for the number of rows in a table, but that's already relevant because the ADQL translator uses it to help the postgres query planner. It will later also update various kinds of column metadata that, or so I hope, will become relevant in VODataService 1.3.
    • forceUnique on table elements is now a no-op (and should be removed); just define a dupePolicy as before.
    • If you write bad obscore mappings, it could so far be hard to figure out the reason of the failure and, between lots of confusing error messages, to fix it. Instead, you can now run ``dachs imp //obscore recover`` in such a situation. It will re-create the obscore table and throw out all stanzas that fail; after that, you can fix the obscore declarations that were thrown out one by one.
    • If you run DaCHS behind a reverse proxy that terminates https, you can now set [web]adaptProtocol in /etc/gavo.rc to False. This will make that setup work for form-based services, too.
    • If you have custom OAI set name (i.e., anything but local and ivo_managed in the sets attribute of publish elements), you now have to declare them in [ivoa]validOAISets.
    • Removed things: the docform renderer (use form instead), the soap renderer (well, it's not actually removed, it's just that the code it depends on doesn't exist on python3 any more), sortKey on services (use the defaultSortKey property), //scs#q3cpositions (port the table to have ra and dec and one of the SCS index mixins), the (m)img.jpeg renderers (if you were devious enough to use these, let me know), and quite a few even more exotic things.

    Some Breaking Changes

    Python 3 was released in 2008, not long after DaCHS' inception, but since quite a few of the libraries it uses to do its job haven't been available for Python 3, we have been reluctant to make the jump over the past then years (and actually, the stability of the python2 platform was a very welcome thing).

    Indeed, the most critical of our dependencies, twisted, only became properly usable with python3 in, roughly, 2017. Indeed, large parts of DaCHS weren't even using twisted directly, but rather a nice add-on to it called nevow. Significant parts of nevow bled through to DaCHS operators; for instance, the render functions or the entire HTML templating.

    Nevow, unfortunately, fell out of fashion, and so nobody stepped forward to port it. And when I started porting it myself I realised that I'm mainly using the relatively harmless parts of nevow, and hence after a while I figured that I could replace the entire dependency by something like a 1000 lines in DaCHS, which, given significant aches when porting the whole of nevow, seemed like a good deal.

    The net effect is that if you built code on top of nevow – most likely in the form of a custom renderer – that will break now, and porting will probably be rather involved (having ported ~5 custom renderers, I think I can tell). If this concerns you, have a look at the README in gavo.formal (and then complain because it's mainly notes to myself at this point). I feel a bit bad about having to break things that are not totally unreasonable in this drastic way and thus offer any help I can give to port legacy DaCHS code.

    Outside of these custom renderers, there should just be a single visible change: If you have used n:data="some_key" in nevow templates to pull data from dictionaries, that won't work any longer. Use n:data="key some_key" n:render="str" instead. And it turns out that this very construct was used in the default root template, which you may have derived from. So – see if you have /var/gavo/web/templates/root.html and if so, whether there is <ul n:data="chunk" in there. If you have that, change it to <ul n:data="key chunk".

    Update (2020-11-19): Two only loosely related problems have surfaced during updates. In particular if you are updating on rather old installations, you may want to look at the points on Invalid script type preIndex and function spoint_in already exists in our list of common problems.

  • Building consensus

    image: Markus, handwringing

    Sometimes, building consensus takes a little bending: Me, at the Shanghai Interop of 2017. In-joke: there's “STC” on the slide.

    In the Virtual Observatory, procedures are built on consensus: No (relevant) decisions are passed based some sort of majority vote. While I personally think that's a very good thing in general – you really don't want to clobber minorities, and I couldn't even give a minimal size of such a minority below which it might be ok to ignore them –, there is a profound operational reason for that: We cannot force data centers or software writers to comply with our standards, so they had better agree with them in the first place.

    However, building consensus (to avoid Chomsky's somewhat odious notion of manufacturing consent) is hard. In my current work, this insight manifests itself most strongly when I wear my hat as chair of the IVOA Semantics Working Group, where we need to sort items from a certain part of the world into separate boxes and label those, that is, we're building vocabularies. “Part of the world” can be formalised, and there are big phrases like “universe of discourse” to denote such formalisations, but to give you an idea, it's things like reference frames, topics astronomy in general talks about (think journal keywords), relationships between data collections and services, or the roles of files related to or making up a dataset. If you visit the VO's vocabulary repository, you will see what parts we are trying to systematise, and if you skim the current draft for the next release of Vocabularies in the VO, in section two you can find a few reasons why we are bothering to do that.

    As you may expect if you have ever tried classifications like this, what boxes (”concepts” in the argot of the semantics folks) there should be and how to label them are questions with plenty of room for dissent. A case study for this is the discussion on VEP-001 and its successors that has been going on since late last year; it also illustrates that we are not talking about bikeshedding here. The discussion clarified much and, in particular, led to substantial improvements not only to the concept in question but also far beyond that. If you are interested, have a look at a few mail threads (here, here, here, or here; more discussion happened live at meetings).

    An ideal outcome of such a process is, of course, a solution that is obvious in retrospect, so everyone just agrees. Sometimes, that doesn't happen, and one of these times is VEP-001 and the VEP-003 it evolved into. A spontanous splinter between sessions of this week's Virtual Interop yielded two rather sensible names for the concept we had identified in the previous debates: #sibling on the one hand, and #co-derived on the other (in case you're RDF-minded: the full vocabulary URIs are obtained by prefixing this with the vocabulary URI, http://www.ivoa.net/rdf/datalink/core). Choosing between the two is a bit of a matter of taste, but also of perhaps changing implementations, and so I don't see a clear preference. And the people in the conference didn't reach an agreement before people on the North American west coast really had to have some well-deserved sleep.

    In such a situation – extensive discussion yields some very few, apparently rather equivalent solution –, I suspect it is the time to resort to some sort of polling after all. So, in the session I've asked the people involved to give their pain level on a scale of 1 to 10. Given there are quite a few consensus scales out there already (I'm too lazy to look for references now, but I'll retrofit them here if you send some in), I felt this was a bit hasty after I had closed the z**m^H^H^H^H telecon client. But then, thinking about it, I started to like that scale, and so during a little bike ride I came up with what's below. And since I started liking it, I thought I could put it into words, and into a form I can reference when similar situations come up in the future. And so, here it is:

    Markus' Pain Level Scale

    1. Oh wow. I'm enthusiastic about it, and I'd get really cross if we didn't do it.
    2. It's great. I don't think we'll find a better solution. People better have really strong reasons to reject it.
    3. Fine. Just go ahead.
    4. Quite reasonable. I have some doubts, but I either don't have a good alternative, or the alternatives certainly won't improve matters.
    5. Reasonable. I can live with it, possibly accepting a very moderate amount of pain (like: change an implementation that I think is fine as it is).
    6. Sigh. I don't like it much. If you think it's useful, do it, but don't blame me if it later turns out it stinks.
    7. Ouch. I wish we didn't have to go there. For instance: This is going to uglify a few things I care about.
    8. Yikes. I think it's a bad idea. Honestly, let's not do it. It's going to make quite a few things a lot uglier, though I give you it might still just barely work.
    9. OMG. What are you thinking? I won't go near it, and I pity everyone who will have to. And it's quite likely going to blow up some things I care about.
    10. Blech. To me, this clearly is a grave mistake that will impact a lot of things very adversely. If I can do anything within reason to stop it, I'll do it. Consider this a veto, and shame on you if you override it.

    You can qualify this with:

    +:I've thought long and hard about this, and I think I understand the matter in depth. You'll hence need arguments of the profundity of the Earth's outer core to sway me.
    (unqualified):I've thought about this, and as far as I understand the matter I'm sure about it. More information, solid arguments, or a sudden inspiration while showering might still sway me.
    -:This is a gut feeling. It could very well be phantom pain. Feel free to try a differential diagnosis.

    If you like the scale, too, feel free to reference it as href="https://blog.g-vo.org/building-consensus/#scale">https://blog.g-vo.org/building-consensus/#scale.

  • GAVO vs. Corona

    A conference group photo

    You won't see something like this (the May 2018 Interop group photo) in Spring 2020: The Sidney Interop, planned for early May, is going to take place using remote tools. Some of which I'd rather do without.

    The Corona pandemic, regrettably, has also brought with it a dramatic move to closed, proprietary communication and collaboration platforms: I'm being bombarded by requests to join Zoom meetings, edit Google docs, chat on Slack, “stream” something on any of Youtube, Facebook, Instagram, or Sauron (I've made one of these up).

    Mind you, that's within the Virtual Observatory. Call me pig-headed, but I feel that's a disgrace when we're out to establish Free and open standards (for good reasons). To pick a particularly sad case, Slack right now is my pet peeve because they first had an interface to IRC (which has been doing what they do since the late 80ies, though perhaps not as prettily in a web browser) and then cut it when they had sufficient lock-in. Of course, remembering how Google first had XMPP (that's the interoperable standard for instant messaging) in Google talk and then cut that, too... ah well, going proprietary unfortunately is just good business sense once you have sufficient lock-in.

    Be that as it may, I was finally fed up with all this proprietary tech and set up something suitable for conferecing building on open, self-hostable components. It's on https://telco.g-vo.org, and you're welcome to use it for your telecons (assuming that when you're reading this blog, you have at least some relationship to astronomy and open standards).

    What's in there?

    Unfortunately, there doesn't seem to be an established, Free conferencing system based on SIP/RTP, which I consider the standard for voice communication on the internet (if you've never heard of it: it's what your landline phone uses in all likelihood). That came as a bit of a surprise to me, but the next best thing is a Free and multiply implemented solution, and there's the great mumble system that (at least for me) works so much better than all the browser-based horrors, not to mention it's quite a bit more bandwidth-effective. So: Get a client and connect to telco.g-vo.org. Join one of the two meeting rooms, done.

    Mumble doesn't have video, which, considering I've seen enough of peoples' living rooms (not to mention Zoom's silly bluebox backgrounds) to last a lifetime, counts as an advantage in my book. However, being able to share a view on a document (or slide set) and point around in it is a valid use case. Bonus points if the solution to that does not involve looking at other people's mail, IM notifications, or screen backgrounds.

    Now, a quick web search did not turn up anything acceptable to me, and since I've always wanted to play with websockets, I've created poatmyp: With it, you upload a PDF, distribute the link to your meeting partners, and all participants will see the slides and a shared pointer. And they can move around in the document together.

    What's left is shared editing. I've looked at a few implementations of this, but, frankly, there's too much npm and the related curlbashware in this field to make any of it enjoyable; also, it seems nobody has bothered to provide a Debian package of one of the systems. On the other hand, there are a few trustworthy operators of etherpads out there, so for now we are pointing to them on telco.g-vo.

    Setting up a mumble server and poatmyp isn't much work if you know how to configure an nginx and have a suitable box on the web. So: perhaps you'll use this opportunity to re-gain a bit of self-reliance? You see, there's little point to have your local copy of the Gaia catalogue, and doing that right is hard. Thanks to people writing Free software, running a simple telecon infrastructure, on the other hand, isn't hard any more.

  • The Bochum Galactic Disk Survey

    Patches of higher perceived variability on the Sky

    Fig 1: How our haphazard variability ratio varies over the sky (galactic coordinates). And yes, it's clear that this isn't dominated by physical variability.

    About a year ago, I reported on a workshop on “Large Surveys with Small Telescopes” in Bamberg; at around the same time, I've published an example for those, the Bochum Galactic Disk Survey BGDS, which used a twin 15 cm robotic telescope in some no longer forsaken place in the Andes mountains to monitor the brighter stars in the southern Milky Way. While some tables from an early phase of the survey have been on VizieR for a while, we now publish the source images (also in SIAP and Obscore), the mean photometry (via SCS and TAP) and, perhaps potentially most fun of all, the the lightcurves (via SSAP and TAP) – a whopping 35 million of the latter.

    This means that in tools like Aladin, you can now find such light curves (and images in two bands from a lot of epochs) when you are in the survey's coverage, and you can run TAP queries on GAVO's http://dc.g-vo.org/tap server against the full photometry table and the time series.

    Regular readers of this blog will not be surprised to see me use this as an excuse to show off a bit of ADQL trickery.

    If you have a look at the bgds.phot_all table in your favourite TAP client, you'll see that it has a column amp, giving the difference between the highest and lowest magnitude. The trouble is that amp for almost all objects just reflects the measurement error rather than any intrinsic variability. To get an idea what's “normal” (based on the fact that essentially all stars have essentially constant luminosity on the range and resolution scales considered here), run a query like:

    SELECT ROUND(amp/err_mag*10)/10 AS bin, COUNT(*) AS n
    FROM bgds.phot_all
    WHERE nobs>10
    GROUP BY bin
    

    As this scans the entire 75 million rows of the table, you will probably have to use async mode to run this.

    distribution of amplitude/mag errors

    Figure 2: The distribution of amplitude over magnitude error for all BGDS objects with nobs>10 (blue) and the subset with a mean magnitude brighter than 15 (blue).

    When it comes back, you will have, for objects where any sort of statistics make sense at all (hence nobs>10), a histogram (of sorts) of the amplitude in units of upstream's magnitude error estimation. If you log-log-plot this, you'll see something like Figure 2. The curve at least tells you that the magnitude error estimate is not very far off – the peak at about 3 “sigma” is not unreasonable since about half of the objects have nobs of the order of a hundred and thus would likely contain outliers that far out assuming roughly Gaussian errors.

    And if you're doing a rough cutoff at amp/magerr>10, you will get perhaps not necessarily true variables, but, at least potentially interesting objects.

    Let's use this insight to see if we spot any pattern in the distribution of these interesting objects. We'll use the HEALPix technique I've discussed three years ago in this blog, but with a little twist from ADQL 2.1: The Common Table Expressions or CTEs I have already mentioned in my blog post on ADQL 2.1 and then advertised in the piece on the Henry Draper catalogue. The brief idea, again, is that you can write queries and give their results a name that you can use elsewhere in the query as if it were an actual table. It's not much different from normal subqueries, but you can re-use CTEs in multiple places in the query (hence the “common”), and they are usually more readable.

    Here, we first create a version of the photometry table that contains HEALPixes and our variability measure, use that to compute two unsophisticated per-HEALPix statistics and eventually join these two to our observable, the ratio of suspected variables to all stars observed (the multiplication with 1.0 is a cheap way to make a float out of a value, which is necessary here because a/b does integer division in ADQL if a and b are both integers):

    WITH photpoints AS (
      SELECT
        amp/err_mag AS redamp,
        amp,
        ivo_healpix_index(5, ra, dec) AS hpx
      FROM bgds.phot_all
      WHERE
        nobs>10
        AND band_name='SDSS i'
        AND mean_mag<16),
    all_objs AS (
      SELECT count(*) AS ct,
        hpx
        FROM photpoints GROUP BY hpx),
    strong_var AS (
      SELECT COUNT(*) AS ct,
        hpx
        FROM photpoints
        WHERE redamp>4 AND amp>1 GROUP BY hpx)
    SELECT
      strong_var.ct/(1.0*all_objs.ct) AS obs,
      all_objs.ct AS n,
      hpx
    FROM strong_var JOIN all_objs USING (hpx)
    WHERE all_objs.ct>20
    

    If you plot this using TOPCAT's HEALPix thingy and ask it to use Galactic coordinates, you will end up with something like Figure 1.

    There clearly is some structure, but given that the variables ratio reaches up to 0.2, this must be reflecting instrumental or pipeline effects and thus earthly rather than astrophysical causes. And that's going beyond what I wouldd like to talk about on a VO blog, although I'll take any bet that you will see significant structure in the spatial distribution of the variability ratio at about any magnitude cutoff, since there are a lot of different population mixtures in the survey's footprint.

    Before winding down, let's have a quick look at the time series. As with the short spectra from Byurakan use case, we have stored the actual time series as arrays in the database (the mjd and mags columns in bgds.ssa_time_series). Unfortunately, since they are a lot less array-like than homogeneous spectra, it's also a lot harder to do interesting things with them without downloading them (I'm grateful for ideas for ADQL functions that will let you do in-DB analysis for such things). Still, you can at least easily download them in bulk and then process them in, say, python to your heart's content. The Byurakan use case should give you a head start there.

    For a quick demo, I couldn't resist checking out objects that Simbad classifies as possible long-period variables (you see, as I write this, the public excitement over Betelgeuse's brief waning is just dying down), and so I queried Simbad for:

    SELECT ra, dec, main_id
    FROM basic
    WHERE
      otype='LP?'
      AND 1=CONTAINS(
         POINT('', ra, dec),
         POLYGON('', 127, -30, 112, -30, 272, -30, 258, -30))
    

    (as of this writing, Simbad still needs the ADQL 2.0-compliant first arguments to POINT and POLYGON), where the POLYGON is intended to give the survey's footprint. I obtained that by reading off the coordinates of the corners in my Figure 1 while it was still in TOPCAT. Oh, and I had to shrink it a bit because Simbad (well, the underlying Postgres server, and, more precisely, its pg_sphere extension) doesn't want polygons with edges longer than π. This will soon become less pedestrian: MOCs in relational databases are coming; more on this in a later post.

    TOPCAT action shot with a light curve display

    Fig 3: V566 Pup's BGDS lightcuve in a TOPCAT configured to auto-plot the light curves associated with a row from the bgds.ssa_time_series table on the GAVO DC TAP service.

    If you now do the usual spiel with an upload crossmatch to the bgds.ssa_time_series table and check “Plot Table” in Views/Activation Action, you can quickly page through the light curves (TOPCAT will keep the plot style as you go from dataset to dataset, so it's worth configuring the lines and the error bars). Which could bring you to something like Fig. 3; and that would suggest that V* V566 Pup may be long-period (perhaps we are watching a slow maximium here), but on top of that there probably much faster ripples – unless the errors are grossly off; I am amazed that you can apparently do photometry at error levels of a dozen millimags or so from the ground these days.

« Page 11 / 20 »