• Histograms and Hidden Open Clusters

    image: reddish pattern

    Colour-coded histograms for distances of stars in the direction of some NGC open clusters -- one cluster per line, so you're looking a a couple of Gigabytes of data here. If you want this a bit more precise: Read the article and generate your own image.

    I have spent a bit of time last week polishing up what will (hopefully) be the definitive source of common ADQL User Defined Functions (UDFs) for IVOA review. What's a UDF, you ask? Well, it is an extension to ADQL where service operators can invent new functionality. If you have been following this blog for a while, you will probably remember the ivo_healpix_index function from our dereddening exercise (and some earlier postings): That was an UDF, too.

    This polishing work reminded me of a UDF I've wanted to blog about for a quite a while, available in DaCHS (and thus on our Heidelberg Data Center) since mid-2018: gavo_histogram. This, I claim, is a powerful tool for analyses over large amounts of data with rather moderate local means.

    For instance, consider this classic paper on the nature of NGC 2451: What if you were to look for more cases like this, i.e., (indulging in a bit of poetic liberty) open clusters hidden “behind” other open clusters?

    Somewhat more technically this would mean figuring out whether there are “interesting” patterns in the distance and proper motion histograms towards known open clusters. Now, retrieving the dozens of millions of stars that, say, Gaia, has in the direction of open clusters to just build histograms – making each row count for a lot less than one bit – simply is wasteful. This kind of counting and summing is much better done server-side.

    On the other hand, SQL's usual histogram maker, GROUP BY, is a bit unwieldy here, because you have lots of clusters, and you will not see anything if you munge all the histograms together. You could, of course, create a bin index from the distance and then group by this bin and the object name, somewhat like ...ROUND(r_est/20) as bin GROUP by name, bin – but that takes quite a bit of mangling before it can conveniently be used, in particular when you take independent distributions over multiple variables (“naive Bayesian”; but then it's the way to go if you want to capture dependencies between the variables).

    So, gavo_histogram to the rescue. Here's what the server-provided documentation has to say (if you use TOPCAT, you will find this in the ”Service” tab in the TAP windows' ”Use Service” tab):

    gavo_histogram(val REAL, lower REAL, upper REAL, nbins INTEGER) -> INTEGER[]
    
    The aggregate function returns a histogram of val with
    nbins+2 elements. Assuming 0-based arrays, result[0] contains
    the number of underflows (i.e., val<lower), result[nbins+1]
    the number of overflows. Elements 1..nbins are the counts in
    nbins bins of width (upper-lower)/nbins. Clients will have to
    convert back to physical units using some external communication,
    there currently is no (meta-) data as to what lower and upper was in
    the TAP response.
    

    This may sound a bit complicated, but the gist really is: type gavo_histogram(r_est, 0, 2000, 20) as hist, and you will get back an array with 20 bins, roughly 0..100, 100..200, and so on, and two extra bins for under- and overflows.

    Let's try this for our open cluster example. The obvious starting point is selecting the candidate clusters; we are only interested in famous clusters, so we take them from the NGC (if that's too boring for you: with TAP uploads you could take the clusters from Simbad, too), which conveniently sits in my data center as openngc.data:

    select name, raj2000, dej2000, maj_ax_deg
    from openngc.data
    where obj_type='OCl'
    

    Then, we need to add the stars in their rough directions. That's a classic crossmatch, and of course these days we use Gaia as the star catalogue:

    select name, source_id
    from openngc.data
    join gaia.dr2light
    on (
      1=contains(
        point(ra,dec),
        circle(raj2000, dej2000, maj_ax_deg)))
    where obj_type='OCl')
    

    This is now a table of cluster names and Gaia source ids of the candidate stars. To add distances, you could fiddle around with Gaia parallaxes, but because there is a 1/x involved deriving distances, the error model is complicated, and it is much easier and safer to adopt Bailer-Jones et al's pre-computed distances and join them in through source_id.

    And that distance estimation, r_est, is exactly what we want to take our histograms over – which means we have to group by name and use gavo_histogram as an aggregate function:

    with ocl as (
      select name, raj2000, dej2000, maj_ax_deg, source_id
      from openngc.data
      join gaia.dr2light
      on (
        1=contains(
          point(ra,dec),
          circle(raj2000, dej2000, maj_ax_deg)))
      where obj_type='OCl')
    
    select
      name,
      gavo_histogram(r_est, 0, 4000, 200) as hist
    from
      gdr2dist.main
      join ocl
      using (source_id)
    where r_est!='NaN'
    group by name
    

    That's it! This query will give you (admittedly somewhat raw, since we're ignoring the confidence intervals) histograms of the distances of stars in the direction of all NGC open clusters. Of course, it will run a while, as many millions of stars are processed, but TAP async mode easily takes care of that.

    Oh, one odd thing is left to discuss (ignore this paragraph if you don't know what I'm talking about): r_est!='NaN'. That's not quite ADQL but happens to do the isnan of normal programming languages at least when the backend is Postgres: It is true if computations failed and there is an actual NaN in the column. This is uncommon in SQL databases, and normal NULLs wouldn't hurt gavo_histogram. In our distance table, some NaNs slipped through, and they would poison our histograms. So, ADQL wizards probably should know that this is what you do for isnan, and that the usual isnan test val!=val doesn't work in SQL (or at least not with Postgres).

    So, fire up your TOPCAT and run this on the TAP server http://dc.g-vo.org/tap.

    You will get a table with 618 (or so) histograms. At this point, TOPCAT can't do a lot with them. So, let's emigrate to pyVO and save this table in a file ocl.vot

    My visualisation proposition would be: Let's substract a “background” from the histograms (I'm using splines to model that background) and then plot them row by row; multi-peaked rows in the resulting image would be suspicious.

    This is exactly what the programme below does, and the image for this article is a cutout of what the code produces. Set GALLERY = True to see how the histograms and background fits look like (hit 'q' to get to the next one).

    In the resulting image, any two yellow dots in one line are at least suspicious; I've spotted a few, but they are so consipicuous that others must have noticed. Or have they? If you'd like to check a few of them out, feel free to let me know – I think I have a few ideas how to pull some VO tricks to see if these things are real – and if they've been spotted before.

    So, here's the yellow spot programme:

    from astropy.table import Table
    import matplotlib.pyplot as plt
    import numpy
    from scipy.interpolate import UnivariateSpline
    
    GALLERY = False
    
    def substract_background(arr):
        x = range(len(arr))
        mean = sum(arr)/len(arr)
        arr = arr/mean
        background = UnivariateSpline(x, arr, s=100)
        cleaned = arr-background(x)
    
        if GALLERY:
            plt.plot(x, arr)
            plt.plot(x, background(x))
            plt.show()
    
        return cleaned
    
    
    def main():
        tab = Table.read("ocl.vot")
        hist = numpy.array([substract_background(r["hist"][1:-1])
          for r in tab])
        plt.matshow(hist, cmap='gist_heat')
        plt.show()
    
    
    if __name__=="__main__":
        main()
    
  • Tutorial Renewal

    The DaCHS Tutorial among other seminal works

    DaCHS' documentation (readthedocs mirror) has two fat pieces and a lot of smaller read-as-you-go pieces. One of the behmoths, the reference documentation, at roughly 350 PDF pages, has large parts generated from source code, and there is no expectation that anyone would ever read it linearly. Hence, I wasn't terribly worried about unreadable^Wpassages of questionable entertainment value in there.

    That's a bit different with the tutorial (also available as 150 page PDF; epub on request): I think serious DaCHS deployers ought to read the DaCHS Basics and the chapters on configuring DaCHS and the interaction with the VO Registry, and they should skim the remaining material so they are at least aware of what's there.

    Ok. I give you that is a bit utopian. But given that pious wish I felt rather bad that the tutorial has become somewhat incoherent in the years since I had started the piece in April 2009 (perhaps graciously, the early history is not visible at the documentation's current github home). Hence, when applying for funds under our current e-inf-astro project, I had promised to give the tutorial a solid makeover as, hold your breath, Milestone B1-5, due in the 10th quarter. In human terms: last December.

    When it turned out the Python 3 migration was every bit as bad as I had feared, it became clear that other matters had to take priority and that we might miss this part of that “milestone” (sorry, I can't resist these quotes). And given e-inf-astro only had two quarters to go after that, I prepared for having to confess I couldn't make good on my promise of fixing the tutorial.

    But then along came Corona, and reworking prose seemed the ideal pastime for the home office. So, on April 4, I forked off a new-tutorial branch and started a rather large overhaul that, among others, resulted in the operators' guide with its precarious position between tutorial and reference being largely absorbed into the tutorial. In all, off and on over the last few months I accumulated (according to git diff --shortstat 6372 inserted and 3453 deleted lines in the tutorial's source. Since that source currently is 7762 lines, I'd say that's the complete makeover I had promised. Which is good as e-inf-astro will be over next Wednesday (but don't worry, our work is still funded).

    So – whether you are a DaCHS expert, think about running it, or if you're just curious what it takes to build VO services, let me copy from index.html: Tutorial on importing data (tutorial.html,tutorial.pdf,tutorial.rstx). The ideal company for your vacation!

    And if you find typos, boring pieces, overly radical advocacy or anything else you don't like: there's a bug tracker for you (not to mention PRs are welcome).

  • DaCHS 2.1: Say hello to Python 3

    DaCHS and python logos

    Today, I have released DaCHS 2.1, the first stable DaCHS running on Python 3. I have tried hard to make the major version move painless and easy, and indeed “pure DaCHS” RDs should just continue to work. But wherever there's Python in your RDs or near them, things may break, since Python 3 is different from Python 2 in some rather fundamental ways.

    Hence, the Debian package even has a new name: gavodachs2-server. Unless you install that, things will keep running as they do. I will keep fixing serious DaCHS 1 bugs for a while, so there's no immediate urgency to migrate. But unless you migrate, you will not see any new features, so one of these days you will have to migrate anyway. Why not do it today?

    Migrating to DaCHS 2

    In principle, just say apt install gavodachs2-server and hope for the best. If you have a development machine and regression tests defined, this is actually what we recommend, and we'd be very grateful to learn of any problems you may encounter.

    If you'd rather be a little more careful, Carlos Henrique Brandt has kindly updated his Docker files in order to let you spot problems before you mess up your production server. See Test Migration for a quick intro on how to do that. If you spot any problems that are not related to the Python 3 pitfalls mentioned in the howto linked below or nevow exodus, please tell me or (preferably) the dachs-support mailing list.

    A longer, more or less permanent piece elaborating possible migration pains, is in our how-to documentation: How do I go from DaCHS1 to DaCHS2?

    What's new in DaCHS2?

    I've used the opportunity of the major version change to remove a few (mis-) features that I'm rather sure nobody uses; and there are a few new features, too. Here's a rundown of the more notable changes:

    • DaCHS now produces VOTable 1.4 by default. This is particularly notable when you provide TIMESYS metadata (on which I'll report some other time).
    • When doing spatial indices, prefer the new //scs#pgs-pos-index to //scs#q3cindex. While q3c is still faster and more compact than pgsphere when just indexing points, on the longer run I'd like to shed the extra dependency (note, however, that the pgsphere index limits the cone search to a maximum radius of 90 degrees at this point).
    • Talking about Cone Search: For custom parameters, DaCHS has so far used SSA-like syntax, so you could say, for instance, vmag=12/13 (for “give me rows where vmag is between 12 and 13”). Since I don't think this was widely used, I've taken the liberty to migrate to DALI-compliant syntax, where intervals are written as they would be in VOTable PARAM values: vmag=12 13.
    • In certain situations, DaCHS tries to enable parallel queries (previously on this blog).
    • Some new ADQL user defined functions: gavo_random_normal, gavo_mocintersect, and gavo_mocunion. See the TAP capabilities for details, and note that the moc functions will fail until we put out a new pgsphere package that has support for the MOC-MOC operations.
    • dachs info (highly recommended after an import) now takes a --sample-percent option that helps when doing statistics on large tables.
    • For SSA services serving something other than spectra (in all likelihood, timeseries), you can now set a productType meta as per the upcoming SimpleDALRegExt 1.2.
    • If you have large, obscore-published SIAP tables, re-index them (dachs imp -I q) so queries over s_ra and s_dec get index support, too.
    • Since we now maintain RD state in the database, you can remove the files /var/gavo/state/updated* after upgrading.
    • When writing datalink metaMakers returning links, you can (and should, for new RDs) define the semantics in an attribute to the element rather in the LinkDef constructor.
    • Starting with this version, it's a good idea to run dachs limits after an import. This, right now, will mainly set an estimate for the number of rows in a table, but that's already relevant because the ADQL translator uses it to help the postgres query planner. It will later also update various kinds of column metadata that, or so I hope, will become relevant in VODataService 1.3.
    • forceUnique on table elements is now a no-op (and should be removed); just define a dupePolicy as before.
    • If you write bad obscore mappings, it could so far be hard to figure out the reason of the failure and, between lots of confusing error messages, to fix it. Instead, you can now run ``dachs imp //obscore recover`` in such a situation. It will re-create the obscore table and throw out all stanzas that fail; after that, you can fix the obscore declarations that were thrown out one by one.
    • If you run DaCHS behind a reverse proxy that terminates https, you can now set [web]adaptProtocol in /etc/gavo.rc to False. This will make that setup work for form-based services, too.
    • If you have custom OAI set name (i.e., anything but local and ivo_managed in the sets attribute of publish elements), you now have to declare them in [ivoa]validOAISets.
    • Removed things: the docform renderer (use form instead), the soap renderer (well, it's not actually removed, it's just that the code it depends on doesn't exist on python3 any more), sortKey on services (use the defaultSortKey property), //scs#q3cpositions (port the table to have ra and dec and one of the SCS index mixins), the (m)img.jpeg renderers (if you were devious enough to use these, let me know), and quite a few even more exotic things.

    Some Breaking Changes

    Python 3 was released in 2008, not long after DaCHS' inception, but since quite a few of the libraries it uses to do its job haven't been available for Python 3, we have been reluctant to make the jump over the past then years (and actually, the stability of the python2 platform was a very welcome thing).

    Indeed, the most critical of our dependencies, twisted, only became properly usable with python3 in, roughly, 2017. Indeed, large parts of DaCHS weren't even using twisted directly, but rather a nice add-on to it called nevow. Significant parts of nevow bled through to DaCHS operators; for instance, the render functions or the entire HTML templating.

    Nevow, unfortunately, fell out of fashion, and so nobody stepped forward to port it. And when I started porting it myself I realised that I'm mainly using the relatively harmless parts of nevow, and hence after a while I figured that I could replace the entire dependency by something like a 1000 lines in DaCHS, which, given significant aches when porting the whole of nevow, seemed like a good deal.

    The net effect is that if you built code on top of nevow – most likely in the form of a custom renderer – that will break now, and porting will probably be rather involved (having ported ~5 custom renderers, I think I can tell). If this concerns you, have a look at the README in gavo.formal (and then complain because it's mainly notes to myself at this point). I feel a bit bad about having to break things that are not totally unreasonable in this drastic way and thus offer any help I can give to port legacy DaCHS code.

    Outside of these custom renderers, there should just be a single visible change: If you have used n:data="some_key" in nevow templates to pull data from dictionaries, that won't work any longer. Use n:data="key some_key" n:render="str" instead. And it turns out that this very construct was used in the default root template, which you may have derived from. So – see if you have /var/gavo/web/templates/root.html and if so, whether there is <ul n:data="chunk" in there. If you have that, change it to <ul n:data="key chunk".

    Update (2020-11-19): Two only loosely related problems have surfaced during updates. In particular if you are updating on rather old installations, you may want to look at the points on Invalid script type preIndex and function spoint_in already exists in our list of common problems.

  • Building consensus

    image: Markus, handwringing

    Sometimes, building consensus takes a little bending: Me, at the Shanghai Interop of 2017. In-joke: there's “STC” on the slide.

    In the Virtual Observatory, procedures are built on consensus: No (relevant) decisions are passed based some sort of majority vote. While I personally think that's a very good thing in general – you really don't want to clobber minorities, and I couldn't even give a minimal size of such a minority below which it might be ok to ignore them –, there is a profound operational reason for that: We cannot force data centers or software writers to comply with our standards, so they had better agree with them in the first place.

    However, building consensus (to avoid Chomsky's somewhat odious notion of manufacturing consent) is hard. In my current work, this insight manifests itself most strongly when I wear my hat as chair of the IVOA Semantics Working Group, where we need to sort items from a certain part of the world into separate boxes and label those, that is, we're building vocabularies. “Part of the world” can be formalised, and there are big phrases like “universe of discourse” to denote such formalisations, but to give you an idea, it's things like reference frames, topics astronomy in general talks about (think journal keywords), relationships between data collections and services, or the roles of files related to or making up a dataset. If you visit the VO's vocabulary repository, you will see what parts we are trying to systematise, and if you skim the current draft for the next release of Vocabularies in the VO, in section two you can find a few reasons why we are bothering to do that.

    As you may expect if you have ever tried classifications like this, what boxes (”concepts” in the argot of the semantics folks) there should be and how to label them are questions with plenty of room for dissent. A case study for this is the discussion on VEP-001 and its successors that has been going on since late last year; it also illustrates that we are not talking about bikeshedding here. The discussion clarified much and, in particular, led to substantial improvements not only to the concept in question but also far beyond that. If you are interested, have a look at a few mail threads (here, here, here, or here; more discussion happened live at meetings).

    An ideal outcome of such a process is, of course, a solution that is obvious in retrospect, so everyone just agrees. Sometimes, that doesn't happen, and one of these times is VEP-001 and the VEP-003 it evolved into. A spontanous splinter between sessions of this week's Virtual Interop yielded two rather sensible names for the concept we had identified in the previous debates: #sibling on the one hand, and #co-derived on the other (in case you're RDF-minded: the full vocabulary URIs are obtained by prefixing this with the vocabulary URI, http://www.ivoa.net/rdf/datalink/core). Choosing between the two is a bit of a matter of taste, but also of perhaps changing implementations, and so I don't see a clear preference. And the people in the conference didn't reach an agreement before people on the North American west coast really had to have some well-deserved sleep.

    In such a situation – extensive discussion yields some very few, apparently rather equivalent solution –, I suspect it is the time to resort to some sort of polling after all. So, in the session I've asked the people involved to give their pain level on a scale of 1 to 10. Given there are quite a few consensus scales out there already (I'm too lazy to look for references now, but I'll retrofit them here if you send some in), I felt this was a bit hasty after I had closed the z**m^H^H^H^H telecon client. But then, thinking about it, I started to like that scale, and so during a little bike ride I came up with what's below. And since I started liking it, I thought I could put it into words, and into a form I can reference when similar situations come up in the future. And so, here it is:

    Markus' Pain Level Scale

    1. Oh wow. I'm enthusiastic about it, and I'd get really cross if we didn't do it.
    2. It's great. I don't think we'll find a better solution. People better have really strong reasons to reject it.
    3. Fine. Just go ahead.
    4. Quite reasonable. I have some doubts, but I either don't have a good alternative, or the alternatives certainly won't improve matters.
    5. Reasonable. I can live with it, possibly accepting a very moderate amount of pain (like: change an implementation that I think is fine as it is).
    6. Sigh. I don't like it much. If you think it's useful, do it, but don't blame me if it later turns out it stinks.
    7. Ouch. I wish we didn't have to go there. For instance: This is going to uglify a few things I care about.
    8. Yikes. I think it's a bad idea. Honestly, let's not do it. It's going to make quite a few things a lot uglier, though I give you it might still just barely work.
    9. OMG. What are you thinking? I won't go near it, and I pity everyone who will have to. And it's quite likely going to blow up some things I care about.
    10. Blech. To me, this clearly is a grave mistake that will impact a lot of things very adversely. If I can do anything within reason to stop it, I'll do it. Consider this a veto, and shame on you if you override it.

    You can qualify this with:

    +:I've thought long and hard about this, and I think I understand the matter in depth. You'll hence need arguments of the profundity of the Earth's outer core to sway me.
    (unqualified):I've thought about this, and as far as I understand the matter I'm sure about it. More information, solid arguments, or a sudden inspiration while showering might still sway me.
    -:This is a gut feeling. It could very well be phantom pain. Feel free to try a differential diagnosis.

    If you like the scale, too, feel free to reference it as href="https://blog.g-vo.org/building-consensus/#scale">https://blog.g-vo.org/building-consensus/#scale.

  • GAVO vs. Corona

    A conference group photo

    You won't see something like this (the May 2018 Interop group photo) in Spring 2020: The Sidney Interop, planned for early May, is going to take place using remote tools. Some of which I'd rather do without.

    The Corona pandemic, regrettably, has also brought with it a dramatic move to closed, proprietary communication and collaboration platforms: I'm being bombarded by requests to join Zoom meetings, edit Google docs, chat on Slack, “stream” something on any of Youtube, Facebook, Instagram, or Sauron (I've made one of these up).

    Mind you, that's within the Virtual Observatory. Call me pig-headed, but I feel that's a disgrace when we're out to establish Free and open standards (for good reasons). To pick a particularly sad case, Slack right now is my pet peeve because they first had an interface to IRC (which has been doing what they do since the late 80ies, though perhaps not as prettily in a web browser) and then cut it when they had sufficient lock-in. Of course, remembering how Google first had XMPP (that's the interoperable standard for instant messaging) in Google talk and then cut that, too... ah well, going proprietary unfortunately is just good business sense once you have sufficient lock-in.

    Be that as it may, I was finally fed up with all this proprietary tech and set up something suitable for conferecing building on open, self-hostable components. It's on https://telco.g-vo.org, and you're welcome to use it for your telecons (assuming that when you're reading this blog, you have at least some relationship to astronomy and open standards).

    What's in there?

    Unfortunately, there doesn't seem to be an established, Free conferencing system based on SIP/RTP, which I consider the standard for voice communication on the internet (if you've never heard of it: it's what your landline phone uses in all likelihood). That came as a bit of a surprise to me, but the next best thing is a Free and multiply implemented solution, and there's the great mumble system that (at least for me) works so much better than all the browser-based horrors, not to mention it's quite a bit more bandwidth-effective. So: Get a client and connect to telco.g-vo.org. Join one of the two meeting rooms, done.

    Mumble doesn't have video, which, considering I've seen enough of peoples' living rooms (not to mention Zoom's silly bluebox backgrounds) to last a lifetime, counts as an advantage in my book. However, being able to share a view on a document (or slide set) and point around in it is a valid use case. Bonus points if the solution to that does not involve looking at other people's mail, IM notifications, or screen backgrounds.

    Now, a quick web search did not turn up anything acceptable to me, and since I've always wanted to play with websockets, I've created poatmyp: With it, you upload a PDF, distribute the link to your meeting partners, and all participants will see the slides and a shared pointer. And they can move around in the document together.

    What's left is shared editing. I've looked at a few implementations of this, but, frankly, there's too much npm and the related curlbashware in this field to make any of it enjoyable; also, it seems nobody has bothered to provide a Debian package of one of the systems. On the other hand, there are a few trustworthy operators of etherpads out there, so for now we are pointing to them on telco.g-vo.

    Setting up a mumble server and poatmyp isn't much work if you know how to configure an nginx and have a suitable box on the web. So: perhaps you'll use this opportunity to re-gain a bit of self-reliance? You see, there's little point to have your local copy of the Gaia catalogue, and doing that right is hard. Thanks to people writing Free software, running a simple telecon infrastructure, on the other hand, isn't hard any more.

« Page 11 / 20 »