• Building consensus

    image: Markus, handwringing

    Sometimes, building consensus takes a little bending: Me, at the Shanghai Interop of 2017. In-joke: there's “STC” on the slide.

    In the Virtual Observatory, procedures are built on consensus: No (relevant) decisions are passed based some sort of majority vote. While I personally think that's a very good thing in general – you really don't want to clobber minorities, and I couldn't even give a minimal size of such a minority below which it might be ok to ignore them –, there is a profound operational reason for that: We cannot force data centers or software writers to comply with our standards, so they had better agree with them in the first place.

    However, building consensus (to avoid Chomsky's somewhat odious notion of manufacturing consent) is hard. In my current work, this insight manifests itself most strongly when I wear my hat as chair of the IVOA Semantics Working Group, where we need to sort items from a certain part of the world into separate boxes and label those, that is, we're building vocabularies. “Part of the world” can be formalised, and there are big phrases like “universe of discourse” to denote such formalisations, but to give you an idea, it's things like reference frames, topics astronomy in general talks about (think journal keywords), relationships between data collections and services, or the roles of files related to or making up a dataset. If you visit the VO's vocabulary repository, you will see what parts we are trying to systematise, and if you skim the current draft for the next release of Vocabularies in the VO, in section two you can find a few reasons why we are bothering to do that.

    As you may expect if you have ever tried classifications like this, what boxes (”concepts” in the argot of the semantics folks) there should be and how to label them are questions with plenty of room for dissent. A case study for this is the discussion on VEP-001 and its successors that has been going on since late last year; it also illustrates that we are not talking about bikeshedding here. The discussion clarified much and, in particular, led to substantial improvements not only to the concept in question but also far beyond that. If you are interested, have a look at a few mail threads (here, here, here, or here; more discussion happened live at meetings).

    An ideal outcome of such a process is, of course, a solution that is obvious in retrospect, so everyone just agrees. Sometimes, that doesn't happen, and one of these times is VEP-001 and the VEP-003 it evolved into. A spontanous splinter between sessions of this week's Virtual Interop yielded two rather sensible names for the concept we had identified in the previous debates: #sibling on the one hand, and #co-derived on the other (in case you're RDF-minded: the full vocabulary URIs are obtained by prefixing this with the vocabulary URI, http://www.ivoa.net/rdf/datalink/core). Choosing between the two is a bit of a matter of taste, but also of perhaps changing implementations, and so I don't see a clear preference. And the people in the conference didn't reach an agreement before people on the North American west coast really had to have some well-deserved sleep.

    In such a situation – extensive discussion yields some very few, apparently rather equivalent solution –, I suspect it is the time to resort to some sort of polling after all. So, in the session I've asked the people involved to give their pain level on a scale of 1 to 10. Given there are quite a few consensus scales out there already (I'm too lazy to look for references now, but I'll retrofit them here if you send some in), I felt this was a bit hasty after I had closed the z**m^H^H^H^H telecon client. But then, thinking about it, I started to like that scale, and so during a little bike ride I came up with what's below. And since I started liking it, I thought I could put it into words, and into a form I can reference when similar situations come up in the future. And so, here it is:

    Markus' Pain Level Scale

    1. Oh wow. I'm enthusiastic about it, and I'd get really cross if we didn't do it.
    2. It's great. I don't think we'll find a better solution. People better have really strong reasons to reject it.
    3. Fine. Just go ahead.
    4. Quite reasonable. I have some doubts, but I either don't have a good alternative, or the alternatives certainly won't improve matters.
    5. Reasonable. I can live with it, possibly accepting a very moderate amount of pain (like: change an implementation that I think is fine as it is).
    6. Sigh. I don't like it much. If you think it's useful, do it, but don't blame me if it later turns out it stinks.
    7. Ouch. I wish we didn't have to go there. For instance: This is going to uglify a few things I care about.
    8. Yikes. I think it's a bad idea. Honestly, let's not do it. It's going to make quite a few things a lot uglier, though I give you it might still just barely work.
    9. OMG. What are you thinking? I won't go near it, and I pity everyone who will have to. And it's quite likely going to blow up some things I care about.
    10. Blech. To me, this clearly is a grave mistake that will impact a lot of things very adversely. If I can do anything within reason to stop it, I'll do it. Consider this a veto, and shame on you if you override it.

    You can qualify this with:

    +:I've thought long and hard about this, and I think I understand the matter in depth. You'll hence need arguments of the profundity of the Earth's outer core to sway me.
    (unqualified):I've thought about this, and as far as I understand the matter I'm sure about it. More information, solid arguments, or a sudden inspiration while showering might still sway me.
    -:This is a gut feeling. It could very well be phantom pain. Feel free to try a differential diagnosis.

    If you like the scale, too, feel free to reference it as href="https://blog.g-vo.org/building-consensus/#scale">https://blog.g-vo.org/building-consensus/#scale.

  • GAVO vs. Corona

    A conference group photo

    You won't see something like this (the May 2018 Interop group photo) in Spring 2020: The Sidney Interop, planned for early May, is going to take place using remote tools. Some of which I'd rather do without.

    The Corona pandemic, regrettably, has also brought with it a dramatic move to closed, proprietary communication and collaboration platforms: I'm being bombarded by requests to join Zoom meetings, edit Google docs, chat on Slack, “stream” something on any of Youtube, Facebook, Instagram, or Sauron (I've made one of these up).

    Mind you, that's within the Virtual Observatory. Call me pig-headed, but I feel that's a disgrace when we're out to establish Free and open standards (for good reasons). To pick a particularly sad case, Slack right now is my pet peeve because they first had an interface to IRC (which has been doing what they do since the late 80ies, though perhaps not as prettily in a web browser) and then cut it when they had sufficient lock-in. Of course, remembering how Google first had XMPP (that's the interoperable standard for instant messaging) in Google talk and then cut that, too... ah well, going proprietary unfortunately is just good business sense once you have sufficient lock-in.

    Be that as it may, I was finally fed up with all this proprietary tech and set up something suitable for conferecing building on open, self-hostable components. It's on https://telco.g-vo.org, and you're welcome to use it for your telecons (assuming that when you're reading this blog, you have at least some relationship to astronomy and open standards).

    What's in there?

    Unfortunately, there doesn't seem to be an established, Free conferencing system based on SIP/RTP, which I consider the standard for voice communication on the internet (if you've never heard of it: it's what your landline phone uses in all likelihood). That came as a bit of a surprise to me, but the next best thing is a Free and multiply implemented solution, and there's the great mumble system that (at least for me) works so much better than all the browser-based horrors, not to mention it's quite a bit more bandwidth-effective. So: Get a client and connect to telco.g-vo.org. Join one of the two meeting rooms, done.

    Mumble doesn't have video, which, considering I've seen enough of peoples' living rooms (not to mention Zoom's silly bluebox backgrounds) to last a lifetime, counts as an advantage in my book. However, being able to share a view on a document (or slide set) and point around in it is a valid use case. Bonus points if the solution to that does not involve looking at other people's mail, IM notifications, or screen backgrounds.

    Now, a quick web search did not turn up anything acceptable to me, and since I've always wanted to play with websockets, I've created poatmyp: With it, you upload a PDF, distribute the link to your meeting partners, and all participants will see the slides and a shared pointer. And they can move around in the document together.

    What's left is shared editing. I've looked at a few implementations of this, but, frankly, there's too much npm and the related curlbashware in this field to make any of it enjoyable; also, it seems nobody has bothered to provide a Debian package of one of the systems. On the other hand, there are a few trustworthy operators of etherpads out there, so for now we are pointing to them on telco.g-vo.

    Setting up a mumble server and poatmyp isn't much work if you know how to configure an nginx and have a suitable box on the web. So: perhaps you'll use this opportunity to re-gain a bit of self-reliance? You see, there's little point to have your local copy of the Gaia catalogue, and doing that right is hard. Thanks to people writing Free software, running a simple telecon infrastructure, on the other hand, isn't hard any more.

  • The Bochum Galactic Disk Survey

    Patches of higher perceived variability on the Sky

    Fig 1: How our haphazard variability ratio varies over the sky (galactic coordinates). And yes, it's clear that this isn't dominated by physical variability.

    About a year ago, I reported on a workshop on “Large Surveys with Small Telescopes” in Bamberg; at around the same time, I've published an example for those, the Bochum Galactic Disk Survey BGDS, which used a twin 15 cm robotic telescope in some no longer forsaken place in the Andes mountains to monitor the brighter stars in the southern Milky Way. While some tables from an early phase of the survey have been on VizieR for a while, we now publish the source images (also in SIAP and Obscore), the mean photometry (via SCS and TAP) and, perhaps potentially most fun of all, the the lightcurves (via SSAP and TAP) – a whopping 35 million of the latter.

    This means that in tools like Aladin, you can now find such light curves (and images in two bands from a lot of epochs) when you are in the survey's coverage, and you can run TAP queries on GAVO's http://dc.g-vo.org/tap server against the full photometry table and the time series.

    Regular readers of this blog will not be surprised to see me use this as an excuse to show off a bit of ADQL trickery.

    If you have a look at the bgds.phot_all table in your favourite TAP client, you'll see that it has a column amp, giving the difference between the highest and lowest magnitude. The trouble is that amp for almost all objects just reflects the measurement error rather than any intrinsic variability. To get an idea what's “normal” (based on the fact that essentially all stars have essentially constant luminosity on the range and resolution scales considered here), run a query like:

    SELECT ROUND(amp/err_mag*10)/10 AS bin, COUNT(*) AS n
    FROM bgds.phot_all
    WHERE nobs>10
    GROUP BY bin
    

    As this scans the entire 75 million rows of the table, you will probably have to use async mode to run this.

    distribution of amplitude/mag errors

    Figure 2: The distribution of amplitude over magnitude error for all BGDS objects with nobs>10 (blue) and the subset with a mean magnitude brighter than 15 (blue).

    When it comes back, you will have, for objects where any sort of statistics make sense at all (hence nobs>10), a histogram (of sorts) of the amplitude in units of upstream's magnitude error estimation. If you log-log-plot this, you'll see something like Figure 2. The curve at least tells you that the magnitude error estimate is not very far off – the peak at about 3 “sigma” is not unreasonable since about half of the objects have nobs of the order of a hundred and thus would likely contain outliers that far out assuming roughly Gaussian errors.

    And if you're doing a rough cutoff at amp/magerr>10, you will get perhaps not necessarily true variables, but, at least potentially interesting objects.

    Let's use this insight to see if we spot any pattern in the distribution of these interesting objects. We'll use the HEALPix technique I've discussed three years ago in this blog, but with a little twist from ADQL 2.1: The Common Table Expressions or CTEs I have already mentioned in my blog post on ADQL 2.1 and then advertised in the piece on the Henry Draper catalogue. The brief idea, again, is that you can write queries and give their results a name that you can use elsewhere in the query as if it were an actual table. It's not much different from normal subqueries, but you can re-use CTEs in multiple places in the query (hence the “common”), and they are usually more readable.

    Here, we first create a version of the photometry table that contains HEALPixes and our variability measure, use that to compute two unsophisticated per-HEALPix statistics and eventually join these two to our observable, the ratio of suspected variables to all stars observed (the multiplication with 1.0 is a cheap way to make a float out of a value, which is necessary here because a/b does integer division in ADQL if a and b are both integers):

    WITH photpoints AS (
      SELECT
        amp/err_mag AS redamp,
        amp,
        ivo_healpix_index(5, ra, dec) AS hpx
      FROM bgds.phot_all
      WHERE
        nobs>10
        AND band_name='SDSS i'
        AND mean_mag<16),
    all_objs AS (
      SELECT count(*) AS ct,
        hpx
        FROM photpoints GROUP BY hpx),
    strong_var AS (
      SELECT COUNT(*) AS ct,
        hpx
        FROM photpoints
        WHERE redamp>4 AND amp>1 GROUP BY hpx)
    SELECT
      strong_var.ct/(1.0*all_objs.ct) AS obs,
      all_objs.ct AS n,
      hpx
    FROM strong_var JOIN all_objs USING (hpx)
    WHERE all_objs.ct>20
    

    If you plot this using TOPCAT's HEALPix thingy and ask it to use Galactic coordinates, you will end up with something like Figure 1.

    There clearly is some structure, but given that the variables ratio reaches up to 0.2, this must be reflecting instrumental or pipeline effects and thus earthly rather than astrophysical causes. And that's going beyond what I wouldd like to talk about on a VO blog, although I'll take any bet that you will see significant structure in the spatial distribution of the variability ratio at about any magnitude cutoff, since there are a lot of different population mixtures in the survey's footprint.

    Before winding down, let's have a quick look at the time series. As with the short spectra from Byurakan use case, we have stored the actual time series as arrays in the database (the mjd and mags columns in bgds.ssa_time_series). Unfortunately, since they are a lot less array-like than homogeneous spectra, it's also a lot harder to do interesting things with them without downloading them (I'm grateful for ideas for ADQL functions that will let you do in-DB analysis for such things). Still, you can at least easily download them in bulk and then process them in, say, python to your heart's content. The Byurakan use case should give you a head start there.

    For a quick demo, I couldn't resist checking out objects that Simbad classifies as possible long-period variables (you see, as I write this, the public excitement over Betelgeuse's brief waning is just dying down), and so I queried Simbad for:

    SELECT ra, dec, main_id
    FROM basic
    WHERE
      otype='LP?'
      AND 1=CONTAINS(
         POINT('', ra, dec),
         POLYGON('', 127, -30, 112, -30, 272, -30, 258, -30))
    

    (as of this writing, Simbad still needs the ADQL 2.0-compliant first arguments to POINT and POLYGON), where the POLYGON is intended to give the survey's footprint. I obtained that by reading off the coordinates of the corners in my Figure 1 while it was still in TOPCAT. Oh, and I had to shrink it a bit because Simbad (well, the underlying Postgres server, and, more precisely, its pg_sphere extension) doesn't want polygons with edges longer than π. This will soon become less pedestrian: MOCs in relational databases are coming; more on this in a later post.

    TOPCAT action shot with a light curve display

    Fig 3: V566 Pup's BGDS lightcuve in a TOPCAT configured to auto-plot the light curves associated with a row from the bgds.ssa_time_series table on the GAVO DC TAP service.

    If you now do the usual spiel with an upload crossmatch to the bgds.ssa_time_series table and check “Plot Table” in Views/Activation Action, you can quickly page through the light curves (TOPCAT will keep the plot style as you go from dataset to dataset, so it's worth configuring the lines and the error bars). Which could bring you to something like Fig. 3; and that would suggest that V* V566 Pup may be long-period (perhaps we are watching a slow maximium here), but on top of that there probably much faster ripples – unless the errors are grossly off; I am amazed that you can apparently do photometry at error levels of a dozen millimags or so from the ground these days.

  • Parallel Queries

    Image: Plot of run times

    An experiment with parallel querying of PPMX, going from single-threaded execution to using seven workers.

    Let me start this post with a TL;DR for

    scientists:
    Large analysis queries (like those that contain a GROUP BY clause) profit a lot from parallel execution, and you needn't do a thing for that.
    DaCHS operators:
    When you have large tables, Postgres 11 together with the next DaCHS release may speed up your responses quite dramatically in some cases.

    So, here's the story –

    I've finally overcome my stretch trauma and upgraded the Heidelberg data center's database server to Debian buster. With that, I got Postgres 11, and I finally bothered to look into what it takes to enable parallel execution of database queries.

    Turns out: My Postgres started to do parallel execution right away, but just in case, I went for the following lines in postgresql.conf:

    max_parallel_workers_per_gather = 4
    max_worker_processes = 10
    max_parallel_workers = 10
    

    Don't quote me on this – I frankly admit I haven't really developed a feeling for the consequences of max_parallel_workers_per_gather and instead just did some experiments while the box was loaded otherwise, determining where raising that number has a diminishing return (see below for more on this).

    The max_worker_processes thing, on the other hand, is an educated guess: on my data center, there's essentially never more than one person at a time who's running “interesting”, long-running queries (i.e., async), and that person should get the majority of the execution units (the box has 8 physical CPUs that look like 16 cores due to hyperthreading) because all other operations are just peanuts in comparison. I'll gladly accept advice to the effect that that guess isn't that educated after all.

    Of course, that wasn't nearly enough. You see, since TAP queries can return rather large result sets – on the GAVO data center, the match limit is 16 million rows, which for a moderate row size of 2 kB already translates to 32 GB of memory use if pulled in at once, half the physical memory of that box –, DaCHS uses cursors (if you're a psycopg2 person: named cursors) to stream results and write them out to disk as they come in.

    Sadly, postgres won't do parallel plans if it thinks people will discard a large part of the result anyway, and it thinks that if you're coming through a cursor. So, in SVN revision 7370 of DaCHS (and I'm not sure if I'll release that in this form), I'm introducing a horrible hack that, right now, just checks if there's a literal “group” in the query and doesn't use a cursor if so. The logic is, roughly: With GROUP, the result set probably isn't all that large, so streaming isn't that important. At the same time, this type of query is probably going to profit from parallel execution much more than your boring sequential scan.

    This gives rather impressive speed gains. Consider this example (of course, it's selected to be extreme):

    import contextlib
    import pyvo
    import time
    
    @contextlib.contextmanager
    def timeit(activity):
      start_time = time.time()
      yield
      end_time = time.time()
      print("Time spent on {}: {} s".format(activity, end_time-start_time))
    
    
    svc = pyvo.tap.TAPService("http://dc.g-vo.org/tap")
    with timeit("Cold (?) run"):
      svc.run_sync("select round(Rmag) as bin, count(*) as n"
        " from ppmx.data group by bin")
    with timeit("Warm run"):
      svc.run_sync("select round(Rmag) as bin, count(*) as n"
        " from ppmx.data group by bin")
    

    (if you run it yourself and you get warnings about VOTable versions from astropy, ignore them; I'm right and astropy is wrong).

    Before enabling parallel execution, this was 14.5 seconds on a warm run, after, it was 2.5 seconds. That's an almost than a 6-fold speedup. Nice!

    Indeed, that holds beyond toy examples. The showcase Gaia density plot:

    SELECT
            count(*) AS obs,
            source_id/140737488355328 AS hpx
    FROM gaia.dr2light
    GROUP BY hpx
    

    (the long odd number is 235416-6, which turns source_ids into level 6-HEALPixes as per Gaia footnote id; please note that Postgres right now isn't smart enough to parallelise ivo_healpix), which traditionally ran for about an hour is now done in less than 10 minutes.

    In case you'd like to try things out on your postgres, here's what I've done to establish the max_parallel_workers_per_gather value above.

    1. Find a table with a few 1e7 rows. Think of a query that will return a small result set in order to not confuse the measurements by excessive client I/O. In my case, that's a magnitude histogram, and the query would be:

      select round(Rmag) as bin, count(*) as n from ppmx.data group by bin;

      Run this query once so the data is in the disk cache (the query is “warm”).

    2. Establish a non-parallel baseline. That's easy to do:

      set max_parallel_workers_per_gather=0;
      
    3. Then run:

      explain analyze select round(Rmag) as bin, count(*) as n from ppmx.data group by bin;
      

      You should see a simple query plan with the runtime for the non-parallel execution – in my case, a bit more than 12 seconds.

    4. Then raise the number of max_parallel_workers_per_gatherer successively. Make sure the query plan has lines of the form “Workers Planned” or so. You should see that the execution time falls with the number of workers you give it, up to the value of max_worker_processes – or until postgres decides your table is too small to warrant further parallelisation, which for my settings happened at 7.

    Note, though, that in realistic, more complex queries, there will probably be multiple operations that will profit from parallelisation in a single query. So, if in this trivial example you can go to 15 gatherers and still see an improvement, this could actually make things slower for complex queries. But as I said above: I have no instinct yet for how things will actually work out. If you have experiences to share: I'm sure I'm not the only person on dachs-users who't be interested.

    Update 2022-05-17: In Postgres 13, I found that the planner disfavours parallel plans a lot stronger than I think it has in Postgres 11. To make up for that, I've amended my postgres configuration (in /etc/postgresql/13/main/postgresql.conf) with the slightly bizarre:

    parallel_tuple_cost = 0.001
    parallel_setup_cost = 3
    

    This is certainly not ideal for every workload, but given the queries I see in the VO I want to give Postgres no excuse not to parallelise when there is at least the shard of a chance it'll help; given I'll never execute more than very few queries per second, the extra overhead for parallelising queries that would be faster sequentially will never really bite me.

  • LAMOST5 meets Datalink

    One of the busiest spectral survey instruments operated right now is the Large Sky Area Multi-Object Fiber Spectrograph Telescope (LAMOST). And its data in the VO, more or less: DR2 and DR3 have been brought into the VO by our Czech colleagues, but since they currently lack resources to update their services to the latest releases, they have kindly given me their DaCHS resource descriptor, and so I had a head start for publishing DR5 in Heidelberg.

    With some minor updates, here it is now: Over nine million medium-resolution spectra covering large parts of the northen sky – the spatial coverage is like this:

    Coverage Healpix map

    There's lots of fun to be had with this; of course, there's an SSA service, so when you point Aladin or Splat at some part of the covered sky and look for spectra, chances are you'll see LAMOST spectra, and when working on some of our tutorials (this one, for example), it happened that LAMOST actually had what I was looking for when writing them.

    But I'd like to use the opportunity to mention two other modes of accessing the data.

    Stacked spectra

    Tablesample and TOPCAT's Plot Table activation action

    Say you'd like to look at spectra of M stars and would like to have some sample from across the sky, fire up TOPCAT, point its TAP client the GAVO DC TAP service (http://dc.g-vo.org/tap) and run something like:

    select
      ssa_pubDID, accref, raj2000, dej2000, ssa_targsubclass
    from lamost5.data tablesample(1)
    where
      ssa_targsubclass like 'M%'
    

    This is using the TABLESAMPLE modifier in the from clause, which isn't standard ADQL yet. As mentioned in the DaCHS 1.4 announcement, DaCHS has a prototype implementation of what's been discussed on the IVOA's DAL mailing list: pick a part of a table rather than the full one. It takes a percentage as an argument, and tells the server to choose about this percentage of the table's records using a reasonable and fast heuristic. Note that this won't give you perfect statistical sampling, but if it's not “good enough” for some purpose, I'd like to learn about that purpose.

    Drawing a proper statistical sample, on the other hand, would take minutes on the GAVO database server – with tablesample, I had the roughly 6000 spectra the above query returns essentially instantaneously, and from eyeballing a sky plot of them, I'd say their distribution is close enough to that of the full DR5. So: tablesample is your friend.

    For a quick look at the spectra themselves, in TOPCAT click Views/Activation Actions, check “Plot Table” and make sure TOPCAT proposes the accref column as “Table Location” (if you don't see these items, update your TOPCAT – it's worth it). Now click on a row or perhaps a dot on a plot and behold an M spectrum.

« Page 13 / 22 »