• Small Telescopes, Large Surveys

    Image: Blink comparator and survey camera

    Plate technology at Bamberg observatory: a blink comparator with one plate mounted, and a survey camera that was once used at Boyden Station, an astronomer outpost in 60ies South Africa.

    I'm currently at the workshop “Large surveys with small telescopes: past, present, and future” (or Astroplate III for short) in Bamberg, where people are discussing using and re-using the rich heritage of historical observations (hence the “plate” part) as well growing that heritage in the age of large CCDs, fast computers and large disks.

    Using and re-using is of course what the Virtual Observatory is about, and we've been keeping fairly large plate collections in our data center for quite a while (among them the Archives of Landessternwarte Königstuhl or the Palomar-Leiden Trojan surveys, and there is the WFPDB TAP-accessibly). Therefore, people from GAVO Heidelberg have been to all past astroplate conferences.

    For this one, I brought a brand-new tutorial on plate scans in the VO, which, I hope, also works as a general introduction to image discovery in the VO using SIAP, Datalink, and Obscore. If you're doing image stuff now and then, please have a quick look at the thing – I am particularly grateful for hints on what to improve or perhaps particularly obvious use cases for the material discussed.

    Such VO proselytising aside, the conference is discussing the wide variety of creative, low-cost data collectors out there as well as computer-aided re-analysis extracting new knowledge from decades-old data. If I had to choose a single come-to-think-of-it moment, it would be Norbert Zacharias' observation that if you have a well-behaved object and you'd like to know where it was in 1900, it's now more accurate to extrapolate Gaia astrometry to the epoch of observation than to measure it on the plate itself. Which is saying a lot about the amazing feat of engineering that Gaia is.

    This is not, however, an argument for dumping the old data. Usually, it is exactly what is not so well-behaved (like those) that's interesting – both in terms of astrometry and in terms of photometry (for which there's a lot more unruly behaviour in the first place). To figure out how objects don't behave well, and, for objects disguising as well-behaved only on time scales of the (say) Gaia mission, which these are, the key is “old” data. The freshness of which we're discussing this week.

  • A New View on SSAP in DaCHS

    When I started working on the VO in 2007, my collagues in Garching already had a software that implemented major parts of the simple spectral access protocol (SSAP) that was being developed back then. It would publish spectra in the FITS format by just blindly dumping all header cards into a database table and then defining a view over that “raw” metadata table to make the whole thing match SSAP's expectations for how the output table should look like. Sometimes you could just map through a header to an SSA column, sometimes you would just convert a unit, sometimes you would have to write a fairly complex SQL expressions combining multiple fields.

    Back then, I didn't like it – why have two things (a table and a view) that can break when one (just a table in SSA's format) would do, too? Also, SSAP has about 50 metadata fields, but lets you put constant values into VOTable PARAMs, which seemed a very reasonable way to attain more compact responses. So, when DaCHS grew SSAP support, I defined a mixin (essentially, a configurable interface definition) that let operators define SSA tables and their constant parameters in a fairly simple fashion and directly produced a table you could base your SSAP service on.

    That made assumptions about which pieces of metadata are constant and which are not; for instance, the original mixin (“hcd” for “homogeneous collection”) assumed all spectra in a data collection came from the same instrument and had the same resolution and (what was I thinking?) SNR. Unsurprisingly, that broke fairly soon. So, I added a second mixin (“mixc”) for when different instruments or codes produced the data.

    But even that was headache, at the latest when I started making time series services using SSAP. And I had to fix a few bugs in the mixins themselves in the meantime, which mostly required re-imports of the data in that design. Such re-imports are non-trivial when you have millions of spectra, and they need to happen at software upgrade time or the services would break with the upgrade. Ouch.

    It was about mid-2018 when it dawned on me that sometimes it's better to have two things that can break even if one would do, after all. Specifically, if fixing the one thing is expensive, it's an excellent idea to put a facade on top of it that's cheap to change and can already be used to repair most deficiencies. Why re-build the house if a paint job does the trick?

    As to having more compact query responses when you stuff metadata that's constant in all the rows into VOTable PARAMS – well, in the age of web pages pulling in a megabyte of javascript and two megabytes of images to display five lines of text, I've become a bit cavalier in that department. Sure, the average row may have grown by a factor of three, but we're still talking only a few megabyte even with large responses. To me, these extra bytes seem a fair price to pay for the increased flexibility and overall more straightforward architecture.

    So, I've now come up with a view-based solution in DaCHS, too: the //ssap#view mixin. This is a bit less radical than the Garching software of 2007, as it doesn't dump raw headers but instead lets you do the primary transformations in the RD. But it no longer constrains what pieces of metadata should be constant and which may vary between spectra, and it uses the same names for the same pieces of metadata throughout (which also is a step forward over the old SSAP mixins).

    With this, DaCHS operators should no longer use the hcd and mixc mixins for new services. The new technique is already reflected the respective tutorial chapter, and the SSAP template (you're using dachs start, aren't you?) now uses it, too.

    If you have a spectra publishing project in your pipeline, this would be the perfect time to upgrade to the DaCHS 1.2.4 beta, which has the new mixin. It would be great if we could iron out remaining wrinkles before the next release makes changes a load on my conscience.

    As to migrating existing SSAP services: Well, it would be great if I could drop the old mixins in a couple of years, as they cause quite a bit of uglyness in DaCHS's built-in //ssap RD. But the migration regrettably isn't straightforward, so you may want to wait a bit before embarking on that journey (I'll be happy to help, though).

  • A Grey Eminence of a Standard

    [Screenshot: graphs and numbers]

    Examples for extra metadata: extended column descriptions on the web pages accompanying the ARI-Gaia TAP service.

    Last friday, I've uploaded a first working draft of VODataService 1.2 to the IVOA documents repository. That's the first major step in updating a standard, and it's an invitation to everyone to have a look and comment.

    Foof, you might say, what do I care? I've not even heard of that standard.

    Well, but you've probably used it. VODataService is (among several other things) the standard that governs how a TAP service tells clients (TOPCAT, say) what tables it has and what's inside of them. So, if you see in TOPCAT that there is a column named ang_error with a unit of deg, a UCD of stat.error;pos and the meaning “1 σ confidence radius of the position”, that most likely came in a document standardised by VODataService.

    The question of what (TAP) services can tell clients about their table set is one major open point: Do we want additional metadata there? This article's image, for inspiration, shows a screenshot of extended metadata Grégory delivers to browsers on his ARI-Gaia service; among this are minima, maxima, means, standard deviations, quartiles, and fill factors (i.e., how many of the columns are NULL). He even shows histograms of the values' distributions and HEALPix maps showing how (the means of) the values vary on the sky. Another example of extended metadata could be footnotes as you will find them on many of my resources' reference URLs (example; footnotes are, unsurprisingly, near the foot of that page).

    We could define interoperable means to communicate information like this. The question is: does the added value justify the complication in implementation? This is where it would be great if you weighed in, in particular if you are a “mere” TAP user: Are there any such pieces of metadata you've always wanted to see in your TAP interfaces? Oh, and metadata of course can also be added to tables rather than columns. The current draft already lets services communicate the number of rows in each table – is there more “simple”, table-specific metadata of this sort?

    VODataService furthermore deals with several other topics; for instance, the STC in the registry business I've blogged about in February is going to be standardised here (update on this: spectral coverage is no longer in wavelength but in energy). Other changes are rather more technical in nature, like several new resource types that will improve the discovery of tables and other such resources, or a careful adjustment of some features to keep them in line with TAP evolution.

    But don't let the technicalities scare you away – just have a peek, and if you have thoughts on any of the VODataService topics: I'm just a mail away.

  • Find Outliers using ADQL and TAP

    Annie Cannon's notebook and a plot

    Two pages from Annie Cannon's notebooks[1], and a histogram of the basic BP-RP color distribution in the HD catalogue (blue) and the distribution of the outliers (red). For more of Annie Cannon's notebooks, search on ADS.

    The other day I gave one of my improvised live demos (“What, roughly, are you working on?”) and I ended up needing to translate identifiers from the Henry Draper Catalogue to modern positions. Quickly typing “Henry Draper” into TOPCAT's TAP search window didn't yield anything useful (some resources only using the HD, and a TAP service that didn't support uploads – hmpf).

    Now, had I tried the somewhat more thorough WIRR Registry interface, I'd have noted the HD catalogue at VizieR and in particular Fabricius' et al's HD-Tycho 2 match (explaining why they didn't show up in TOPCAT is a longer story; we're working on it). But alas, I didn't, and so I set out to produce a catalogue matching HD and Gaia DR2, easily findable from within TOPCAT's TAP client. Well, it's here in the form of the hdgaia.main table in our data center.

    Considering the nontrivial data discovery and some yak shaving I had to do to get from HD identifiers to Gaia DR2 ones, it was perhaps not as futile an exercise as I had thought now and then during the preparation of the thing. And it gives me the chance to show a nice ADQL technique to locate outliers.

    In this case, one might ask: Which objects might Annie Cannon and colleagues have misclassified? Or perhaps the objects have changed their spectrum between the time Cannon's photographic plates have been taken and Gaia observed them? Whatever it is: We'll have to figure out where there are unusual BP-RPs given the spectral type from HD.

    To figure this out, we'll first have to determine what's “usual”. If you've worked through our ADQL course, you know what to expect: grouping. So, to get a table of average colours by spectral type, you'd say (all queries executable on the TAP service at http://dc.g-vo.org/tap):

    select spectral,
      avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
      count(*) as ct
    from hdgaia.main
    join gaia.dr2light
    using (source_id)
    group by spectral
    

    – apart from the join that's needed here because we want to pull photometry from gaia, that's standard fare. And that join is the selling point of this catalog, so I won't apologise for using it already in the first query.

    The next question is how strict we want to be before we say something that doesn't have the expected colour is unusual. While these days you can rather easily use actual distributions, at least for an initial analysis just assuming a Gaussian and estimating its FWHM as the standard deviation works pretty well if your data isn't excessively nasty. Regrettably, there is no aggregate function STDDEV in ADQL (you could still ask for it: head over to the DAL mailing list before ADQL 2.1 is a done deal!). However, you may remember that Var(X)=E(X2)-E(X)2, that the average is an estimator for the expectation, and that the standard deviation is actually an estimator for the square root of the variance. And that these estimators will work like a charm if you're actually dealing with Gaussian data.

    So, let's use that to compute our standard deviations. While we are at it, throw out everything that's not a star[2], and ensure that our groups have enough members to make our estimates non-ridiculous; that last bit is done through a HAVING clause that essentially works like a WHERE, just for entire GROUPs:

    select spectral,
      avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
      sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
        power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
      count(*) as ct
    from hdgaia.main
    join gaia.dr2light
      using (source_id)
    where m_v<18
    group by spectral
    having count(*)>10
    

    This may look a bit scary, but if you read it line by line, I'd argue it's no worse than our harmless first GROUP BY query.

    From here, the step to determine the outliers isn't big any more. What the query I've just written produces is a mapping from spectral type to the means and scales (“µ,σ” in the rotten jargon of astronomy) of the Gaussians for the colors of the stars having that spectral type. So, all we need to do is join that information by spectral type to the original table and then see which actual colors are further off than, say, three sigma. This is a nice application of the common table expressions I've tried to sell you in the post on ADQL 2.1; our determine-what's-usual query from above stays nicely separated from the (largely trivial) rest:

    with standards as (select spectral,
      avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
      sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
        power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
      count(*) as ct
      from hdgaia.main
      join gaia.dr2light
      using (source_id)
      where m_v<18
      group by spectral
      having count(*)>10)
    select *
    from hdgaia.main
    join standards
    using (spectral)
    join gaia.dr2light using (source_id)
    where
      abs(phot_bp_mean_mag-phot_rp_mean_mag-col)>3*sig_col
      and m_v<18
    

    – and that's a fairly general pattern for doing an initial outlier analysis on the the remote side. For HD, this takes a few seconds and yields 2722 rows (at least until we also push HDE into the table). That means you can keep 99% of the rows (the boring ones) on the server and can just pull the ones that could be interesting. These 99% savings aren't terribly much with a catalogue like the HD that's small by today's standards. For large catalogs, it's the difference between a download of a couple of minutes and pulling data for a day while frantically freeing disk space.

    By the way, that there's only 2.7e3 outliers among 2.25e5 objects, while Annie Cannon, Williamina Fleming, Antonia Maury, Edward Pickering, and the rest of the crew not only had to come up with the spectral classification while working on the catalogue but also had to classify all these objects manually. This is an amazing feat even if all of those rows actually were misclassifications (which they certainly aren't) – the machine classifiers of today would be proud to only get 1% wrong.

    The inset in the facsimile of Annie Cannons notebooks above shows how the outliers are distributed in color space relative to the full catalogue, where the basic catalogue is in blue and the outliers (scaled by 70) in red. Wouldn't it make a nice little side project to figure out the reason for the outlier clump on the red side of the histogram?

    [1]The notebook pages are from a notebook Annie Cannon used in 1929. The material was kindly provided by Project PHAEDRA at the John G. Wolbach Library, Harvard College Observatory.
    [2]I'll not hide that I was severely tempted to undo the mapping of object classes to – for HD – unrealistic magnitudes (20 .. 50) but then left the HD as it came from ADC; I still doubt that decision was well taken, and sure enough, the example query above already has insane constraints on m_v reflecting that encoding. From today's position, of course there should have been an extra column or, better yet, a different catalogue for nonstellar objects. Ah well. It's always hard to break unhealty patterns.
  • HTTPS in DaCHS

    Browser windows with and without HTTPS.

    Another little aspect of HTTPS support in DaCHS: In the web interface, the webSAMP button must disappear in pages served through HTTPS: it simply wouldn't work.

    (Warning: No astronomy-relevant content at all this time).

    I can't say I'm a big fan of the mighty push towards HTTPS that's going on right now – as I'm arguing in the updated operator's guide it doesn't do people's privacy a lot of good (compared to, say, pushing for browsers to not execute Javascript by default or have DNSSEC widely deployed), but it's a fairly substantial operational liability. With HTTPS, operators have to deal with cryptographic material, regularly update their certificates, restart their services in time and assemble the whole thing correctly (don't get me started about proxying, SNI, and all those horrors). Users, on the other hand, have to keep their CA certificates in order, in particular when they do programmatic VO access, where the browser vendors, their employers and who knows who else doesn't do it for them. Pop quiz: How would you install a new CA certificate on your box? And will your default browser see it?

    But on the other hand, there are some scenarios in which HTTPS makes sense, and I can remotely fantasise that some of those may even be relevant to the VO. And people have been asking for HTTPS in DaCHS a number of times, at times even because their administrations urged them to switch. So, here it is, hopefully. Turning it on is reasonably easy when you use Letsencrypt (which in particular entails having ports 80 and 443); the section on Letencrypt in the operator's guide tells what to do. In particular don't forget the cron job, because without it, things would break after three months (when the initial certificate expires).

    Things get difficult after that. For one, if your box is known under several names (our data center, for instance, can be reached as any of dc.g-vo.org, vo.uni-hd.de, and dc.zah.uni-heidelberg.de; this of course also includes things like www.example.org and example.org), you'll now have to tell DaCHS about it in the new [web]alternateHostnames configuration item; for instance, we have:

    [web]
    serverURL: http://dc.zah.uni-heidelberg.de
    alternateHostnames:dc.g-vo.org, vo.uni-hd.de
    

    in our /etc/gavo.rc.

    And then the Registry has to know you have https. There's actually no convention for that in the VO yet. But since I'd really like to have at least fallback interfaces with plain HTTP, we'll have to come up with something. For now, my plan is to have the alternative protocol (i.e., HTTPS for sites that have an HTTP-serverURL and vice versa) using the brand-new VOResource 1.1 mirrorURLs (in RegTAP 1.1, they are in the mirror_url column rr.interface). To make DaCHS declare the alternate URLs, set [web]registerAlternative to True.

    Another change I've introduced for HTTPS is that the default HTML template for the form renderer (i.e., the one people use who come with a browser) now suppresses the SAMP button if the request came in through HTTPS; that's because WebSAMP doesn't work with HTTPS and probably never will – at least I can't see a way to make it happen without totally wrecking what security guarantees HTTPS gives.

    All this doesn't yet cater for the case when you use a reverse proxy to terminate HTTPS. If you are in that situation, please talk to me so we can figure out a sane way for you explain to DaCHS what to tell the Registry.

    Anyway, if you want to try things out, just switch to the beta repostitory and upgrade. Feedback is highly welcome.

    Oh, and if you're a client developer: Our data center is now reachable through HTTPS (at https://dc.g-vo.org), and we already have pushed the records with mirrorURLs declaring HTTPS support to the RegTAP service at dc.g-vo.org (the others will have to wait a bit longer, as we haven't re-published our registry records yet (it's all experimental, after all).

« Page 15 / 21 »