Find Outliers using ADQL and TAP

[Annie Cannon's notebook and a plot]
Two pages from Annie Cannon’s notebooks, and a histogram of the basic BP-RP color distribution in the HD catalogue (blue) and the distribution of the outliers (red). For more of Annie Cannon’s notebooks, search on ADS.

The other day I gave one of my improvised live demos (“What, roughly, are you working on?”) and I ended up needing to translate identifiers from the Henry Draper Catalogue to modern positions. Quickly typing “Henry Draper” into TOPCAT’s TAP search window didn’t yield anything useful (some resources only using the HD, and a TAP service that didn’t support uploads – hmpf).

Now, had I tried the somewhat more thorough WIRR Registry interface, I’d have noted the HD catalogue at VizieR and in particular Fabricius’ et al’s HD-Tycho 2 match (explaining why they didn’t show up in TOPCAT is a longer story; we’re working on it). But alas, I didn’t, and so I set out to produce a catalogue matching HD and Gaia DR2, easily findable from within TOPCAT’s TAP client. Well, it’s here in the form of the hdgaia.main table in our data center.

Considering the nontrivial data discovery and some yak shaving I had to do to get from HD identifiers to Gaia DR2 ones, it was perhaps not as futile an exercise as I had thought now and then during the preparation of the thing. And it gives me the chance to show a nice ADQL technique to locate outliers.

In this case, one might ask: Which objects might Annie Cannon and colleagues have misclassified? Or perhaps the objects have changed their spectrum between the time Cannon’s photographic plates have been taken and Gaia observed them? Whatever it is: We’ll have to figure out where there are unusual BP-RPs given the spectral type from HD.

To figure this out, we’ll first have to determine what’s “usual”. If you’ve worked through our ADQL course, you know what to expect: grouping. So, to get a table of average colours by spectral type, you’d say (all queries executable on the TAP service at http://dc.g-vo.org/tap):

select spectral, 
  avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
  count(*) as ct
from hdgaia.main
join gaia.dr2light
using (source_id)
group by spectral

– apart from the join that’s needed here because we want to pull photometry from gaia, that’s standard fare. And that join is the selling point of this catalog, so I won’t apologise for using it already in the first query.

The next question is how strict we want to be before we say something that doesn’t have the expected colour is unusual. While these days you can rather easily use actual distributions, at least for an initial analysis just assuming a Gaussian and estimating its FWHM as the standard deviation works pretty well if your data isn’t excessively nasty. Regrettably, there is no aggregate function STDDEV in ADQL (you could still ask for it: head over to the DAL mailing list before ADQL 2.1 is a done deal!). However, you may remember that Var(X)=E(X2)-E(X)2, that the average is an estimator for the expectation, and that the standard deviation is actually an estimator for the square root of the variance. And that these estimators will work like a charm if you’re actually dealing with Gaussian data.

So, let’s use that to compute our standard deviations. While we are at it, throw out everything that’s not a star1, and ensure that our groups have enough members to make our estimates non-ridiculous; that last bit is done through a HAVING clause that essentially works like a WHERE, just for entire GROUPs:

select spectral, 
  avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
  sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
    power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
  count(*) as ct
  from hdgaia.main
  join gaia.dr2light
  using (source_id)
  where m_v<18
  group by spectral
  having count(*)>10

This may look a bit scary, but if you read it line by line, I’d argue it’s no worse than our harmless first GROUP BY query.

From here, the step to determine the outliers isn’t big any more. What the query I’ve just written produces is a mapping from spectral type to the means and scales (“µ,σ,“ in the rotten jargon of astronomy) of the Gaussians for the colors of the stars having that spectral type. So, all we need to do is join that information by spectral type to the original table and then see which actual colors are further off than, say, three sigma. This is a nice application of the common table expressions I’ve tried to sell you in the post on ADQL 2.1; our determine-what’s-usual query from above stays nicely separated from the (largely trivial) rest:

with standards as (select spectral, 
  avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
  sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
    power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
  count(*) as ct
  from hdgaia.main
  join gaia.dr2light
  using (source_id)
  where m_v<18
  group by spectral
  having count(*)>10)
select * 
from hdgaia.main
join standards 
using (spectral)
join gaia.dr2light using (source_id)
where 
  abs(phot_bp_mean_mag-phot_rp_mean_mag-col)>3*sig_col
  and m_v<18

– and that's a fairly general pattern for doing an initial outlier analysis on the the remote side. For HD, this takes a few seconds and yields 2722 rows (at least until we also push HDE into the table). That means you can keep 99% of the rows (the boring ones) on the server and can just pull the ones that could be interesting. These 99% savings aren't terribly much with a catalogue like the HD that's small by today's standards. For large catalogs, it's the difference between a download of a couple of minutes and pulling data for a day while frantically freeing disk space.

By the way, that there's only 2.7e3 outliers among 2.25e5 objects, while Annie Cannon, Williamina Fleming, Antonia Maury, Edward Pickering, and the rest of the crew not only had to come up with the spectral classification while working on the catalogue but also had to classify all these objects manually, this is an amazing feat even if all of those rows actually were misclassifications (which they certainly aren't) – the machine classifiers of today would be proud to only get 1% wrong.

The inset in the facsimile of Annie Cannons notebooks above shows how the outliers are distributed in color space relative to the full catalogue, where the basic catalogue is in blue and the outliers (scaled by 70) in red. Wouldn't it make a nice little side project to figure out the reason for the outlier clump on the red side of the histogram?


1I'll not hide that I was severely tempted to undo the mapping of object classes to – for HD – unrealistic magnitudes (20 .. 50) but then left the HD as it came from ADC; I still doubt that decision was well taken, and sure enough, the example query above already has insane constraints on m_v reflecting that encoding. From today's position, of course there should have been an extra column or, better yet, a different catalogue for nonstellar objects. Ah well. It's always hard to break unhealty patterns.

GAVO at the Northern Spring Interop

[IVOA 2002-2017]
15 Years of IVOA: The birthday cake our Shanghai hosts prepared for us.

Every half year, VO enthusiasts from all over the world gather for an “Interoperability conference”, or Interop for short. The latest such event, the Shanghai Interop 2017, ended Friday a week ago. It has been a “long” one again after the short southern spring Interop in Trieste last year (featured in this blog).

As usual, it was a week of many discussions and much consensus-building. In this post, I’d like to mention a few of the GAVO-related contributions; links typcially go to slides or lecture notes PDFs.

On the Registry side of things, we’re currently (among many other things) briding the gap between DOIs and the Registry in VOResource 1.1, and we invited registry providers to take up the new features, as well as proposing how to update RegTAP (which is used to actually query the Registry) to cope with the new metadata.

Also in Registry, our efforts of almost a decade to properly support registering tables and similar data collections bore fruit (Britain’s Mark Taylor reported on his experiences taking up our current proposal), and the fairly spectacular new Aladin V10 (presented by the CDS’ Pierre Fernique, who showed off what I’m tempted to call a “visual registry interface”) urgently needs what we’ve developed over the years.

We furthermore reported on new steps to finally let people search the registry using Space-Time constraints (spoiler: the tech is almost there, registry records need lots of work).

Spatial searches in the registries are one thing enabled by storing and searching for MOCs in relational databases, as was reported by Markus Nullmeier over in an Applications session. The setting may already tell you that these MOCs (Multi Order Coverages, a healpix-based way of representing fairly arbitrary areas on the sky) have applications far beyond Registry.

Also in Apps, Ole reported on progress in packaging VO applications for easy and reliable installation, in this case for Debian and derivatives. Finally for Apps, Margarida reported on getting lines and line lists into the spectral analysis package SPLAT: Implementation of SLAP and VAMDC interfaces in SPLAT-VO.

In the wider area of data access protocols and underlying data models, we contributed to Marco’s talk on the long-overdue facelifting for the VO’s bedrock, Simple Cone Search (Keeping SCS up-to-date within DAL landscape) – the fact that there’s an installed base of 15000 of such services may let you guess that we need to tread lightly here. On the bleeding-edge side of things, we presented our current ideas on how, eventually, several data models, data modelling as such and the annotation of data according to these data models might play together in publishing time domain data with DACHS (previously featured on this blog in a slightly less technical way).

We also talked about education and outreach. Hendrik reported on our ADQL course and how it helps future astronomers learn dealing efficiently with even very large datasets. Hendrik’s long-lasting dedication to these topics did not go unpunished at this interop: since the Exec meeting on the Interop Wednesday he is vice-chairing the education interest group of the IVOA. Back in the session I also mused a bit about what metadata changes are needed to make the VO tutorial collection VOTT more useful.

It is a particular pleasure for me to mention that the IVOA has a new interest group: “Solar System”. Regular readers of the blog will have noticed that I have a particularly soft spot in my heart for that crowd, and so I gave a short overview over how DaCHS is used among them, too.

And that’s just the official programme. Much more fixing, designing, and discussion went on between sessions or in the evenings. The latter, of course, included some decidedly less technical aspects. Including, as pictured above, a nice birthday cake for the IVOA, as it is now 15 year since the first Interop meeting in January 2002.

And the Solar System, too

Virtual Observatory technologies are increasingly being adopted outside of “core” astronomy in the vicinity of the optical band (to which they have had, I’ll have to admit, a certain slant) . An excellent example for that trend is the Europlanet community. Their goal is to make solar system data accessible without fiddling, and they are employing a wide range of VO standards for that. At the heart of their efforts are TAP and the VO Registry.

While the usual VO client software will of course work fine with their services, they are offering a nice web-based discovery tool executing queries against an increasing number of services. Such uniform quering over many services is possible is because all of them implement TAP and host EPNcore tables. The resulting interface, also known as EPN-TAP, allows for very flexible discovery and retrieval of solar system data products, much like ObsTAP does for astronomical observations outside of the solar system.

Since quite a few EPN-TAP services are built using GAVO’s DaCHS publication suite, I was invited to this week’s VESPA implementation workshop 2017 in Graz to help the data providers set up their services.

I can’t deny that I’m somewhat excited when I see how our software is used to publish spectra of the ice blocks in Saturn’s ring taken by the lonely Cassini spacecraft still orbiting the gas giant, or data transmitted by Rosetta, now (and for who knows how long) sitting on comet 67P/Churyumov-Gerasimenko. There’s even an upcoming archive of solar system alerts that may, according to its builders, include events like meteor showers on Mars. I can almost hear my code whisper “I’ve archived signals of C-beams glitter in the dark near the Tannhäuser Gate”.

Even documentation can become otherworldly in this business: Already in February, DaCHS has learnt to procude GeoJSON, a format common in the GIS community and also adopted by Europlanet – planetology has lots of common ground with geoinformatics. And in the reference documentation on annotating tables to enable that, when I wrote “standards-compliant GeoJSON clients will interpret your coordinates as WGS84 on Earth if you leave [frame annotation] out”, I was severely tempted to add “which is probably not what you want” and feel like Spaceman Spiff.

Romantic space adventures aside, after this intense week, not only are there several additional or improved EPN-TAP services from places ranging from Pasadena to Villafranca to Warsaw in the pipeline, the close interaction with the data providers has also led to very significant improvements to DaCHS’ EPN-TAP support. The tutorial chapter on EPN-TAP and the reference documentation linked from there already reflect the results of this workshop. You’ll need a current DaCHS beta package for that to work, though; we expect this stuff to go into our release packages around July.

If any of the workshop participants read this: Thanks a lot for your patience with DaCHS’ sometimes somewhat cryptic diagnostics. If, on the other hand, you missed the Graz workshop and have solar system data: Please talk to us or the kind and friendly Europlanet folks – either us will be delighted to support your publication project. And perhaps we’ll meet you at the next such workshop, planned for 2018 in the Czech Republic.

Updated Proper Motion Tutorial

At the risk of turning this into a blog on nice TAP tricks (which it’s not supposed to be): Our classic short tutorial on adding proper motions to almost arbitrary object lists has just gotten a facelift today.

And there’s new content, too – I now show what to do when you don’t even have positions but just object names. In order to keep this sufficiently geeky, here’s the query as a spoiler:

SELECT col1, ra, dec
FROM TAP_UPLOAD.t1
LEFT OUTER JOIN ident
ON (id=normId(col1))
LEFT OUTER JOIN basic
ON (oidref=oid)

But to close on a non-TAP topic: Registry! There’s an experimental facility to have this kind of thing in the Registry; the PM tutorial is in, for instance, with the ivoid ivo://edu.gavo.org/hd/gavo_addpms). One thing you can do with this is generate a list of registred documents that essentially updates itself from the registry.

Another is figure out where the source code of the document is (if the authors choose to share it, which is of course a very smart thing to do); in our example it’s in Volute, the IVOA’s semi-official version control system. So, if you find a bug (defined as “superset of typo”) in the linked document, you’re most welcome to supply patches as diffs or just directly fix things if you have commit privileges in Volute.