Query the Registry with WIRR

Search windows of VODesktop and WIRR
Pixels from venerable VODesktop and WIRR: it’s supposed to be about the same thing, except WIRR uses and exposes the latest Registry standards (and then some tech that’s not standard yet).

When the VO was young, there was a programme called VODesktop that had a very nice interface for searching the Registry. Also, it would run queries against the services discovered, giving nice all-VO querying that few modern clients do quite as elegantly. Regrettably, when the astrogrid UK project was de-funded, VODesktop’s development ceased in 2010.

In 2012, it had become clear that nobody would step up to continue it, and I wanted to at least provide a replacement for the Registry interface part. In consequence, Florian Rothmaier and I wrote the Web Interface to the Relational Registry, or WIRR for short; this lets you build Registry queries in your Web Browser in an interface inspired by VODesktop (which, I’m told, in turn was inspired by early iTunes).

WIRR’s sweet spot is between the Registry interfaces in the usual clients (TOPCAT, Aladin: these try to hide the gory details of where their service lists come from and hence are limited in what interaction they allow) and using a TAP client to write and execute RegTAP queries (where there are no limitations beyond the protocol’s, but it’s tedious unless you happen to know the RegTAP standard by heart).

In contrast to its model VODesktop, WIRR cannot run any queries against the services discovered using it. But you can transfer the services you have found to clients via SAMP (TOPCAT can handle the relevant MTypes, but I’m frankly not sure what else). Apart from that, an obvious use for WIRR are the queries one needs in VO curation. For instance, I keep linking to it when sending people canned registry queries, as in the section on claiming an authority in the DaCHS Tutorial.

Given that both Javascript and the Registry have evolved a lot in the past decade, WIRR was in need of a major redecoration for some time now, and in early July, I found some time to do it. The central result is that the code is now halfway modern, strict Javascript; let’s see how many web browsers still run that can’t execute this.

On the surface, much less has changed, but there are some news I’d consider noteworthy and that might help your data discovery-fu:

  • Since I’ve added some constraint types, the constraint type selector is now a hierarchical box, sporting what I think are or should be the most common constraint types (full text, service type and UAT term) on level 0 and then having “Blind Discovery“, “Finer Grained“, and “Special Effects“ as pop-ups; all this so we obey Miller’s Rule of Seven.
  • Rather than explain the constraints on a second, separate page, there are now brief help texts coming with each constaint.
  • You can now match against UAT concepts, and there is a completing input box for them; in case you’re wondering what this is about, see this post from last February. And yes, next time I’ll play with WIRR I’ll probably include SemBaReBro here.
  • When constraining by column UCD, you can now choose from UCDs found in the registry (the “Pick one“ button).
  • You can now constrain by spatial, temporal, and spectral coverage, though that’s still a gamble because not many (or, actually, very few in the case of temporal and spectral) operators care to declare their services’ coverage. When they don’t, you won’t see their resources with such blind discovery constraints. For some background on this, check Space and Time not lost on the Registry on this blog.
  • There is now a „SQL“ button with successful searches that lets you retrieve the SQL executed for the particular constraint. While that query does not immediately execute on RegTAP services (it’s Postgres’ SQL rather than ADQL), it ought to give you a head start when transplanting your Registry query into, say, a pyVO-based script.
  • You can now use your browser’s back and forward buttons (or, in my case. key bindings) to navigate in your query history.

What this still doesn’t do: Work without Javascript. That’s a bit of a disgrace, since after the last changes it would actually be reasonable to provide non-javascript fallbacks for some of the basic functionality (of course, no SAMP at all then…). I’ll do it the first time someone asks. Promised.

A document that now needs at least slight updates because things have moved about a bit is the data discovery use case Florian wrote back then. The updates absolutely necessary are not terribly involved, but I would like to use the opportunity to add a bit more spice to the tutorial. If you have ideas: I’m all ears.

Oh, and before I close: you can still run VODesktop; kudos to the maintainers of the JVM for that. But it’s nevertheless not really usable any more, which perhaps isn’t too surprising for a client built on top of experimental online services ten years ago. For one, its TAP client speaks pre-release versions of both TAP and ADQL, so those won’t work on modern TAP services (and the ancient ones have vanished). Worse, it needed to use a non-standard extension of RegTAP’s predecessor (for those old enough to remember: it used XQuery), and none of the modern searchable registries understands that any more.

Which is a pity, really. It’s been a fine programme. It just was a few years early: By 2012, everything it needed has been defined in nice, stable standards that are still around and probably will be for another decade at least.

Tangible Astronomy and Movies with TOPCAT

This March, I’ve put up two new VO resources (that’s jargon for “table or service or whatever”) that, I think, fit quite well what I like to call tangible astronomy: things you can readily relate to what you see when you step out at night. And, since I’m a professing astronomy nerd, that’s always nicely gratifying.

The two resources are the Constellations as Polygons and the Gaia eDR3 catalogue of nearby stars (GCNS).

Constellations

On the constellations, you might rightfully say that’s really far from science. But then they do help getting an idea where something is, and when and from where you might see something. I’ve hence wanted for a long time to re-publish the Davenhall Constellation Boundary Data as proper, ADQL-queriable polygons, and figuring out where the loneliest star in the sky (and Voyager 1) were finally made me do it.

GCNS density around taurus.
Taurus in the GCNS density plot: with constellations!

So, since early March there’s the cstl.geo table on the TAP service at https://dc.g-vo.org/tap with the constallation polygons in its p column. Which, for starters, means it’s trivial to overplot constallation boundaries in your favourite VO clients now, as in the plot above. To make it, I’ve just done a boring SELECT * FROM cstl.geo, did the background (a plain HEALPix density plot of GCNS) and, clicked Layers → Add Area Control and selected the cstl.geo table.

If you want to identify constellations by clicking, while in the area control, choose “add central” from the Forms menu in the Form tab; that’s what I did in the figure above to ensure that what we’re looking at here is the Hyades and hence Taurus. Admittedly: these “centres“ are – as in the catalogue – just the means of the vertices rather than the centres of mass of the polygon (which are hard to compute). Oh, and: there is also the AreaLabel in the Forms menu, for when you need the identification more than the table highlighting (be sure to use a center anchor here).

Note that TOPCAT’s polygon plot at this point is not really geared towards large polygons (which the constellations are) right now. At the time of writing, the documentation has: “Areas specified in this way are generally intended for displaying relatively small shapes such as instrument footprints. Larger areas may also be specified, but there may be issues with use.” That you’ll see at the edges of the sky plots – but keeping that in mind I’d say this is a fun and potentially very useful feature.

What’s a bit worse: You cannot turn the constellation polygons into MOCs yet, because the MOC library currently running within our database will not touch non-convex polygons. We’re working on getting that fixed.

Nearby Stars

Similarly tangible in my book is the GCNS: nearby stars I always find romantic.

Let’s look at the 100 nearest stars, and let’s add spectral types from Henry Draper (cf. my post on Annie Cannon’s catalogue) as well as the constellation name:

WITH nearest AS (
SELECT TOP 100 
  a.source_id, 
  a.ra, a.dec,
  phot_g_mean_mag,
  dist_50,
  spectral
FROM gcns.main AS a
LEFT OUTER JOIN hdgaia.main AS b 
  ON (b.source_id_dr3=a.source_id)
ORDER BY dist_50 ASC)
SELECT nearest.*, name
FROM nearest
JOIN cstl.geo AS g
  ON (1=CONTAINS(
    POINT(nearest.ra, nearest.dec), 
    p))

Note how I’m using CONTAINS with the polygon in the constellations table here; that’s the usage I’ve had in mind for this table (and it’s particularly handy with table uploads).

That I have a Common Table Expression (“WITH”) here is due to SQL planner confusion (I’ll post something about that real soon now): With the WITH, the machine first selects the nearest 100 rows and then does the (relatively costly) spatial match, without it, the machine (somewhat surprisingly) did the geometric match first. This particular confusion looks fixable, but for now I’d ask you for forgiveness for the hack – and the technique is often useful anyway.

If you inspect the result, you will notice that Proxima Cen is right there, but α Cen is missing; without having properly investigated matters, I’d say it’s just too bright for the current Gaia data reduction (and quite possibly even for future Gaia analysis).

Most of the objects on that list that have made it into the HD (i.e., have a spectral type here) are K dwarfs – which is an interesting conspiracy between the limits of the HD (the late red and old white dwarfs are too weak for it) and the limits of Gaia (the few earlier stars within 6 parsec – which includes such luminaries as Sirius at a bit more than 2.5 pc – are just too bright for where Gaia data reduction is now).

Animation

Another fairly tangible thing in the GCNS is the space velcity, given in km/s in the three dimensions U, V, and W. That is, of course, an invitation to look for stellar streams, as, within the relatively small portion of the Milky Way the GCNS looks at, stars on similar orbits will exhibit similar space motions.

Considering the velocity dispersion within a stellar stream will be a few km/s, let’s have the database bin the data. Even though this data is small enough to conveniently handle locally, this kind of remote analysis is half of what TAP is really great at (the other half being the ability to just jump right into a new dataset). You can group by multiple things at the same time:

SELECT 
  COUNT(*) AS n,
  ROUND(uvel_50/5)*5 AS ubin,
  ROUND(vvel_50/5)*5 AS vbin,
  ROUND(wvel_50/5)*5 AS wbin
FROM gcns.main
GROUP BY ubin, vbin, wbin

Note that this (truly) 3D histogram only represents a small minority of the GCNS objects – you need radial velocities for space motion, and these are precious even in the Gaia age.

What really surprised me is how clumpy this distribution is – are we sure we already know all stellar streams in the solar neighbourhood? Watch for yourself (if your browser can’t play webm, complain to your vendor):

[Update (2021-04-01): Mark Taylor points out that the “flashes” you sometimes see when the grid is aligned with the viewing axes (and the general appearance) could be improved by just pulling all non-NULL UVW values out of the table and using a density plot (perhaps shading=density densemap=inferno densefunc=linear). That is quite certainly true, but it would of course defeat the purpose of having on-server aggregation. Which, again, isn’t all that critical for this dataset, so doing the prettier plot actually is a valuable exercise for the reader]

How did I make this video? Well, I started with a Cube Plot in TOPCAT as usual, configuring weighted plotting with n as its weight and played around a bit with scaling out a few outliers. And then I saved the table (to zw.vot), hit “STILTS“ in the plot window and saved the text from there to a text file, zw.sh. I had to change the “in“ clause in the script to make it look like this:

  #!/bin/sh
  stilts plot2cube \
   xpix=887 ypix=431 \
   xlabel='ubin / km/s' ylabel='vbin / km/s' \
   zlabel='wbin / km/s' \
   xmin=-184.5 xmax=49.5 ymin=-77.6 ymax=57.6 \
   zmin=-119.1 zmax=94.1 phi=-84.27 theta=90.35 \
    psi=-62.21 \
   auxmin=1 auxmax=53.6 \
   auxvisible=true auxlabel=n \
   legend=true \
   layer=Mark \
      in=zw.vot \
      x=ubin y=vbin z=wbin weight=n \
      shading=weighted size=2 color=blue

– and presto, “sh zw.sh“ would produce the plot I just had in TOPCAT. This makes a difference because now I can animate this.

In his documentation, Mark already has a few hints on how to build animations; here are a few more ideas on how to organise this. For instance, if, as I want here, you want to animate more than one variable, stilts tloop may become a bit unwieldy. Here’s how to give the camera angles in python:

import sys
from astropy import table
import numpy

angles = numpy.array(
  [float(a) for a in range(0, 360)])
table.Table([
    angles,
    40+30*numpy.cos((angles+57)*numpy.pi/180)],
  names=("psi", "theta")).write(
    sys.stdout, format="votable")

– the only thing to watch out for is that the names match the names of the arguments in stilts that you want to animate (and yes, the creation of angles will make numpy afficionados shudder – but I wasn’t sure if I might want to have somewhat more complex logic there).

[Update (2021-04-01): Mark Taylor points out that all that Python could simply be replaced with a straightforward piece of stilts using the new loop table scheme in stilts, where you would simply put

  animate=:loop:0,360,0.5
  acmd='addcol phi $1'
  acmd='addcol theta 40+30*cosDeg($1+57)'

into the plot2cube command line – and you wouldn’t even need the shell pipeline.]

What’s left to do is basically the shell script that TOPCAT wrote for me above. In the script below I’m using a little convenience hack to let me quickly switch between screen output and file output: I’m defining a shell variable OUTPUT, and when I un-comment the second OUTPUT, stilts renders to the screen. The other changes versus what TOPCAT gave me are de-dented (and I’ve deleted the theta and psi parameters from the command line, as I’m now filling them from the little python script):

OUTPUT="omode=out out=pre-movie.png"
#OUTPUT=omode=swing

python3 camera.py |\
stilts plot2cube \
   xpix=500 ypix=500 \
   xlabel='ubin / km/s' ylabel='vbin / km/s' \
   zlabel='wbin / km/s' \
   xmin=-184.5 xmax=49.5 ymin=-77.6 ymax=57.6 \
   zmin=-119.1 zmax=94.1 \
   auxmin=1 auxmax=53.6 \
phi=8 \
animate=- \
afmt=votable \
$OUTPUT \
   layer=Mark \
      in=zw.vot \
      x=ubin y=vbin z=wbin weight=n \
      shading=weighted size=4 color=blue

# render to movie with something like
# ffmpeg -i "pre-movie-%03d.png" -framerate 15 -pix_fmt yuv420p /stream-movie.webm
# (the yuv420p incantation is so real-world 
# web browsers properly will not go psychedelic 
# with the colours)

The comment at the end says how to make a proper movie out of the PNGs this produces, using ffmpeg (packaged with every self-respecting distribution these days) and yielding a webm. Yes, going for mpeg x264 might be a lot faster for you as it’s a lot more likely to have hardware support, but everything around mpeg is so patent-infested that for the sake of your first-born’s soul you probably should steer clear of it.

Movies are fun in webm, too.

The Loneliest Star in the Sky

sky images and a distribution plot
The loneliest star in the sky on the left, and on the right a somewhat more lonelier one (it’s explained in the text). The inset shows the distribution of the 500 loneliest stars on the whole sky in Galactic coordinates.

In early December, the object catalogue of Gaia’s data release 3 was published (“eDR3“), and I’ve been busy in various ways on this data off and on since then – see, for instance, the The Case of the disappearing bits on this blog.

One of the things I have missed when advising people on projects with previous Gaia data releases is a table that, for every object, gives the nearest neighbour. And so for this release I’ve created it and christened it, perhaps just a bit over-grandiosely, “Gaia eDR3 Autocorrelation”. Technically, it is just a long (1811709771 rows, to be precise) list of pairs of Gaia eDR3 source ids, the ids of their nearest neighbour, and a spherical distance between.

This kind of data is useful for many applications, mostly when looking for objects that are close together or (more often) things that fail for such close pairs for a wide variety of reasons. I have taken some pains to not only have close neighbours, though, because sometimes you may want specifically objects far away from others.

As in the case of this article’s featured image: The loneliest star in the sky (as seen by Gaia, that is) is eDR3 6049144983226879232, which is 4.3 arcminutes from its neighbour, 6049144021153793024, which in turn is the second-loneliest star in the sky. They are, perhaps a bit surprisingly, in Ophiuchus (and thus fairly close to the Milky Way plane), and (probably) only about 150 parsec from Earth. Doesn’t sound too lonely, hm? Turns out: these stars are lonely because dust clouds blot out all their neighbours.

Rank three is in another dust cloud, this time in Taurus, and so it continues in low Galactic latitude to rank 8 (4402975278134691456) at Galactic latitude 36.79 degrees; visualising the thing, it turns out it’s again in a dark cloud. What about rank 23 at 83.92 Galactic (3954600105683842048)? That’s probably bona-fide, or at least it doesn’t look very dusty in the either DSS or PanSTARRS. Coryn (see below) estimates it’s about 1100 parsec away. More than 1 kpc above the galactic disk: that’s more what I had expected for lonely stars.

Looking at the whole distribution of the 500 loneliest stars (inset above), things return a bit more to what I had expected: Most of them are around the galactic poles, where the stellar density is low.

So: How did I find these objects? Here’s the ADQL query I’ve used:

SELECT TOP 500 
  ra, dec, source_id, phot_g_mean_mag, ruwe,
  r_med_photogeo,
  partner_id, dist, 
  COORD2(gavo_transform('ICRS', 'GALACTIC', 
    point(ra, dec))) AS glat
FROM
  gedr3dist.litewithdist
  NATURAL JOIN gedr3auto.main
ORDER BY dist DESC

– run this on the TAP server at http://dc.g-vo.org/tap (don’t be shy, it’s a cheap query).

Most of this should be familiar to you if you’ve worked through the first pages of ADQL course. There’s two ADQL things I’d like to advertise while I have your attention:

  1. NATURAL JOIN is like a JOIN USING, except that the database auto-selects what column(s) to join on by matching the columns that have the same name. This is a convenient way to join tables designed to be joined (as they are here). And it probably won’t work at all if the tables haven’t been designed for that.
  2. The messy stuff with GALACTIC in it. Coordinate transformations had a bad start in ADQL; the original designers hoped they could hide much of this; and it’s rarely a good idea in science tools to hide complexity essentially everyone has to deal with. To get back on track in this field, DaCHS servers since about version 1.4 have been offering a user defined function gavo_transfrom that can transform (within reason) between a number of popular reference frames. You will find more on it in the server’s capabilities (in TOPCAT: the “service” tab). What is happening in the query is: I’m making a Point out of the RA and Dec given in the catalogue, tell the transform function it’s in ICRS and ask it to make Galactic coordinates from it, and then take the second element of the result: the latitude.

And what about the gedr3dist.litewithdist table? That doesn’t look a lot like the gaiaedr3.gaiasource we’re supposed to query for eDR3?

Well, as for DR2, I’m again only carrying a “lite” version of the Gaia catalogue in GAVO’s Heidelberg data center, stripped down to the columns you absolutely cannot live without even for the most gung-ho science; it’s called gaia.edr3lite.

But then my impression is that almost everyone wants distances and then hacks something to make Gaia’s parallax work for them. That’s a bad idea as the SNR goes down to levels very common in the Gaia result catalogue (see 2020arXiv201205220B if you don’t take my word for it). Hence, I’m offering a pre-joined view (a virtual table, if you will) with the carefully estimated distances from Coryn Bailer-Jones, and that’s this gedr3dist.litewithdist. Whenever you’re doing something with eDR3 and distances, this is where I’d point you first.

Oh, and I should be mentioning that, of course, I figured out what is in dust clouds and what is not with TOPCAT and Aladin as in our tutorial TOPCAT and Aladin working together (which needs a bit of an update, but you’ll figure it out).

There’s a lot more fun to be had with this (depending on what you find fun in). What about finding the 10 arcsec-pairs with the least different luminosities (which might actually be useful for testing some optics)? Try this:

SELECT TOP 300 
  a.source_id, partner_id, dist, 
  a.phot_g_mean_mag AS source_mag,
  b.phot_g_mean_mag AS partner_mag,
  abs(a.phot_g_mean_mag-b.phot_g_mean_mag) AS magdiff
FROM gedr3auto.main
  NATURAL JOIN gaia.edr3lite AS a
  JOIN gaia.edr3lite AS b
    ON (partner_id=b.source_id)
WHERE 
  dist BETWEEN 9.999/3600 AND 10.001/3600
  AND a.phot_g_mean_mag IS NOT NULL
  AND b.phot_g_mean_mag IS NOT NULL
ORDER BY magdiff ASC

– this one takes a bit longer, as there’s many 10 arcsec-pairs in eDR3; the query above looks at 84690 of them. Of course, this only returns really faint pairs, and given the errors stars that weak have they’re probably not all that equal-luminosity as that. But fixing all that is left as an exercise to the reader. Given there’s the RP and BP magnitude columns, what about looking for the most colourful pair with a given separation?

Acknowledgement: I couldn’t have coolly mumbled about Ophiuchus or Taurus without the SCS service ivo://cds.vizier/vi/42 (”Identification of a Constellation From Position, Roman 1982”).

Update [2021-02-05]: I discovered an extra twist to this story: Voyager 1 is currently flying towards Ophiuchus (or so Wikipedia claims). With an industrial size package of artistic licence you could say: It’s coming to keep the loneliest star company. But of course: by the time Voyager will be 150 pc from earth, eDR3 6049144983226879232 will quite certainly have left Ophiuchus (and Voyager will be in a completely different part of our sky, that wouldn’t look familar to us at all) – so, I’m afraid apart from a nice conincidence in this very moment (galactically speaking), this whole thing won’t be Hollywood material.

The Bochum Galactic Disk Survey

[Image: Patches of higher perceived variability on the Sky]
Fig 1: How our haphazard variability ratio varies over the sky (galactic coordinates). And yes, it’s clear that this isn’t dominated by physical variability.

About a year ago, I reported on a workshop on “Large Surveys with Small Telescopes” in Bamberg; at around the same time, I’ve published an example for those, the Bochum Galactic Disk Survey BGDS, which used a twin 15 cm robotic telescope in some no longer forsaken place in the Andes mountains to monitor the brighter stars in the southern Milky Way. While some tables from an early phase of the survey have been on VizieR for a while, we now publish the source images (also in SIAP and Obscore), the mean photometry (via SCS and TAP) and, perhaps potentially most fun of all, the the lightcurves (via SSAP and TAP) – a whopping 35 million of the latter.

This means that in tools like Aladin, you can now find such light curves (and images in two bands from a lot of epochs) when you are in the survey’s coverage, and you can run TAP queries on GAVO’s http://dc.g-vo.org/tap server against the full photometry table and the time series.

Regular readers of this blog will not be surprised to see me use this as an excuse to show off a bit of ADQL trickery.

If you have a look at the bgds.phot_all table in your favourite TAP client, you’ll see that it has a column amp, giving the difference between the highest and lowest magnitude. The trouble is that amp for almost all objects just reflects the measurement error rather than any intrinsic variability. To get an idea what’s “normal” (based on the fact that essentially all stars have essentially constant luminosity on the range and resolution scales considered here), run a query like

SELECT ROUND(amp/err_mag*10)/10 AS bin, COUNT(*) AS n
FROM bgds.phot_all
WHERE nobs>10
GROUP BY bin

As this scans the entire 75 million rows of the table, you will probably have to use async mode to run this.

[image: distribution of amplitude/mag error
Figure 2: The distribution of amplitude over magnitude error for all BGDS objects with nobs>10 (blue) and the subset with a mean magnitude brighter than 15 (blue).

When it comes back, you will have, for objects where any sort of statistics make sense at all (hence nobs>10), a histogram (of sorts) of the amplitude in units of upstream’s magnitude error estimation. If you log-log-plot this, you’ll see something like Figure 2. The curve at least tells you that the magnitude error estimate is not very far off – the peak at about 3 “sigma” is not unreasonable since about half of the objects have nobs of the order of a hundred and thus would likely contain outliers that far out assuming roughly Gaussian errors.

And if you’re doing a rough cutoff at amp/magerr>10, you will get perhaps not necessarily true variables, but, at least potentially interesting objects.

Let’s use this insight to see if we spot any pattern in the distribution of these interesting objects. We’ll use the HEALPix technique I’ve discussed three years ago in this blog, but with a little twist from ADQL 2.1: The Common Table Expressions or CTEs I have already mentioned in my blog post on ADQL 2.1 and then advertised in the piece on the Henry Draper catalogue. The brief idea, again, is that you can write queries and give them a name that you can use elsewhere in the query as if it were an actual table. It’s not much different from normal subqueries, but you can re-use CTEs in multiple places in the query (hence the “common”), and it’s usually more readable.

Here, we first create a version of the photometry table that contains HEALPixes and our variability measure, use that to compute two unsophisticated per-HEALPix statistics and eventually join these two to our observable, the ratio of suspected variables to all stars observed (the multiplication with 1.0 is a cheap way to make a float out of a value, which is necessary here because a/b does integer division in ADQL if a and b are both integers):

WITH photpoints AS (
  SELECT 
    amp/err_mag AS redamp,
    amp,
    ivo_healpix_index(5, ra, dec) AS hpx
  FROM bgds.phot_all
  WHERE 
    nobs>10
    AND band_name='SDSS i'
    AND mean_mag<16),
all_objs AS (
  SELECT count(*) AS ct,
    hpx
    FROM photpoints GROUP BY hpx),
strong_var AS (
  SELECT COUNT(*) AS ct,
    hpx
    FROM photpoints
    WHERE redamp>4 AND amp>1 GROUP BY hpx)
SELECT
  strong_var.ct/(1.0*all_objs.ct) AS obs,
  all_objs.ct AS n,
  hpx
FROM strong_var JOIN all_objs USING (hpx)
WHERE all_objs.ct>20

If you plot this using TOPCAT’s HEALPix thingy and ask it to use Galactic coordinates, you’ll end up with something like Figure 1.

There clearly is some structure, but given that the variables ratio reaches up to 0.2, this is still reflecting instrumental or pipeline effects and thus earthly rather than Astrophysics. And that’s going beyond what I’d like to talk about on a VO blog, although I’l take any bet that you will see significant structure in the spatial distribution of the variability ratio at about any magnitude cutoff, since there are a lot of different population mixtures in the survey’s footprint.

Be that as it may, let’s have a quick look at the time series. As with the short spectra from Byurakan use case, we’ve stored the actual time series as arrays in the database (the mjd and mags columns in bgds.ssa_time_series. Unfortunately, since they are a lot less array-like than homogeneous spectra, it’s also a lot harder to do interesting things with them without downloading them (I’m grateful for ideas for ADQL functions that will let you do in-DB analysis for such things). Still, you can at least easily download them in bulk and then process them in, say, python to your heart’s content. The Byurakan use case should give you a head start there.

For a quick demo, I couldn’t resist checking out objects that Simbad classifies as possible long-period variables (you see, as I write this, the public bohei over Betelgeuse’s brief waning is just dying down), and so I queried Simbad for:

SELECT ra, dec, main_id
FROM basic
WHERE
  otype='LP?'
  AND 1=CONTAINS(
     POINT('', ra, dec),
     POLYGON('', 127, -30, 112, -30, 272, -30, 258, -30))

(as of this writing, Simbad still needs the ADQL 2.0-compliant first arguments to POINT and POLYGON), where the POLYGON is intended to give the survey’s footprint. I obtained that by reading off the coordinates of the corners in my Figure 1 while it was still in TOPCAT. Oh, and I had to shrink it a bit because Simbad (well, the underlying Postgres server, and, more precisely, its pg_sphere extension) doesn’t want polygons with edges longer than π. This will soon become less pedestrian: MOCs in relational databases are coming; more on this soon.

[TOPCAT action shot with a light curve display]
Fig 3: V566 Pup’s BGDS lightcuve in a TOPCAT configured to auto-plot the light curves associated with a row from the bgds.ssa_time_series table on the GAVO DC TAP service.

If you now do the usual spiel with an upload crossmatch to the bgds.ssa_time_series table and check “Plot Table” in Views/Activation Action, you can quickly page through the light curves (TOPCAT will keep the plot style as you go from dataset to dataset, so it’s worth configuring the lines and the error bars). Which could bring you to something like Fig. 3; and that would suggest that V* V566 Pup isn’t really long-period unless the errors are grossly off.

LAMOST5 meets Datalink

One of the busiest spectral survey instruments operated right now is the Large Sky Area Multi-Object Fiber Spectrograph Telescope (LAMOST). And its data in the VO, more or less: DR2 and DR3 have been brought into the VO by our Czech colleagues, but since they currently lack resources to update their services to the latest releases, they have kindly given me their DaCHS resource descriptor, and so I had a head start for publishing DR5 in Heidelberg.

With some minor updates, here it is now: Over nine million medium-resolution spectra covering large parts of the northen sky – the spatial coverage is like this:

[Coverage Healpix map]

There’s lots of fun to be had with this; of course, there’s an SSA service, so when you point Aladin or Splat at some part of the covered sky and look for spectra, chances are you’ll see LAMOST spectra, and when working on some of our tutorials (this one, for example), it happened that LAMOST actually had what I was looking for when writing them.

But I’d like to use the opportunity to mention two other modes of accessing the data.

Tablesample and TOPCAT’s Plot Table activation action

Say you’d like to look at spectra of M stars and would like to have some sample from across the sky, fire up TOPCAT, point its TAP client the GAVO DC TAP service (http://dc.g-vo.org/tap) and run something like

select 
  ssa_pubDID, accref, raj2000, dej2000, ssa_targsubclass
from lamost5.data tablesample(1)
where 
  ssa_targsubclass like 'M%' 

Image: stacked spectra

This is using the TABLESAMPLE modifier in the from clause, which isn’t standard ADQL yet. As mentioned in the DaCHS 1.4 announcement, DaCHS has a prototype implementation of what’s been discussed on the IVOA’s DAL mailing list: pick a part of a table rather than the full one. It takes a percentage as an argument, and tells the server to choose about this percentage of the table’s records using a reasonable and fast heuristic. Note that this won’t give you perfect statistical sampling, but if it’s not “good enough” for some purpose, I’d like to learn about that purpose.

Drawing a proper statistical sample, on the other hand, would take minutes on the GAVO database server – with tablesample, I had the roughly 6000 spectra the above query returns essentially instantaneously, and from eyeballing a sky plot of them, I’d say their distribution is close enough to that of the full DR5. So: tablesample is your friend.

For a quick look at the spectra themselves, in TOPCAT click Views/Activation Actions, check “Plot Table” and make sure TOPCAT proposes the accref column as “Table Location” (if you don’t see these items, update your TOPCAT – it’s worth it). Now click on a row or perhaps a dot on a plot and behold an M spectrum.

Cutouts via Datalink

LAMOST releases spectra in FITS format pretty much like the ones you may know from SDSS. The trick above works because we instead hand out proper, IVOA Spectral Data Model-compliant spectra through SSA and TAP. However, if you need to go back to the original files, you can, using Datalink. If you’re unsure what this Datalink thing is: call me vain, but I still like my 2015 ADASS poster explaining that. In TOPCAT, you’d be using the “Invoke Service” activation action to get to the datalinks.

If you have actual work to do, offloading repetetive work to the computer is what you want, and fortunately, pyVO knows about datalink, too. I give you this is hard to discover so far, and the interface is… a tiny bit clunky. Until some kind soul cleans up the pyVO datalink act, a poster Stefan and I showed at the 2017 ADASS might give you an idea which buttons to press. Or read on and see how things work for LAMOST5.

The shortest way to datalinks is a TAP query that at least retrieves the ssa_pubdid column (that’s a must; Datalink can’t work without it) and, on the result, run the iter_datalinks method. This returns an object in which you can find the associated data items (in this case, a preview and the original FITS with the #progenitor semantics), plus the cutout service.

Hence, a minimal example for pulling the legacy FITS links out of the first three items in lamost5.data would look like this:

import pyvo

svc = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
for dl in svc.run_sync("select top 3 ssa_pubdid"
        " from lamost5.data").iter_datalinks():
    print(next(dl.bysemantics("#progenitor")
        )["access_url"].decode("ascii"))

This is a bit different from listing 2 in the poster linked above because it’s python3, so getting the first element from iterator an iterator looks a bit different, and (curse astropy.votable for returning VOTable chars as bytes rather than strings!) you’ll want to turn the URL into a proper string manually.

Another, actually more interesting, thing you can do with Datalink is cut out regions of interest. The LAMOST spectra are fairly long (though of course still small by image standards), so if you’re only interested in a single line, you can save a bit of storage and bandwidth over blindly pulling the whole thing.

For instance, if you wanted to pull the vicinity of the H and K Fraunhofer lines from the matches in the loop in the snippet above, you could say:

from astropy import units as u
proc = next(dl.iter_procs())
cutout = proc.processed(band=(392*u.nm,398*u.nm))

And this is what I’ve done for the decorative left border above: it’s the H and K line profiles for 0.1% of the stars LAMOST has classified as G8. Building the image didn’t take more than a few seconds (where I’d like the cutouts to be faster by a factor of 10; I guess that’s about an afternoon of work for me, so if it’d save you more than that afternoon, poke me to do it).

What’s coming back is tables. By the time python has digested these, they’re numpy record arrays. Thus, you can immediately bring in your beloved scipy (or whatever). For instance, if for some reason you’re convinced that the H and K lines should be fit by identical Gaussians in the boring case and would like find objects for which that’s patently untrue and that hence could be un-boring, here’s how you could do that:

def spectral_model(wl, c1, c2, depth, width):
    return (1
        -depth*numpy.exp(-numpy.square(wl-c1)
            /numpy.square(width))
        -depth*numpy.exp(-numpy.square(wl-c2)
            /numpy.square(width)))

for pubdid, prof in get_profiles(
        "G8", (392*u.nm,398*u.nm), 0.01, 4):
    prof["flux"] /= max(prof["flux"])
    popt, pcov = curve_fit(
        spectral_model, prof["spectral"], prof["flux"],
        sigma=prof["flux_error"],
        p0=[3968, 3934, 1, 1])
    if pcov[3][3]>1:
        break

– where get_profiles is essentially doing the TAP plus datalink routine above, except I’m swallowing spectra with too much noise and I have the function transform the spectral coordinate into the objects’ rest frames. If you’re curious how I’m doing this just based on the IVOA Spectral Data Model, check the source linked at the bottom of this post.

I’ve just run this, and the first spectrum that the machinery flagged as suspicious was this:

Image: A fairly boring late G spectrum

– which doesn’t look like I’ve made a discovery just yet. But that doesn’t mean there’s not a lot to find within LAMOST5’s lines…

To get you up to speed quickly: here’s the actual python3 code I ran for the “analysis” and the plot.

From Byurakan to L2: Short Spectra

multiband images of carbon stars
A snapshot from the DFBS tutorial: Carbon Stars in different spectral bands.

On June 30, a small project we’ve done together with the Armenian Virtual Observatory has ended. Its objective was to publish the spectra from the First Byurakan Survey (the DFBS) in a VO-compilant way. The data comes from one of the big surveys with Schmidt telescopes that form a sizable part of the observational heritage from the second part of the 20th century (you’re still using a few of them daily if you tell Aladin to show a DSS plane).

In this case, spectra from objects on the entire northern sky off the milky way down to about 18th mag were obtained. In a previous cooperation between Armenian and Italian astronomers a good decade ago, the plates were digitised and calibrated, and spectra were extracted. However, they resided behind a web interface so far, which made them somewhat clumsy to work with.

Now, they’re in the VO, and to give you a few ideas for what kind of things you can do with this kind of data, within the project we’ve also written the tutorial “Outlier Analysis in Low-Resolution Spectra”.

Have a glance at the tutorial – you see, while the Byurakan survey certainly is a valuable resource by itself, I happen to believe at this point it’s particularly valuable because with the next Gaia data release (planned for next year), a deluxe version of it will come: Gaia’s RP/BP spectra will be all-sky, properly calibrated, and quite a bit deeper, but still low-resolution. So, if you’re just waiting for such a data collection, you can train your methods right now on the DFBS.

APPLAUSE via Obscore

[a composite of two rather noisy photo plates]
Aladin showing some Bamberg Sky Patrol plates (see towards the end of the post for what this is and how I made it).

At the Astroplate conference I blogged about recently, the people behind APPLAUSE gave a couple of talks about their Data Release 3. APPLAUSE is a fairly massive endeavour to make available data from some of the larger plate archives in Germany, and its DR3 even hit the non-Astronomy press last February.

Already for previous APPLAUSE releases, I’ve wanted to bring this data (or rather, its metadata) to the VO, but it never quite happened, basically because there was always another little thing that turned out to be too tedious to work out via mail. However, working out things interactively is exactly what conferences are great for. So, the kind APPLAUSE folks (thanks, Taavi and Harry) and I used the Astroplate to map their database schema (“schema” is jargon for what boils down to the set of tables and columns with which they describe their data) to the much simpler (and, admittedly, less powerful) IVOA Obscore one.

Sure, Obscore doesn’t deal with multiple exposures (like when the target field and the north pole were exposed on one plate to help precision photometry), object-guided images, and all the other interesting techniques that astronomers applied in the pre-digital age; it also doesn’t usefully cope with multiple scans of the same plate (for instance, to correct for imprecisions in the mechanics of flatbed scanners). APPLAUSE, of course, has to cope with them, since there are many reasons to preserve data of this kind.

Obscore, on the other hand, is geared towards uniform discovery, where too funky datasets in all likelihood cause more harm than good. So, when we mapped APPLAUSE to Obscore, of the 101138 scans of 70276 plates that the full APPLAUSE holds in DR3, only 44000 plate scans made it into the Obscore table. The advantage: whatever can be sensibly mapped to Obscore can now be queried together with all the other data in the world that others have published through Obscore.

You can immediately see the effect when you run the little python program doing the global discovery we gave in our plates tutorial. Here’s what it prints now (values from pre-APPLAUSE-in-Obscore are in square brackets):

Column t_exptime: 3460 values
  Min   12, Max 15300, Mean 890.24  [previous mean: 370.722]
---
Column em_mean: 3801 values
  Min 1.8081e-09, Max 9.3e-07, Mean 6.40804e-07 [No change: Sigh!]
---
Column t_mean: 4731 values
  Min 12564.5, Max 58126.3, Mean 49897.9 [previous mean: 51909.1]
---
Column instrument_name: 4747 values
  Matches from , Petzval, [Max Wolf's residence in
  Heidelberg, Maerzgasse, Wolf's Doppelastrograph,
  Heidelberg Koenigstuhl (24), Wolf's
  Doppelastrograph,] AG-Astrograph, [Zeiss Triplet
  15 cm Potsdam-Telegrafenberg], Zeiss Triplet,
  Astrograph (four 10-cm Tessar f/6 cameras),
  [3.5m APO, ROSAT PSPCC, Heidelberg Koenigstuhl
  (24), Bruce Astrograph, Calar Alto (493),
  Schmidt], Grosser Refraktor, [ROSAT HRI,
  DK-1.54], Hamburger Schmidt-Spiegel,
  [DFOSC_FASU], ESO 1-metre Schmidt telescope,
  Great Schmidt Camera, Lippert-Astrograph, Ross-B
  3", [AZT 22], Astrograph (six 10-cm Tessar f/6
  cameras), 1m-Spiegelteleskop, [ROSAT PSPCB],
  Astrograph (ten 10-cm Tessar f/6 cameras), Zeiss
  Objective
---
Column access_url: 4747 values [4067]

So – for the fields selected in the tutorial, there are 15% more images in the global Obscore image pool now than there were before APPLAUSE, and their mean observation date went a bit farther into the past. I’ve not made any statistics, but I suspect for many other fields the gain is going to be much higher. For a strong effect, try some random region covered by the Bamberg Sky Patrol on the southern sky.

But you have probably noticed the deep sigh in the annotations to the statistics above: Yes, we don’t have the spectral band for the APPLAUSE data, which is why the stats on em_min doesn’t change. As a matter of fact, from the Obscore data you can’t even guess whether a plate is “more red” or “rather blue”, as Obscore doesn’t have an (agreed-upon) field for “qualititive bandpass indicator”.

For some other data collections, we did map known emulsion/filter combinations to rough bandpasses (e.g., the Palomar-Leiden Trojan Survey, which only had a few of them). For APPLAUSE, there are 435 combinations of filter and emulsion (that’s a VOTable link that you can paste into TOPCAT’s load button in order to have a look at the table). Granted, quite a few of these pairs are (more or less) spurious because of inconsistent spelling. But we still gave up on researching the bandpasses even before we started.

If you’re a photographic plate buff: You could help us and posteriority a lot if you could go through this list and at least for some combinations tell us what, roughly, the lower and upper limits of the corresponding bandpasses might have been (what DaCHS already knows, plate-relevant data near the bottom of the file). As usual, send mail to gavo@ari.uni-heidelberg.de if you have anything to contribute.

Finally, here’s the brief explanation of the image for this article: Well, I wanted to find some Bamberg Sky Patrol images for a single field to play with. I knew they were primarily located in the South, and were made using Tessar cameras. So, I ran

SELECT t_min, access_url, s_region
FROM ivoa.obscore
WHERE instrument_name like '%Tessar%'
AND 1=CONTAINS(POINT(345, -38), s_region)

on GAVO’s TAP service. Since Aladin 10, you can do that from within the program (although some versions will reject this query because they mistakenly believe the ADQL is bad. Query through TOPCAT and send the result over to Aladin if that bites you). Incidentally, when there are s_region values in Obscore tables, it’s a good idea to use them as I do here, as it’s quite a bit more likely that this query will use indices than some condition on s_ra and s_dec. But then not all services fill s_region properly, so for all-VO queries you will probably want to make do with s_ra and s_dec.

From that result I first made the inset bar graph in the article image to show the temporal distribution of the Patrol plates. And then I grabbed two (rather randomly selected) plates and had Aladin produce a red-blue composite of them. Whatever is really red or really blue in that image may correspond to a transient event. Or, as certainly the case with that little hair (or whatever) that shines out in blue, it may not.

Find Outliers using ADQL and TAP

[Annie Cannon's notebook and a plot]
Two pages from Annie Cannon’s notebooks1, and a histogram of the basic BP-RP color distribution in the HD catalogue (blue) and the distribution of the outliers (red). For more of Annie Cannon’s notebooks, search on ADS.

The other day I gave one of my improvised live demos (“What, roughly, are you working on?”) and I ended up needing to translate identifiers from the Henry Draper Catalogue to modern positions. Quickly typing “Henry Draper” into TOPCAT’s TAP search window didn’t yield anything useful (some resources only using the HD, and a TAP service that didn’t support uploads – hmpf).

Now, had I tried the somewhat more thorough WIRR Registry interface, I’d have noted the HD catalogue at VizieR and in particular Fabricius’ et al’s HD-Tycho 2 match (explaining why they didn’t show up in TOPCAT is a longer story; we’re working on it). But alas, I didn’t, and so I set out to produce a catalogue matching HD and Gaia DR2, easily findable from within TOPCAT’s TAP client. Well, it’s here in the form of the hdgaia.main table in our data center.

Considering the nontrivial data discovery and some yak shaving I had to do to get from HD identifiers to Gaia DR2 ones, it was perhaps not as futile an exercise as I had thought now and then during the preparation of the thing. And it gives me the chance to show a nice ADQL technique to locate outliers.

In this case, one might ask: Which objects might Annie Cannon and colleagues have misclassified? Or perhaps the objects have changed their spectrum between the time Cannon’s photographic plates have been taken and Gaia observed them? Whatever it is: We’ll have to figure out where there are unusual BP-RPs given the spectral type from HD.

To figure this out, we’ll first have to determine what’s “usual”. If you’ve worked through our ADQL course, you know what to expect: grouping. So, to get a table of average colours by spectral type, you’d say (all queries executable on the TAP service at http://dc.g-vo.org/tap):

select spectral, 
  avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
  count(*) as ct
from hdgaia.main
join gaia.dr2light
using (source_id)
group by spectral

– apart from the join that’s needed here because we want to pull photometry from gaia, that’s standard fare. And that join is the selling point of this catalog, so I won’t apologise for using it already in the first query.

The next question is how strict we want to be before we say something that doesn’t have the expected colour is unusual. While these days you can rather easily use actual distributions, at least for an initial analysis just assuming a Gaussian and estimating its FWHM as the standard deviation works pretty well if your data isn’t excessively nasty. Regrettably, there is no aggregate function STDDEV in ADQL (you could still ask for it: head over to the DAL mailing list before ADQL 2.1 is a done deal!). However, you may remember that Var(X)=E(X2)-E(X)2, that the average is an estimator for the expectation, and that the standard deviation is actually an estimator for the square root of the variance. And that these estimators will work like a charm if you’re actually dealing with Gaussian data.

So, let’s use that to compute our standard deviations. While we are at it, throw out everything that’s not a star2, and ensure that our groups have enough members to make our estimates non-ridiculous; that last bit is done through a HAVING clause that essentially works like a WHERE, just for entire GROUPs:

select spectral, 
  avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
  sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
    power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
  count(*) as ct
from hdgaia.main
join gaia.dr2light
  using (source_id)
where m_v<18
group by spectral
having count(*)>10

This may look a bit scary, but if you read it line by line, I’d argue it’s no worse than our harmless first GROUP BY query.

From here, the step to determine the outliers isn’t big any more. What the query I’ve just written produces is a mapping from spectral type to the means and scales (“µ,σ” in the rotten jargon of astronomy) of the Gaussians for the colors of the stars having that spectral type. So, all we need to do is join that information by spectral type to the original table and then see which actual colors are further off than, say, three sigma. This is a nice application of the common table expressions I’ve tried to sell you in the post on ADQL 2.1; our determine-what’s-usual query from above stays nicely separated from the (largely trivial) rest:

with standards as (select spectral, 
  avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
  sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
    power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
  count(*) as ct
  from hdgaia.main
  join gaia.dr2light
  using (source_id)
  where m_v<18
  group by spectral
  having count(*)>10)
select * 
from hdgaia.main
join standards 
using (spectral)
join gaia.dr2light using (source_id)
where 
  abs(phot_bp_mean_mag-phot_rp_mean_mag-col)>3*sig_col
  and m_v<18

– and that's a fairly general pattern for doing an initial outlier analysis on the the remote side. For HD, this takes a few seconds and yields 2722 rows (at least until we also push HDE into the table). That means you can keep 99% of the rows (the boring ones) on the server and can just pull the ones that could be interesting. These 99% savings aren't terribly much with a catalogue like the HD that's small by today's standards. For large catalogs, it's the difference between a download of a couple of minutes and pulling data for a day while frantically freeing disk space.

By the way, that there's only 2.7e3 outliers among 2.25e5 objects, while Annie Cannon, Williamina Fleming, Antonia Maury, Edward Pickering, and the rest of the crew not only had to come up with the spectral classification while working on the catalogue but also had to classify all these objects manually, this is an amazing feat even if all of those rows actually were misclassifications (which they certainly aren't) – the machine classifiers of today would be proud to only get 1% wrong.

The inset in the facsimile of Annie Cannons notebooks above shows how the outliers are distributed in color space relative to the full catalogue, where the basic catalogue is in blue and the outliers (scaled by 70) in red. Wouldn't it make a nice little side project to figure out the reason for the outlier clump on the red side of the histogram?


1The notebook pages are from a notebook Annie Cannon used in 1929. The material was kindly provided by Project PHAEDRA at the John G. Wolbach Library, Harvard College Observatory.
2I'll not hide that I was severely tempted to undo the mapping of object classes to – for HD – unrealistic magnitudes (20 .. 50) but then left the HD as it came from ADC; I still doubt that decision was well taken, and sure enough, the example query above already has insane constraints on m_v reflecting that encoding. From today's position, of course there should have been an extra column or, better yet, a different catalogue for nonstellar objects. Ah well. It's always hard to break unhealty patterns.

Deredden using TAP

An animated color-magnitude diagram
Raw and dereddened CMD for a region in Cygnus.

Today I published a nice new service on our TAP service: The Bayestar17 3D dust map derived from Pan-STARRS 1 by Greg Green et al. I mention in passing that this was made particularly enjoyable because Greg and friends put an explicit license on their data (in this case, CC-BY-SA).

This dust map is probably a fascinating resource by itself, but the really nifty thing is that you can use it to correct all kinds of photometric data for extinction – at least to some extent. On the Bayestar web page, the authors give some examples for usage – and with our new service, you can use TAP as well to correct photometry for extinction.

To see how, first have a look at the table metadata for the prdust.map_union table; this is what casual users probably should look at. More specifically, at the coverage, best_fit, and grdiagnostic columns.

coverage here is an interval of 10-healpixes. It has to be an interval because the orginal data comes on wildly different levels; depending on the density of stars, sometimes it takes the area of a 6-healpix (about a square degree) to get enough signal, whereas in the galactic plane a 10-healpix (a thousandth of a square degree) already has enough stars. To make the whole thing conveniently queriable without exploding a 6-healpix row into 1000 identical rows, larger healpixes translate into intervals of 10-helpixes. Don’t panic, though, I’ll show how to conveniently query this below.

best_fit and grdiagnostic are arrays (remember the light cuves in Gaia DR2?). In bins of 0.5 in distance modulus (which is, in case you feel a bit uncertain as to the algebraic signs, 5 log10(dist)-5 for a distance in parsec), starting with a distance modulus of 4 and ending with 19. This means that for a distance modulus of 4.2 you should check the array index 0, whereas 4.3 already would be covered by array index 1. With this, best_fit[ind] gives E(B-V) = (B-V) – (B-V)0 in the direction of coverage in a distance modulus bin of 2*ind+4. For each best_fit[ind], grdiagnostic[ind] contains a quality measure for that value. You probably shouldn’t touch the E(B-V) if that measure is larger than 1.2.

So, how does one use this?

To try things, let’s pull some Gaia data with distances; in order to have interesting extinctions, I’m using a patch in Cygnus (RA 288.5, Dec 2.3). If you live on the northern hemisphere and step out tonight, you could see dust clouds there with the naked eye (provided electricity fails all around, that is). Full disclosure: I tried the Coal Sack first but after checking the coverage of the dataset – which essentially is the sky north of -30 degrees – I noticed that wouldn’t fly. But stories like these are one reason why I’m making such a fuss about having standard STC coverage representations.

We want distances, and to dodge all the intricacies involved when naively turning parallaxes to distances discussed at length
in a paper by Xavier Luri et al (and elsewhere), I’m using precomputed distances from Bailer-Jones et al. (2018AJ….156…58B); you’ll find them on the “ARI Gaia” service; in TOPCAT’s TAP dialog simply search for “Gaia” – that’ll give you the GAVO DC TAP search, too, and that we’ll need in a second.

The pre-computed distances are in the gaiadr2_complements.geometric_distance table, which can be joined to the main Gaia object catalog using the source_id column. So, here’s a query to produce a little photometric catalog around our spot in Cygnus (we’re discarding objects with excessive parallax errors while we’re at it):

SELECT 
r_est, 5*log10(r_est)-5 as dist_mod,
phot_g_mean_mag, phot_bp_mean_mag, phot_rp_mean_mag,
ra, dec
FROM
gaiadr2.gaia_source
JOIN gaiadr2_complements.geometric_distance
USING (source_id)
WHERE
parallax_over_error>1
AND 1=CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', 288.5, 2.3, 0.5 ))

The color-magnitude diagram resulting from this is the red point cloud in the animated GIF at the top. To reproduce it, just plot phot_bp_mean_mag-phot_rp_mean_mag against phot_g_mean_mag-dist_mod (and invert the y axis).

De-reddening this needs a few minor technicalities. The most important one is how to match against the odd intervals of healpixes in the prdust.map_union table. A secondary one is that we have only pulled equatorial coordinates, and the healpixes in prdust are in galactic coordinates.

Computing the healpix requires the ivo_healpix_index ADQL user defined function (UDF) that you may have met before, and since we have to go from ICRS to Galactic it requires a fairly new UDF I’ve recently defined to finally get the discussion on having a “standard library” of astrometric functions in ADQL going: gavo_transform. Here’s how to get a 10-healpix as required for map_union from ra and dec:

CAST(ivo_healpix_index(10, 
  gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER)

The CAST call is a pure technicality – ivo_healpix_index returns a 64-bit integer, which I can’t use in my interval logic.

The comparison against the intervals you could do yourself, but as argued in Registry-STC article this is one of the trivial things that are easy to get wrong. So, let’s use the ivo_interval_overlaps UDF; it goes in the join condition to properly match prdust healpixes to catalog positions. Then our total query – that, I hope, should be reasonably easy to adapt to similar problems – is:

WITH sources AS (
  SELECT phot_g_mean_mag, 
    phot_bp_mean_mag, 
    phot_rp_mean_mag,
    dist_mod,
    CAST(ivo_healpix_index(10, 
      gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER) AS hpx,
    ROUND((dist_mod-4)*2)+1 AS dist_mod_bin
  FROM TAP_UPLOAD.T1)

SELECT
  phot_bp_mean_mag-phot_rp_mean_mag-dust.best_fit[dist_mod_bin] AS color,
  phot_g_mean_mag-dist_mod+
    dust.best_fit[dist_mod_bin]*3.384 AS abs_mag,
  dust.grdiagnostic[dist_mod_bin] as qual
FROM sources
JOIN prdust.map_union AS dust
ON (1=ivo_interval_has(hpx, coverage))

(If you’re following along: you have to switch to the GAVO DC TAP to run this, and you will probably have to change the index after TAP_UPLOAD).

Ok, in the photometry department there’s a bit of cheating going on here – I’m correcting Gaia B-R with B-V, and I’m using the factor for Johnson V to estimate the extinction in Gaia G (if you’re curious where that comes from: See the footnote on best_fit and the MC extinction service docs should get you started), so this is far from physically correct. But, as you can see from the green cloud in the plot above, it already helps a bit. And if you find out better factors, by all means let me know so I can add an update… right here:

Update (2018-09-11): The original data creator, Gregory Green points out that the thing with having a better factor for Gaia G isn’t that simple, because, as he says “Gaia G is very broad, [and] the extinction coefficients are much more dependent on stellar type, and extinction is also more nonlinear with dust column (extinction is only linear with dust column and independent of stellar type for an infinitely narrow passband)”. So – when de-reddening, prefer narrow passbands. But whether narrow or wide: TAP helps you.

Gaia DR2: A light version and light curves

screenshot: topcat and matplotlib
Topcat is doing datalink, and our little python script has plotted a two-color time series of RMC 18 (or so I think).

If anyone ever writes a history of the VO, the second data release of Gaia on April 25, 2018 will probably mark its coming-of-age – at least if you, like me, consider the Registry the central element of the VO. It was spectacular to view the spike of tens of Registry queries per second right around 12:00 CEST, the moment the various TAP services handing out the data made it public (with great aplomb, of course).

In GAVO’s Data Center we also carry Gaia DR2 data. Our host institute, the Zentrum für Astronomie in Heidelberg, also has a dedicated Gaia server. This gives relieves us from having to be a true mirror of the upstream data release. And since the source catalog has lots and lots of columns that most users will not be using most of the time, we figured a “light” version of the source catalog might fill an interesting ecological niche: Behold gaia.dr2light on the GAVO DC TAP service, containing essentially just the basic astrometric parameters and the diagonal of the covariance matrix.

That has two advantages: Result sets with SELECT * are a lot less unwieldy (but: just don’t do this with Gaia DR2), and, more importantly, a lighter table puts less load on the server. You see, conventional databases read entire rows when processing data, and having just 30% of the columns means we will be 3 times faster on I/O-bound tasks (assuming the same hardware, of course). Hence, and contrary to several other DR2-carrying sites, you can perform full sequential scans before timing out on our TAP service on gaia.dr2light. If, on the other hand, you need to do debugging or full-covariance-matrix error calculations: The full DR2 gaia_source table is available in many places in the VO. Just use the Registry.

Photometry via TAP

A piece of Gaia DR2 that’s not available in this form anywhere else is the lightcurves; that’s per-transit photometry in the G, BP, and RP band for about 0.5 million objects that the reduction system classified as variable. ESAC publishes these through datalink from within their gaia_source table, and what you get back is a VOTable that has the photometry in the three bands interleaved.

I figured it might be useful if that data were available in a TAP-queriable table with lightcurves in the database. And that’s how gaia.dr2epochflux came into being. In there, you have three triples of arrays: the epochs (g_transit_time, bp_obs_time, and rp_obs_time), the fluxes (g_transit_flux, bp_flux, and rp_flux), and their errors (you can probably guess their names). So, to retrieve G lightcurves where available together with a gaia_source query of your liking, you could write something like

SELECT g.*, g_transit_time, g_transit_flux
FROM gaia.dr2light AS g
LEFT OUTER JOIN gaia.dr2epochflux
USING (source_id)
WHERE ...whatever...

– the LEFT OUTER JOIN arranges things such that the g_transit_time and g_transit_flux columns simply are NULL when there are no lightcurves; with a normal (“inner”) join, rows without lightcurves would not be returned in such a query.

To give you an idea of what you can do with this, suppose you would like to discover new variable blue supergiants in the Gaia data (who knows – you might discover the precursor of the next nearby supernova!). You could start with establishing color cuts and train your favourite machine learning device on light curves of variable blue supergiants. Here’s how to get (and, for simplicity, plot) time series of stars classified as blue supergiants by Simbad for which Gaia DR2 lightcurves are available, using pyvo and a little async trick:

from matplotlib import pyplot as plt
import pyvo

def main():
  simbad = pyvo.dal.TAPService(
    "http://simbad.u-strasbg.fr:80/simbad/sim-tap")
  gavodc = pyvo.dal.TAPService("http://dc.g-vo.org/tap")

  # Get blue supergiants from Simbad
  simjob = simbad.submit_job("""
    select main_id, ra, dec
    from basic
    where otype='BlueSG*'""")
  simjob.run()

  # Get lightcurves from Gaia
  try:
    simjob.wait()
    time_series = gavodc.run_sync("""
      SELECT b.*, bp_obs_time, bp_flux, rp_obs_time, rp_flux
      FROM (SELECT
         main_id, source_id, g.ra, g.dec
         FROM 
        gaia.dr2light as g
         JOIN TAP_UPLOAD.t1 AS tc
         ON (0.002>DISTANCE(tc.ra, tc.dec, g.ra, g.dec))
      OFFSET 0) AS b
      JOIN gaia.dr2epochflux
      USING (source_id)
      """, 
      uploads={"t1": simjob.result_uri})
  finally:
    simjob.delete()

  # Now plot one after the other
  for row in time_series.table:
    plt.plot(row["bp_obs_time"], row["bp_flux"])
    plt.plot(row["rp_obs_time"], row["rp_flux"])
    plt.show(block=False)
    raw_input("{}; press return for next...".format(row["main_id"]))
    plt.cla()

if __name__=="__main__":
  main()

If you bother to read the code, you’ll notice that we transfer the Simbad result directly to the GAVO data center without first downloading it. That’s fairly boring in this case, where the table is small. But if you have a narrow pipe for one reason or another and some 105 rows, passing around async result URLs is a useful trick.

In this particular case the whole thing returns just four stars, so perhaps that’s not a terribly useful target for your learning machine. But this piece of code should get you started to where there’s more data.

You should read the column descriptions and footnotes in the query results (or from the reference URL) – this tells you how to interpret the times and how to make magnitudes from the fluxes if you must. You probably can’t hear it any more, but just in case: If you can, process fluxes rather than magnitudes from Gaia, because the errors are painful to interpret in magnitudes when the fluxes are small (try it!).

Note how the photometry data is stored in arrays in the database, and that VOTables can just transport these. The bad news is that support for manipulating arrays in ADQL is pretty much zero at this point; this means that, when you have trained your ML device, you’ll probably have to still download lots and lots of light curves rather than write some elegant ADQL to do the filtering server-side. However, I’d be highly interested to work out how some tastefully chosen user defined functions might enable offloading at least a good deal of that analysis to the database. So – if you know what you’d like to do, by all means let me know. Perhaps there’s something I can do for you.

Incidentally, I’ll talk a bit more about ADQL arrays in a blog post coming up in a few weeks (I think). Don’t miss it, subscribe to our feed).

Datalink

In the results from queries involving gaia.dr2epochflux, we also provide datalinks. These let you retrieve lightcurves that already have mags and that are more easily plotted. Perhaps more importantly, they link back to the full ESAC lightcurves that, in addition, give you a lot more debug information and are required if you want to reliably identify photometry points with the identifiers of the transits that generated them.

Datalink support in clients still is not great, but it’s growing nicely. Your ideas for workflows that should be supported are (again) most welcome – and have a good chance of being adopted. So, try things out, for instance by getting the most recent TOPCAT (as of this writing) and do the following:

  1. Open the VO/TAP dialog from the menu bar and double click the GAVO DC TAP service.
  2. Enter
    SELECT source_id, ra, dec,
    phot_bp_mean_mag, phot_rp_mean_mag, phot_g_mean_mag,
    g_transit_time, g_transit_flux,
    rp_obs_time, rp_flux
    FROM gaia.dr2epochflux 
    JOIN gaia.dr2light
    USING (source_id)
    WHERE parallax>50
    

    into “ADQL” text to retrieve lightcurves for the more nearby variables (in reality, you’d have to be a bit more careful with the distances, but you already knew that).

  3. plot something like phot_bp_mean_mag-phot_rp_mean_mag vs. phot_g_mean_mag (and adapt the plot to fit your viewing habits).
  4. Open the dialog for Views/Activation Actions (from the menu bar or the tool bar – same thing), check “Invoke Service”, choose “View Datalink Table”.
  5. Whenever you click on a a point in your CMD, a window will pop up in which you can choose between the time series in the various bands, and you can pull in the data from ESAC; to load a table, select “Load Table” from the actions near the foot of the datalink table and click “Invoke”.

Yeah. It’s clunky. Help us make it better with your fresh ideas for interfaces (and don’t be cross with us if we have to marry them with what’s technically feasible and readily generalised).

SSAP and Obscore

If you’re fed up with bleeding-edge tech, the light curves are also available through good old SSAP and Obscore. To use that, just get Splat (or another SSA client, preferably with a bit of time series support). Look for a Gaia DR2 time series service (you may have to update the service list before you find it), enter (in keeping with our LBV theme) S Dor as position and hit “Lookup” followed by “Send Query”. Just click on any result to just view the time series – and then apply Splat’s rich tool set to it.

Update (8.5.2018): Clusters

Here’s another quick application – how about looking for variable stars in clusters? This piece of ADQL should get you started:

SELECT TOP 100 
  source_id, ra, dec, parallax, g.pmra, g.pmdec,
  m.name, m.pmra AS c_pmra, m.pmde AS c_pmde, 
  m.e_pm AS c_e_pm,
  1/dist AS cluster_parallax
FROM 
  gaia.dr2epochflux
  JOIN gaia.dr2light AS g USING (source_id)
  JOIN mwsc.main AS m
  ON (1=CONTAINS(
    POINT(g.ra, g.dec),
    CIRCLE(m.raj2000, m.dej2000, rcluster)))
WHERE IN_UNIT(pmdec, 'deg/yr') BETWEEN m.pmde-m.e_pm*3 AND m.pmde+m.e_pm*3

– yes, you’ll want to constrain pmra, too, and the distance, and properly deal with error and all. But you get simple lightcurves for free. Just add them in the SELECT clause!