Crazy Shapes in TAP

OpenNGC shapes
A complex shape from OpenNGC: MOCs need not be convex, or simply connected, or anything.

So far when you did spherical geometry in ADQL, you had points, circles, and polygons as data types, and you could test for intersection and containment as operations. This feature set is a bit unsatisfying because there are no (algebraic) groups in this picture: When you join or intersect two circles, the result only is a circle if one contains the other. With non-intersecting polygons, you will again not have a (simply connected) spherical polygon in the end.

Enter MOCs (which I’ve mentioned a few times before on this blog): these are essentially arbitrary shapes on the sky, in practice represented through lists of pixels, cleverly done so they can be sufficiently precise and rather compact at the same time. While MOCs are powerful and surprisingly simple in practice, ADQL doesn’t know about them so far, which limits quite a bit what you can do with them. Well, DaCHS would serve them since about 1.3 if you managed to push them into the database, but there were no operations you could do on them.

Thanks to work done by credativ (who were really nice to work with), funded with some money we had left from our previous e-inf-astro project (BMBF FKZ 05A17VH2) on the pgsphere database extension, this has now changed. At least on the GAVO data center, MOCs are now essentially first-class citizens that you can create, join, and intersect within ADQL, and you can retrieve the results. All operators of DaCHS services are just a few updates away from being able to offer the same.

So, what can you do? To follow what’s below, get a sufficiently new TOPCAT (4.7 will do) and open its TAP client on (a.k.a. GAVO DC TAP).

Basic MOC Operations in TAP

First, let’s make sure you can plot MOCs; run

SELECT name, deepest_shape
FROM openngc.shapes

Then do Graphics/Sky Plot, and in the window that pops up then, Layers/Add Area Control. Then select your new table in the Position tab, and finally choose deepest_shape as area (yeah, this could become a bit more automatic and probably will over time). You will then see the footprints of a few NGC objects (OpenNGC’s author Mattia Verga hasn’t done all yet; he certainly welcomes help on OpenNGC’s version control repo), and you can move around in the plot, yielding perhaps something like Fig. 1.

Now let’s color these shapes by object class. If you look, has an obj_type column – let’s group on it:

  AREA(shape) AS ar
  SELECT obj_type, SUM(deepest_shape) AS shape
  FROM openngc.shapes
  GROUP BY obj_type) AS q

(the extra subquery is a workaround necessary because the area function wants a geometry or a column reference, and ADQL doesn’t allow aggregate functions – like sum – as either of these).

Coloured shapes
Fig. 2: OpenNGC shapes grouped and coloured by type.

In the result you will see that so far, contours for about 40 square degrees of star clusters with nebulae have been put in, but only 0.003 square degrees of stellar associations. And you can now plot by the areas covered by the various sorts of objects; in Fig. 2, I’ve used Subsets/Classify by Column in TOPCAT’s Row Subsets to have colours indicate the different object types – a great workaround when one deals with categorial variables in TOPCAT.

MOCs and JOINs

Another table that already has MOCs in them is rr.stc_spatial, which has the coverage of VO resources (and is the deeper reason I’ve been pushing improved MOC support in pgsphere – background); this isn’t available for all resources yet , but at least there are about 16000 in already. For instance, here’s how to get the coverage of resources talking about planetary nebulae:

SELECT ivoid, res_title, coverage
FROM rr.subject_uat
  NATURAL JOIN rr.stc_spatial
  NATURAL JOIN rr.resource
WHERE uat_concept='planetary-nebulae'
  AND AREA(coverage)<20

(the rr.subject_uat table is a local extension to RegTAP that will be the subject of some future blog post; you could also use rr.res_subject, but because people still use wildly different keyword schemes – if any –, that wouldn’t be as much fun). When plotted, that’s the left side of Fig. 3. If you do that yourself, you will notice that the resolution here is about one degree, which is a special property of the sort of MOCs I am proposing for the Registry: They are of order 6. Resolution in MOC goes up with order, doubling with every step. Thus MOCs of order 7 have a resolution of about half a degree, MOCs of order 5 a resolution of about two degrees.

One possible next step is fetch the intersection of each of these coverages with, say, the DFBS (cf. the post on Byurakan spectra). That would look like this:

  gavo_mocintersect(coverage, dfbscoverage) as ovrlp
  SELECT ivoid, res_title, coverage
  FROM rr.subject_uat
  NATURAL JOIN rr.stc_spatial
  NATURAL JOIN rr.resource
  WHERE uat_concept='planetary-nebulae'
  AND AREA(coverage)<20) AS others
  SELECT coverage AS dfbscoverage 
  FROM rr.stc_spatial
  WHERE ivoid='ivo://org.gavo.dc/dfbsspec/q/spectra') AS dfbs

(the DFBS’ identifier I got with a quick query on WIRR). This uses the gavo_mocintersect user defined function (UDF), which takes two MOCs and returns a MOC of their common pixels. Which is another important part why MOCs are so cool: together with union and intersection, they form groups. It should not come as a surprise that there is also a gavo_mocunion UDF. The sum aggregate function we’ve used in our grouping above is (conceptually) built on that.

Planetary Nebula footprint and plate matches
Fig. 3: Left: The common footprint of VO resources declaring a subject of planetary-nebula (and declaring a footprint). Right bottom: Heidelberg plates intersecting this, and, in blue, level-6 intersections. Above this, an enlarged detail from this plot.

You can also convert polygons and circles to MOCs using the (still DaCHS-only) MOC constructor. For instance, you could compute the coverage of all resources dealing with planetary nebulae, filtering against obviously over-eager ones by limiting the total area, and then match that against the coverages of images in, say, the Königstuhl plate achives HDAP. Watch this:

  gavo_mocintersect(MOC(6, im.coverage), pn_coverage) as ovrlp
  SELECT SUM(coverage) AS pn_coverage
  FROM rr.subject_uat
  NATURAL JOIN rr.stc_spatial
  WHERE uat_concept='planetary-nebulae'
  AND AREA(coverage)<20) AS c
JOIN lsw.plates AS im
ON 1=INTERSECTS(pn_coverage, MOC(6, coverage))

– so, the MOC(order, geo) function should give you a MOC for other geometries. There are limits to this right now because of limitations of the underlying MOC library; in particular, non-convex polygons are not supported right now, and there are precision issue. We hope this will be rectified soon-ish when we base pgsphere’s MOC operations on the CDS HEALPix library. Anyway, the result of this is plotted on the right of Fig. 3.

Open Ends

In case you have MOCs from the outside, you can also construct MOCs from literals, which happen to be the ASCII MOCs from the standard. This could look like this:

  MOC('4/30-33 38 52 7/324-934') AS ar 
FROM tap_schema.tables

For now, you cannot combine MOCs in CONTAINS and INTERSECTS expressions directly; this is mainly because in such an operation, the machine as to decide on the order of the MOC the other geometries are converted to (and computing the predicates between geometry and MOC directly is really painful). This means that if you have a local table with MOCs in a column cmoc that you want to compare against a polygon-valued column coverage in a remote table like this:

  lsw.plates AS db
  JOIN tap_upload.t6
ON 1=CONTAINS(coverage, cmoc) -- fails!

you will receive a rather scary message of the type “operator does not exist: spoly <@ smoc”. To fix it (until we’ve worked out how to reasonably let the computer do that), explicitly convert the polygon:

  lsw.plates AS db
  JOIN tap_upload.t6
ON 1=CONTAINS(MOC(7, coverage), cmoc)

(be stingy when choosing the order here – MOCs that already exist are fast, but making them at high order is expensive).

Having said all that: what I’ve written here is bleeding-edge, and it is not standardised yet. I’d wager, though, that we will see MOCs in ADQL relatively soon, and that what we will see will not be too far from this experiment. Well: Some rough edges, I’d hope, will still be smoothed out.

Getting This on Your Own DaCHS Installation

If you are running a DaCHS installation, you can contribute to takeup (and if not, you can stop reading here). To do that, you need to upgrade to DaCHS’s latest beta (anything newer than 2.1.4 will do) to have the ADQL extension, and, even more importantly, you need to install the postgresql-postgres package from our release repository (that’s version 1.1.4 or newer; in a few weeks, getting it from Debian testing would work as well).

You will probably not get that automatically, because if you followed our normal installation instructions, you will have a package called postgresql-11-pgsphere installed (apologies for this chaos; as ususal, every single step made sense). The upshot is that with our release repo added, sudo apt install postgresql-pgsphere should give you the new code.

That’s not quite enough, though, because you also need to acquaint the database with the new functions. This can only be done with database administrator privileges, which DaCHS by design does not possess. What DaCHS can do is figure out the commands to do that when it is called as dachs upgrade -e. Have a look at the output, and if you are satisfied it is about what to expect, just pipe it into psql as a superuser; in the default installation, dachsroot would be sufficiently privileged. That is:

dachs upgrade -e | psql gavo   # as dachsroot

If running

select top 1 gavo_mocunion(moc('1/3'), moc('2/9')) 
from tap_schema.tables

through your TAP endpoint returns ‘1/3 2/9’, then all is fine. For entertainment, you might also make sure that gavo_mocintersect(moc('1/3'), moc('2/13')) is 2/13 as expected, and that if you intersect with 2/3 you get back an empty string.

So – let’s bring MOCs to ADQL!

The Bochum Galactic Disk Survey

[Image: Patches of higher perceived variability on the Sky]
Fig 1: How our haphazard variability ratio varies over the sky (galactic coordinates). And yes, it’s clear that this isn’t dominated by physical variability.

About a year ago, I reported on a workshop on “Large Surveys with Small Telescopes” in Bamberg; at around the same time, I’ve published an example for those, the Bochum Galactic Disk Survey BGDS, which used a twin 15 cm robotic telescope in some no longer forsaken place in the Andes mountains to monitor the brighter stars in the southern Milky Way. While some tables from an early phase of the survey have been on VizieR for a while, we now publish the source images (also in SIAP and Obscore), the mean photometry (via SCS and TAP) and, perhaps potentially most fun of all, the the lightcurves (via SSAP and TAP) – a whopping 35 million of the latter.

This means that in tools like Aladin, you can now find such light curves (and images in two bands from a lot of epochs) when you are in the survey’s coverage, and you can run TAP queries on GAVO’s server against the full photometry table and the time series.

Regular readers of this blog will not be surprised to see me use this as an excuse to show off a bit of ADQL trickery.

If you have a look at the bgds.phot_all table in your favourite TAP client, you’ll see that it has a column amp, giving the difference between the highest and lowest magnitude. The trouble is that amp for almost all objects just reflects the measurement error rather than any intrinsic variability. To get an idea what’s “normal” (based on the fact that essentially all stars have essentially constant luminosity on the range and resolution scales considered here), run a query like

SELECT ROUND(amp/err_mag*10)/10 AS bin, COUNT(*) AS n
FROM bgds.phot_all
WHERE nobs>10

As this scans the entire 75 million rows of the table, you will probably have to use async mode to run this.

[image: distribution of amplitude/mag error
Figure 2: The distribution of amplitude over magnitude error for all BGDS objects with nobs>10 (blue) and the subset with a mean magnitude brighter than 15 (blue).

When it comes back, you will have, for objects where any sort of statistics make sense at all (hence nobs>10), a histogram (of sorts) of the amplitude in units of upstream’s magnitude error estimation. If you log-log-plot this, you’ll see something like Figure 2. The curve at least tells you that the magnitude error estimate is not very far off – the peak at about 3 “sigma” is not unreasonable since about half of the objects have nobs of the order of a hundred and thus would likely contain outliers that far out assuming roughly Gaussian errors.

And if you’re doing a rough cutoff at amp/magerr>10, you will get perhaps not necessarily true variables, but, at least potentially interesting objects.

Let’s use this insight to see if we spot any pattern in the distribution of these interesting objects. We’ll use the HEALPix technique I’ve discussed three years ago in this blog, but with a little twist from ADQL 2.1: The Common Table Expressions or CTEs I have already mentioned in my blog post on ADQL 2.1 and then advertised in the piece on the Henry Draper catalogue. The brief idea, again, is that you can write queries and give them a name that you can use elsewhere in the query as if it were an actual table. It’s not much different from normal subqueries, but you can re-use CTEs in multiple places in the query (hence the “common”), and it’s usually more readable.

Here, we first create a version of the photometry table that contains HEALPixes and our variability measure, use that to compute two unsophisticated per-HEALPix statistics and eventually join these two to our observable, the ratio of suspected variables to all stars observed (the multiplication with 1.0 is a cheap way to make a float out of a value, which is necessary here because a/b does integer division in ADQL if a and b are both integers):

WITH photpoints AS (
    amp/err_mag AS redamp,
    ivo_healpix_index(5, ra, dec) AS hpx
  FROM bgds.phot_all
    AND band_name='SDSS i'
    AND mean_mag<16),
all_objs AS (
  SELECT count(*) AS ct,
    FROM photpoints GROUP BY hpx),
strong_var AS (
    FROM photpoints
    WHERE redamp>4 AND amp>1 GROUP BY hpx)
  strong_var.ct/(1.0*all_objs.ct) AS obs,
  all_objs.ct AS n,
FROM strong_var JOIN all_objs USING (hpx)
WHERE all_objs.ct>20

If you plot this using TOPCAT’s HEALPix thingy and ask it to use Galactic coordinates, you’ll end up with something like Figure 1.

There clearly is some structure, but given that the variables ratio reaches up to 0.2, this is still reflecting instrumental or pipeline effects and thus earthly rather than Astrophysics. And that’s going beyond what I’d like to talk about on a VO blog, although I’l take any bet that you will see significant structure in the spatial distribution of the variability ratio at about any magnitude cutoff, since there are a lot of different population mixtures in the survey’s footprint.

Be that as it may, let’s have a quick look at the time series. As with the short spectra from Byurakan use case, we’ve stored the actual time series as arrays in the database (the mjd and mags columns in bgds.ssa_time_series. Unfortunately, since they are a lot less array-like than homogeneous spectra, it’s also a lot harder to do interesting things with them without downloading them (I’m grateful for ideas for ADQL functions that will let you do in-DB analysis for such things). Still, you can at least easily download them in bulk and then process them in, say, python to your heart’s content. The Byurakan use case should give you a head start there.

For a quick demo, I couldn’t resist checking out objects that Simbad classifies as possible long-period variables (you see, as I write this, the public bohei over Betelgeuse’s brief waning is just dying down), and so I queried Simbad for:

SELECT ra, dec, main_id
FROM basic
     POINT('', ra, dec),
     POLYGON('', 127, -30, 112, -30, 272, -30, 258, -30))

(as of this writing, Simbad still needs the ADQL 2.0-compliant first arguments to POINT and POLYGON), where the POLYGON is intended to give the survey’s footprint. I obtained that by reading off the coordinates of the corners in my Figure 1 while it was still in TOPCAT. Oh, and I had to shrink it a bit because Simbad (well, the underlying Postgres server, and, more precisely, its pg_sphere extension) doesn’t want polygons with edges longer than π. This will soon become less pedestrian: MOCs in relational databases are coming; more on this soon.

[TOPCAT action shot with a light curve display]
Fig 3: V566 Pup’s BGDS lightcuve in a TOPCAT configured to auto-plot the light curves associated with a row from the bgds.ssa_time_series table on the GAVO DC TAP service.

If you now do the usual spiel with an upload crossmatch to the bgds.ssa_time_series table and check “Plot Table” in Views/Activation Action, you can quickly page through the light curves (TOPCAT will keep the plot style as you go from dataset to dataset, so it’s worth configuring the lines and the error bars). Which could bring you to something like Fig. 3; and that would suggest that V* V566 Pup isn’t really long-period unless the errors are grossly off.

Gaia DR2: A light version and light curves

screenshot: topcat and matplotlib
Topcat is doing datalink, and our little python script has plotted a two-color time series of RMC 18 (or so I think).

If anyone ever writes a history of the VO, the second data release of Gaia on April 25, 2018 will probably mark its coming-of-age – at least if you, like me, consider the Registry the central element of the VO. It was spectacular to view the spike of tens of Registry queries per second right around 12:00 CEST, the moment the various TAP services handing out the data made it public (with great aplomb, of course).

In GAVO’s Data Center we also carry Gaia DR2 data. Our host institute, the Zentrum für Astronomie in Heidelberg, also has a dedicated Gaia server. This gives relieves us from having to be a true mirror of the upstream data release. And since the source catalog has lots and lots of columns that most users will not be using most of the time, we figured a “light” version of the source catalog might fill an interesting ecological niche: Behold gaia.dr2light on the GAVO DC TAP service, containing essentially just the basic astrometric parameters and the diagonal of the covariance matrix.

That has two advantages: Result sets with SELECT * are a lot less unwieldy (but: just don’t do this with Gaia DR2), and, more importantly, a lighter table puts less load on the server. You see, conventional databases read entire rows when processing data, and having just 30% of the columns means we will be 3 times faster on I/O-bound tasks (assuming the same hardware, of course). Hence, and contrary to several other DR2-carrying sites, you can perform full sequential scans before timing out on our TAP service on gaia.dr2light. If, on the other hand, you need to do debugging or full-covariance-matrix error calculations: The full DR2 gaia_source table is available in many places in the VO. Just use the Registry.

Photometry via TAP

A piece of Gaia DR2 that’s not available in this form anywhere else is the lightcurves; that’s per-transit photometry in the G, BP, and RP band for about 0.5 million objects that the reduction system classified as variable. ESAC publishes these through datalink from within their gaia_source table, and what you get back is a VOTable that has the photometry in the three bands interleaved.

I figured it might be useful if that data were available in a TAP-queriable table with lightcurves in the database. And that’s how gaia.dr2epochflux came into being. In there, you have three triples of arrays: the epochs (g_transit_time, bp_obs_time, and rp_obs_time), the fluxes (g_transit_flux, bp_flux, and rp_flux), and their errors (you can probably guess their names). So, to retrieve G lightcurves where available together with a gaia_source query of your liking, you could write something like

SELECT g.*, g_transit_time, g_transit_flux
FROM gaia.dr2light AS g
LEFT OUTER JOIN gaia.dr2epochflux
USING (source_id)
WHERE ...whatever...

– the LEFT OUTER JOIN arranges things such that the g_transit_time and g_transit_flux columns simply are NULL when there are no lightcurves; with a normal (“inner”) join, rows without lightcurves would not be returned in such a query.

To give you an idea of what you can do with this, suppose you would like to discover new variable blue supergiants in the Gaia data (who knows – you might discover the precursor of the next nearby supernova!). You could start with establishing color cuts and train your favourite machine learning device on light curves of variable blue supergiants. Here’s how to get (and, for simplicity, plot) time series of stars classified as blue supergiants by Simbad for which Gaia DR2 lightcurves are available, using pyvo and a little async trick:

from matplotlib import pyplot as plt
import pyvo

def main():
  simbad = pyvo.dal.TAPService(
  gavodc = pyvo.dal.TAPService("")

  # Get blue supergiants from Simbad
  simjob = simbad.submit_job("""
    select main_id, ra, dec
    from basic
    where otype='BlueSG*'""")

  # Get lightcurves from Gaia
    time_series = gavodc.run_sync("""
      SELECT b.*, bp_obs_time, bp_flux, rp_obs_time, rp_flux
         main_id, source_id, g.ra, g.dec
        gaia.dr2light as g
         JOIN TAP_UPLOAD.t1 AS tc
         ON (0.002>DISTANCE(tc.ra, tc.dec, g.ra, g.dec))
      OFFSET 0) AS b
      JOIN gaia.dr2epochflux
      USING (source_id)
      uploads={"t1": simjob.result_uri})

  # Now plot one after the other
  for row in time_series.table:
    plt.plot(row["bp_obs_time"], row["bp_flux"])
    plt.plot(row["rp_obs_time"], row["rp_flux"])
    raw_input("{}; press return for next...".format(row["main_id"]))

if __name__=="__main__":

If you bother to read the code, you’ll notice that we transfer the Simbad result directly to the GAVO data center without first downloading it. That’s fairly boring in this case, where the table is small. But if you have a narrow pipe for one reason or another and some 105 rows, passing around async result URLs is a useful trick.

In this particular case the whole thing returns just four stars, so perhaps that’s not a terribly useful target for your learning machine. But this piece of code should get you started to where there’s more data.

You should read the column descriptions and footnotes in the query results (or from the reference URL) – this tells you how to interpret the times and how to make magnitudes from the fluxes if you must. You probably can’t hear it any more, but just in case: If you can, process fluxes rather than magnitudes from Gaia, because the errors are painful to interpret in magnitudes when the fluxes are small (try it!).

Note how the photometry data is stored in arrays in the database, and that VOTables can just transport these. The bad news is that support for manipulating arrays in ADQL is pretty much zero at this point; this means that, when you have trained your ML device, you’ll probably have to still download lots and lots of light curves rather than write some elegant ADQL to do the filtering server-side. However, I’d be highly interested to work out how some tastefully chosen user defined functions might enable offloading at least a good deal of that analysis to the database. So – if you know what you’d like to do, by all means let me know. Perhaps there’s something I can do for you.

Incidentally, I’ll talk a bit more about ADQL arrays in a blog post coming up in a few weeks (I think). Don’t miss it, subscribe to our feed).


In the results from queries involving gaia.dr2epochflux, we also provide datalinks. These let you retrieve lightcurves that already have mags and that are more easily plotted. Perhaps more importantly, they link back to the full ESAC lightcurves that, in addition, give you a lot more debug information and are required if you want to reliably identify photometry points with the identifiers of the transits that generated them.

Datalink support in clients still is not great, but it’s growing nicely. Your ideas for workflows that should be supported are (again) most welcome – and have a good chance of being adopted. So, try things out, for instance by getting the most recent TOPCAT (as of this writing) and do the following:

  1. Open the VO/TAP dialog from the menu bar and double click the GAVO DC TAP service.
  2. Enter
    SELECT source_id, ra, dec,
    phot_bp_mean_mag, phot_rp_mean_mag, phot_g_mean_mag,
    g_transit_time, g_transit_flux,
    rp_obs_time, rp_flux
    FROM gaia.dr2epochflux 
    JOIN gaia.dr2light
    USING (source_id)
    WHERE parallax>50

    into “ADQL” text to retrieve lightcurves for the more nearby variables (in reality, you’d have to be a bit more careful with the distances, but you already knew that).

  3. plot something like phot_bp_mean_mag-phot_rp_mean_mag vs. phot_g_mean_mag (and adapt the plot to fit your viewing habits).
  4. Open the dialog for Views/Activation Actions (from the menu bar or the tool bar – same thing), check “Invoke Service”, choose “View Datalink Table”.
  5. Whenever you click on a a point in your CMD, a window will pop up in which you can choose between the time series in the various bands, and you can pull in the data from ESAC; to load a table, select “Load Table” from the actions near the foot of the datalink table and click “Invoke”.

Yeah. It’s clunky. Help us make it better with your fresh ideas for interfaces (and don’t be cross with us if we have to marry them with what’s technically feasible and readily generalised).

SSAP and Obscore

If you’re fed up with bleeding-edge tech, the light curves are also available through good old SSAP and Obscore. To use that, just get Splat (or another SSA client, preferably with a bit of time series support). Look for a Gaia DR2 time series service (you may have to update the service list before you find it), enter (in keeping with our LBV theme) S Dor as position and hit “Lookup” followed by “Send Query”. Just click on any result to just view the time series – and then apply Splat’s rich tool set to it.

Update (8.5.2018): Clusters

Here’s another quick application – how about looking for variable stars in clusters? This piece of ADQL should get you started:

  source_id, ra, dec, parallax, g.pmra, g.pmdec,, m.pmra AS c_pmra, m.pmde AS c_pmde, 
  m.e_pm AS c_e_pm,
  1/dist AS cluster_parallax
  JOIN gaia.dr2light AS g USING (source_id)
  JOIN mwsc.main AS m
    POINT(g.ra, g.dec),
    CIRCLE(m.raj2000, m.dej2000, rcluster)))
WHERE IN_UNIT(pmdec, 'deg/yr') BETWEEN m.pmde-m.e_pm*3 AND m.pmde+m.e_pm*3

– yes, you’ll want to constrain pmra, too, and the distance, and properly deal with error and all. But you get simple lightcurves for free. Just add them in the SELECT clause!