Histograms and Hidden Open Clusters

[image: reddish pattern]
Colour-coded histograms for distances of stars in the direction of some NGC open clusters — one cluster per line, so you’re looking a a couple of Gigabytes of data here. If you want this a bit more precise: Read the article and generate your own image.

I have spent a bit of time last week polishing up what will (hopefully) be the definitive source of common ADQL User Defined Functions (UDFs) for IVOA review. What’s a UDF, you ask? Well, it is an extension to ADQL where service operators can invent new functionality. If you have been following this blog for a while, you will probably remember the ivo_healpix_index function from our dereddening exercise (and some earlier postings): That was an UDF, too.

This polishing work reminded me of a UDF I’ve wanted to blog about for a quite a while, available in DaCHS (and thus on our Heidelberg Data Center) since mid-2018: gavo_histogram. This, I claim, is a powerful tool for analyses over large amounts of data with rather moderate local means.

For instance, consider this classic paper on the nature of NGC 2451: What if you were to look for more cases like this, i.e., (indulging in a bit of poetic liberty) open clusters hidden “behind” other open clusters?

Somewhat more technically this would mean figuring out whether there are “interesting” patterns in the distance and proper motion histograms towards known open clusters. Now, retrieving the dozens of millions of stars that, say, Gaia, has in the direction of open clusters to just build histograms – making each row count for a lot less than one bit – simply is wasteful. This kind of counting and summing is much better done server-side.

On the other hand, SQL’s usual histogram maker, GROUP BY, is a bit unwieldy here, because you have lots of clusters, and you will not see anything if you munge all the histograms together. You could, of course, create a bin index from the distance and then group by this bin and the object name, somewhat like ...ROUND(r_est/20) as bin GROUP by name, bin – but that takes quite a bit of mangling before it can conveniently be used, in particular when you take independent distributions over multiple variables (“naive Bayesian”; but then it’s the way to go if you want to capture dependencies between the variables).

So, gavo_histogram to the rescue. Here’s what the server-provided documentation has to say (if you use TOPCAT, you will find this in the ”Service” tab in the TAP windows’ ”Use Service” tab):

gavo_histogram(val REAL, lower REAL, upper REAL, nbins INTEGER) -> INTEGER[]

The aggregate function returns a histogram of val with nbins+2 elements. Assuming 0-based arrays, result[0] contains the number of underflows (i.e., val<lower), result[nbins+1] the number of overflows. Elements 1..nbins are the counts in nbins bins of width (upper-lower)/nbins. Clients will have to convert back to physical units using some external communication, there currently is no (meta-) data as to what lower and upper was in the TAP response.

This may sound a bit complicated, but the gist really is: type gavo_histogram(r_est, 0, 2000, 20) as hist, and you will get back an array with 20 bins, roughly 0..100, 100..200, and so on, and two extra bins for under- and overflows.

Let’s try this for our open cluster example. The obvious starting point is selecting the candidate clusters; we are only interested in famous clusters, so we take them from the NGC (if that’s too boring for you: with TAP uploads you could take the clusters from Simbad, too), which conveniently sits in my data center as openngc.data:

select name, raj2000, dej2000, maj_ax_deg
from openngc.data
where obj_type='OCl'

Then, we need to add the stars in their rough directions. That’s a classic crossmatch, and of course these days we use Gaia as the star catalogue:

  select name, source_id
  from openngc.data 
  join gaia.dr2light
  on (
      circle(raj2000, dej2000, maj_ax_deg)))
  where obj_type='OCl')

This is now a table of cluster names and Gaia source ids of the candidate stars. To add distances, you could fiddle around with Gaia parallaxes, but because there is a 1/x involved deriving distances, the error model is complicated, and it is much easier and safer to adopt Bailer-Jones et al’s pre-computed distances and join them in through source_id.

And that distance estimation, r_est, is exactly what we want to take our histograms over – which means we have to group by name and use gavo_histogram as an aggregate function:

with ocl as (
  select name, raj2000, dej2000, maj_ax_deg, source_id
  from openngc.data 
  join gaia.dr2light
  on (
      circle(raj2000, dej2000, maj_ax_deg)))
  where obj_type='OCl')

  gavo_histogram(r_est, 0, 4000, 200) as hist
  join ocl
  using (source_id)
where r_est!='NaN'
group by name

That’s it! This query will give you (admittedly somewhat raw, since we’re ignoring the confidence intervals) histograms of the distances of stars in the direction of all NGC open clusters. Of course, it will run a while, as many millions of stars are processed, but TAP async mode easily takes care of that.

Oh, one odd thing is left to discuss (ignore this paragraph if you don’t know what I’m talking about): r_est!='NaN'. That’s not quite ADQL but happens to do the isnan of normal programming languages at least when the backend is Postgres: It is true if computations failed and there is an actual NaN in the column. This is uncommon in SQL databases, and normal NULLs wouldn’t hurt gavo_histogram. In our distance table, some NaNs slipped through, and they would poison our histograms. So, ADQL wizards probably should know that this is what you do for isnan, and that the usual isnan test val!=val doesn’t work in SQL (or at least not with Postgres).

So, fire up your TOPCAT and run this on the TAP server http://dc.g-vo.org/tap.

You will get a table with 618 (or so) histograms. At this point, TOPCAT can’t do a lot with them. So, let’s emigrate to pyVO and save this table in a file ocl.vot

My visualisation proposition would be: Let’s substract a “background” from the histograms (I’m using splines to model that background) and then plot them row by row; multi-peaked rows in the resulting image would be suspicious.

This is exactly what the programme below does, and the image for this article is a cutout of what the code produces. Set GALLERY = True to see how the histograms and background fits look like (hit ‘q’ to get to the next one).

In the resulting image, any two yellow dots in one line are at least suspicious; I’ve spotted a few, but they are so consipicuous that others must have noticed. Or have they? If you’d like to check a few of them out, feel free to let me know – I think I have a few ideas how to pull some VO tricks to see if these things are real – and if they’ve been spotted before.

So, here’s the yellow spot programme:

from astropy.table import Table
import matplotlib.pyplot as plt
import numpy
from scipy.interpolate import UnivariateSpline


def substract_background(arr):
    x = range(len(arr))
    mean = sum(arr)/len(arr)
    arr = arr/mean
    background = UnivariateSpline(x, arr, s=100)
    cleaned = arr-background(x)

    if GALLERY:
        plt.plot(x, arr)
        plt.plot(x, background(x))

    return cleaned

def main():
    tab = Table.read("ocl.vot")
    hist = numpy.array([substract_background(r["hist"][1:-1])
      for r in tab])
    plt.matshow(hist, cmap='gist_heat')

if __name__=="__main__":

Deredden using TAP

An animated color-magnitude diagram
Raw and dereddened CMD for a region in Cygnus.

Today I published a nice new service on our TAP service: The Bayestar17 3D dust map derived from Pan-STARRS 1 by Greg Green et al. I mention in passing that this was made particularly enjoyable because Greg and friends put an explicit license on their data (in this case, CC-BY-SA).

This dust map is probably a fascinating resource by itself, but the really nifty thing is that you can use it to correct all kinds of photometric data for extinction – at least to some extent. On the Bayestar web page, the authors give some examples for usage – and with our new service, you can use TAP as well to correct photometry for extinction.

To see how, first have a look at the table metadata for the prdust.map_union table; this is what casual users probably should look at. More specifically, at the coverage, best_fit, and grdiagnostic columns.

coverage here is an interval of 10-healpixes. It has to be an interval because the orginal data comes on wildly different levels; depending on the density of stars, sometimes it takes the area of a 6-healpix (about a square degree) to get enough signal, whereas in the galactic plane a 10-healpix (a thousandth of a square degree) already has enough stars. To make the whole thing conveniently queriable without exploding a 6-healpix row into 1000 identical rows, larger healpixes translate into intervals of 10-helpixes. Don’t panic, though, I’ll show how to conveniently query this below.

best_fit and grdiagnostic are arrays (remember the light cuves in Gaia DR2?). In bins of 0.5 in distance modulus (which is, in case you feel a bit uncertain as to the algebraic signs, 5 log10(dist)-5 for a distance in parsec), starting with a distance modulus of 4 and ending with 19. This means that for a distance modulus of 4.2 you should check the array index 0, whereas 4.3 already would be covered by array index 1. With this, best_fit[ind] gives E(B-V) = (B-V) – (B-V)0 in the direction of coverage in a distance modulus bin of 2*ind+4. For each best_fit[ind], grdiagnostic[ind] contains a quality measure for that value. You probably shouldn’t touch the E(B-V) if that measure is larger than 1.2.

So, how does one use this?

To try things, let’s pull some Gaia data with distances; in order to have interesting extinctions, I’m using a patch in Cygnus (RA 288.5, Dec 2.3). If you live on the northern hemisphere and step out tonight, you could see dust clouds there with the naked eye (provided electricity fails all around, that is). Full disclosure: I tried the Coal Sack first but after checking the coverage of the dataset – which essentially is the sky north of -30 degrees – I noticed that wouldn’t fly. But stories like these are one reason why I’m making such a fuss about having standard STC coverage representations.

We want distances, and to dodge all the intricacies involved when naively turning parallaxes to distances discussed at length
in a paper by Xavier Luri et al (and elsewhere), I’m using precomputed distances from Bailer-Jones et al. (2018AJ….156…58B); you’ll find them on the “ARI Gaia” service; in TOPCAT’s TAP dialog simply search for “Gaia” – that’ll give you the GAVO DC TAP search, too, and that we’ll need in a second.

The pre-computed distances are in the gaiadr2_complements.geometric_distance table, which can be joined to the main Gaia object catalog using the source_id column. So, here’s a query to produce a little photometric catalog around our spot in Cygnus (we’re discarding objects with excessive parallax errors while we’re at it):

r_est, 5*log10(r_est)-5 as dist_mod,
phot_g_mean_mag, phot_bp_mean_mag, phot_rp_mean_mag,
ra, dec
JOIN gaiadr2_complements.geometric_distance
USING (source_id)
AND 1=CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', 288.5, 2.3, 0.5 ))

The color-magnitude diagram resulting from this is the red point cloud in the animated GIF at the top. To reproduce it, just plot phot_bp_mean_mag-phot_rp_mean_mag against phot_g_mean_mag-dist_mod (and invert the y axis).

De-reddening this needs a few minor technicalities. The most important one is how to match against the odd intervals of healpixes in the prdust.map_union table. A secondary one is that we have only pulled equatorial coordinates, and the healpixes in prdust are in galactic coordinates.

Computing the healpix requires the ivo_healpix_index ADQL user defined function (UDF) that you may have met before, and since we have to go from ICRS to Galactic it requires a fairly new UDF I’ve recently defined to finally get the discussion on having a “standard library” of astrometric functions in ADQL going: gavo_transform. Here’s how to get a 10-healpix as required for map_union from ra and dec:

  gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER)

The CAST call is a pure technicality – ivo_healpix_index returns a 64-bit integer, which I can’t use in my interval logic.

The comparison against the intervals you could do yourself, but as argued in Registry-STC article this is one of the trivial things that are easy to get wrong. So, let’s use the ivo_interval_overlaps UDF; it goes in the join condition to properly match prdust healpixes to catalog positions. Then our total query – that, I hope, should be reasonably easy to adapt to similar problems – is:

WITH sources AS (
  SELECT phot_g_mean_mag, 
      gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER) AS hpx,
    ROUND((dist_mod-4)*2)+1 AS dist_mod_bin

  phot_bp_mean_mag-phot_rp_mean_mag-dust.best_fit[dist_mod_bin] AS color,
    dust.best_fit[dist_mod_bin]*3.384 AS abs_mag,
  dust.grdiagnostic[dist_mod_bin] as qual
FROM sources
JOIN prdust.map_union AS dust
ON (1=ivo_interval_has(hpx, coverage))

(If you’re following along: you have to switch to the GAVO DC TAP to run this, and you will probably have to change the index after TAP_UPLOAD).

Ok, in the photometry department there’s a bit of cheating going on here – I’m correcting Gaia B-R with B-V, and I’m using the factor for Johnson V to estimate the extinction in Gaia G (if you’re curious where that comes from: See the footnote on best_fit and the MC extinction service docs should get you started), so this is far from physically correct. But, as you can see from the green cloud in the plot above, it already helps a bit. And if you find out better factors, by all means let me know so I can add an update… right here:

Update (2018-09-11): The original data creator, Gregory Green points out that the thing with having a better factor for Gaia G isn’t that simple, because, as he says “Gaia G is very broad, [and] the extinction coefficients are much more dependent on stellar type, and extinction is also more nonlinear with dust column (extinction is only linear with dust column and independent of stellar type for an infinitely narrow passband)”. So – when de-reddening, prefer narrow passbands. But whether narrow or wide: TAP helps you.