Posts with the Tag Gaia:

  • Histograms and Hidden Open Clusters

    image: reddish pattern

    Colour-coded histograms for distances of stars in the direction of some NGC open clusters -- one cluster per line, so you're looking a a couple of Gigabytes of data here. If you want this a bit more precise: Read the article and generate your own image.

    I have spent a bit of time last week polishing up what will (hopefully) be the definitive source of common ADQL User Defined Functions (UDFs) for IVOA review. What's a UDF, you ask? Well, it is an extension to ADQL where service operators can invent new functionality. If you have been following this blog for a while, you will probably remember the ivo_healpix_index function from our dereddening exercise (and some earlier postings): That was an UDF, too.

    This polishing work reminded me of a UDF I've wanted to blog about for a quite a while, available in DaCHS (and thus on our Heidelberg Data Center) since mid-2018: gavo_histogram. This, I claim, is a powerful tool for analyses over large amounts of data with rather moderate local means.

    For instance, consider this classic paper on the nature of NGC 2451: What if you were to look for more cases like this, i.e., (indulging in a bit of poetic liberty) open clusters hidden “behind” other open clusters?

    Somewhat more technically this would mean figuring out whether there are “interesting” patterns in the distance and proper motion histograms towards known open clusters. Now, retrieving the dozens of millions of stars that, say, Gaia, has in the direction of open clusters to just build histograms – making each row count for a lot less than one bit – simply is wasteful. This kind of counting and summing is much better done server-side.

    On the other hand, SQL's usual histogram maker, GROUP BY, is a bit unwieldy here, because you have lots of clusters, and you will not see anything if you munge all the histograms together. You could, of course, create a bin index from the distance and then group by this bin and the object name, somewhat like ...ROUND(r_est/20) as bin GROUP by name, bin – but that takes quite a bit of mangling before it can conveniently be used, in particular when you take independent distributions over multiple variables (“naive Bayesian”; but then it's the way to go if you want to capture dependencies between the variables).

    So, gavo_histogram to the rescue. Here's what the server-provided documentation has to say (if you use TOPCAT, you will find this in the ”Service” tab in the TAP windows' ”Use Service” tab):

    gavo_histogram(val REAL, lower REAL, upper REAL, nbins INTEGER) -> INTEGER[]
    
    The aggregate function returns a histogram of val with
    nbins+2 elements. Assuming 0-based arrays, result[0] contains
    the number of underflows (i.e., val<lower), result[nbins+1]
    the number of overflows. Elements 1..nbins are the counts in
    nbins bins of width (upper-lower)/nbins. Clients will have to
    convert back to physical units using some external communication,
    there currently is no (meta-) data as to what lower and upper was in
    the TAP response.
    

    This may sound a bit complicated, but the gist really is: type gavo_histogram(r_est, 0, 2000, 20) as hist, and you will get back an array with 20 bins, roughly 0..100, 100..200, and so on, and two extra bins for under- and overflows.

    Let's try this for our open cluster example. The obvious starting point is selecting the candidate clusters; we are only interested in famous clusters, so we take them from the NGC (if that's too boring for you: with TAP uploads you could take the clusters from Simbad, too), which conveniently sits in my data center as openngc.data:

    select name, raj2000, dej2000, maj_ax_deg
    from openngc.data
    where obj_type='OCl'
    

    Then, we need to add the stars in their rough directions. That's a classic crossmatch, and of course these days we use Gaia as the star catalogue:

    select name, source_id
    from openngc.data
    join gaia.dr2light
    on (
      1=contains(
        point(ra,dec),
        circle(raj2000, dej2000, maj_ax_deg)))
    where obj_type='OCl')
    

    This is now a table of cluster names and Gaia source ids of the candidate stars. To add distances, you could fiddle around with Gaia parallaxes, but because there is a 1/x involved deriving distances, the error model is complicated, and it is much easier and safer to adopt Bailer-Jones et al's pre-computed distances and join them in through source_id.

    And that distance estimation, r_est, is exactly what we want to take our histograms over – which means we have to group by name and use gavo_histogram as an aggregate function:

    with ocl as (
      select name, raj2000, dej2000, maj_ax_deg, source_id
      from openngc.data
      join gaia.dr2light
      on (
        1=contains(
          point(ra,dec),
          circle(raj2000, dej2000, maj_ax_deg)))
      where obj_type='OCl')
    
    select
      name,
      gavo_histogram(r_est, 0, 4000, 200) as hist
    from
      gdr2dist.main
      join ocl
      using (source_id)
    where r_est!='NaN'
    group by name
    

    That's it! This query will give you (admittedly somewhat raw, since we're ignoring the confidence intervals) histograms of the distances of stars in the direction of all NGC open clusters. Of course, it will run a while, as many millions of stars are processed, but TAP async mode easily takes care of that.

    Oh, one odd thing is left to discuss (ignore this paragraph if you don't know what I'm talking about): r_est!='NaN'. That's not quite ADQL but happens to do the isnan of normal programming languages at least when the backend is Postgres: It is true if computations failed and there is an actual NaN in the column. This is uncommon in SQL databases, and normal NULLs wouldn't hurt gavo_histogram. In our distance table, some NaNs slipped through, and they would poison our histograms. So, ADQL wizards probably should know that this is what you do for isnan, and that the usual isnan test val!=val doesn't work in SQL (or at least not with Postgres).

    So, fire up your TOPCAT and run this on the TAP server http://dc.g-vo.org/tap.

    You will get a table with 618 (or so) histograms. At this point, TOPCAT can't do a lot with them. So, let's emigrate to pyVO and save this table in a file ocl.vot

    My visualisation proposition would be: Let's substract a “background” from the histograms (I'm using splines to model that background) and then plot them row by row; multi-peaked rows in the resulting image would be suspicious.

    This is exactly what the programme below does, and the image for this article is a cutout of what the code produces. Set GALLERY = True to see how the histograms and background fits look like (hit 'q' to get to the next one).

    In the resulting image, any two yellow dots in one line are at least suspicious; I've spotted a few, but they are so consipicuous that others must have noticed. Or have they? If you'd like to check a few of them out, feel free to let me know – I think I have a few ideas how to pull some VO tricks to see if these things are real – and if they've been spotted before.

    So, here's the yellow spot programme:

    from astropy.table import Table
    import matplotlib.pyplot as plt
    import numpy
    from scipy.interpolate import UnivariateSpline
    
    GALLERY = False
    
    def substract_background(arr):
        x = range(len(arr))
        mean = sum(arr)/len(arr)
        arr = arr/mean
        background = UnivariateSpline(x, arr, s=100)
        cleaned = arr-background(x)
    
        if GALLERY:
            plt.plot(x, arr)
            plt.plot(x, background(x))
            plt.show()
    
        return cleaned
    
    
    def main():
        tab = Table.read("ocl.vot")
        hist = numpy.array([substract_background(r["hist"][1:-1])
          for r in tab])
        plt.matshow(hist, cmap='gist_heat')
        plt.show()
    
    
    if __name__=="__main__":
        main()
    
  • Gaia DR2: A light version and light curves

    screenshot: topcat and matplotlib

    Topcat is doing datalink, and our little python script has plotted a two-color time series of RMC 18 (or so I think).

    If anyone ever writes a history of the VO, the second data release of Gaia on April 25, 2018 will probably mark its coming-of-age – at least if you, like me, consider the Registry the central element of the VO. It was spectacular to view the spike of tens of Registry queries per second right around 12:00 CEST, the moment the various TAP services handing out the data made it public (with great aplomb, of course).

    In GAVO's Data Center we also carry Gaia DR2 data. Our host institute, the Zentrum für Astronomie in Heidelberg, also has a dedicated Gaia server. This gives relieves us from having to be a true mirror of the upstream data release. And since the source catalog has lots and lots of columns that most users will not be using most of the time, we figured a “light” version of the source catalog might fill an interesting ecological niche: Behold gaia.dr2light on the GAVO DC TAP service, containing essentially just the basic astrometric parameters and the diagonal of the covariance matrix.

    That has two advantages: Result sets with SELECT * are a lot less unwieldy (but: just don't do this with Gaia DR2), and, more importantly, a lighter table puts less load on the server. You see, conventional databases read entire rows when processing data, and having just 30% of the columns means we will be 3 times faster on I/O-bound tasks (assuming the same hardware, of course). Hence, and contrary to several other DR2-carrying sites, you can perform full sequential scans before timing out on our TAP service on gaia.dr2light. If, on the other hand, you need to do debugging or full-covariance-matrix error calculations: The full DR2 gaia_source table is available in many places in the VO. Just use the Registry.

    Photometry via TAP

    A piece of Gaia DR2 that's not available in this form anywhere else is the lightcurves; that's per-transit photometry in the G, BP, and RP band for about 0.5 million objects that the reduction system classified as variable. ESAC publishes these through datalink from within their gaia_source table, and what you get back is a VOTable that has the photometry in the three bands interleaved.

    I figured it might be useful if that data were available in a TAP-queriable table with lightcurves in the database. And that's how gaia.dr2epochflux came into being. In there, you have three triples of arrays: the epochs (g_transit_time, bp_obs_time, and rp_obs_time), the fluxes (g_transit_flux, bp_flux, and rp_flux), and their errors (you can probably guess their names). So, to retrieve G lightcurves where available together with a gaia_source query of your liking, you could write something like:

    SELECT g.*, g_transit_time, g_transit_flux
    FROM gaia.dr2light AS g
    LEFT OUTER JOIN gaia.dr2epochflux
    USING (source_id)
    WHERE ...whatever...
    

    – the LEFT OUTER JOIN arranges things such that the g_transit_time and g_transit_flux columns simply are NULL when there are no lightcurves; with a normal (“inner”) join, rows without lightcurves would not be returned in such a query.

    To give you an idea of what you can do with this, suppose you would like to discover new variable blue supergiants in the Gaia data (who knows – you might discover the precursor of the next nearby supernova!). You could start with establishing color cuts and train your favourite machine learning device on light curves of variable blue supergiants. Here's how to get (and, for simplicity, plot) time series of stars classified as blue supergiants by Simbad for which Gaia DR2 lightcurves are available, using pyvo and a little async trick:

    from matplotlib import pyplot as plt
    import pyvo
    
    def main():
      simbad = pyvo.dal.TAPService(
        "http://simbad.u-strasbg.fr:80/simbad/sim-tap")
      gavodc = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
    
      # Get blue supergiants from Simbad
      simjob = simbad.submit_job("""
        select main_id, ra, dec
        from basic
        where otype='BlueSG*'""")
      simjob.run()
    
      # Get lightcurves from Gaia
      try:
        simjob.wait()
        time_series = gavodc.run_sync("""
          SELECT b.*, bp_obs_time, bp_flux, rp_obs_time, rp_flux
          FROM (SELECT
             main_id, source_id, g.ra, g.dec
             FROM
            gaia.dr2light as g
             JOIN TAP_UPLOAD.t1 AS tc
             ON (0.002>DISTANCE(tc.ra, tc.dec, g.ra, g.dec))
          OFFSET 0) AS b
          JOIN gaia.dr2epochflux
          USING (source_id)
          """,
          uploads={"t1": simjob.result_uri})
      finally:
        simjob.delete()
    
      # Now plot one after the other
      for row in time_series.table:
        plt.plot(row["bp_obs_time"], row["bp_flux"])
        plt.plot(row["rp_obs_time"], row["rp_flux"])
        plt.show(block=False)
        raw_input("{}; press return for next...".format(row["main_id"]))
        plt.cla()
    
    if __name__=="__main__":
      main()
    

    If you bother to read the code, you'll notice that we transfer the Simbad result directly to the GAVO data center without first downloading it. That's fairly boring in this case, where the table is small. But if you have a narrow pipe for one reason or another and some 105 rows, passing around async result URLs is a useful trick.

    In this particular case the whole thing returns just four stars, so perhaps that's not a terribly useful target for your learning machine. But this piece of code should get you started to where there's more data.

    You should read the column descriptions and footnotes in the query results (or from the reference URL) – this tells you how to interpret the times and how to make magnitudes from the fluxes if you must. You probably can't hear it any more, but just in case: If you can, process fluxes rather than magnitudes from Gaia, because the errors are painful to interpret in magnitudes when the fluxes are small (try it!).

    Note how the photometry data is stored in arrays in the database, and that VOTables can just transport these. The bad news is that support for manipulating arrays in ADQL is pretty much zero at this point; this means that, when you have trained your ML device, you'll probably have to still download lots and lots of light curves rather than write some elegant ADQL to do the filtering server-side. However, I'd be highly interested to work out how some tastefully chosen user defined functions might enable offloading at least a good deal of that analysis to the database. So – if you know what you'd like to do, by all means let me know. Perhaps there's something I can do for you.

    Incidentally, I'll talk a bit more about ADQL arrays in a blog post coming up in a few weeks (I think). Don't miss it, subscribe to our feed).

    SSAP and Obscore

    If you're fed up with bleeding-edge tech, the light curves are also available through good old SSAP and Obscore. To use that, just get Splat (or another SSA client, preferably with a bit of time series support). Look for a Gaia DR2 time series service (you may have to update the service list before you find it), enter (in keeping with our LBV theme) S Dor as position and hit “Lookup” followed by “Send Query”. Just click on any result to just view the time series – and then apply Splat's rich tool set to it.

    Update (8.5.2018): Clusters

    Here's another quick application – how about looking for variable stars in clusters? This piece of ADQL should get you started:

    SELECT TOP 100
      source_id, ra, dec, parallax, g.pmra, g.pmdec,
      m.name, m.pmra AS c_pmra, m.pmde AS c_pmde,
      m.e_pm AS c_e_pm,
      1/dist AS cluster_parallax
    FROM
      gaia.dr2epochflux
      JOIN gaia.dr2light AS g USING (source_id)
      JOIN mwsc.main AS m
      ON (1=CONTAINS(
        POINT(g.ra, g.dec),
        CIRCLE(m.raj2000, m.dej2000, rcluster)))
    WHERE IN_UNIT(pmdec, 'deg/yr') BETWEEN m.pmde-m.e_pm*3 AND m.pmde+m.e_pm*3
    

    – yes, you'll want to constrain pmra, too, and the distance, and properly deal with error and all. But you get simple lightcurves for free. Just add them in the SELECT clause!

  • PPMXL+Gaia DR1=HSOY in the Heidelberg Data Center

    The stunning precision of Gaia's astrometry is already apparent in the first release of the data obtained by the satellite, available since last September. However, apart from the small TGAS subset (objects already observed by the 90ies HIPPARCOS astrometry satellite) there is no information on the objects' proper motions in DR1.

    Until Gaia-quality proper motions will become available in DR2, the HSOY catalog – described in Altmann et al's paper Hot Stuff for One Year (HSOY) freshly up in arXiv and online at http://dc.g-vo.org/hsoy – can help if you can live with somewhat lesser-quality kinematics.

    It derives proper motions for roughly half a billion stars from PPMXL and Gaia DR1, which already gives an unprecedented source for 4D astrometry around J2015. And you can start working with it right now. The catalog is available in GAVO's Heidelberg data center (TAP access URL: http://dc.g-vo.org/tap; there's also an SCS service). Fire up your favourite TAP or SCS client (our preference: TOPCAT) and search for HSOY.

    An all-sky heatmap showing much larger errors south of -30 deg.

    Oh, and in case you're new to the whole TAP/ADQL game: There's our ADQL introduction, and if you're at a German astronomical institution, we'd be happy to hold one of our VO Days at your institute – just drop us a mail.

« Page 2 / 2