Articles from Data

  • LAMOST5 meets Datalink

    One of the busiest spectral survey instruments operated right now is the Large Sky Area Multi-Object Fiber Spectrograph Telescope (LAMOST). And its data in the VO, more or less: DR2 and DR3 have been brought into the VO by our Czech colleagues, but since they currently lack resources to update their services to the latest releases, they have kindly given me their DaCHS resource descriptor, and so I had a head start for publishing DR5 in Heidelberg.

    With some minor updates, here it is now: Over nine million medium-resolution spectra covering large parts of the northen sky – the spatial coverage is like this:

    Coverage Healpix map

    There's lots of fun to be had with this; of course, there's an SSA service, so when you point Aladin or Splat at some part of the covered sky and look for spectra, chances are you'll see LAMOST spectra, and when working on some of our tutorials (this one, for example), it happened that LAMOST actually had what I was looking for when writing them.

    But I'd like to use the opportunity to mention two other modes of accessing the data.

    Stacked spectra

    Tablesample and TOPCAT's Plot Table activation action

    Say you'd like to look at spectra of M stars and would like to have some sample from across the sky, fire up TOPCAT, point its TAP client the GAVO DC TAP service (http://dc.g-vo.org/tap) and run something like:

    select
      ssa_pubDID, accref, raj2000, dej2000, ssa_targsubclass
    from lamost5.data tablesample(1)
    where
      ssa_targsubclass like 'M%'
    

    This is using the TABLESAMPLE modifier in the from clause, which isn't standard ADQL yet. As mentioned in the DaCHS 1.4 announcement, DaCHS has a prototype implementation of what's been discussed on the IVOA's DAL mailing list: pick a part of a table rather than the full one. It takes a percentage as an argument, and tells the server to choose about this percentage of the table's records using a reasonable and fast heuristic. Note that this won't give you perfect statistical sampling, but if it's not “good enough” for some purpose, I'd like to learn about that purpose.

    Drawing a proper statistical sample, on the other hand, would take minutes on the GAVO database server – with tablesample, I had the roughly 6000 spectra the above query returns essentially instantaneously, and from eyeballing a sky plot of them, I'd say their distribution is close enough to that of the full DR5. So: tablesample is your friend.

    For a quick look at the spectra themselves, in TOPCAT click Views/Activation Actions, check “Plot Table” and make sure TOPCAT proposes the accref column as “Table Location” (if you don't see these items, update your TOPCAT – it's worth it). Now click on a row or perhaps a dot on a plot and behold an M spectrum.

  • From Byurakan to L2: Short Spectra

    A snapshot from the DFBS tutorial: Carbon Stars in different spectral bands.

    A snapshot from the DFBS tutorial: Carbon Stars in different spectral bands.

    On June 30, a small project we've done together with the Armenian Virtual Observatory has ended. Its objective was to publish the spectra from the First Byurakan Survey (the DFBS) in a VO-compilant way. The data comes from one of the big surveys with Schmidt telescopes that form a sizable part of the observational heritage from the second part of the 20th century (you're still using a few of them daily if you tell Aladin to show a DSS plane).

    In this case, spectra from objects on the entire northern sky off the milky way down to about 18th mag were obtained. In a previous cooperation between Armenian and Italian astronomers a good decade ago, the plates were digitised and calibrated, and spectra were extracted. However, they resided behind a web interface so far, which made them somewhat clumsy to work with.

    Now, they're in the VO, and to give you a few ideas for what kind of things you can do with this kind of data, within the project we've also written the tutorial “Outlier Analysis in Low-Resolution Spectra”.

    Have a glance at the tutorial – you see, while the Byurakan survey certainly is a valuable resource by itself, I happen to believe at this point it's particularly valuable because with the next Gaia data release (planned for next year), a deluxe version of it will come: Gaia's RP/BP spectra will be all-sky, properly calibrated, and quite a bit deeper, but still low-resolution. So, if you're just waiting for such a data collection, you can train your methods right now on the DFBS.

  • APPLAUSE via Obscore

    A composite of two rather noisy photo plates

    Aladin showing some Bamberg Sky Patrol plates (see towards the end of the post for what this is and how I made it).

    At the Astroplate conference I blogged about recently, the people behind APPLAUSE gave a couple of talks about their Data Release 3. APPLAUSE is a fairly massive endeavour to make available data from some of the larger plate archives in Germany, and its DR3 even hit the non-Astronomy press last February.

    Already for previous APPLAUSE releases, I've wanted to bring this data (or rather, its metadata) to the VO, but it never quite happened, basically because there was always another little thing that turned out to be too tedious to work out via mail. However, working out things interactively is exactly what conferences are great for. So, the kind APPLAUSE folks (thanks, Taavi and Harry) and I used the Astroplate to map their database schema (“schema” is jargon for what boils down to the set of tables and columns with which they describe their data) to the much simpler (and, admittedly, less powerful) IVOA Obscore one.

    Sure, Obscore doesn't deal with multiple exposures (like when the target field and the north pole were exposed on one plate to help precision photometry), object-guided images, and all the other interesting techniques that astronomers applied in the pre-digital age; it also doesn't usefully cope with multiple scans of the same plate (for instance, to correct for imprecisions in the mechanics of flatbed scanners). APPLAUSE, of course, has to cope with them, since there are many reasons to preserve data of this kind.

    Obscore, on the other hand, is geared towards uniform discovery, where too funky datasets in all likelihood cause more harm than good. So, when we mapped APPLAUSE to Obscore, of the 101138 scans of 70276 plates that the full APPLAUSE holds in DR3, only 44000 plate scans made it into the Obscore table. The advantage: whatever can be sensibly mapped to Obscore can now be queried together with all the other data in the world that others have published through Obscore.

    You can immediately see the effect when you run the little python program doing the global discovery we gave in our plates tutorial. Here's what it prints now (values from pre-APPLAUSE-in-Obscore are in square brackets):

    Column t_exptime: 3460 values
      Min   12, Max 15300, Mean 890.24  [previous mean: 370.722]
    ---
    Column em_mean: 3801 values
      Min 1.8081e-09, Max 9.3e-07, Mean 6.40804e-07 [No change: Sigh!]
    ---
    Column t_mean: 4731 values
      Min 12564.5, Max 58126.3, Mean 49897.9 [previous mean: 51909.1]
    ---
    Column instrument_name: 4747 values
      Matches from , Petzval, [Max Wolf's residence in
      Heidelberg, Maerzgasse, Wolf's Doppelastrograph,
      Heidelberg Koenigstuhl (24), Wolf's
      Doppelastrograph,] AG-Astrograph, [Zeiss Triplet
      15 cm Potsdam-Telegrafenberg], Zeiss Triplet,
      Astrograph (four 10-cm Tessar f/6 cameras),
      [3.5m APO, ROSAT PSPCC, Heidelberg Koenigstuhl
      (24), Bruce Astrograph, Calar Alto (493),
      Schmidt], Grosser Refraktor, [ROSAT HRI,
      DK-1.54], Hamburger Schmidt-Spiegel,
      [DFOSC_FASU], ESO 1-metre Schmidt telescope,
      Great Schmidt Camera, Lippert-Astrograph, Ross-B
      3", [AZT 22], Astrograph (six 10-cm Tessar f/6
      cameras), 1m-Spiegelteleskop, [ROSAT PSPCB],
      Astrograph (ten 10-cm Tessar f/6 cameras), Zeiss
      Objective
    ---
    Column access_url: 4747 values [4067]
    

    So – for the fields selected in the tutorial, there are 15% more images in the global Obscore image pool now than there were before APPLAUSE, and their mean observation date went a bit farther into the past. I've not made any statistics, but I suspect for many other fields the gain is going to be much higher. For a strong effect, try some random region covered by the Bamberg Sky Patrol on the southern sky.

    But you have probably noticed the deep sigh in the annotations to the statistics above: Yes, we don't have the spectral band for the APPLAUSE data, which is why the stats on em_min doesn't change. As a matter of fact, from the Obscore data you can't even guess whether a plate is “more red” or “rather blue”, as Obscore doesn't have an (agreed-upon) field for “qualititive bandpass indicator”.

    For some other data collections, we did map known emulsion/filter combinations to rough bandpasses (e.g., the Palomar-Leiden Trojan Survey, which only had a few of them). For APPLAUSE, there are 435 combinations of filter and emulsion (that's a VOTable link that you can paste into TOPCAT's load button in order to have a look at the table). Granted, quite a few of these pairs are (more or less) spurious because of inconsistent spelling. But we still gave up on researching the bandpasses even before we started.

    If you're a photographic plate buff: You could help us and posteriority a lot if you could go through this list and at least for some combinations tell us what, roughly, the lower and upper limits of the corresponding bandpasses might have been (what DaCHS already knows, plate-relevant data near the bottom of the file). As usual, send mail to gavo@ari.uni-heidelberg.de if you have anything to contribute.

    Finally, here's the brief explanation of the image for this article: Well, I wanted to find some Bamberg Sky Patrol images for a single field to play with. I knew they were primarily located in the South, and were made using Tessar cameras. So, I ran:

    SELECT t_min, access_url, s_region
    FROM ivoa.obscore
    WHERE instrument_name like '%Tessar%'
    AND 1=CONTAINS(POINT(345, -38), s_region)
    

    on GAVO's TAP service. Since Aladin 10, you can do that from within the program (although some versions will reject this query because they mistakenly believe the ADQL is bad. Query through TOPCAT and send the result over to Aladin if that bites you). Incidentally, when there are s_region values in Obscore tables, it's a good idea to use them as I do here, as it's quite a bit more likely that this query will use indices than some condition on s_ra and s_dec. But then not all services fill s_region properly, so for all-VO queries you will probably want to make do with s_ra and s_dec.

    From that result I first made the inset bar graph in the article image to show the temporal distribution of the Patrol plates. And then I grabbed two (rather randomly selected) plates and had Aladin produce a red-blue composite of them. Whatever is really red or really blue in that image may correspond to a transient event. Or, as certainly the case with that little hair (or whatever) that shines out in blue, it may not.

  • Find Outliers using ADQL and TAP

    Annie Cannon's notebook and a plot

    Two pages from Annie Cannon's notebooks[1], and a histogram of the basic BP-RP color distribution in the HD catalogue (blue) and the distribution of the outliers (red). For more of Annie Cannon's notebooks, search on ADS.

    The other day I gave one of my improvised live demos (“What, roughly, are you working on?”) and I ended up needing to translate identifiers from the Henry Draper Catalogue to modern positions. Quickly typing “Henry Draper” into TOPCAT's TAP search window didn't yield anything useful (some resources only using the HD, and a TAP service that didn't support uploads – hmpf).

    Now, had I tried the somewhat more thorough WIRR Registry interface, I'd have noted the HD catalogue at VizieR and in particular Fabricius' et al's HD-Tycho 2 match (explaining why they didn't show up in TOPCAT is a longer story; we're working on it). But alas, I didn't, and so I set out to produce a catalogue matching HD and Gaia DR2, easily findable from within TOPCAT's TAP client. Well, it's here in the form of the hdgaia.main table in our data center.

    Considering the nontrivial data discovery and some yak shaving I had to do to get from HD identifiers to Gaia DR2 ones, it was perhaps not as futile an exercise as I had thought now and then during the preparation of the thing. And it gives me the chance to show a nice ADQL technique to locate outliers.

    In this case, one might ask: Which objects might Annie Cannon and colleagues have misclassified? Or perhaps the objects have changed their spectrum between the time Cannon's photographic plates have been taken and Gaia observed them? Whatever it is: We'll have to figure out where there are unusual BP-RPs given the spectral type from HD.

    To figure this out, we'll first have to determine what's “usual”. If you've worked through our ADQL course, you know what to expect: grouping. So, to get a table of average colours by spectral type, you'd say (all queries executable on the TAP service at http://dc.g-vo.org/tap):

    select spectral,
      avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
      count(*) as ct
    from hdgaia.main
    join gaia.dr2light
    using (source_id)
    group by spectral
    

    – apart from the join that's needed here because we want to pull photometry from gaia, that's standard fare. And that join is the selling point of this catalog, so I won't apologise for using it already in the first query.

    The next question is how strict we want to be before we say something that doesn't have the expected colour is unusual. While these days you can rather easily use actual distributions, at least for an initial analysis just assuming a Gaussian and estimating its FWHM as the standard deviation works pretty well if your data isn't excessively nasty. Regrettably, there is no aggregate function STDDEV in ADQL (you could still ask for it: head over to the DAL mailing list before ADQL 2.1 is a done deal!). However, you may remember that Var(X)=E(X2)-E(X)2, that the average is an estimator for the expectation, and that the standard deviation is actually an estimator for the square root of the variance. And that these estimators will work like a charm if you're actually dealing with Gaussian data.

    So, let's use that to compute our standard deviations. While we are at it, throw out everything that's not a star[2], and ensure that our groups have enough members to make our estimates non-ridiculous; that last bit is done through a HAVING clause that essentially works like a WHERE, just for entire GROUPs:

    select spectral,
      avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
      sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
        power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
      count(*) as ct
    from hdgaia.main
    join gaia.dr2light
      using (source_id)
    where m_v<18
    group by spectral
    having count(*)>10
    

    This may look a bit scary, but if you read it line by line, I'd argue it's no worse than our harmless first GROUP BY query.

    From here, the step to determine the outliers isn't big any more. What the query I've just written produces is a mapping from spectral type to the means and scales (“µ,σ” in the rotten jargon of astronomy) of the Gaussians for the colors of the stars having that spectral type. So, all we need to do is join that information by spectral type to the original table and then see which actual colors are further off than, say, three sigma. This is a nice application of the common table expressions I've tried to sell you in the post on ADQL 2.1; our determine-what's-usual query from above stays nicely separated from the (largely trivial) rest:

    with standards as (select spectral,
      avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
      sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
        power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
      count(*) as ct
      from hdgaia.main
      join gaia.dr2light
      using (source_id)
      where m_v<18
      group by spectral
      having count(*)>10)
    select *
    from hdgaia.main
    join standards
    using (spectral)
    join gaia.dr2light using (source_id)
    where
      abs(phot_bp_mean_mag-phot_rp_mean_mag-col)>3*sig_col
      and m_v<18
    

    – and that's a fairly general pattern for doing an initial outlier analysis on the the remote side. For HD, this takes a few seconds and yields 2722 rows (at least until we also push HDE into the table). That means you can keep 99% of the rows (the boring ones) on the server and can just pull the ones that could be interesting. These 99% savings aren't terribly much with a catalogue like the HD that's small by today's standards. For large catalogs, it's the difference between a download of a couple of minutes and pulling data for a day while frantically freeing disk space.

    By the way, that there's only 2.7e3 outliers among 2.25e5 objects, while Annie Cannon, Williamina Fleming, Antonia Maury, Edward Pickering, and the rest of the crew not only had to come up with the spectral classification while working on the catalogue but also had to classify all these objects manually. This is an amazing feat even if all of those rows actually were misclassifications (which they certainly aren't) – the machine classifiers of today would be proud to only get 1% wrong.

    The inset in the facsimile of Annie Cannons notebooks above shows how the outliers are distributed in color space relative to the full catalogue, where the basic catalogue is in blue and the outliers (scaled by 70) in red. Wouldn't it make a nice little side project to figure out the reason for the outlier clump on the red side of the histogram?

    [1]The notebook pages are from a notebook Annie Cannon used in 1929. The material was kindly provided by Project PHAEDRA at the John G. Wolbach Library, Harvard College Observatory.
    [2]I'll not hide that I was severely tempted to undo the mapping of object classes to – for HD – unrealistic magnitudes (20 .. 50) but then left the HD as it came from ADC; I still doubt that decision was well taken, and sure enough, the example query above already has insane constraints on m_v reflecting that encoding. From today's position, of course there should have been an extra column or, better yet, a different catalogue for nonstellar objects. Ah well. It's always hard to break unhealty patterns.
  • Deredden using TAP

    An animated color-magnitude diagram

    Raw and dereddened CMD for a region in Cygnus.

    Today I published a nice new set of tables on our TAP service: The Bayestar17 3D dust map derived from Pan-STARRS 1 by Greg Green et al. I mention in passing that this was made particularly enjoyable because Greg and friends put an explicit license on their data (in this case, CC-BY-SA).

    This dust map is probably a fascinating resource by itself, but the really nifty thing is that you can use it to correct all kinds of photometric data for extinction – at least to some extent. On the Bayestar web page, the authors give some examples for usage – and with our new service, you can use TAP as well to correct photometry for extinction.

    To see how, first have a look at the table metadata for the prdust.map_union table; this is what casual users probably should look at. More specifically, at the coverage, best_fit, and grdiagnostic columns.

    coverage here is an interval of 10-healpixes. It has to be an interval because the orginal data comes on wildly different levels; depending on the density of stars, sometimes it takes the area of a 6-healpix (about a square degree) to get enough signal, whereas in the galactic plane a 10-healpix (a thousandth of a square degree) already has enough stars. To make the whole thing conveniently queriable without exploding a 6-healpix row into 1000 identical rows, larger healpixes translate into intervals of 10-helpixes. Don't panic, though, I'll show how to conveniently query this below.

    best_fit and grdiagnostic are arrays (remember the light cuves in Gaia DR2?). In bins of 0.5 in distance modulus (which is, in case you feel a bit uncertain as to the algebraic signs, 5 log10(dist)-5 for a distance in parsec), starting with a distance modulus of 4 and ending with 19. This means that for a distance modulus of 4.2 you should check the array index 0, whereas 4.3 already would be covered by array index 1. With this, best_fit[ind] gives E(B-V) = (B-V) - (B-V)0 in the direction of coverage in a distance modulus bin of 2*ind+4. For each best_fit[ind], grdiagnostic[ind] contains a quality measure for that value. You probably shouldn't touch the E(B-V) if that measure is larger than 1.2.

    So, how does one use this?

    To try things, let's pull some Gaia data with distances; in order to have interesting extinctions, I'm using a patch in Cygnus (RA 288.5, Dec 2.3). If you live on the northern hemisphere and step out tonight, you could see dust clouds there with the naked eye (provided electricity fails all around, that is). Full disclosure: I tried the Coal Sack first but after checking the coverage of the dataset – which essentially is the sky north of -30 degrees – I noticed that wouldn't fly. But stories like these are one reason why I'm making such a fuss about having standard STC coverage representations.

    We want distances, and to dodge all the intricacies involved when naively turning parallaxes to distances discussed at length in a paper by Xavier Luri et al (and elsewhere), I'm using precomputed distances from Bailer-Jones et al. (2018AJ....156...58B); you'll find them on the "ARI Gaia" service; in TOPCAT's TAP dialog simply search for “Gaia” – that'll give you the GAVO DC TAP search, too, and that we'll need in a second.

    The pre-computed distances are in the gaiadr2_complements.geometric_distance table, which can be joined to the main Gaia object catalog using the source_id column. So, here's a query to produce a little photometric catalog around our spot in Cygnus (we're discarding objects with excessive parallax errors while we're at it):

    SELECT
    r_est, 5*log10(r_est)-5 as dist_mod,
    phot_g_mean_mag, phot_bp_mean_mag, phot_rp_mean_mag,
    ra, dec
    FROM
    gaiadr2.gaia_source
    JOIN gaiadr2_complements.geometric_distance
    USING (source_id)
    WHERE
    parallax_over_error>1
    AND 1=CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', 288.5, 2.3, 0.5 ))
    

    The color-magnitude diagram resulting from this is the red point cloud in the animated GIF at the top. To reproduce it, just plot phot_bp_mean_mag-phot_rp_mean_mag against phot_g_mean_mag-dist_mod (and invert the y axis).

    De-reddening this needs a few minor technicalities. The most important one is how to match against the odd intervals of healpixes in the prdust.map_union table. A secondary one is that we have only pulled equatorial coordinates, and the healpixes in prdust are in galactic coordinates.

    Computing the healpix requires the ivo_healpix_index ADQL user defined function (UDF) that you may have met before, and since we have to go from ICRS to Galactic it requires a fairly new UDF I've recently defined to finally get the discussion on having a “standard library” of astrometric functions in ADQL going: gavo_transform. Here's how to get a 10-healpix as required for map_union from ra and dec:

    CAST(ivo_healpix_index(10,
      gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER)
    

    The CAST call is a pure technicality – ivo_healpix_index returns a 64-bit integer, which I can't use in my interval logic.

    The comparison against the intervals you could do yourself, but as argued in Registry-STC article this is one of the trivial things that are easy to get wrong. So, let's use the ivo_interval_overlaps UDF; it goes in the join condition to properly match prdust healpixes to catalog positions. Then our total query – that, I hope, should be reasonably easy to adapt to similar problems – is:

    WITH sources AS (
      SELECT phot_g_mean_mag,
        phot_bp_mean_mag,
        phot_rp_mean_mag,
        dist_mod,
        CAST(ivo_healpix_index(10,
          gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER) AS hpx,
        ROUND((dist_mod-4)*2)+1 AS dist_mod_bin
      FROM TAP_UPLOAD.T1)
    
    SELECT
      phot_bp_mean_mag-phot_rp_mean_mag-dust.best_fit[dist_mod_bin] AS color,
      phot_g_mean_mag-dist_mod+
        dust.best_fit[dist_mod_bin]*3.384 AS abs_mag,
      dust.grdiagnostic[dist_mod_bin] as qual
    FROM sources
    JOIN prdust.map_union AS dust
    ON (1=ivo_interval_has(hpx, coverage))
    

    (If you're following along: you have to switch to the GAVO DC TAP to run this, and you will probably have to change the index after TAP_UPLOAD).

    Ok, in the photometry department there's a bit of cheating going on here – I'm correcting Gaia B-R with B-V, and I'm using the factor for Johnson V to estimate the extinction in Gaia G (if you're curious where that comes from: See the footnote on best_fit and the MC extinction service docs should get you started), so this is far from physically correct. But, as you can see from the green cloud in the plot above, it already helps a bit. And if you find out better factors, by all means let me know so I can add an update... right here:

    Update (2018-09-11): The original data creator, Gregory Green points out that the thing with having a better factor for Gaia G isn't that simple, because, as he says “Gaia G is very broad, [and] the extinction coefficients are much more dependent on stellar type, and extinction is also more nonlinear with dust column (extinction is only linear with dust column and independent of stellar type for an infinitely narrow passband)”. So – when de-reddening, prefer narrow passbands. But whether narrow or wide: TAP helps you.

« Page 2 / 3 »