• APPLAUSE via Obscore

    A composite of two rather noisy photo plates

    Aladin showing some Bamberg Sky Patrol plates (see towards the end of the post for what this is and how I made it).

    At the Astroplate conference I blogged about recently, the people behind APPLAUSE gave a couple of talks about their Data Release 3. APPLAUSE is a fairly massive endeavour to make available data from some of the larger plate archives in Germany, and its DR3 even hit the non-Astronomy press last February.

    Already for previous APPLAUSE releases, I've wanted to bring this data (or rather, its metadata) to the VO, but it never quite happened, basically because there was always another little thing that turned out to be too tedious to work out via mail. However, working out things interactively is exactly what conferences are great for. So, the kind APPLAUSE folks (thanks, Taavi and Harry) and I used the Astroplate to map their database schema (“schema” is jargon for what boils down to the set of tables and columns with which they describe their data) to the much simpler (and, admittedly, less powerful) IVOA Obscore one.

    Sure, Obscore doesn't deal with multiple exposures (like when the target field and the north pole were exposed on one plate to help precision photometry), object-guided images, and all the other interesting techniques that astronomers applied in the pre-digital age; it also doesn't usefully cope with multiple scans of the same plate (for instance, to correct for imprecisions in the mechanics of flatbed scanners). APPLAUSE, of course, has to cope with them, since there are many reasons to preserve data of this kind.

    Obscore, on the other hand, is geared towards uniform discovery, where too funky datasets in all likelihood cause more harm than good. So, when we mapped APPLAUSE to Obscore, of the 101138 scans of 70276 plates that the full APPLAUSE holds in DR3, only 44000 plate scans made it into the Obscore table. The advantage: whatever can be sensibly mapped to Obscore can now be queried together with all the other data in the world that others have published through Obscore.

    You can immediately see the effect when you run the little python program doing the global discovery we gave in our plates tutorial. Here's what it prints now (values from pre-APPLAUSE-in-Obscore are in square brackets):

    Column t_exptime: 3460 values
      Min   12, Max 15300, Mean 890.24  [previous mean: 370.722]
    ---
    Column em_mean: 3801 values
      Min 1.8081e-09, Max 9.3e-07, Mean 6.40804e-07 [No change: Sigh!]
    ---
    Column t_mean: 4731 values
      Min 12564.5, Max 58126.3, Mean 49897.9 [previous mean: 51909.1]
    ---
    Column instrument_name: 4747 values
      Matches from , Petzval, [Max Wolf's residence in
      Heidelberg, Maerzgasse, Wolf's Doppelastrograph,
      Heidelberg Koenigstuhl (24), Wolf's
      Doppelastrograph,] AG-Astrograph, [Zeiss Triplet
      15 cm Potsdam-Telegrafenberg], Zeiss Triplet,
      Astrograph (four 10-cm Tessar f/6 cameras),
      [3.5m APO, ROSAT PSPCC, Heidelberg Koenigstuhl
      (24), Bruce Astrograph, Calar Alto (493),
      Schmidt], Grosser Refraktor, [ROSAT HRI,
      DK-1.54], Hamburger Schmidt-Spiegel,
      [DFOSC_FASU], ESO 1-metre Schmidt telescope,
      Great Schmidt Camera, Lippert-Astrograph, Ross-B
      3", [AZT 22], Astrograph (six 10-cm Tessar f/6
      cameras), 1m-Spiegelteleskop, [ROSAT PSPCB],
      Astrograph (ten 10-cm Tessar f/6 cameras), Zeiss
      Objective
    ---
    Column access_url: 4747 values [4067]
    

    So – for the fields selected in the tutorial, there are 15% more images in the global Obscore image pool now than there were before APPLAUSE, and their mean observation date went a bit farther into the past. I've not made any statistics, but I suspect for many other fields the gain is going to be much higher. For a strong effect, try some random region covered by the Bamberg Sky Patrol on the southern sky.

    But you have probably noticed the deep sigh in the annotations to the statistics above: Yes, we don't have the spectral band for the APPLAUSE data, which is why the stats on em_min doesn't change. As a matter of fact, from the Obscore data you can't even guess whether a plate is “more red” or “rather blue”, as Obscore doesn't have an (agreed-upon) field for “qualititive bandpass indicator”.

    For some other data collections, we did map known emulsion/filter combinations to rough bandpasses (e.g., the Palomar-Leiden Trojan Survey, which only had a few of them). For APPLAUSE, there are 435 combinations of filter and emulsion (that's a VOTable link that you can paste into TOPCAT's load button in order to have a look at the table). Granted, quite a few of these pairs are (more or less) spurious because of inconsistent spelling. But we still gave up on researching the bandpasses even before we started.

    If you're a photographic plate buff: You could help us and posteriority a lot if you could go through this list and at least for some combinations tell us what, roughly, the lower and upper limits of the corresponding bandpasses might have been (what DaCHS already knows, plate-relevant data near the bottom of the file). As usual, send mail to gavo@ari.uni-heidelberg.de if you have anything to contribute.

    Finally, here's the brief explanation of the image for this article: Well, I wanted to find some Bamberg Sky Patrol images for a single field to play with. I knew they were primarily located in the South, and were made using Tessar cameras. So, I ran:

    SELECT t_min, access_url, s_region
    FROM ivoa.obscore
    WHERE instrument_name like '%Tessar%'
    AND 1=CONTAINS(POINT(345, -38), s_region)
    

    on GAVO's TAP service. Since Aladin 10, you can do that from within the program (although some versions will reject this query because they mistakenly believe the ADQL is bad. Query through TOPCAT and send the result over to Aladin if that bites you). Incidentally, when there are s_region values in Obscore tables, it's a good idea to use them as I do here, as it's quite a bit more likely that this query will use indices than some condition on s_ra and s_dec. But then not all services fill s_region properly, so for all-VO queries you will probably want to make do with s_ra and s_dec.

    From that result I first made the inset bar graph in the article image to show the temporal distribution of the Patrol plates. And then I grabbed two (rather randomly selected) plates and had Aladin produce a red-blue composite of them. Whatever is really red or really blue in that image may correspond to a transient event. Or, as certainly the case with that little hair (or whatever) that shines out in blue, it may not.

  • Small Telescopes, Large Surveys

    Image: Blink comparator and survey camera

    Plate technology at Bamberg observatory: a blink comparator with one plate mounted, and a survey camera that was once used at Boyden Station, an astronomer outpost in 60ies South Africa.

    I'm currently at the workshop “Large surveys with small telescopes: past, present, and future” (or Astroplate III for short) in Bamberg, where people are discussing using and re-using the rich heritage of historical observations (hence the “plate” part) as well growing that heritage in the age of large CCDs, fast computers and large disks.

    Using and re-using is of course what the Virtual Observatory is about, and we've been keeping fairly large plate collections in our data center for quite a while (among them the Archives of Landessternwarte Königstuhl or the Palomar-Leiden Trojan surveys, and there is the WFPDB TAP-accessibly). Therefore, people from GAVO Heidelberg have been to all past astroplate conferences.

    For this one, I brought a brand-new tutorial on plate scans in the VO, which, I hope, also works as a general introduction to image discovery in the VO using SIAP, Datalink, and Obscore. If you're doing image stuff now and then, please have a quick look at the thing – I am particularly grateful for hints on what to improve or perhaps particularly obvious use cases for the material discussed.

    Such VO proselytising aside, the conference is discussing the wide variety of creative, low-cost data collectors out there as well as computer-aided re-analysis extracting new knowledge from decades-old data. If I had to choose a single come-to-think-of-it moment, it would be Norbert Zacharias' observation that if you have a well-behaved object and you'd like to know where it was in 1900, it's now more accurate to extrapolate Gaia astrometry to the epoch of observation than to measure it on the plate itself. Which is saying a lot about the amazing feat of engineering that Gaia is.

    This is not, however, an argument for dumping the old data. Usually, it is exactly what is not so well-behaved (like those) that's interesting – both in terms of astrometry and in terms of photometry (for which there's a lot more unruly behaviour in the first place). To figure out how objects don't behave well, and, for objects disguising as well-behaved only on time scales of the (say) Gaia mission, which these are, the key is “old” data. The freshness of which we're discussing this week.

  • A New View on SSAP in DaCHS

    When I started working on the VO in 2007, my collagues in Garching already had a software that implemented major parts of the simple spectral access protocol (SSAP) that was being developed back then. It would publish spectra in the FITS format by just blindly dumping all header cards into a database table and then defining a view over that “raw” metadata table to make the whole thing match SSAP's expectations for how the output table should look like. Sometimes you could just map through a header to an SSA column, sometimes you would just convert a unit, sometimes you would have to write a fairly complex SQL expressions combining multiple fields.

    Back then, I didn't like it – why have two things (a table and a view) that can break when one (just a table in SSA's format) would do, too? Also, SSAP has about 50 metadata fields, but lets you put constant values into VOTable PARAMs, which seemed a very reasonable way to attain more compact responses. So, when DaCHS grew SSAP support, I defined a mixin (essentially, a configurable interface definition) that let operators define SSA tables and their constant parameters in a fairly simple fashion and directly produced a table you could base your SSAP service on.

    That made assumptions about which pieces of metadata are constant and which are not; for instance, the original mixin (“hcd” for “homogeneous collection”) assumed all spectra in a data collection came from the same instrument and had the same resolution and (what was I thinking?) SNR. Unsurprisingly, that broke fairly soon. So, I added a second mixin (“mixc”) for when different instruments or codes produced the data.

    But even that was headache, at the latest when I started making time series services using SSAP. And I had to fix a few bugs in the mixins themselves in the meantime, which mostly required re-imports of the data in that design. Such re-imports are non-trivial when you have millions of spectra, and they need to happen at software upgrade time or the services would break with the upgrade. Ouch.

    It was about mid-2018 when it dawned on me that sometimes it's better to have two things that can break even if one would do, after all. Specifically, if fixing the one thing is expensive, it's an excellent idea to put a facade on top of it that's cheap to change and can already be used to repair most deficiencies. Why re-build the house if a paint job does the trick?

    As to having more compact query responses when you stuff metadata that's constant in all the rows into VOTable PARAMS – well, in the age of web pages pulling in a megabyte of javascript and two megabytes of images to display five lines of text, I've become a bit cavalier in that department. Sure, the average row may have grown by a factor of three, but we're still talking only a few megabyte even with large responses. To me, these extra bytes seem a fair price to pay for the increased flexibility and overall more straightforward architecture.

    So, I've now come up with a view-based solution in DaCHS, too: the //ssap#view mixin. This is a bit less radical than the Garching software of 2007, as it doesn't dump raw headers but instead lets you do the primary transformations in the RD. But it no longer constrains what pieces of metadata should be constant and which may vary between spectra, and it uses the same names for the same pieces of metadata throughout (which also is a step forward over the old SSAP mixins).

    With this, DaCHS operators should no longer use the hcd and mixc mixins for new services. The new technique is already reflected the respective tutorial chapter, and the SSAP template (you're using dachs start, aren't you?) now uses it, too.

    If you have a spectra publishing project in your pipeline, this would be the perfect time to upgrade to the DaCHS 1.2.4 beta, which has the new mixin. It would be great if we could iron out remaining wrinkles before the next release makes changes a load on my conscience.

    As to migrating existing SSAP services: Well, it would be great if I could drop the old mixins in a couple of years, as they cause quite a bit of uglyness in DaCHS's built-in //ssap RD. But the migration regrettably isn't straightforward, so you may want to wait a bit before embarking on that journey (I'll be happy to help, though).

  • A Grey Eminence of a Standard

    [Screenshot: graphs and numbers]

    Examples for extra metadata: extended column descriptions on the web pages accompanying the ARI-Gaia TAP service.

    Last friday, I've uploaded a first working draft of VODataService 1.2 to the IVOA documents repository. That's the first major step in updating a standard, and it's an invitation to everyone to have a look and comment.

    Foof, you might say, what do I care? I've not even heard of that standard.

    Well, but you've probably used it. VODataService is (among several other things) the standard that governs how a TAP service tells clients (TOPCAT, say) what tables it has and what's inside of them. So, if you see in TOPCAT that there is a column named ang_error with a unit of deg, a UCD of stat.error;pos and the meaning “1 σ confidence radius of the position”, that most likely came in a document standardised by VODataService.

    The question of what (TAP) services can tell clients about their table set is one major open point: Do we want additional metadata there? This article's image, for inspiration, shows a screenshot of extended metadata Grégory delivers to browsers on his ARI-Gaia service; among this are minima, maxima, means, standard deviations, quartiles, and fill factors (i.e., how many of the columns are NULL). He even shows histograms of the values' distributions and HEALPix maps showing how (the means of) the values vary on the sky. Another example of extended metadata could be footnotes as you will find them on many of my resources' reference URLs (example; footnotes are, unsurprisingly, near the foot of that page).

    We could define interoperable means to communicate information like this. The question is: does the added value justify the complication in implementation? This is where it would be great if you weighed in, in particular if you are a “mere” TAP user: Are there any such pieces of metadata you've always wanted to see in your TAP interfaces? Oh, and metadata of course can also be added to tables rather than columns. The current draft already lets services communicate the number of rows in each table – is there more “simple”, table-specific metadata of this sort?

    VODataService furthermore deals with several other topics; for instance, the STC in the registry business I've blogged about in February is going to be standardised here (update on this: spectral coverage is no longer in wavelength but in energy). Other changes are rather more technical in nature, like several new resource types that will improve the discovery of tables and other such resources, or a careful adjustment of some features to keep them in line with TAP evolution.

    But don't let the technicalities scare you away – just have a peek, and if you have thoughts on any of the VODataService topics: I'm just a mail away.

  • Find Outliers using ADQL and TAP

    Annie Cannon's notebook and a plot

    Two pages from Annie Cannon's notebooks[1], and a histogram of the basic BP-RP color distribution in the HD catalogue (blue) and the distribution of the outliers (red). For more of Annie Cannon's notebooks, search on ADS.

    The other day I gave one of my improvised live demos (“What, roughly, are you working on?”) and I ended up needing to translate identifiers from the Henry Draper Catalogue to modern positions. Quickly typing “Henry Draper” into TOPCAT's TAP search window didn't yield anything useful (some resources only using the HD, and a TAP service that didn't support uploads – hmpf).

    Now, had I tried the somewhat more thorough WIRR Registry interface, I'd have noted the HD catalogue at VizieR and in particular Fabricius' et al's HD-Tycho 2 match (explaining why they didn't show up in TOPCAT is a longer story; we're working on it). But alas, I didn't, and so I set out to produce a catalogue matching HD and Gaia DR2, easily findable from within TOPCAT's TAP client. Well, it's here in the form of the hdgaia.main table in our data center.

    Considering the nontrivial data discovery and some yak shaving I had to do to get from HD identifiers to Gaia DR2 ones, it was perhaps not as futile an exercise as I had thought now and then during the preparation of the thing. And it gives me the chance to show a nice ADQL technique to locate outliers.

    In this case, one might ask: Which objects might Annie Cannon and colleagues have misclassified? Or perhaps the objects have changed their spectrum between the time Cannon's photographic plates have been taken and Gaia observed them? Whatever it is: We'll have to figure out where there are unusual BP-RPs given the spectral type from HD.

    To figure this out, we'll first have to determine what's “usual”. If you've worked through our ADQL course, you know what to expect: grouping. So, to get a table of average colours by spectral type, you'd say (all queries executable on the TAP service at http://dc.g-vo.org/tap):

    select spectral,
      avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
      count(*) as ct
    from hdgaia.main
    join gaia.dr2light
    using (source_id)
    group by spectral
    

    – apart from the join that's needed here because we want to pull photometry from gaia, that's standard fare. And that join is the selling point of this catalog, so I won't apologise for using it already in the first query.

    The next question is how strict we want to be before we say something that doesn't have the expected colour is unusual. While these days you can rather easily use actual distributions, at least for an initial analysis just assuming a Gaussian and estimating its FWHM as the standard deviation works pretty well if your data isn't excessively nasty. Regrettably, there is no aggregate function STDDEV in ADQL (you could still ask for it: head over to the DAL mailing list before ADQL 2.1 is a done deal!). However, you may remember that Var(X)=E(X2)-E(X)2, that the average is an estimator for the expectation, and that the standard deviation is actually an estimator for the square root of the variance. And that these estimators will work like a charm if you're actually dealing with Gaussian data.

    So, let's use that to compute our standard deviations. While we are at it, throw out everything that's not a star[2], and ensure that our groups have enough members to make our estimates non-ridiculous; that last bit is done through a HAVING clause that essentially works like a WHERE, just for entire GROUPs:

    select spectral,
      avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
      sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
        power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
      count(*) as ct
    from hdgaia.main
    join gaia.dr2light
      using (source_id)
    where m_v<18
    group by spectral
    having count(*)>10
    

    This may look a bit scary, but if you read it line by line, I'd argue it's no worse than our harmless first GROUP BY query.

    From here, the step to determine the outliers isn't big any more. What the query I've just written produces is a mapping from spectral type to the means and scales (“µ,σ” in the rotten jargon of astronomy) of the Gaussians for the colors of the stars having that spectral type. So, all we need to do is join that information by spectral type to the original table and then see which actual colors are further off than, say, three sigma. This is a nice application of the common table expressions I've tried to sell you in the post on ADQL 2.1; our determine-what's-usual query from above stays nicely separated from the (largely trivial) rest:

    with standards as (select spectral,
      avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
      sqrt(avg(power(phot_bp_mean_mag-phot_rp_mean_mag, 2))-
        power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
      count(*) as ct
      from hdgaia.main
      join gaia.dr2light
      using (source_id)
      where m_v<18
      group by spectral
      having count(*)>10)
    select *
    from hdgaia.main
    join standards
    using (spectral)
    join gaia.dr2light using (source_id)
    where
      abs(phot_bp_mean_mag-phot_rp_mean_mag-col)>3*sig_col
      and m_v<18
    

    – and that's a fairly general pattern for doing an initial outlier analysis on the the remote side. For HD, this takes a few seconds and yields 2722 rows (at least until we also push HDE into the table). That means you can keep 99% of the rows (the boring ones) on the server and can just pull the ones that could be interesting. These 99% savings aren't terribly much with a catalogue like the HD that's small by today's standards. For large catalogs, it's the difference between a download of a couple of minutes and pulling data for a day while frantically freeing disk space.

    By the way, that there's only 2.7e3 outliers among 2.25e5 objects, while Annie Cannon, Williamina Fleming, Antonia Maury, Edward Pickering, and the rest of the crew not only had to come up with the spectral classification while working on the catalogue but also had to classify all these objects manually. This is an amazing feat even if all of those rows actually were misclassifications (which they certainly aren't) – the machine classifiers of today would be proud to only get 1% wrong.

    The inset in the facsimile of Annie Cannons notebooks above shows how the outliers are distributed in color space relative to the full catalogue, where the basic catalogue is in blue and the outliers (scaled by 70) in red. Wouldn't it make a nice little side project to figure out the reason for the outlier clump on the red side of the histogram?

    [1]The notebook pages are from a notebook Annie Cannon used in 1929. The material was kindly provided by Project PHAEDRA at the John G. Wolbach Library, Harvard College Observatory.
    [2]I'll not hide that I was severely tempted to undo the mapping of object classes to – for HD – unrealistic magnitudes (20 .. 50) but then left the HD as it came from ADC; I still doubt that decision was well taken, and sure enough, the example query above already has insane constraints on m_v reflecting that encoding. From today's position, of course there should have been an extra column or, better yet, a different catalogue for nonstellar objects. Ah well. It's always hard to break unhealty patterns.

« Page 14 / 20 »