I recently got embarrassed by ADQL NULLs, i.e., the magic value indicating that a value in a given column is missing. And since that’s a common source of errors when writing ADQL queries, I’ll take this as a cue for a blog post.
The concrete background is fairly technical and registry-ish; suffice it to say that some data providers who implemented interfaces conforming to some standard didn’t properly say so in their registry records. Back in RegTAP 1.0 (that’s the standard that says how a client like TOPCAT talks to the VO Registry), I decided to work around that by fudging the pattern for how to discover those interfaces so they’d still be found.
In RegTAP 1.1, which is now under review by the VO community, I wanted to do away with that workaround. But would that break anything? This question translates to “are there vs:ParamHTTP interfaces that don’t have a role attribute of std”. Whatever “ParamHTTP” and “role attribute” actually mean, just appreciate that it looks like it might translate into SQL like
select * from rr.interface
and not intf_role='std'
I ran that query, rejoiced because it didn’t return anything, removed the workarund from the standard, and then was shot down when I read Mark’s mail (politely) saying I’m wrong and there are services still requiring the workaround. As usual: If a query returns what you expect, be double careful.
What went wrong? Well, NULL semantics. You see, in SQL NULL is never equal to anything, not even itself (it’s like NaN in IEEE floats in that: try n = float('nan');print(n==n) in Python and look again if you’re cool about it). It’s also not unequal. Don’t take my word for it. Try
select * from tap_schema.schemas where NULL=NULL
select * from tap_schema.schemas where NULL!=NULL
– you’ll get empty results in both cases.
What does that mean for science queries? Well, whenever there’s NULLs in columns (and the only safe assumption for now is that they may hide in there; we should probably add nun-null as a column property in the tap schema and in VODataService some day), you need to be careful in particular with inverted logic.
Here’s an example: Suppose you want to investigate NGC objects brighter than 10 mag in B in one bin in everything else in another. The ones brighter are simple:
select count(*) from openngc.data where mag_b<10
(try it on the TAP server at http://dc.g-vo.org/tap, it’s 383 in the current release). It becomes difficult for “the rest”. If you write
select count(*) from openngc.data where not mag_b<10
select count(*) from openngc.data where mag_b>=10
you’ll get (for the current release) 10887. However, the whole catalogue has 13954 entries, so there’s 13954-10887-383=2684 rows missing. Your “rest” has missed everything for which mag_b isn’t given. Sure enough,
select count(*) from openngc.data where mag_b is null
(and this is the only good way to compare against null) gives 2684.
The right way to say “anything for which mag_b is not smaller than 10” thus is
select count(*) from openngc.data
or mag_b is null
Morale: Unless you’re sure there are no missing values (i.e., NULLs) in a column you’re looking at, think about what these mean to your research (or other) question: Should these rows just vanish? Then you usually don’t need to do anything and the SQL semantics magically do the right thing (which is why things are defined as they are). If, however, the corresponding rows would mean something to your question, you need to be explicit, and you must have some condition involving IS NULL or IS NOT NULL.
The trouble, of course, is that just knowing this still isn’t enough. You need to remember it in the right moment. Or you’ll share my fate of suffering some public embarrassement.
Almost a year has passed since release 1.2 of DaCHS – I’ve let the normal autumn release slip last year because there weren’t so many release-worthy new features in DaCHS at the traditional release time (i.e., after the College Park interop), and also because running betas when you do need a new feature is a fairly stable thing by now.
But here it finally is: Release 1.3 (tarball for the die-hard self-builders; everyone else just switches back the release branch as necessary and then runs an update/upgrade cycle).
Columns can now be hidden from TAP/ADQL (and other interfaces) by setting hidden="True".
There is now a setting [web]maxSyncUploadSize=500000 (meaning: about 500 kByte) as the default upload limit on sync queries. In compensation, clients uploading too much now receive a more useful error message (except it doesn’t reach TOPCAT users most of the time because it does chunked uploads). To get back the behaviour of 1.2 (which is probably ok if you can live with the occasional resource hog), add maxSyncUploadSize=20000000 to your /etc/gavo.rc.
Adding support for https (certificate reading, certificate updating with letsencrypt, registering alternate endpoints, no WebSAMP with https). See HTTPS in DaCHS on this blog for details.
New source_table and preview columns in obscore. If you’re using the various obscore mixins, this should be automatic. If you have defined views manually, you will have to amend these (and have a broken obscore until a dachs upgrade ran without error).
No longer producing arraysize="1" in VOTables for scalars (except char, for compatibility with a legacy TOPCAT workaround; see VOTable 1.3 Erratum 3 for background information).
Support for draft TIMESYS in VOTable (with STC 2 annotation; ask about details if you’re interested. This is for draft VOTable 1.4 and probably only relevant to you if you’re publishing time series).
You can now add targetType and targetTitle properties to URL-valued columns to help Aladin figure out what to do with URLs (see Datalinks as product URLs in the reference documentation).
New gavo_transform, gavo_ipix, and gavo_urlescale ufuncs for ADQL, fixed gavo_urlescape to have acceptable performance.
//soda#sdm_genDesc now matches accref rather than pubDID by default. If you use Datalink with SSA and have a custom pubDID schema (or no index on accref), add a useAccref="False" to your descriptorGenerator statement.
There is now a --foreground option for dachs serve start. This is mainly to play nice with systemd, and indeed, the Debian package now comes with a systemd unit file. I’m not terribly familiar with systemd, so please have an eye on DaCHS controlled by systemd and let me know if you see something that’s not as it should be.
Fixes for various bugs (most notable: ” in ADQL, WCS in SIAP cutout products) and many minor improvements. Check out the source tree (still via subversion) and read the changelog if you want to know the whole truth.
On systems running from the Debian package, the update should be automatic with the next system upgrade. However, you’ll be saving yourself quite a bit of headache if you check the health of your installation before the upgrade; see Upgrading DaCHS in the operator’s guide on how to upgrade professionally.
About every six month, the people making the standards for the Virtual Observatory meet to sort out the next things we need to tackle, to show off what we’ve done, and to meet each other in person, which sometimes is what it takes to take some excessive heat out of a debate or two. We’ve talkedaboutInterops before. And now it’s time for this (northern) spring’s Interop, which is taking place in Paris (Program).
This time I thought I’d see if there’s any chance I can copy the pattern I’m enjoying at Skyweek now and then: A live blog, where I’ll extend the post as I go. If that’s a plan that can fly remains to be seen, as I’ll give seven talks until Friday, and there’s a plethora of side meetings and other things requiring my attention.
Anyway, the first agenda item is a meeting of the TCG, the Technical Coordination Group, which is made up of the chairs and vice-chairs of the IVOA’s working groups (I’m in there as the vice chair of the semantics WG). We’ll review how the standards under review progress, sanction (or perhaps defer) errata, and generally look at issues of general VO interest.
Update (2019-05-12, 10:50): Oh dang, my VOResource 1.1 Erratum 1 hasn’t quite made it. You see, it’s about authentication, i.e., restricting service access, which, in a federated, interoperating system is trickier than you would think, and quite a few discussions on that will happen during this Interop. So, the TCG has just decided to only consider it passed if nothing happens this week that would kill it. To give you an idea of other things we’ve talked about: Obscore 1.1 Erratum 1 and SODA 1.0 Erratum 1 both try to fix problems with UCD annoation (i.e., a rough idea what it is) not directly related to the standards themselves but intended to help when service results are consumed outside of the standard context, and RegTAP 1.0 Erratum 1 fixes an example in the standard regulating registry discovery that didn’t properly take into account my old nemesis, case-insensitivity of IVOA identifiers. So, yay!, at least one of my Errata made the TCG review.
Update (2019-05-12, 12:15): Yay! After some years of back and forth, the TCG has finally endorsed my Discovering Data Collections note. This is another example of the class of text you don’t really notice. It’s supposed to let you, for instance, type in a table name into TOPCAT and then figure out at which TAP service to query it. You say: I can already do that! I say: Yeah, but only because I’m running a non-standard service, which I’d like to cease at some point.
Update (2019-05-12, 15:55): The TCG meeting slowly draws to an end. This second half was, in particular, concerned with reports from Working and Interest Groups; this is, essentially, an interactive version of the roadmaps, where the various chairs say what they’d like to do in the six month following an Interop. The one from after College Park (VO insiders live by Interops, named by the towns they’re in) you could read at 2018 B Roadmap in the IVOA Wiki – but really, as of next Friday, you’d rather look at what’s going be cooked up here (which will be at 2019 A Roadmap).
Update (2019-05-12, 16:30) It’s now Exec, i.e., the governing body of the VO, consisting of the principal investigators (or, bosses), of the national VO projects (I’m just sitting in for my boss, really). This has, for instance, the final say on what gets to be a standard and what doesn’t. This is, of course, a bit more formal than the hands-on debates going on in the TCG, so I get to look around a bit in the meeting room. And what a meeting room they have here at Paris observatory. Behind me there’s a copy of Louis XIV’s most famous portrait (and for a reason: Louis XIV had the main part of the building we’re in built), along the walls around me are the portraits of the former directors of Paris observatory (among them names all mathematicians or astronomers know: Laplace, Delaunay, Lalande, the Cassinis, and so on), and above me, in the meeting room’s dome, there’s an allegoric image of a Venus transit that I can’t link here lest schools block this important outreach site. What a pity we’ll have to move into a tent when everyone else comes in tomorrow…
Update (2019-05-13, 9:11) The logistics speech is being given by Baptiste Cecconi, who’s just given the carbon footprint of this meeting – 155 tons of CO2 for travel alone, or 1.2 tons per person. That, as he points out, is about what would be sustainable per year. Well, they’re trying to make amends as far as possible. We’ll have vegetarian-only food today (good for me), and locally grown food as far as possible. Also, the conference freebie is a reusable cup so people won’t produce endless amounts of waste plastic cups. I have to say I’m impressed.
Update (2019-05-13, 9:43): One important function of these meetings is that when software authors and users sit together, it’s much easier to fix things. And, first success for me this time around: The LAMOST services at the data center of the Czech academy of sciences do fast positional searches now; you’ll find them by looking for LAMOST in TOPCAT‘s SSAP window, in Aladin 10, in Splat, or really whereever clients let you do discovery of spectral services in the VO.
Update (2019-05-13, 10:59):Next up: “Charge to the Working Groups”. That’s when the various working group chairs give lightning talks on what’s going to happen in their sessions and try to pull as many people as they can. Meanwhile, in the coffee break, I’ve had the next little success: With the people involved, we’ve worked out a good way to fix a Registry problem briefly described by “two publishing registries claim the same authority” (it’s always nice to pretend I’m in Star Trek) – indeed, we’ll only need a single deletion at a single point. Given the potential fallout of such a problem, that’s very satisfying.
Update (2019-05-13, 14:07): While the IG/WG chairs presented their plans, the Ghost of Le Verrier (or was it just the wind?) occasionally haunted the tent, which gave off dreadful noises. And after the session, I quickly ported the build infrastructure for the future EPN-TAP specification (SVN for nerds; previously in this blog for the rest of you) to python 3. Le Verrier was quiet during that time, so I’m sure the guy who led the way to the discovery of Uranus approved.
Update (2019-05-13, 14:29): Mark SubbaRao from Chicago’s Adler Planetarium is giving a plenary talk (in other places, this might be called a “keynote”) on Planetaria and the VO. And he makes the point that there’s 150 million people visiting a plenetarium each year, which, he claims, is a kind of outreach opportunity that no other science has. I’d not bet on that last statement given all the natural history museums, exploratoria, maker faires and the like, but still: That the existence of planetaria says something about the relationship of the public with astronomy is an insight I just had.
Update (2019-05-13, 15:07): So, you think you just sit back and enjoy a colourful talk, and then suddenly there’s work in there. Specifially, there’s a standard called AVM designed to annotate astronomical images to show them in the right place on a planetarium dome (ok, FITS WCS can do that as well) and furnish it with other metadata useful in outreach and education. As Registry and Semantics enthusiast, I immediately clicked on the AVM link at the foot of http://www.data2dome.org and was greeted by something pretty close to a standard IVOA document header. Except it declares itself as an “IVOA draft”; such a document category doesn’t really exist. Even if it did, after around 10 years (there are conflicting date specs in the document) a document shouldn’t be a “draft” any more. If it’s survived that long and is still used, it deserves to be some sort of proper document, I think. So, I took the liberty of cold-contacting one of the authors. Let’s see where that goes.
Update (2019-05-13, 16:29): We’ve just learned about the standardisation process at IPDA (that’s a bit like the IVOA, just for planetary data), and interestingly, people are voting there on their standards – this is against the IVOA practice of requiring consensus. Our argument has always been that a standard only makes sense if all interested parties adopt it and thus have to at least not veto it. I wonder if these different approaches have to do with the different demographics: within the IPDA, there are far fewer players (space agencies, really) with much clearer imbalances (e.g., between NASA and the space agency of the UAE). Hm. I couldn’t say how these would impact our arguments for requiring consensus…
Update (2019-05-13, 17:11): Isn’t that nice? In the session of the solar system interest group, Eleonora Alei is just reporting on her merged catalog of explanets – which is nice in itself, but what’s pleasant for me is to learn she got to make this because of the skills she learned at the ASTERICS school in Strasbourg last November. You see, I was one of the tutors there!
Update (2019-05-14, 8:50): Next up is the first Registry session, with a talk on how to get the information on all our fine VO services into B2Find, a Registry-like thing for the Eurpean Open Science Cloud as its highlight. I’ll also present my findings on what we (as the VO) have gotten wrong when we used “capabilities” do describe things, and also progress on VODataService 1.2; this latter thing is, as far as users are concerned, mainly about finally enabling registry searches by space, time, and spectral coverage.
Update (2019-05-14, 14:11): So, I did run into overtime a bit with my talks, which mostly is a good sign in Interops, because it indicates there’s discussion, which again indicates interest in the topic at hand. The rest of the morning I spent trying to work out how we can map the VO Registry (i.e., the set of metadata records about our services) into b2find in a way that it’s actually useful. I guess we – that’s Claudia from b2find, Theresa as Registry chair, and me – made good progress on this, perhaps not the least because of the atmosphere of the meeting: In the sun in the beautiful garden of Paris observatory. And now: Data Models I.
Update (2019-05-14, 14:51): Whoops – Steve just mentions in his talk on the Planetary Data System that there’s ISO 14721, a reference model for an Open Archival Information System. Since I run such an archive, I’m a bit embarrassed to admit I’ve never heard of that standard. The question, of course, being if this has the same relationship to actually running an Archive as ISO 9001 has to “quality” (Scott Adams once famously said something to the effect of: if you’ve not worked with ISO 9001, you probably don’t know what it is. If you have worked with ISO 9001, you certainly don’t know what it is).
Update (2019-05-15, 9:30): I’ve already given my first talk today: TIMESYS and TOPOCENTER, on a quick way to deal with the problem of adjusting for light travel times when people have not reduced the times they give to one of the standard reference positions. There’s more things close to my heart in this session: MOCs in Space and Time, which might become relevant for the Registry [up-update: and, wow, of quick searches against planetary or asteroid orbits. Gasp]; you see, MOCs are rather compact representations of (so far only spatial) coverages, and the space MOCs are already in use for the Registry in the rr.stc_spatial table on the TAP service at http://dc.g-vo.org/tap. The temporal part of STC-based discovery is just intervals at this point, which probably is good enough – but who knows? And I’m also curious about Dave’s thoughts on the registration of VOEvents, which takes up something I’ve reviewed ages ago and that went dormant then – which was somewhat of a pity, because there’s to this day no way to find active VOEvent streams.
Update (2019-05-15, 14:18): After another Exec session over lunch I ran over to a session somewhat flamboyantly called “TAP-fostered Authentication in the Server-Client scenario“. This is about enabling running access-controlled services, which I’m not really a fan of; but then I figure if people can use VO tools to access their proprietary data, chances are better that that data will eventually be usable from everyone’s VO tools. Data dumped behind custom-written web pages will much less likely be freed in the end, or so I believe. Anyway, I’m now in the game of figuring out how to do this, and I’m giving the (current) Registry perspective. The main part of the session, however, will be free discussion, a time-honored and valuable tradition at Interops.
Update (2019-05-16, 9:00): I’m now in the Theory session, where people deal with simulated data and such things (rather than, as you might guess, with the theory of publishing and/or processing data). The main reason I’m here is that theory was an early adopter of vocabularies. Due to my new(ish) role in the semantics WG, I’ll have to worry about this, because things changed a bit since they started (I’ll talk about that later today) – and also, some of their vocabularies – for instance, object types – are of general interest and shouldn’t probably be theory-only. Let’s see how far my charm goes…
Update (2019-05-16, 12:20): I was doing a bit of back-and-forth between a DAL session (in which, among other things, my colleague Jon gave a talk on a machine-readable grammar for ADQL and Dave tells us how ADQL 2.1 goes on (previously on this blog), and a code sprint the astropy folks have next to the conference, where we’ve been discussing pyVO’s future (remember pyVO? See the update for yesterday 11:16 if not).
Update (2019-05-16, 14:27): Again, in-session running: I gave a quick talk on how we’ll finally get to do data collection-based discovery (rather than service-based, as we do now; lecture notes) and then walked through the garden of Paris observatory to the semantics session, where I joined while people were still discussing the age-old problem of enumerating the observatories, space-probes, and instruments in the world (an endeavour that, very frankly, scares me a tiny bit because of its enormous size). After talks on the use of vocabularies in CAOM2 (Pat) and theory (Emeric), I’ll then do my first formal action in the semantics WG: I’ll disclose my plans for specifying how the IVOA should do vocabulary work in the future.
Update (2019-05-16, 17:56): So, the afternoon, between my talks in Registry II and Semantics, planning for the Semantics roadmap (this is something where WG chairs say what they’re planning until the next Interop; more on this, I guess, tomorrow), talking with the theory people about how their vocabularies will better integrate with the wider VO, and passing on pyVO to core astropy folks, was a bit too busy for live-blogging. I conclude with a “splinter” on the development of Datalink. This is pure discussion without a formal talk, which, frankly, often is the most useful format for things we do at Interops, and there’s almost 20 people here. In contrast to yesterday’s after-show splinter (which was on integration of the VO Registry with b2find), I’m just a participant here. Phewy.
Update (2019-05-17, 8:52): We’re going to start the last act of Interops, where the working group chairs report on the progress made during the interop. That, at the time of writing, only three WGs already have their slide on it shows that that’s always a bit of a real-time affair – understandibly, because the last bargains and agreements are being worked out as I write. This time around, though, there’s a variation to that theme: The astropy hackathon that ran in parallel to the Interop will also present its findings, and I particularly rejoice because they’re taking over pyVO development. That’s excellent news because Stefan, who’s curated pyVO for the last couple of years from Heidelberg, has moved on and so pyVO might have orphaned. That’s what I call a happy end.
Update (2019-05-17, 13:01): So, after reviews and a kind good-bye speech by the Exec chair Mark Allen – which included quite a bit well-deserved applause for the organisers of the meeting –, the official part is over. Of course, I still have a last side-meeting: planning for what we’re going to do within ESCAPE, a project linking astronomy with the European Open Science Cloud. But that’s not going to be more than an hour. Good-bye.
At the Astroplate conference I blogged about recently, the people behind APPLAUSE gave a couple of talks about their Data Release 3. APPLAUSE is a fairly massive endeavour to make available data from some of the larger plate archives in Germany, and its DR3 even hit the non-Astronomy press last February.
Already for previous APPLAUSE releases, I’ve wanted to bring this data (or rather, its metadata) to the VO, but it never quite happened, basically because there was always another little thing that turned out to be too tedious to work out via mail. However, working out things interactively is exactly what conferences are great for. So, the kind APPLAUSE folks (thanks, Taavi and Harry) and I used the Astroplate to map their database schema (“schema” is jargon for what boils down to the set of tables and columns with which they describe their data) to the much simpler (and, admittedly, less powerful) IVOA Obscore one.
Sure, Obscore doesn’t deal with multiple exposures (like when the target field and the north pole were exposed on one plate to help precision photometry), object-guided images, and all the other interesting techniques that astronomers applied in the pre-digital age; it also doesn’t usefully cope with multiple scans of the same plate (for instance, to correct for imprecisions in the mechanics of flatbed scanners). APPLAUSE, of course, has to cope with them, since there are many reasons to preserve data of this kind.
Obscore, on the other hand, is geared towards uniform discovery, where too funky datasets in all likelihood cause more harm than good. So, when we mapped APPLAUSE to Obscore, of the 101138 scans of 70276 plates that the full APPLAUSE holds in DR3, only 44000 plate scans made it into the Obscore table. The advantage: whatever can be sensibly mapped to Obscore can now be queried together with all the other data in the world that others have published through Obscore.
You can immediately see the effect when you run the little python program doing the global discovery we gave in our plates tutorial. Here’s what it prints now (values from pre-APPLAUSE-in-Obscore are in square brackets):
Column t_exptime: 3460 values
Min 12, Max 15300, Mean 890.24 [previous mean: 370.722]
Column em_mean: 3801 values
Min 1.8081e-09, Max 9.3e-07, Mean 6.40804e-07 [No change: Sigh!]
Column t_mean: 4731 values
Min 12564.5, Max 58126.3, Mean 49897.9 [previous mean: 51909.1]
Column instrument_name: 4747 values
Matches from , Petzval, [Max Wolf's residence in
Heidelberg, Maerzgasse, Wolf's Doppelastrograph,
Heidelberg Koenigstuhl (24), Wolf's
Doppelastrograph,] AG-Astrograph, [Zeiss Triplet
15 cm Potsdam-Telegrafenberg], Zeiss Triplet,
Astrograph (four 10-cm Tessar f/6 cameras),
[3.5m APO, ROSAT PSPCC, Heidelberg Koenigstuhl
(24), Bruce Astrograph, Calar Alto (493),
Schmidt], Grosser Refraktor, [ROSAT HRI,
DK-1.54], Hamburger Schmidt-Spiegel,
[DFOSC_FASU], ESO 1-metre Schmidt telescope,
Great Schmidt Camera, Lippert-Astrograph, Ross-B
3", [AZT 22], Astrograph (six 10-cm Tessar f/6
cameras), 1m-Spiegelteleskop, [ROSAT PSPCB],
Astrograph (ten 10-cm Tessar f/6 cameras), Zeiss
Column access_url: 4747 values 
So – for the fields selected in the tutorial, there are 15% more images in the global Obscore image pool now than there were before APPLAUSE, and their mean observation date went a bit farther into the past. I’ve not made any statistics, but I suspect for many other fields the gain is going to be much higher. For a strong effect, try some random region covered by the Bamberg Sky Patrol on the southern sky.
But you have probably noticed the deep sigh in the annotations to the statistics above: Yes, we don’t have the spectral band for the APPLAUSE data, which is why the stats on em_min doesn’t change. As a matter of fact, from the Obscore data you can’t even guess whether a plate is “more red” or “rather blue”, as Obscore doesn’t have an (agreed-upon) field for “qualititive bandpass indicator”.
For some other data collections, we did map known emulsion/filter combinations to rough bandpasses (e.g., the Palomar-Leiden Trojan Survey, which only had a few of them). For APPLAUSE, there are 435 combinations of filter and emulsion (that’s a VOTable link that you can paste into TOPCAT’s load button in order to have a look at the table). Granted, quite a few of these pairs are (more or less) spurious because of inconsistent spelling. But we still gave up on researching the bandpasses even before we started.
If you’re a photographic plate buff: You could help us and posteriority a lot if you could go through this list and at least for some combinations tell us what, roughly, the lower and upper limits of the corresponding bandpasses might have been (what DaCHS already knows, plate-relevant data near the bottom of the file). As usual, send mail to firstname.lastname@example.org if you have anything to contribute.
Finally, here’s the brief explanation of the image for this article: Well, I wanted to find some Bamberg Sky Patrol images for a single field to play with. I knew they were primarily located in the South, and were made using Tessar cameras. So, I ran
SELECT t_min, access_url, s_region
WHERE instrument_name like '%Tessar%'
AND 1=CONTAINS(POINT(345, -38), s_region)
on GAVO’s TAP service. Since Aladin 10, you can do that from within the program (although some versions will reject this query because they mistakenly believe the ADQL is bad. Query through TOPCAT and send the result over to Aladin if that bites you). Incidentally, when there are s_region values in Obscore tables, it’s a good idea to use them as I do here, as it’s quite a bit more likely that this query will use indices than some condition on s_ra and s_dec. But then not all services fill s_region properly, so for all-VO queries you will probably want to make do with s_ra and s_dec.
From that result I first made the inset bar graph in the article image to show the temporal distribution of the Patrol plates. And then I grabbed two (rather randomly selected) plates and had Aladin produce a red-blue composite of them. Whatever is really red or really blue in that image may correspond to a transient event. Or, as certainly the case with that little hair (or whatever) that shines out in blue, it may not.
Using and re-using is of course what the Virtual Observatory is about, and we’ve been keeping fairly large plate collections in our data center for quite a while (among them the Archives of Landessternwarte Königstuhl or the Palomar-Leiden Trojan surveys, and there is the WFPDB TAP-accessibly). Therefore, people from GAVO Heidelberg have been to all past astroplate conferences.
For this one, I brought a brand-new tutorial on plate scans in the VO, which, I hope, also works as a general introduction to image discovery in the VO using SIAP, Datalink, and Obscore. If you’re doing image stuff now and then, please have a quick look at the thing – I am particularly grateful for hints on what to improve or perhaps particularly obvious use cases for the material discussed.
Such VO proselytising aside, the conference is discussing the wide variety of creative, low-cost data collectors out there as well as computer-aided re-analysis extracting new knowledge from decades-old data. If I had to choose a single come-to-think-of-it moment, it would be Norbert Zacharias’ observation that if you have a well-behaved object and you’d like to know where it was in 1900, it’s now more accurate to extrapolate Gaia astrometry to the epoch of observation than to measure it on the plate itself. Which is saying a lot about the amazing feat of engineering that Gaia is.
This is not, however, an argument for dumping the old data. Usually, it is exactly what is not so well-behaved (like those) that’s interesting – both in terms of astrometry and in terms of photometry (for which there’s a lot more unruly behaviour in the first place). To figure out how objects don’t behave well, and, for objects disguising as well-behaved only on time scales of the (say) Gaia mission, which these are, the key is “old” data. The freshness of which we’re discussing this week.
When I started working on the VO in 2007, my collagues in Garching already had a software that implemented major parts of the simple spectral access protocol (SSAP) that was being developed back then. It would publish spectra in the FITS format by just blindly dumping all header cards into a database table and then defining a view over that “raw” metadata table to make the whole thing match SSAP’s expectations for how the output table should look like. Sometimes you could just map through a header to an SSA column, sometimes you would just convert a unit, sometimes you would have to write a fairly complex SQL expressions combining multiple fields.
Back then, I didn’t like it – why have two things (a table and a view) that can break when one (just a table in SSA’s format) would do, too? Also, SSAP has about 50 metadata fields, but lets you put constant values into VOTable PARAMs, which seemed a very reasonable way to attain more compact responses. So, when DaCHS grew SSAP support, I defined a mixin (essentially, a configurable interface definition) that let operators define SSA tables and their constant parameters in a fairly simple fashion and directly produced a table you could base your SSAP service on.
That made assumptions about which pieces of metadata are constant and which are not; for instance, the original mixin (“hcd” for “homogeneous collection”) assumed all spectra in a data collection came from the same instrument and had the same resolution and (what was I thinking?) SNR. Unsurprisingly, that broke fairly soon. So, I added a second mixin (“mixc”) for when different instruments or codes produced the data.
But even that was headache, at the latest when I started making time series services using SSAP. And I had to fix a few bugs in the mixins themselves in the meantime, which mostly required re-imports of the data in that design. Such re-imports are non-trivial when you have millions of spectra, and they need to happen at software upgrade time or the services would break with the upgrade. Ouch.
It was about mid-2018 when it dawned on me that sometimes it’s better to have two things that can break even if one would do, after all. Specifically, if fixing the one thing is expensive, it’s an excellent idea to put a facade on top of it that’s cheap to change and can already be used to repair most deficiencies. Why re-build the house if a paint job does the trick?
So, I’ve now come up with a view-based solution in DaCHS, too: the //ssap#view mixin. This is a bit less radical than the Garching software of 2007, as it doesn’t dump raw headers but instead lets you do the primary transformations in the RD. But it no longer constrains what pieces of metadata should be constant and which may vary between spectra, and it uses the same names for the same pieces of metadata throughout (which also is a step forward over the old SSAP mixins).
With this, DaCHS operators should no longer use the hcd and mixc mixins for new services. The new technique is already reflected the respective tutorial chapter, and the SSAP template (you’re using dachs start, aren’t you?) now uses it, too.
If you have a spectra publishing project in your pipeline, this would be the perfect time to upgrade to the DaCHS 1.2.4 beta, which has the new mixin. It would be great if we could iron out remaining wrinkles before the next release makes changes a load on my conscience.
As to migrating existing SSAP services: Well, it would be great if I could drop the old mixins in a couple of years, as they cause quite a bit of uglyness in DaCHS’s built-in //ssap RD. But the migration regrettably isn’t straightforward, so you may want to wait a bit before embarking on that journey (I’ll be happy to help, though).
Last friday, I’ve uploaded a first working draft of VODataService 1.2 to the IVOA documents repository. That’s the first major step in updating a standard, and it’s an invitation to everyone to have a look and comment.
Foof, you might say, what do I care? I’ve not even heard of that standard.
Well, but you’ve probably used it. VODataService is (among several other things) the standard that governs how a TAP service tells clients (TOPCAT, say) what tables it has and what’s inside of them. So, if you see in TOPCAT that there is a column named ang_error with a unit of deg, a UCD of stat.error;pos and the meaning “1 σ confidence radius of the position”, that most likely came in a document standardised by VODataService.
The question of what (TAP) services can tell clients about their table set is one major open point: Do we want additional metadata there? This article’s image, for inspiration, shows a screenshot of extended metadata Grégory delivers to browsers on his ARI-Gaia service; among this are minima, maxima, means, standard deviations, quartiles, and fill factors (i.e., how many of the columns are NULL). He even shows histograms of the values’ distributions and HEALPix maps showing how (the means of) the values vary on the sky. Another example of extended metadata could be footnotes as you will find them on many of my resources’ reference URLs (example; footnotes are, unsurprisingly, near the foot of that page).
We could define interoperable means to communicate information like this. The question is: does the added value justify the complication in implementation? This is where it would be great if you weighed in, in particular if you are a “mere” TAP user: Are there any such pieces of metadata you’ve always wanted to see in your TAP interfaces? Oh, and metadata of course can also be added to tables rather than columns. The current draft already lets services communicate the number of rows in each table – is there more “simple”, table-specific metadata of this sort?
VODataService furthermore deals with several other topics; for instance, the STC in the registry business I’ve blogged about in February is going to be standardised here (update on this: spectral coverage is no longer in wavelength but in energy). Other changes are rather more technical in nature, like several new resource types that will improve the discovery of tables and other such resources, or a careful adjustment of some features to keep them in line with TAP evolution.
But don’t let the technicalities scare you away – just have a peek, and if you have thoughts on any of the VODataService topics: I’m just a mail away.
The other day I gave one of my improvised live demos (“What, roughly, are you working on?”) and I ended up needing to translate identifiers from the Henry Draper Catalogue to modern positions. Quickly typing “Henry Draper” into TOPCAT’s TAP search window didn’t yield anything useful (some resources only using the HD, and a TAP service that didn’t support uploads – hmpf).
Now, had I tried the somewhat more thorough WIRR Registry interface, I’d have noted the HD catalogue at VizieR and in particular Fabricius’ et al’s HD-Tycho 2 match (explaining why they didn’t show up in TOPCAT is a longer story; we’re working on it). But alas, I didn’t, and so I set out to produce a catalogue matching HD and Gaia DR2, easily findable from within TOPCAT’s TAP client. Well, it’s here in the form of the hdgaia.main table in our data center.
Considering the nontrivial data discovery and some yak shaving I had to do to get from HD identifiers to Gaia DR2 ones, it was perhaps not as futile an exercise as I had thought now and then during the preparation of the thing. And it gives me the chance to show a nice ADQL technique to locate outliers.
In this case, one might ask: Which objects might Annie Cannon and colleagues have misclassified? Or perhaps the objects have changed their spectrum between the time Cannon’s photographic plates have been taken and Gaia observed them? Whatever it is: We’ll have to figure out where there are unusual BP-RPs given the spectral type from HD.
To figure this out, we’ll first have to determine what’s “usual”. If you’ve worked through our ADQL course, you know what to expect: grouping. So, to get a table of average colours by spectral type, you’d say (all queries executable on the TAP service at http://dc.g-vo.org/tap):
avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
count(*) as ct
group by spectral
– apart from the join that’s needed here because we want to pull photometry from gaia, that’s standard fare. And that join is the selling point of this catalog, so I won’t apologise for using it already in the first query.
The next question is how strict we want to be before we say something that doesn’t have the expected colour is unusual. While these days you can rather easily use actual distributions, at least for an initial analysis just assuming a Gaussian and estimating its FWHM as the standard deviation works pretty well if your data isn’t excessively nasty. Regrettably, there is no aggregate function STDDEV in ADQL (you could still ask for it: head over to the DAL mailing list before ADQL 2.1 is a done deal!). However, you may remember that Var(X)=E(X2)-E(X)2, that the average is an estimator for the expectation, and that the standard deviation is actually an estimator for the square root of the variance. And that these estimators will work like a charm if you’re actually dealing with Gaussian data.
So, let’s use that to compute our standard deviations. While we are at it, throw out everything that’s not a star1, and ensure that our groups have enough members to make our estimates non-ridiculous; that last bit is done through a HAVING clause that essentially works like a WHERE, just for entire GROUPs:
avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
count(*) as ct
group by spectral
This may look a bit scary, but if you read it line by line, I’d argue it’s no worse than our harmless first GROUP BY query.
From here, the step to determine the outliers isn’t big any more. What the query I’ve just written produces is a mapping from spectral type to the means and scales (“µ,σ,“ in the rotten jargon of astronomy) of the Gaussians for the colors of the stars having that spectral type. So, all we need to do is join that information by spectral type to the original table and then see which actual colors are further off than, say, three sigma. This is a nice application of the common table expressions I’ve tried to sell you in the post on ADQL 2.1; our determine-what’s-usual query from above stays nicely separated from the (largely trivial) rest:
with standards as (select spectral,
avg(phot_bp_mean_mag-phot_rp_mean_mag) as col,
power(avg(phot_bp_mean_mag-phot_rp_mean_mag), 2)) as sig_col,
count(*) as ct
group by spectral
join gaia.dr2light using (source_id)
– and that's a fairly general pattern for doing an initial outlier analysis on the the remote side. For HD, this takes a few seconds and yields 2722 rows (at least until we also push HDE into the table). That means you can keep 99% of the rows (the boring ones) on the server and can just pull the ones that could be interesting. These 99% savings aren't terribly much with a catalogue like the HD that's small by today's standards. For large catalogs, it's the difference between a download of a couple of minutes and pulling data for a day while frantically freeing disk space.
By the way, that there's only 2.7e3 outliers among 2.25e5 objects, while Annie Cannon, Williamina Fleming, Antonia Maury, Edward Pickering, and the rest of the crew not only had to come up with the spectral classification while working on the catalogue but also had to classify all these objects manually, this is an amazing feat even if all of those rows actually were misclassifications (which they certainly aren't) – the machine classifiers of today would be proud to only get 1% wrong.
The inset in the facsimile of Annie Cannons notebooks above shows how the outliers are distributed in color space relative to the full catalogue, where the basic catalogue is in blue and the outliers (scaled by 70) in red. Wouldn't it make a nice little side project to figure out the reason for the outlier clump on the red side of the histogram?
1I'll not hide that I was severely tempted to undo the mapping of object classes to – for HD – unrealistic magnitudes (20 .. 50) but then left the HD as it came from ADC; I still doubt that decision was well taken, and sure enough, the example query above already has insane constraints on m_v reflecting that encoding. From today's position, of course there should have been an extra column or, better yet, a different catalogue for nonstellar objects. Ah well. It's always hard to break unhealty patterns.
(Warning: No astronomy-relevant content at all this time).
But on the other hand, there are some scenarios in which HTTPS makes sense, and I can remotely fantasise that some of those may even be relevant to the VO. And people have been asking for HTTPS in DaCHS a number of times, at times even because their administrations urged them to switch. So, here it is, hopefully. Turning it on is reasonably easy when you use Letsencrypt (which in particular entails having ports 80 and 443); the section on Letencrypt in the operator’s guide tells what to do. In particular don’t forget the cron job, because without it, things would break after three months (when the initial certificate expires).
Things get difficult after that. For one, if your box is known under several names (our data center, for instance, can be reached as any of dc.g-vo.org, vo.uni-hd.de, and dc.zah.uni-heidelberg.de; this of course also includes things like www.example.org and example.org), you’ll now have to tell DaCHS about it in the new [web]alternateHostnames configuration item; for instance, we have
And then the Registry has to know you have https. There’s actually no convention for that in the VO yet. But since I’d really like to have at least fallback interfaces with plain HTTP, we’ll have to come up with something. For now, my plan is to have the alternative protocol (i.e., HTTPS for sites that have an HTTP-serverURL and vice versa) using the brand-new VOResource 1.1 mirrorURLs (in RegTAP 1.1, they are in the mirror_url column rr.interface). To make DaCHS declare the alternate URLs, set [web]registerAlternative to True.
Another change I’ve introduced for HTTPS is that the default HTML template for the form renderer (i.e., the one people use who come with a browser) now suppresses the SAMP button if the request came in through HTTPS; that’s because WebSAMP doesn’t work with HTTPS and probably never will – at least I can’t see a way to make it happen without totally wrecking what security guarantees HTTPS gives.
All this doesn’t yet cater for the case when you use a reverse proxy to terminate HTTPS. If you are in that situation, please talk to me so we can figure out a sane way for you explain to DaCHS what to tell the Registry.
Oh, and if you’re a client developer: Our data center is now reachable through HTTPS (at https://dc.g-vo.org), and we already have pushed the records with mirrorURLs declaring HTTPS support to the RegTAP service at dc.g-vo.org (the others will have to wait a bit longer, as we haven’t re-published our registry records yet (it’s all experimental, after all).
Today I published a nice new service on our TAP service: The Bayestar17 3D dust map derived from Pan-STARRS 1 by Greg Green et al. I mention in passing that this was made particularly enjoyable because Greg and friends put an explicit license on their data (in this case, CC-BY-SA).
This dust map is probably a fascinating resource by itself, but the really nifty thing is that you can use it to correct all kinds of photometric data for extinction – at least to some extent. On the Bayestar web page, the authors give some examples for usage – and with our new service, you can use TAP as well to correct photometry for extinction.
To see how, first have a look at the table metadata for the prdust.map_union table; this is what casual users probably should look at. More specifically, at the coverage, best_fit, and grdiagnostic columns.
coverage here is an interval of 10-healpixes. It has to be an interval because the orginal data comes on wildly different levels; depending on the density of stars, sometimes it takes the area of a 6-healpix (about a square degree) to get enough signal, whereas in the galactic plane a 10-healpix (a thousandth of a square degree) already has enough stars. To make the whole thing conveniently queriable without exploding a 6-healpix row into 1000 identical rows, larger healpixes translate into intervals of 10-helpixes. Don’t panic, though, I’ll show how to conveniently query this below.
best_fit and grdiagnostic are arrays (remember the light cuves in Gaia DR2?). In bins of 0.5 in distance modulus (which is, in case you feel a bit uncertain as to the algebraic signs, 5 log10(dist)-5 for a distance in parsec), starting with a distance modulus of 4 and ending with 19. This means that for a distance modulus of 4.2 you should check the array index 0, whereas 4.3 already would be covered by array index 1. With this, best_fit[ind] gives E(B-V) = (B-V) – (B-V)0 in the direction of coverage in a distance modulus bin of 2*ind+4. For each best_fit[ind], grdiagnostic[ind] contains a quality measure for that value. You probably shouldn’t touch the E(B-V) if that measure is larger than 1.2.
So, how does one use this?
To try things, let’s pull some Gaia data with distances; in order to have interesting extinctions, I’m using a patch in Cygnus (RA 288.5, Dec 2.3). If you live on the northern hemisphere and step out tonight, you could see dust clouds there with the naked eye (provided electricity fails all around, that is). Full disclosure: I tried the Coal Sack first but after checking the coverage of the dataset – which essentially is the sky north of -30 degrees – I noticed that wouldn’t fly. But stories like these are one reason why I’m making such a fuss about having standard STC coverage representations.
We want distances, and to dodge all the intricacies involved when naively turning parallaxes to distances discussed at length
in a paper by Xavier Luri et al (and elsewhere), I’m using precomputed distances from Bailer-Jones et al. (2018AJ….156…58B); you’ll find them on the “ARI Gaia” service; in TOPCAT’s TAP dialog simply search for “Gaia” – that’ll give you the GAVO DC TAP search, too, and that we’ll need in a second.
The pre-computed distances are in the gaiadr2_complements.geometric_distance table, which can be joined to the main Gaia object catalog using the source_id column. So, here’s a query to produce a little photometric catalog around our spot in Cygnus (we’re discarding objects with excessive parallax errors while we’re at it):
r_est, 5*log10(r_est)-5 as dist_mod,
phot_g_mean_mag, phot_bp_mean_mag, phot_rp_mean_mag,
AND 1=CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', 288.5, 2.3, 0.5 ))
The color-magnitude diagram resulting from this is the red point cloud in the animated GIF at the top. To reproduce it, just plot phot_bp_mean_mag-phot_rp_mean_mag against phot_g_mean_mag-dist_mod (and invert the y axis).
De-reddening this needs a few minor technicalities. The most important one is how to match against the odd intervals of healpixes in the prdust.map_union table. A secondary one is that we have only pulled equatorial coordinates, and the healpixes in prdust are in galactic coordinates.
Computing the healpix requires the ivo_healpix_index ADQL user defined function (UDF) that you may have met before, and since we have to go from ICRS to Galactic it requires a fairly new UDF I’ve recently defined to finally get the discussion on having a “standard library” of astrometric functions in ADQL going: gavo_transform. Here’s how to get a 10-healpix as required for map_union from ra and dec:
gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER)
The CAST call is a pure technicality – ivo_healpix_index returns a 64-bit integer, which I can’t use in my interval logic.
The comparison against the intervals you could do yourself, but as argued in Registry-STC article this is one of the trivial things that are easy to get wrong. So, let’s use the ivo_interval_overlaps UDF; it goes in the join condition to properly match prdust healpixes to catalog positions. Then our total query – that, I hope, should be reasonably easy to adapt to similar problems – is:
WITH sources AS (
gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER) AS hpx,
ROUND((dist_mod-4)*2)+1 AS dist_mod_bin
phot_bp_mean_mag-phot_rp_mean_mag-dust.best_fit[dist_mod_bin] AS color,
dust.best_fit[dist_mod_bin]*3.384 AS abs_mag,
dust.grdiagnostic[dist_mod_bin] as qual
JOIN prdust.map_union AS dust
ON (1=ivo_interval_has(hpx, coverage))
(If you’re following along: you have to switch to the GAVO DC TAP to run this, and you will probably have to change the index after TAP_UPLOAD).
Ok, in the photometry department there’s a bit of cheating going on here – I’m correcting Gaia B-R with B-V, and I’m using the factor for Johnson V to estimate the extinction in Gaia G (if you’re curious where that comes from: See the footnote on best_fit and the MC extinction service docs should get you started), so this is far from physically correct. But, as you can see from the green cloud in the plot above, it already helps a bit. And if you find out better factors, by all means let me know so I can add an update… right here:
Update (2018-09-11): The original data creator, Gregory Green points out that the thing with having a better factor for Gaia G isn’t that simple, because, as he says “Gaia G is very broad, [and] the extinction coefficients are much more dependent on stellar type, and extinction is also more nonlinear with dust column (extinction is only linear with dust column and independent of stellar type for an infinitely narrow passband)”. So – when de-reddening, prefer narrow passbands. But whether narrow or wide: TAP helps you.