• Registry: A Janitor Speaks Out

    I sometimes claim the reason I like working on the VO Registry is that I am a librarian at heart. Perhaps there is some truth to that, in that ugly metadata does make me unhappy – but beyond that, it also makes the Virtual Observatory look or even work a good deal worse than it should.

    Given that, in this post I'm afraid I will sound more like a grumpy janitor than a wise librarian, but let me still attempt to contribute to better metadata by pointing out a few things to watch out for when writing a resource record. People consuming resource records (i.e., VO-using astronomers) are welcome here, too: when you encounter antipatterns mentioned here, a polite complaint to the service publisher is entirely a good thing.

    Note that I am using real metadata found in the registry – in case you recognise some of own records, do not feel reprimanded individually. Most of the problems I discuss here are really common at this point, and thus if I picked your metadata, that was mere bad luck. I actually picked some of my own occasionally (but duly fixed the problem then).

    Missing Coverage

    Since VODataService 1.2, you can say what part of the sky, spectrum, and time your resource covers. That is incredibly useful metadata in practice. Spatial coverage, for instance, is used in Aladin like this:

    Screenshot: Resource names in white, orange and green, and a part of the sky (h and χ Persei) next to them

    Green means: these services could have data for the patch of sky shown, orange means don't bother with these, and white means: No idea because the resource does not declare its coverage.

    Similarly, it would be great if researchers or clients could reliably say:

    SELECT * FROM rr.resource JOIN rr.stc_spectral WHERE
      1=ivo_interval_overlaps(spectral_start, spectral_end,
          ivo_specconv(658, 'nm', 'J'), ivo_specconv(654, 'nm', 'J'))
    

    to find resources having data covering the Hα line on the spectral axis. Currently, that's just 2064 resources, and given that Hα sits smack in the middle of the optical window that's an indication that far too few resources say where they are.

    So – add STC coverage to your data today. It's not hard with pymoc or pgsphere and chapter 3.2 of VODataService. DaCHS operators will probably get by just studying the corresponding section of the tutorial.

    Broken Author Names

    On the ADS, last time I had information on that, about 90% of the queries were by author. In the VO registry, by my unscientific estimate, less than 5% of queries constrain authors. Sure, people look for literature and data in different ways and for different purposes, but an important reason for the difference still is that we don't do a good job giving creator/name (which contains the equvialent of the author name).

    The ideal format is to have last name first, then a comma, and then abbreviated initials or full first names, as in von der Heide, J.. Many names in the VO are almost in this format do not have a comma; but the comma makes parsing these names a lot simpler, so please put it in. Of all the forms to write names in, that's most easily constrained without guessing how many first names are where. Remember, there are people out their with names like „Kirsten-Claude Selim de Vaucouleurs-van der Heide Lobos“ (or, for that matter, Uthamadhanapuram Venkatasubbaiyer Swaminatha Iyer), and a computer cannot efficiently decide where the last name starts in first name first order (and conversely, without the comma in last name first order, it has a hard time figuring out where the last name stops). Also, last name first almost always gives a more useful natural sort order.

    Realistically, people will have to live with J. von der Heide, too, so author searches in the VO will have to look like LIKE '%von der Heide%' for some years to come, but let's at least try to improve. And whatever you do, don't do any of (in approximate order of severity):

    • Dump in half an acknowledgement, e.g., under a cooperative agreement with the NSF on behalf of the Gemini partnership: the National Science Foundation (United States), or, about as bad: provided by S. Snowden from data by Dickey and Lockman – that's useless for author searches but invites lots of false positives
    • Dump more than one name into one creator/name element, e.g., Zhuang Z.,Kirby E.N.,Leethochawalit N.,de los Reyes M.A.C. or Voges, W.; Aschenbach, B.; Boller, Th.; (and ~200 more characters) – that's really hard to search and essentially impossible to use for, e.g., author datagraphies
    • Include affiliations (the VO can't properly deal with those yet), e.g., Zub M. (The PLANET Collaboration) or a combination of this and the previous: Zhu W. (The Spitzer team) Dominik M.
    • Forget citation debris, e.g., et al. MNRAS (in press), or, shockingly common: and Scheck M.; of course, entire citations (WALKER I. Astron. J. 106) are inappropriate, too – all of this will prevent the use of meaningful name constraints
    • Give a bibcode: 2014ApJ...787...78M – this likely belongs into content/source
    • Have empty author name elements (as, at this moment, 13 records)
    • Cheat with effectively empty author names: <NOT GIVEN>, or "We forgot to give credit, please complain"
    • Go all uppercase, e.g., ZINNECKER H. – standards-compliant ADQL string comparisons are case-sensitive, and case-folding would require special indexes. Perhaps case-insensitive author matches should be made easier in that van der Waals is probably the same person as Van der Waals, but for now that's not how it works right now. And I don't think that will change any time soon, because if I have learned one thing in my life it is that case insensitivity is almost always evil
    • Have just a first name: walter or W.I. or W-J
    • Combine author lists from different contributing papers: Wright et al.; Griffith, Wright, Burke, Ekers; Griffith, Wright – if you really need to do something like this, merge the two author lists – and then of course use one name per creator element

    In principle, these considerations would apply to contributors, contacts and perhaps publishers, too, but since I don't think people should use these in discovery queries, their format does not matter too much: If they're human-readable, that's enough.

    Fragile Contact Info

    Quite regularly I need to ask people to fix something in their publishing registries, and then it's really useful to have reliable contact information. That's also nice for VO users; pyVO, for instance, has the get_contact method on registry records, and in WIRR, you can pop up contact info on all records:

    a screenshot showing a match in a registry query.  A subwindow is popped up that shows a mail address and a telephone number of a “GAVO Data Center Team“.

    For that to work, personal addresses in the contact information are really dangerous – it is my experience that these break significantly more often than institutional addresses. So, please avoid things like (I'm making all of these up because there may still be folks around harvesting mail addresses to send spam):

    • a.b.miller-parachtnix@gmail.com (well: avoid using gmail.com unconditionally)
    • friederike.student@ari.uni-heidelberg.de

    Rather, create an alias that you can hand on and that perhaps is even a bit speaking. This could be:

    • vo-help@great-telescope.org
    • gavo@ari.uni-heidelberg.de
    • uni-hd-vo@posteo.de (in case your own institution absolutely loathes the idea of addresses not bound to persons)

    Non-machine-readable Subjects

    VOResource 1.1 said that subjects are to be taken “from the UAT” (that's the Unified Astronomy Thesaurus), but failed to say what exactly that means. Since last July, this is properly defined: Use fragment identifiers into http://www.ivoa.net/rdf/uat, that is, something like abell-clusters.

    Having all subject keywords in a predictable format, with useful metadata, and part of a proper hierarchy enables all kinds of cool stuff, and hence it would be great if we could stomp out the following sorts of mispractice in the VO:

    • Multiple things in one subject element: ATLAS DR1, SIAP, Images – have one term per subject element
    • Undefined NULL values: NOT PROVIDED, ??? – if you really cannot find a pertinent term, use astronomical-research (or one of the other top-level terms). If nowhere else, that at least helps when your record moves to interdisciplinary search engines
    • Random free text: optical lines equivalent width catalog – that's multiple terms rolled into one, and the machine will not know what it means
    • Project or instrument names: 6dF Data Release 3 Spectra, COROT N2 – there's the instrument metadata for some uses of that. For the rest, see above on having projects in creator/name.
    • Protocol names: TAP – that's what capabilities are for
    • Service titles: CADC image/cube HiPS service – that's what the title element is for
    • Non-UAT keyword schemes: Galaxy:general – let's not force VO components to learn about multiple keyword systems. If you are missing something from the UAT, tell them about it

    Unfulfilling Resource Descriptions

    Descriptions of VO resources serve a dual purpose: The should give researches a quick idea of what to expect and not expect of a resource, and they should mention all the important buzzwords for the benefit of full-text searches. Hence, if you only have two words as in:

    Survey (LoLSS).

    or have something like a title:

    Convolution of normalized synthetic stellar spectra.

    or use somewhat uncommon abbreviations and technical details that probably will not help much during data discovery:

    USET Group form

    (what group? Does „form“ really mean „web browser-facing“? If so, that's again better expressed through the capabilities), you should work a bit on your description.

    It is usually helpful to start the description with „this service is…“ or something similar. While it's marginally ok to mention terms and conditions like:

    When referencing results from this online catalog, please cite &lt;a href="https://iopscience.iop.org/article/10.384

    further down in the description (the proper place for this kind of thing is the rights element, though), don't discuss stuff like this before you have told people what is in the “online catalog” in the first place. Also: registry records are like e-mail in that you shouldn't use HTML anywhere in registry metadata. If you have to include URLs in text for human consumption, just put them in as text.

    Talking about markup: You cannot rely on any of that in descriptions. Even basic ASCII art (or, well, tables) will always come out bad:

    Only the data from the first catalog that was matched is reported here according to the following priority list. This means for example, if a star was matched with Hipparcos, that information was used while possible other catalog data are not listed here. -------------------------------------------------------- # stars flg catalog -------------------------------------------------------- 53500 0 no catalog match 55549 1 Hipparcos 254 2 Yale Parallax Catalog 1041 3 Finch and Zacharias 2016 (UPM NNNN-NNNN) 1431 4 MEarth parallaxes 402 5 SIMBAD Database (w/parallax) -------------------------------------------------------- 112177 total number stars in catalog -------------------------------------------------------- Not all parallaxes from the...

    (of course, that in this case the newlines and longer sequences of blanks have been normalised to single blanks already in the original resource record makes it particularly certain that the table will come out wrong).

    And where in titles abbreviations are usually a good thing, in particular when you can expect your target audience too look for the abbreviation rather spelled-out names in discovery queries, in descriptions you have space, and hence you normally should explain MCQA as „Monte Carlo Quality Assessment“ in something like the following:

    Herschel sources in Planck fields measured at 350 µm MCQA

    Remember: The people who read your descriptions may come from the future (as in: 25 years from now) or at least may be unfamilar with your project's jargon.

    Lame Relationships

    There are an incredible 136958 relationships in the current VO that have related-to as their relationship type. This is deplorable because the relevant vocabulary, https://www.ivoa.net/rdf/voresource/relationship_type, marks it as deprecated, and that's for a good reason: Just stating “some relationship“ between two resources is rarely useful. Decide what the relationship is and then pick a proper term (or, if that does not exist, prepare a VEP).

    Missing Tablesets

    Tablesets are a VODataService feature giving metadata on the return table (or, in the case of the flexible TAP services, the queried tables). They are really useful if you look for services returning some sort of physics – and if you are running TAP services, they will one day let me shut down the GloTS service that replicates a good deal of registry functionality for no good reason at all.

    So, if you have a catalog service and your registry record ends somewhat like this:

      </capability>
    </ri:Resource>
    

    it is almost certainly missing a tableset (which would normally go after the capabilities; you are probably also missing the sky coverage, though, because that would sit there, too).

    Writing basic tablesets is not hard. In fact, if you are running a TAP service, you have a working tableset on your service's tables endpoint. But even without VOSI tables, making a tableset from the VOTable you return is straightforward – with a few encouraging words, I could be talked to write a few lines of Python that do that.

    I will readily admit that writing good tablesets is more involved, but what is hard about it you should be doing anyway, because it also will improve the VOTables that you write, and hence the usability of your data all around. So, until the end of this post let me look at some common warts of the column metadata in today's VO.

    Deficient Column Descriptions

    Column descriptions like ?, ??, or even ??? are surprisingly common. Please don't do that. If you really have no idea what your upstream has put into a column, admit that, aplogise and try to make your upstream explain.

    And while RA somewhat works among astronomers, a word or two on the reference system (“IRCS”) and an informal provenance (“from PSF fits”) would certainly be much appreciated by your users and might even come handy in discovery.

    Or consider “Age” – this could immediately be improved by revealing just what has aged here and, again, some hint on how the age was estimated (e.g., “obtained from ivo://foo.bar/res” versus “by isochrone fitting”).

    But don't overdo it, either: Do not include entire footnotes in descriptions, because that will lead to many false positives in full text searches (not to mention slow down the Registry as a whole if this became common practice). DaCHS operators: you can have footnotes in your RD by using note meta items; cf. Typed Meta Elements in the DaCHS reference.

    Near the upper limit of what is appropriate in a column description is perhaps something like this:

    The 2.5 percentile of the Log total SFR PDF. This is derived by combining emission line measurements from within the fibre where possible and aperture corrections are done by fitting models ala Gallazzi et al (2005), Salim et al (2007) to the photometry outside the fibre. For those objects where the emission lines within the fibre do not provide an estimate of the SFR, model fits were made to the integrated photometry.

    – but at the same time it illustrates how you can provide a lot of information that helps casual users.

    The position angles I will turn to in a second give another nice example of why human-readable descriptions are so important: There is no reliable convention of the direction and the baseline of these, so stating something like „north over east“ in a description will avoid a lot of head-scratching.

    Column UCDs: Missing, Outdated, or Useless

    A very plausible discovery scenario involves UCDs: „give me resources with (some photometry | redshifts | kinematics | dynamics | positions on earth)“. Hence, make sure your columns' metadata has predictable and halfway correct UCDs.

    Sure, that's not always straightforward (note, by the way, that there is a reasonably simple process to suggest new UCDs), but there's no excuse for there being 117 columns called pa without any UCD, where pos.posAng will almost certainly fit all of them (though, who knows: 30 of these in addition don't even have a description).

    To make sure the UCDs you assign exist, run them through astropy at least once. Do not ignore complaints by astropy; it is actually preferable to have no UCD rather than “??” (which currently a whopping 30342 column sport, in addition to which we have 41 times “???“ and 70 times “????“[1]). Also, resist the temptation to freely invent things, such as the “mjd” UCD I'm seeing on 13 columns. In this particular case, by the way, I give you that saying “this column contains MJDs“ has been a pain in VOTables for a long time, but since version 1.4, TIMESYS lets you do that in a reasonable way.

    Oh, let me qualify the “freely invent“ in the last paragraph: It could be[2] that MJD has actually been part of the original UCDs you may still know from cone search (“POS_EQ_RA”); that people have not updated their metadata from these ancient days is also the reason I'm still seeing 13827 columns with an (invalid) UCD of “error“ in column metadata (and 84 with pos_eq_dec).

    Unrelatedly (though with an undisputable entertainment value): the longest UCD in the current VO is meta.code;phot.flux.density;arith.ratio;em.ir.15-30um;em.radio.750-1500mhz; unless I and astropy are missing something, it's even syntactically correct.

    Bad Units

    While I do not see many discovery scenarios that would make good use of units, do not forget to update your units to VOUnits when you touch up your tablesets. This will let software like astropy do the unit calculus for its users, which is a win overall. It cannot do that if you ignore VOUnits and write, say, ABmag/arcsec2 – the AB part you will have to communicate in the description for now, and exponentiation is ** in VOUnits.

    Recent versions of the stilts validators (votlint, taplint) will complain about bad units. And you can use stilts interactively to figure out whether you got it right:

    $ stilts calc 'vounitStatus("ABmag/arcsec2")'
      BAD_SYNTAX
    $ stilts calc 'vounitStatus("mag/arcsec**2")'
      OK
    

    [In a previous version of this post, I have given a piece of astropy to do unit checking; it turns out that astropy by default is rather forgiving, and you want stilts on your box anyway; why not use it for unit validation? If your stilts says something about “bad expression“ with the command lines above, it's an indication that you should update it.]

    And with this somewhat non-registry topic: Go forth and polish your resource records. Or, as a consumer of such metadata, ask the publishers of bad resource metadata to fix it.

    [1]Remarkably, there are no ????? or even longer sequences of question marks, and even more remarkably, nobody has put in a lonely question mark. If someone versed in cognitive psychology has a plausible interpretation for that fact: can you share it with me?
    [2]Since the original UCDs predate my VO involvement and, for all I know, never were properly standardised, I frankly can't say.
  • Updates to GAVO's Tutorials

    Over the years, GAVO has produced a number of VO tutorials, i.e., texts that introduce some technique related to using the Virtual Observatory, preferably within some halfway plausible scenario. In effect, they are software documentation, and as software itself, software documentation suffers from bit rot. To work against that, the tutorials have to be revised occasionally.

    My two student assistants Sonja Gabriel and Chuanming Mao have recently done some of that revising. Let me use this opportunity to show off some of these freshly polished tutorials.

    A classic one (that has, if I may say so myself, aged rather well), is Adding catalog data to object lists using the VO. This is a thinly disguised introduction to TAP uploads, arguably the most powerful of all the VO tech to date. If you have come to this place without ever having done a TAP upload, you owe it to yourself to at least skim the tutorial and quickly follow along the few steps to do positional crossmatches with just about any astronomical catalog and with just about any level of sophistication.

    part of a screenshot: a histogram, a sky photo with overplotted points

    Another classic – it has its roots in the original Italian VO Days[1] – is TOPCAT and Aladin working together. It is using SDSS data of some galaxy cluster to try and get you to to send around data and positions between different programs using SAMP. If you are reading VO blogs, it is not unlikely this kind of thing will make you yawn. But at VO Days, it's little things like this that usually most immediately appeal to students and researchers alike.

    part of a screenshot: a color-magnitude diagram is a very narrow main sequence, and a proper motion plot

    From a tech point of view, Explore the Pleiades with TOPCAT and Aladin also mainly looks at SAMP (perhaps even somewhat less convincingly), but it's such a striking demo of what an amazing instrument Gaia is, and it's a nice introduction to TOPCAT's VO interface and subsetting facility that it's definitely worth a look, in particular as a showcase of having instant results with the VO.

    circular cloud of red crosses and blue circles in a celestial coordinate system

    An entirely different topic (well: it also employs SAMP for a moment) is covered by Data Discovery Using the Virtual Observatory Registry. This is trying to motivate looking for data collections in the VO Registry (in the form of our Browser interface to it). This tutorial has grown quite a bit during the review and now includes two sections joining data from different resources for various purposes. One section illustrates how systematics of quasar redshifts might be looked into using different sources, the other investigates the Tully-Fisher relationship in different spectral bands.

    A TOPCAT-plotted histogram with a sharp peak around 39.5 AU and a much wider one around 44.

    The tutorial on Asteroids in the Solar System was entriely overhauled. It was (and still is) mainly intended to be used in schools, and thus it originally just built on things that ran in a web browser. As is typical of things in web browsers, they have long since vanished. Hence, a rather fundamental update was necessary anyway. While we were looking for interesting things to do – the plot above, by the way, is the distribution of major halfaxes in the Kuiper belt –, we ended up even includeding a brief bit on ADQL.

    Due to its school focus, we are also offering this particular text in German as well as in English. If you are an Astronomy teacher with particularly motivated pupils , we would like to hear from you…

    An aladin window showing two aligned photos of the ring nebula in Lyra

    The last revised tutorial I would like to mention also has a somewhat special (main) target audience: Astrometric Calibration using Aladin. Admittedly, automatic, or “blind” calibration has become really great, and I think getting their images located on the sky is not much of a problem even for amateurs any more, thanks in part to services like astrometry.net. But then – sometimes there is nothing like a good, old manual, ummm, “plate” solution. Aladin and the VO make that lot less tedious than it used to be.

    Of course, I cannot have a post on tutorials without mentioning the VO Text Treasures, a web page that shows the educational material currently registered in the VO Registry. This little page also accounts for bit rot: You can sort by the time last inspected there, and thanks to Sonja's and Chuanming's efforts, our tutorials look very good in that representation at the moment.

    In case you have some material suitable for WIRR yourself: Please register it, too. Send me a mail and I will lend you a hand (or, if you are a VO pro, directly read the pertinent standard).

    [1]That's block courses on VO matters lasting a day or two. If you are in Germany, you can book us for your very own one!
  • HEALPix Maps: In General and in Gaia

    blue and reddish pixels drawing a bar on the sky.

    A map of average Gaia colours in HEALPixes 2/83 and 2/86 (Orion south-east). This post tells you how to (relatively) quickly produce such maps.

    This year's puzzler for the AG Tagung turned out to be a valuable source of interesting ADQL queries. I have already written about finding dusty spots on the sky, and in the puzzler solution, I had promised some words on creating dust maps, or, more generally, HEALPix maps of any sort.

    Making HEALPix maps with Gaia source_ids

    The basic technique is explained in Mark Taylor's classical ADASS poster from 2016. On GAVO's TAP service (access URL http://dc.g-vo.org/tap), you will also find an example for that (in TOPCAT's TAP window, check the Service-provided section unter the Examples button for it). However, once you have Gaia source_ids, there is something a lot faster and arguably not much less convenient. Let me quote the footnote on source_id from my DR3 lite table:

    For the contents of Gaia DR3, the source ID consists of a 64-bit integer, least significant bit = 1 and most significant bit = 64, comprising:

    • a HEALPix index number (sky pixel) in bits 36 - 63; by definition the smallest HEALPix index number is zero.
    • […]

    This means that the HEALpix index level 12 of a given source is contained in the most significant bits. HEALpix index of 12 and lower levels can thus be retrieved as follows:

    • [...]
    • HEALpix [at] level n = (source_id)/(235⋅412 − n).

    That is: Once you have a Gaia source_id, you an compute HEALpix indexes on levels 12 or less by a simple integer division! I give you that the more-than-35-bit numbers you have to divide by do look a bit scary – but you can always come back here for cutting and pasting:

    HEALPix level Integer-divide source id by
    12 34359738368
    11 137438953472
    10 549755813888
    9 2199023255552
    8 8796093022208
    7 35184372088832
    6 140737488355328
    5 562949953421312
    4 2251799813685248
    3 9007199254740992
    2 36028797018963968

    If you know – and that is very valuable knowledge far beyond this particular application – that you can simply jump between HEALPix indexes of different levels by multiplying with or integer-dividing by four, the general formula in the footnote actually becomes rather memorisable. Let me illustrate that with an example in Python. HEALPix number 3145 on level 6 is:

    >>> 3145//4  # ...within this HEALPix on level 5...
    786
    >>> 3145*4, (3145+1)*4  # ..and covers these on level 7...
    (12580, 12584)
    

    Simple but ingenious.

    You can immediately exploit this to make HEALPix maps like the one in the puzzler. This piece of ADQL does the job within a few seconds on the GAVO DC TAP service[1]:

    SELECT source_id/8796093022208 AS pix,
      AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
    FROM gaia.edr3lite
    WHERE distance(ra, dec, 246.7, -24.5)<2
    GROUP by pix
    

    Using the table above, you see that the horrendous 8796093022208 is the code for HEALPix level 8. When you remember (and you should) that HEALPix level 6 corresponds to a linear dimension of about 1 degree and each level is a factor of two in linear dimension, you see that the map ought to have a resolution of about 1/8th of a degree.

    HEALPix to Screen Pixel

    How do you plot this? Well, in TOPCAT, do GraphicsSky Plot, and then in the plot window LayersAdd HEALPix control (there are icons for both of these, too). You then have to manually configure the plot for the table you just retrieved: Set the Level to 8, the index to pix and the Value to avgcol – we're working on making the annotation a bit richer so that TOPCAT has a chance to figure this out by itself.

    With a bit of extra configuration, you get the following map of average colours (really: dust concentration):

    Plot: Black and reddish pixels showing a bit of structure

    This is not totally ideal, as at the border of the cone, certain Healpixes are only partially covered, which makes statistics unnecessarily harder.

    Positional Constraints using source_ids

    Due to Gaia's brilliant numbering scheme, we can do analysis by HEALpix, too, circumventing (among other things) this problem. Say you are interested in the vicinity of the M42 and would like to investigate a patch of about 8 degrees. By our rule of thumb, 8 degrees is three levels up from the one-degree level 6. To find the corresponding HEALpix index, on DaCHS servers with their gavo_simbadpoint UDF you could say:

    SELECT TOP 1 ivo_healpix_index(3, gavo_simbadpoint('M42'))
    FROM tap_schema.tables
    

    Hu, you ask, what's tap_schema.tables to do with this? Well, nothing, really. It's just that ADQL's syntax requires selecting from a table, even if what we select is completely independent of any table, as for instance the index of M42's 3-HEALpix. The hack above picks in a table guaranteed to exist on all TAP services, and the TOP 1 makes sure we only compute the value once. In case you ever feel the need to abuse a TAP service as a calculator: Keep this trick in mind.

    The result, 334, you could also have found more graphically, as follows:

    1. Start Aladin
    2. Check OverlayHEALPix grid
    3. Enter M42 in Command
    4. Zoom out until you see HEALPix indexes of level 3 in the grid.

    An advantage you have with this method: You see that M42 happens to lie on a border of HEALPixes; perhaps you should include all of 334, 335, 356, and 357 if you were really interested in the Orion Nebula's vicinity.

    We, on the other hand, are just interested in instructive examples, and hence let's just repeat our colour mapping with all Gaia objects from HEALPix 3/334. How do you select these? Well, by source_id's construction, you know their source_ids will be between 334⋅9007199254740992 and (334 + 1)⋅9007199254740992 − 1:

    SELECT source_id/8796093022208 AS pix,
      AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
    FROM gaia.edr3lite
    WHERE source_id BETWEEN 334*9007199254740992 AND 335*9007199254740992-1
    GROUP by pix
    

    This is computationally cheap (though Postgres, not being a column store still has to do quite a bit of I/O; note how much faster this query is when you run it again and all the tuples are already in memory). Even going to HEALPix level 2 would in general still be within our sync time limit. The opening figure was produced with the constraint:

    source_id BETWEEN 83*36028797018963968 AND 84*36028797018963968-1
    OR source_id BETWEEN 86*36028797018963968 AND 87*36028797018963968-1
    

    – and with a sync query.

    Aggregating over a Non-HEALPix

    One last point: The constraints we have just been using are, in effect, positional constraints. You can also use them as quick and in some sense rather unbiased sampling tools.

    For instance, if you would like so see how the reddening in one of the “dense“ spots in the opening picture behaves with distance, you could first pick a point – α = 98, δ = 4, say –, then convert that to a level 7 healpix as above (that's/88974) and then write:

    SELECT ROUND(r_med_photogeo/200)*200 AS distbin, COUNT(*) as n,
        AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
    FROM gaia.dr3lite
    JOIN gedr3dist.main USING (source_id)
    WHERE source_id BETWEEN 88974*35184372088832 and 88975*35184372088832-1
    GROUP BY distbin
    

    This is creating 200 pc bins in distance based on the estimates in the gedr3dist.main table (note that this adds subtle correlations, because these estimates already contain Gaia colour information). Since quite a few of these bins will be very sparsely populated, I'm also fetching the number of objects contributing. And then I plot the whole thing, using the conventional (n) ⁄ n as a rough estimate for the relative error:

    Plot: A line that first slowly declines, then rises quite a bit, then flattens out and becomes crazy as errors start to dominate.

    This plot immediatly shows that colour systematics are not exclusively due to dust, as in that case things would only get redder all the time. The blueward trend up to 700 pc is reasonably well explained by the brighter, bluer upper main sequence becoming more dominant in the population sampled as red dwarfs become too faint for Gaia.

    The strong reddening setting in after that is rather certainly due to the Orion complex, though I would perhaps not have expected it to reach out to 2 kpc (the conventional distance to M42 is about 0.5 kpc); without having properly thought about it, I'll chalk it off as “the Orion arm“. And after that, it's again what I'd call Malmquist-blueing until the whole things dissolves into noise.

    In conclusion: Did you know you can group by both healpix and distbin at the same time? I am sure there are interesting structures to be found in what you will get from such a query…

    [1]You may be tempted to write source_id/(POWER(2, 35)*POWER(4, 3) here for clarity. Resist that temptation. POWER returns floating point numbers. If you have one float in a division, not even a ROUND will get you back into the integer division realm, and the whole trick implodes. No, you will need the integer literals for now.
  • Computing Residuals of an Astrometric Calibration

    Two plots, left a fairly good correlation, right a cloudy wave

    The kind of plot you can make following the recipe given here: Left, a comparison of the photometry, right, a positional residuals, not taking into account the SIP plate solution, when comparing the HDAP plate B3261a against Gaia DR3. Note that the cut-off a 4 arcsec is because of the match radius when obtaining the calibrator stars.

    I recently had to assess the quality of the astrometric calibration of a photographic plate. What I am going to show you in this post will of course work just as well for CCD frames, and if these have a sufficiently large field of view, this may be an issue for them as well. However, the sort of data that needs this assessment most typically are scans of plates, as these tend to have a “wobble”, systematic offsets in the scan direction resulting from imperfections in the mechanics.

    Prerequisites: An astronomical frame with a calibration in ICRS (or some frame not very far from it), called my-image.fits in the following, SExtractor (in Debian and derivatives: apt install source-extractor – long live Debian Astro; since it's called source-extractor in Debian, that's what I'll use here, too), and of course TOPCAT.

    Step 1: Extract Sources. Source extraction is of course a high science, and if you know better than me, by all means do it the way you think is appropriate. Meanwhile, the following might very well work for you sufficiently well.

    Create a working directory and enter it. Then, to create a file telling source-extractor what columns you would like to see, write the following to a file default.param:

    ALPHA_SKY
    DELTA_SKY
    X_IMAGE
    Y_IMAGE
    MAG_ISO
    FLUX_AUTO
    ELONGATION
    

    Next, give a few parameters to source-extractor; depending on the sort of image you have, you may want to play around with DETECT_MINAREA (how many pixels need to show a signal to register as a source) and DETECT_THRESH (how many sigmas a pixel has to be above the background to register as a candidate for belonging to a source). Meanwhile, write the following into a file default.control:

    CATALOG_TYPE     FITS_1.0
    CATALOG_NAME     img.axy
    PARAMETERS_NAME  default.param
    FILTER           N
    DETECT_MINAREA   30
    DETECT_THRESH    4
    SEEING_FWHM      1.2
    

    – but if the following call gives you a few hundred sources, that ought to work for the present purpose.

    Then run:

    source-extractor -c default.control my-image.fits
    

    This will give you a catalogue of extracted objects in the file img.axy.

    Step 2: Fix source-extractor's output. Load that img.axy into TOPCAT. Regrettably, source-extractor does not add any useful metadata to the columns of its output table. To add the absolute bare minimum, in TOPCAT go to ViewsColumn Info. In that window, check UCD in the Display menu, and then put pos.eq.ra and pos.eq.dec into the UCD fields of the ALPHA_SKY and DELTA_SKY columns, respectively; double click to change fields in TOPCAT.

    To see if you have done the annotation right, in TOPCAT's main window, click GraphicsSky Plot. If the objects show up, you have just provided enough annotation to let TOPCAT figure out the position for each row.

    Step 3: Get calibrators. We will now try to add counterparts for Gaia DR3 to the extracted sources. To do that, click VOTable Access Protocol, and in the window popping up double click the entry for the GAVO DC TAP.

    In the Find box, type dr3lite to look for this site's version of the Gaia DR3 source catalogue. Click on gaia.dr3lite to select that table, and then select the Columns pane. This should show some of the Gaia DR3 columns.

    Now ExamplesUpload Join will generate a query that will cross-match your extracted sources with the Gaia sources. You should edit it a bit, only selecting the columns you will actually need, removing the TOP 1000 (at least on large images with more than 1000 sources), and reducing the match radius a bit when the calibration is not actually completely off and your epoch is sufficiently close to J2000.

    Hint: you can control-click in the Columns pane and then use the Cols button to insert all the column names in one go[1]. For me, the resulting query would be:

    SELECT
       source_id, ra, dec, phot_bp_mean_mag,
       tc.*
       FROM gaia.dr3lite AS db
       JOIN TAP_UPLOAD.t1 AS tc
       ON 1=CONTAINS(POINT('ICRS', db.ra, db.dec),
                     CIRCLE('ICRS', tc.ALPHA_SKY, tc.DELTA_SKY, 4./3600.))
    

    This should result in about as many matches as your extraction had – a few more is ok, because you will have some spurious matches, a few less is ok, too, as there are always some outliers and artefacts, but you should clearly not pull a magnitude more or less objects here than you put in; fiddle with the match radius as necessary.

    See if there is a rough correlation between the Gaia calibrators and your extracted sources by plotting phot_bp_mean_mag against MAG_ISO. Absent more information, MAG_ISO, source-extractor's guess for the magnitude of the extracted object, will be just some crazy number, but it should have some discernable correlation with the actual magnitude. Do not expect too much here, in particular with old plates, for which good photometry is a science of their own.

    In my example, this looked like this:

    Plot: a rough correlation in red with a green tail

    The green points certainly are spurious matches; this observation did not reach beyond 14th magnitude or so, and there are many weak stars on the sky, so a few of them will show up in just about any cross match. See the opening picture for an example with a better correlation.

    Step 4: Do the correlation plot. Do GraphicsPlane Plot and then plot ra-alpha_sky or dec-delta_sky against X_IMAGE or Y_IMAGE. You could get something like this:

    Plot: A single wavy thing

    This rather certainly reflects some optical distortion; source-extractor regrettably does not take into account SIP corrections yet, so it is likely that a large part of this would be taken care of by the polynomials of the plate solution (the github issue I am linking to tells you how to be sure).

    But it can also look like this:

    Plot: Multiple wobbles

    This certainly is not the result of a lens or anything optical at all. It's the scanner's gears that you are looking at here. With an amplitude of perhaps three arcseconds this is rather excessive here; but something like this you will rather likely see even on good scanners – though it may essentially be invisible, as of the Heidelberg scanner we used for HDAP:

    Plot: A vertical cloud with no discernible structure.
    [1]I'm using the BP magnitude in the query below as most historical plates tend to be “blue sensitive“ (in some sense). Hence, BP magnitudes should be a bit closer to what source-extractor has extracted.
  • DaCHS is now at Version 2.7

    Logo-ish 2.7 with a multi-array plot

    Last Friday, I have released Version 2.7 of GAVO's Virtual Observatory server package DaCHS. As is customary, I will give a brief overview of the more noteworthy changes in this blog post. This is probably only of interest to people running DaCHS-based data centres. What I discuss here is both a bit more verbose and a bit less extensive than what you find in the Changes file (when installed from package, you would read it by running /usr/share/doc/python3-gavo/changelog.gz).

    The highlight in this release from my view are simple, numpy-like vector operations in ADQL. Regular readers of this blog will already have seen an example for their use. This is altogether a prototype, which is why what specification is there is only on the IVOA wiki. It is thus likely some details of the vector math will change until they make it into any sort of standard (I am hoping for ADQL 2.2). This should not keep you from trying it out and telling your users about it.

    In that same vein, the FITS binary table grammar now copes with vectors, which makes it easier to populate tables that make these operations useful, and for the sort of large tables where the array magic has particularly much promise, it is now a lot simpler to feed array-valued columns with C boosters.

    Other ADQL work includes the addition of proper, standards-compliant epoch propagation (i.e., “application of proper motion and radial velocity“) in the form of the ivo_epoch_prop and ivo_epoch_prop_pos user defined functions. Regrettably, this will not immediately work for you, as it builds on a feature in pgsphere that upstream has not merged yet; comments on that PR will certainly help make that happen. Of course, if you want, you can just build the pgsphere branch containing the new feature yourself. To make up for this complication, DaCHS will no longer advertise UDFs that will not work given the database extensions present – which will help me be a bit more liberal in letting in UDFs wrapping functionality not in Postgres' default distribution in the future.

    If you run datalink services and have multiple items with the same semantics, you may be interested in using local_semantics in Datalink. The use case here is that clients like TOPCAT will remain on, say, light curves in a red filter when the user jumps between records rather than randomly switching between red and blue ones when both have #coderived semantics (Mark's proposal). If you have data of this kind: you can now pass a localSemantics parameter to the makeLink and makeLinkFromFile methods of datalink descriptors; what string you use is up to you, as long as it's the same between similar rows for different datasets.

    I tend to forget that surprisingly many people actually do something with the ADQL form you get on DaCHS' web interface rather than use a TAP client. Well, a DaCHS operator complained about really sub-standard table headings in the HTML tables coming out of this service. Looking again, I had to admit he was right. So, TAP columns now have more meaningful table headings; in particular, if you write expressions, up to a certain length these expressions will be used as table headings. At least in this respect the ADQL form now has an advantage over using a proper client.

    In case you have a processor doing astrometric calibration with astrometry.net (you probably don't because it would have been very hard to make that work on without a lot of hacks so far) – have another look at the documentation because I have had various reasons to change api.AnetHeaderProcessor's API in quite a number of ways. It's now a lot easier to use with astrometry.net and source-extractor as distributed by Debian, but I'd still not have broken the API so badly if I had suspected anyone but me had significant code against this.

    I should also warn you that DaCHS now uses astropy to format sexagesimal times and coordinates. This is probably welcome news to those who ever encountered one of DaCHS' 05:59:60 outputs (which happened due to the way it did its rounding). Still, if you have regression tests testing for strings like that, you will need to update them.

    From the many minor fixes I should probably mention that DaCHS is now ready for Postgres 15 (which will probably the Postgres version in the next Debian stable). This used to be broken on new installations because Postgres 15 no longer lets normal users write to the public schema. DaCHS needs a database role that can do this, though, because it defines public functions. Since version 2.7, it does the necessary setup to make this possible. If you make your public schema non-world-writable manually – Postgres upgrades will not do that for you, and I would say there is no strong reason to do so for databases backing DaCHS –, do not forget to GRANT ALL ON SCHEMA public TO gavoadmin.

    With this – don't wait, upgrade. If you have GAVO's repository enabled, apt update && apt upgrade would probably do the trick, though of course I recommend having a look at our upgrading guide for robustness and good housekeeping.

« Page 5 / 21 »