Articles from Operations

News From the VO Via ActivityPub

2024-01-22 Markus Demleitner

If you ask us: Get a proper client to join the Fediverse. But as shown here, in a pinch a web browser will do, too.

When Twitter was still fairly young, we had an account there that would tweet out when new data collections appeared in the VO. Even back then, I was rather doubtful whether using a proprietary platform to disseminate open data is a good idea, but as long as the content was also available through standard protocols (RSS in this case), I thought it might be worth a try. Well: It never really took off, and after Twitter broke the whole thing a couple of times by incompatible API changes, I finally let it go ca. 2017.

Given to the recent mass exodus from the smouldering remains of Twitter into the open and standard Fediverse, I thought reviving our little missives there might actually be a worthwhile effort. Specifically, joining Mastodon – which speaks the ActivityPub protocol and hence is part of the Fediverse – has become really straightforward.

So, if the VO Fresh RSS Feed is not for you (perhaps because you do not have an RSS aggregator, which would be a shame), maybe following our new Mastodon account @gavo@botsin.space would be for you?

Followup (2024-11-21)

In late 2024, botsin.space shut down, and we moved our operations to @gavo@astrodon.social; so, please point your fediverse clients there.

Followup (2025-03-13)

While we would obviously nudge people to properly open and federated systems like the mastodon or the Fediverse in general, you can follow us from bluesky, too. Try @gavo.astrodon.social.ap.brid.gy there.

Followup (2025-07-01)

Oh bother, astrodon.social shut down, too. Perhaps we really need to run some sort of activityPub server of our own? Or convince some university to do it? Until then, we have moved on to fediscience.org. You can now follow @gavo@fediscience.org. From bluesky, we are now @gavo.fediscience.org.ap.brid.gy.

Oh, and yes, I give you the previews the Mastodon web client produces for VizieR resources are not overly pretty yet (curse Javascript templating!), but then if I were you, I'd disable URL previews anyway; really, they are little more than an annoyance.

This post also doubles as identity verification, so:
Visit Our Mastodon Page.
Category: Operations
Making Custom Indexes for astrometry.net

2023-07-26 Markus Demleitner
When you have an image or a scan of a photographic plate, you usually only have a vague idea of what position the telescope actually was pointed at. Furnishing the image with (more or less) precise information about what pixel corresponds to what sky position is called astrometric calibration. For a while now, arguably the simplest option to do astrometric calibration has been a package called astrometry.net. The eponymous web page has been experiencing… um… operational problems lately, but thanks to the Debian astronomy team, there is a nice package for it in Debian.

However, just running apt install astrometry.net will not give you a working setup. Astrometry.net in addition needs an “index”, files that map star patterns (“quads“, in astrometry.net jargon) to positions. Debian comes with two pre-made sets of indexes at the moment (see apt search astrometry-data): those based on the Tycho 2 catalogue, and those based on 2MASS.

For the index based on Tycho 2, you will find packages astrometry-data-tycho2-10-19, astrometry-data-tycho2-09, astrometry-data-tycho2-08, astrometry-data-tycho2-07[1]. The numbers in there (“scale numbers”) define the size of images the index is good for: 19 means “a major part of the sky”, 10 is “about a degree”, 8 “about half a degree”. Indexes for large images only have a few bright stars and hence are rather compact, which is why 10 though 19 fit into one package, whereas astrometry-data-tycho2-07-littleendian weighs in at 141 MB, and indexes at scale number 0 (suitable for images of a few arcminutes) take dozens of Gigabytes if they are for the whole sky.

So, when you do astrometric calibration, consider the size of your images first and then decide which scale number is sensible for you. It is usually a good idea to try the neighbouring scale numbers, too.

You can then feed these to your calibration routine. If you are running DaCHS, you will probably want to use the AnetHeaderProcessor, where you give the names of the indexes in the sp_indices; you also have to say where to find the indexes, as in:
```
from gavo import api

class MyObsCalibrator(api.AnetHeaderProcessor):
  indexPath = "/usr/share/astrometry"
  sp_indices = ["index-tycho2-09*.fits",
    "index-tycho2-10*.fits",
    "index-tycho2-11*.fits",]
```
This would be suitable for images that cover about a degree on the sky.
Custom Indexes for Targeted Observations

The Tycho catalogue starts becoming severely incomplete below m_V ≈ 11, and since astrometry.net needs a few stars on an image to be able to calibrate it, you cannot use it to calibrate images smaller than a few tens of arcminutes (depending on where you look, of course). If you have smaller images, there are the 2MASS-based indexes; but the bluer your images are, the worse 2MASS as an infrared survey will do, and in addition, having the giant indexes is a big waste of storage and compute resources when you know your images are on a rather small part of the sky.

In such a situation, you will save a lot of CPU and possibly even improve your astrometry if you create a custom index for your specific data. For instance, assume you have images sized about 10 arcminutes, and the observation programme covers a reasonably small set of objects (as long as it's of order a few hundred, a custom index certainly will be a good deal). You could then make your index based on Gaia positions and photometry like this:
```
"""
Create an index for astrometry.net and a few small fields based on Gaia.

Be sure to adapt this for your use case; for instance, if what your are
calibrating will be from only a part of the sky, pick specific healpixes
(perhaps on a different level; below, we're using level 5).  Also consider
changing the target epoch, the photometry, or the magnitude limit.

This script takes the sample positions from a text file; have
space-separated pairs of ra and dec in targets.txt.
"""

import os
import subprocess

from astropy.table import Table
import pyvo

# 0 is for images of about two arcminutes, 10 for about degree, 12 for two
# degrees, etc.
SIZE_PRESET = 1

# The typical radius of your images in degrees (this is the size of our cone
# searches, so cut some slack); this needs to be changed in unison with
# SIZE_PRESET
IMAGE_RADIUS = 1/10.


def get_target_table():
    """must return an astropy table with columns ra and dec in degrees.

    (of course, if you have your data in a proper format with actual metadata,
    you don't need any of the ugly magic).
    """
    targets = Table.read("targets.txt", format="ascii")
    targets["col1"].name, targets["col2"].name = "ra", "dec"
    targets["ra"].meta = {"ucd": "pos.eq.ra;meta.main"}
    targets["dec"].meta = {"ucd": "pos.eq.dec;meta.main"}
    return targets


def main():
    tap_service = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
    res = tap_service.run_async(f"""
        SELECT g.ra as RA, g.dec as DEC, phot_g_mean_mag as MAG
        FROM gaia.dr3lite AS g
        JOIN TAP_UPLOAD.t1 as mine
            ON DISTANCE(mine.ra, mine.dec, g.ra, g.dec)<{IMAGE_RADIUS}""",
      uploads={"t1": get_target_table()})

    cat_file = "basic-cat.fits"
    res.to_table().write(cat_file, format="fits", overwrite=True)

    try:
        subprocess.run(["build-astrometry-index", "-i", cat_file,
            "-o", f"./index-custom-{SIZE_PRESET:02d}.fits",
            "-P", str(SIZE_PRESET), "-S", "MAG"])
    finally:
        os.unlink(cat_file)


if __name__=="__main__":
    main()
```
This writes a single file, index-custom-01.fits (in this case).

If you read your positions from something else than the simple ASCII file I'm assuming here: Be sure to annotate the columns containing RA and Dec with the proper UCDs as shown here. That makes DaCHS (and perhaps other TAP services, too) create the right hints for the database, speeding up things tremendously.

You can of course change the ADQL query; it might, for instance, help to replace the G magnitudes with RP or BP ones, or you could use a different catalogue than Gaia. Just make sure the FITS table that is written to basic-cat.fits has exactly the columns RA, DEC, and MAG.

In DaCHS, I tend to keep scripts like the one above in a subdirectory of the resdir called custom-index, and then in the calibration script I write:
```
from gavo import api

RD = api.getRD("myres/q")

class MyObsCalibrator(api.AnetHeaderProcessor):
  indexPath = RD.resdir
  sp_indices = ["custom-index/index-custom-01.fits"]
```
Custom Indexes for Ancient Observations

On the other hand, if you have oldish images not going terribly deep, you may want to tailor an index for about the epoch the images were taken at. Many bright stars have a proper motion large enough to matter over a century, and so doing epoch propagation (in this case with the ivo_epoch_prop user defined function, which is not available everywhere) is probably a good idea. The following script computes three full-sky indexes with quads around the desired size; note how you can set the limiting magnitude and the size preset:
```
"""
Create a full-sky index for bright stars and astrometry.net based on Gaia.

This only works for rather bright stars because the Gaia service will refuse
to server more than ~1e7 objects.

Make sure to choose SIZE_PRESET to your use case (19 means 30 deg,
10 about a degree, two preset steps are about a factor two in scale).
"""

import os
import subprocess

import pyvo

# see the module docstring
SIZE_PRESET = 12

# ignore stars fainter than this; you can't go below 14 all-sky with Gaia
# and the GAVO DC server
MAX_MAG = 12

# Epoch to transform the stars to
TARGET_EPOCH = 1910


def main():
    tap_service = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
    res = tap_service.run_async(f"""
        SELECT pos[1] as RA, pos[2] as DEC, mag as MAG
        FROM (
            SELECT phot_bp_mean_mag AS mag,
                ivo_epoch_prop(ra, dec, parallax,
                    pmra, pmdec, radial_velocity, 2016, {TARGET_EPOCH}) as pos
            FROM gaia.dr3lite
          WHERE phot_bp_mean_mag<{MAX_MAG}) AS q""")

    cat_file = "current.fits"
    res.to_table().write(cat_file, format="fits", overwrite=True)

    try:
        for size_preset in range(SIZE_PRESET-1, SIZE_PRESET+2):
            subprocess.run(["build-astrometry-index", "-i", cat_file,
                "-o", f"./index-custom-{size_preset:02d}.fits",
                "-P", str(size_preset), "-S", "MAG"])
    finally:
        os.unlink(cat_file)


if __name__=="__main__":
    main()
```
With this and my custom-index directory, your DaCHS header processor could say:
```
from gavo import api

RD = api.getRD("myres/q")

class MyObsCalibrator(api.AnetHeaderProcessor):
  indexPath = RD.resdir
  sp_indices = ["custom-index/index-custom-*.fits"]
```
Custom Indexes: Full-sky and Deep

I have covered the cases “deep and spotty” and “shallow and full-sky“. The case “deep and full-sky“ is a bit more involved because it still lies in the realm of big data, which always requires extra tricks. In this case, that would be retrieving the basic catalogue in parts – for instance, by HEALPix – and at the same time splitting the index up between HEALPixes, too. This does not require great magic, but it does require a bit of non-trivial bookkeeping, and hence I will only write about it if someone actually needs it – if that's you, please write in.

[1] You will also find that each of these exist in a littleendian and bigendian flavours; ignore these, your machine will pick what it needs when you install the packages without tags.

Category: Operations

Registry: A Janitor Speaks Out

2023-04-25 Markus Demleitner

Missing Coverage
Broken Author Names
Fragile Contact Info
Non-machine-readable Subjects
Unfulfilling Resource Descriptions
Lame Relationships
Missing Tablesets
Deficient Column Descriptions
Column UCDs: Missing, Outdated, or Useless
Bad Units

I sometimes claim the reason I like working on the VO Registry is that I am a librarian at heart. Perhaps there is some truth to that, in that ugly metadata does make me unhappy – but beyond that, it also makes the Virtual Observatory look or even work a good deal worse than it should.

Given that, in this post I'm afraid I will sound more like a grumpy janitor than a wise librarian, but let me still attempt to contribute to better metadata by pointing out a few things to watch out for when writing a resource record. People consuming resource records (i.e., VO-using astronomers) are welcome here, too: when you encounter antipatterns mentioned here, a polite complaint to the service publisher is entirely a good thing.

Note that I am using real metadata found in the registry – in case you recognise some of own records, do not feel reprimanded individually. Most of the problems I discuss here are really common at this point, and thus if I picked your metadata, that was mere bad luck. I actually picked some of my own occasionally (but duly fixed the problem then).

Missing Coverage

Since VODataService 1.2, you can say what part of the sky, spectrum, and time your resource covers. That is incredibly useful metadata in practice. Spatial coverage, for instance, is used in Aladin like this:

Screenshot: Resource names in white, orange and green, and a part of the sky (h and χ Persei) next to them

Green means: these services could have data for the patch of sky shown, orange means don't bother with these, and white means: No idea because the resource does not declare its coverage.

Similarly, it would be great if researchers or clients could reliably say:

SELECT * FROM rr.resource JOIN rr.stc_spectral WHERE
  1=ivo_interval_overlaps(spectral_start, spectral_end,
      ivo_specconv(658, 'nm', 'J'), ivo_specconv(654, 'nm', 'J'))

to find resources having data covering the Hα line on the spectral axis. Currently, that's just 2064 resources, and given that Hα sits smack in the middle of the optical window that's an indication that far too few resources say where they are.

So – add STC coverage to your data today. It's not hard with pymoc or pgsphere and chapter 3.2 of VODataService. DaCHS operators will probably get by just studying the corresponding section of the tutorial.

Broken Author Names

On the ADS, last time I had information on that, about 90% of the queries were by author. In the VO registry, by my unscientific estimate, less than 5% of queries constrain authors. Sure, people look for literature and data in different ways and for different purposes, but an important reason for the difference still is that we don't do a good job giving creator/name (which contains the equvialent of the author name).

The ideal format is to have last name first, then a comma, and then abbreviated initials or full first names, as in von der Heide, J.. Many names in the VO are almost in this format do not have a comma; but the comma makes parsing these names a lot simpler, so please put it in. Of all the forms to write names in, that's most easily constrained without guessing how many first names are where. Remember, there are people out their with names like „Kirsten-Claude Selim de Vaucouleurs-van der Heide Lobos“ (or, for that matter, Uthamadhanapuram Venkatasubbaiyer Swaminatha Iyer), and a computer cannot efficiently decide where the last name starts in first name first order (and conversely, without the comma in last name first order, it has a hard time figuring out where the last name stops). Also, last name first almost always gives a more useful natural sort order.

Realistically, people will have to live with J. von der Heide, too, so author searches in the VO will have to look like LIKE '%von der Heide%' for some years to come, but let's at least try to improve. And whatever you do, don't do any of (in approximate order of severity):

Dump in half an acknowledgement, e.g., under a cooperative agreement with the NSF on behalf of the Gemini partnership: the National Science Foundation (United States), or, about as bad: provided by S. Snowden from data by Dickey and Lockman – that's useless for author searches but invites lots of false positives
Dump more than one name into one creator/name element, e.g., Zhuang Z.,Kirby E.N.,Leethochawalit N.,de los Reyes M.A.C. or Voges, W.; Aschenbach, B.; Boller, Th.; (and ~200 more characters) – that's really hard to search and essentially impossible to use for, e.g., author datagraphies
Include affiliations (the VO can't properly deal with those yet), e.g., Zub M. (The PLANET Collaboration) or a combination of this and the previous: Zhu W. (The Spitzer team) Dominik M.
Forget citation debris, e.g., et al. MNRAS (in press), or, shockingly common: and Scheck M.; of course, entire citations (WALKER I. Astron. J. 106) are inappropriate, too – all of this will prevent the use of meaningful name constraints
Give a bibcode: 2014ApJ...787...78M – this likely belongs into content/source
Have empty author name elements (as, at this moment, 13 records)
Cheat with effectively empty author names: <NOT GIVEN>, or "We forgot to give credit, please complain"
Go all uppercase, e.g., ZINNECKER H. – standards-compliant ADQL string comparisons are case-sensitive, and case-folding would require special indexes. Perhaps case-insensitive author matches should be made easier in that van der Waals is probably the same person as Van der Waals, but for now that's not how it works right now. And I don't think that will change any time soon, because if I have learned one thing in my life it is that case insensitivity is almost always evil
Have just a first name: walter or W.I. or W-J
Combine author lists from different contributing papers: Wright et al.; Griffith, Wright, Burke, Ekers; Griffith, Wright – if you really need to do something like this, merge the two author lists – and then of course use one name per creator element

In principle, these considerations would apply to contributors, contacts and perhaps publishers, too, but since I don't think people should use these in discovery queries, their format does not matter too much: If they're human-readable, that's enough.

Fragile Contact Info

Quite regularly I need to ask people to fix something in their publishing registries, and then it's really useful to have reliable contact information. That's also nice for VO users; pyVO, for instance, has the get_contact method on registry records, and in WIRR, you can pop up contact info on all records:

a screenshot showing a match in a registry query. A subwindow is popped up that shows a mail address and a telephone number of a “GAVO Data Center Team“.

For that to work, personal addresses in the contact information are really dangerous – it is my experience that these break significantly more often than institutional addresses. So, please avoid things like (I'm making all of these up because there may still be folks around harvesting mail addresses to send spam):

a.b.miller-parachtnix@gmail.com (well: avoid using gmail.com unconditionally)
friederike.student@ari.uni-heidelberg.de

Rather, create an alias that you can hand on and that perhaps is even a bit speaking. This could be:

vo-help@great-telescope.org
gavo@ari.uni-heidelberg.de
uni-hd-vo@posteo.de (in case your own institution absolutely loathes the idea of addresses not bound to persons)

Non-machine-readable Subjects

VOResource 1.1 said that subjects are to be taken “from the UAT” (that's the Unified Astronomy Thesaurus), but failed to say what exactly that means. Since last July, this is properly defined: Use fragment identifiers into http://www.ivoa.net/rdf/uat, that is, something like abell-clusters.

Having all subject keywords in a predictable format, with useful metadata, and part of a proper hierarchy enables all kinds of cool stuff, and hence it would be great if we could stomp out the following sorts of mispractice in the VO:

Multiple things in one subject element: ATLAS DR1, SIAP, Images – have one term per subject element
Undefined NULL values: NOT PROVIDED, ??? – if you really cannot find a pertinent term, use astronomical-research (or one of the other top-level terms). If nowhere else, that at least helps when your record moves to interdisciplinary search engines
Random free text: optical lines equivalent width catalog – that's multiple terms rolled into one, and the machine will not know what it means
Project or instrument names: 6dF Data Release 3 Spectra, COROT N2 – there's the instrument metadata for some uses of that. For the rest, see above on having projects in creator/name.
Protocol names: TAP – that's what capabilities are for
Service titles: CADC image/cube HiPS service – that's what the title element is for
Non-UAT keyword schemes: Galaxy:general – let's not force VO components to learn about multiple keyword systems. If you are missing something from the UAT, tell them about it

Unfulfilling Resource Descriptions

Descriptions of VO resources serve a dual purpose: The should give researches a quick idea of what to expect and not expect of a resource, and they should mention all the important buzzwords for the benefit of full-text searches. Hence, if you only have two words as in:

Survey (LoLSS).

or have something like a title:

Convolution of normalized synthetic stellar spectra.

or use somewhat uncommon abbreviations and technical details that probably will not help much during data discovery:

USET Group form

(what group? Does „form“ really mean „web browser-facing“? If so, that's again better expressed through the capabilities), you should work a bit on your description.

It is usually helpful to start the description with „this service is…“ or something similar. While it's marginally ok to mention terms and conditions like:

When referencing results from this online catalog, please cite <a href="https://iopscience.iop.org/article/10.384…

further down in the description (the proper place for this kind of thing is the rights element, though), don't discuss stuff like this before you have told people what is in the “online catalog” in the first place. Also: registry records are like e-mail in that you shouldn't use HTML anywhere in registry metadata. If you have to include URLs in text for human consumption, just put them in as text.

Talking about markup: You cannot rely on any of that in descriptions. Even basic ASCII art (or, well, tables) will always come out bad:

Only the data from the first catalog that was matched is reported here according to the following priority list. This means for example, if a star was matched with Hipparcos, that information was used while possible other catalog data are not listed here. -------------------------------------------------------- # stars flg catalog -------------------------------------------------------- 53500 0 no catalog match 55549 1 Hipparcos 254 2 Yale Parallax Catalog 1041 3 Finch and Zacharias 2016 (UPM NNNN-NNNN) 1431 4 MEarth parallaxes 402 5 SIMBAD Database (w/parallax) -------------------------------------------------------- 112177 total number stars in catalog -------------------------------------------------------- Not all parallaxes from the...

(of course, that in this case the newlines and longer sequences of blanks have been normalised to single blanks already in the original resource record makes it particularly certain that the table will come out wrong).

And where in titles abbreviations are usually a good thing, in particular when you can expect your target audience too look for the abbreviation rather spelled-out names in discovery queries, in descriptions you have space, and hence you normally should explain MCQA as „Monte Carlo Quality Assessment“ in something like the following:

Herschel sources in Planck fields measured at 350 µm MCQA

Remember: The people who read your descriptions may come from the future (as in: 25 years from now) or at least may be unfamilar with your project's jargon.

Lame Relationships

There are an incredible 136958 relationships in the current VO that have related-to as their relationship type. This is deplorable because the relevant vocabulary, https://www.ivoa.net/rdf/voresource/relationship_type, marks it as deprecated, and that's for a good reason: Just stating “some relationship“ between two resources is rarely useful. Decide what the relationship is and then pick a proper term (or, if that does not exist, prepare a VEP).

Missing Tablesets

Tablesets are a VODataService feature giving metadata on the return table (or, in the case of the flexible TAP services, the queried tables). They are really useful if you look for services returning some sort of physics – and if you are running TAP services, they will one day let me shut down the GloTS service that replicates a good deal of registry functionality for no good reason at all.

So, if you have a catalog service and your registry record ends somewhat like this:

  </capability>
</ri:Resource>

it is almost certainly missing a tableset (which would normally go after the capabilities; you are probably also missing the sky coverage, though, because that would sit there, too).

Writing basic tablesets is not hard. In fact, if you are running a TAP service, you have a working tableset on your service's tables endpoint. But even without VOSI tables, making a tableset from the VOTable you return is straightforward – with a few encouraging words, I could be talked to write a few lines of Python that do that.

I will readily admit that writing good tablesets is more involved, but what is hard about it you should be doing anyway, because it also will improve the VOTables that you write, and hence the usability of your data all around. So, until the end of this post let me look at some common warts of the column metadata in today's VO.

Deficient Column Descriptions

Column descriptions like ?, ??, or even ??? are surprisingly common. Please don't do that. If you really have no idea what your upstream has put into a column, admit that, aplogise and try to make your upstream explain.

And while RA somewhat works among astronomers, a word or two on the reference system (“IRCS”) and an informal provenance (“from PSF fits”) would certainly be much appreciated by your users and might even come handy in discovery.

Or consider “Age” – this could immediately be improved by revealing just what has aged here and, again, some hint on how the age was estimated (e.g., “obtained from ivo://foo.bar/res” versus “by isochrone fitting”).

But don't overdo it, either: Do not include entire footnotes in descriptions, because that will lead to many false positives in full text searches (not to mention slow down the Registry as a whole if this became common practice). DaCHS operators: you can have footnotes in your RD by using note meta items; cf. Typed Meta Elements in the DaCHS reference.

Near the upper limit of what is appropriate in a column description is perhaps something like this:

The 2.5 percentile of the Log total SFR PDF. This is derived by combining emission line measurements from within the fibre where possible and aperture corrections are done by fitting models ala Gallazzi et al (2005), Salim et al (2007) to the photometry outside the fibre. For those objects where the emission lines within the fibre do not provide an estimate of the SFR, model fits were made to the integrated photometry.

– but at the same time it illustrates how you can provide a lot of information that helps casual users.

The position angles I will turn to in a second give another nice example of why human-readable descriptions are so important: There is no reliable convention of the direction and the baseline of these, so stating something like „north over east“ in a description will avoid a lot of head-scratching.

Column UCDs: Missing, Outdated, or Useless

A very plausible discovery scenario involves UCDs: „give me resources with (some photometry | redshifts | kinematics | dynamics | positions on earth)“. Hence, make sure your columns' metadata has predictable and halfway correct UCDs.

Sure, that's not always straightforward (note, by the way, that there is a reasonably simple process to suggest new UCDs), but there's no excuse for there being 117 columns called pa without any UCD, where pos.posAng will almost certainly fit all of them (though, who knows: 30 of these in addition don't even have a description).

To make sure the UCDs you assign exist, run them through astropy at least once. Do not ignore complaints by astropy; it is actually preferable to have no UCD rather than “??” (which currently a whopping 30342 column sport, in addition to which we have 41 times “???“ and 70 times “????“[1]). Also, resist the temptation to freely invent things, such as the “mjd” UCD I'm seeing on 13 columns. In this particular case, by the way, I give you that saying “this column contains MJDs“ has been a pain in VOTables for a long time, but since version 1.4, TIMESYS lets you do that in a reasonable way.

Oh, let me qualify the “freely invent“ in the last paragraph: It could be[2] that MJD has actually been part of the original UCDs you may still know from cone search (“POS_EQ_RA”); that people have not updated their metadata from these ancient days is also the reason I'm still seeing 13827 columns with an (invalid) UCD of “error“ in column metadata (and 84 with pos_eq_dec).

Unrelatedly (though with an undisputable entertainment value): the longest UCD in the current VO is meta.code;phot.flux.density;arith.ratio;em.ir.15-30um;em.radio.750-1500mhz; unless I and astropy are missing something, it's even syntactically correct.

Bad Units

While I do not see many discovery scenarios that would make good use of units, do not forget to update your units to VOUnits when you touch up your tablesets. This will let software like astropy do the unit calculus for its users, which is a win overall. It cannot do that if you ignore VOUnits and write, say, ABmag/arcsec2 – the AB part you will have to communicate in the description for now, and exponentiation is ** in VOUnits.

Recent versions of the stilts validators (votlint, taplint) will complain about bad units. And you can use stilts interactively to figure out whether you got it right:

$ stilts calc 'vounitStatus("ABmag/arcsec2")'
  BAD_SYNTAX
$ stilts calc 'vounitStatus("mag/arcsec**2")'
  OK

[In a previous version of this post, I have given a piece of astropy to do unit checking; it turns out that astropy by default is rather forgiving, and you want stilts on your box anyway; why not use it for unit validation? If your stilts says something about “bad expression“ with the command lines above, it's an indication that you should update it.]

And with this somewhat non-registry topic: Go forth and polish your resource records. Or, as a consumer of such metadata, ask the publishers of bad resource metadata to fix it.

[1]	Remarkably, there are no ????? or even longer sequences of question marks, and even more remarkably, nobody has put in a lonely question mark. If someone versed in cognitive psychology has a plausible interpretation for that fact: can you share it with me?

[2]	Since the original UCDs predate my VO involvement and, for all I know, never were properly standardised, I frankly can't say.

Migrating Away From Wordpress

2021-10-20 Markus Demleitner

Since 2016, this blog was served through a Wordpress instance at the Astrophysical Institute Potsdam AIP – thanks again to our colleagues there for maintaining the platform over all these years.

But since it now seems as if this is something that might last a long time (by Web standards), we have decided that we should leave PHP behind and look for something properly version controllable, and something that can simply live somewhere on a web server with essentially zero maintenance. Hence, we have moved the content to pelican – which has a clean Debian package, is written in Python, and does not need any active components of its own.

As an extra bonus, the blog posts are now authored in ReStructuredText, which happens to be what DaCHS' documentation is written in, and what you can use to author metadata for DaCHS resources. If you want, you can now check out the source code for the articles (sorry, it's still subversion; one of these days I'll find something fancier than naked git but lighter than gitlab, and then I'll move GAVO's VCS to git).

As expected, porting the theme (which I only did rather half-heartedly, so things are a bit less pretty now) and getting the figures right was what caused the bulk of the work. On the plus side, I have also greatly cleaned up categories and tags. Still, it's quite likely we messed something up. If you find anything broken here, please let us know: https://www.g-vo.org/pmwiki/About/Impressum lists the main ways through which you can reach us.

With that: Subscribe to our Atom feed!

Category: Operations
Taming the Postgres JIT

2021-09-23 Markus Demleitner

Mild warning: This is exclusively technobabble mainly addressing DaCHS deployers. If you're an astronomer (or yet something else), you're of course still welcome to enjoy it, but don't complain if you're bored.

My development machine as been on Debian bullseye for a while, which means I've been running Postgres 13 for the past few months. Against Postgres 11, 13 is a lot more optimistic when doing Just-In-Time (JIT) compilation, and that's the beginning of this story.

This JIT thing in plain language means that Postgres is writing small programmes to compute query results, then compiles them to machine code and executes that rather than running the query plan in some sort of interpreter. This at first sounds like a great idea that should speed up large queries quite a bit. But for one, query time is often bounded not so much by CPU but by I/O, and the sort of analysis that happens for JIT compilation is not free. Not at all.

I noticed that when a query in the regression test suite I'm running before every commit to DaCHS started to occasionally fail. That test executes:
```
SELECT TOP 1 obs_publisher_did
FROM ivoa.obscore
WHERE distance(s_ra, s_dec, 83.8,-5.4)<0.2
```
and then asserts that the result is in within 10 seconds. The purpose of this particular regression test is to make sure all sizable tables in the obscore view have a usable spatial index on the production system. On the development system, there really aren't any tables in obscore that would be slow even when seqscanned.

How on earth could this query be slow then?

The natural reaction in such a situation to use EXPLAIN in psql. In this case, there is some non-trivial rewriting of the query going on between ADQL and postgres, which means you cannot just paste the ADQL to Postgres. To figure out the query that DaCHS actually executes, I picked the translated query from the VOTable returned from a successful request (look for the sql_query INFO; that's a DaCHS extension, so that trick won't work for other TAP servers), ran the psql gavo DaCHS operators are probably used to, and then typed:
```
EXPLAIN SELECT obs_publisher_did
FROM ivoa.obscore
WHERE  q3c_join(83.8, - 5.4, s_ra, s_dec, 0.2) LIMIT 1;
```
to it. The result was inconspicuous; a few seqscans here and there, but the total cost estimate was “0.00..7.12”, which in physical units works out to “basically nothing”, many orders of magnitude away from the 10 seconds I occasionally saw in the regression tests.

Well, when a query plan doesn't match your expectations, the next thing to do is EXPLAIN ANALYZE. With that, Postgres executes the plan it has made and then compares its estimates to what the cost turned out to be; this, by the way, is also a good way to find out when you should raise the statistics target of one or more of your columns (see Element Column in the DaCHS reference for details).

For me, the output looked something like this:
```
Limit  (cost=10000000000.00..10156565675.53 rows=1 width=57) (actual time=6206.883..6206.899 rows=1 loops=1)
[...]
 Planning Time: 22.174 ms
 JIT:
   Functions: 130
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 55.404 ms, Inlining 107.280 ms, Optimization 3479.626 ms, Emission 2601.411 ms, Total 6243.721 ms
 Execution Time: 6263.243 ms
```
Ok, I'm lying a bit; there is another reason than just the analyze for why the cost estimate exploded from 7.12 to 10156565675.53. I'll confess in the appendix to this post.

The main point, however, is: the execution time now is of the order that I'm expecting (the database is rather busy during a regression test, so those 6 seconds can easily become double that then). Interestingly, essentially all the execution time went into “Optimization” and “Emission”. Until yesterday, I'd never seen a thing like that in Postgres query plans.

That is because here the JIT is at work, and that was at least a lot less likely in Postgres 11. Now, estimating 10 Gigapennies as execution cost up front, Postgres 13 thought some extra time for writing and compiling a little programme is well spent. Of course, that estimate is badly off, and the right thing to do is to fix the reason for the bad estimate. See the appendix for why I don't just yet.

That my obscore view has 32 tables contributing to it, giving its definition a whopping 1280 lines, probably does not help. But in particular since the query plans in the presence of Q3C and pgsphere still are usually badly off, it might be wise to discourage Postgres a bit from using JIT compilation with DaCHS' workloads in your configuration if you're running TAP services (you should) and before you upgrade to Postgres 13. To do that, add a:
```
jit_above_cost = 20000000000
```
(or so; perhaps you can set your limit a good deal lower) to your postgresql.conf. On Debian boxes, that file is in /etc/postgresql/13/main/ (obviously, change the 13 if you have a different version). You need to restart postgres to make this take effect.

While I was in that file, I thought I can share what other configuration I have in there, because it is likely you can speed up your data centre quite a bit by judicious tuning. The following settings aren't particularly well thought out, but I claim they are not unreasonable for a 64 GB machine that runs as a dedicated server; that last thing also causes the first configuration item, as for two-server operation, you have to set
- listen_addresses = '*' – only then can you talk to postgres from another machine (disregarding hacks like ssh tunnels that may even work as last-resort options). Of course, this may mean your postgres port is visible to the internet, which means you ought to understand what pg_hba.conf is before configuring that. Other configuration I'm doing includes
- max_connections = 200 – I actually ran out of connections once; DaCHS itself is now a bit more parsimonious with them, but if you have enough RAM, it still doesn't hurt to be generous here.
- localtime = UTC – TIMESTAMPs suck, because it is hard to compute with them, are a pain when plotting, there are time zones, and they generally are a Babylonian mess (as evinced by base-60 numbers). But you can't always escape timestamps, and if you somehow manage to create them “with time zone”, telling the server to do UTC helps limit their damage radius.
- shared_buffers = 15GB – the Postgres documentation says 25% of the RAM is a good default for shared_buffers, so that's roughly what I went for here. Note that the kernel usually limits how much shared memory processes are allowed to allocate, and you will have to adjust those limits for this to take effect. On Debian, the postgresql-common package installs a file /etc/sysctl.d/30-postgresql-shm.conf for easy adjusting of the limits.
- temp_buffers = 100MB – that one gives buffers for temporary tables, and raising it helps TAP uploads (which use those, at least for now). Since our TAP uploads tend to be large as temporary tables go, it pays to set aside a couple of megabytes for them. Now that I look at this again and think about what people upload into my data centre: I think I could even raise that a bit more.
- work_mem = 64MB – this one is for doing joins and the like (which includes cross-matches), and again these tend to be larger in Astronomy than in many other disciplines, where matching tens-of-millions against billions would count as Big Data. Hence, postgres' default of 4 MB is quite certainly going to be causing a lot of unnecessary disk activity. That said, DaCHS could be a bit smarter here and raise work_mem itself when running TAP jobs (or perhaps only TAP jobs that actually do joins). Note that a single query can use up many times work_mem, which means you shouldn't choose this too high, either. One thing I'd like to look into one day is the hash_mem_multiplier (cf. a bit down on Postgres docs on resource limits). If you do research in that direction with astronomy workloads, please let me know.
- maintenance_work_mem = 2048MB – this is relevant to keep VACUUM runs fast, which become necessary as rows are added to or replaced in the database. I have some relatively large tables that regularly see deletes (e.g., the relational registry), and hence I want smooth vacuuming. If you don't have large tables that regularly change, you probably don't need to bother with maintenance_work_mem.
If you have additional (or contradicting) advice on Postgres configuration for DaCHS: Please let us know, preferably on the dachs-support mailing list (see DaCHS support).

Appendix: As I said: I was lying above. The original with-JIT plan was just fine. The horrible, cost 100 Giga, plan was only chosen when I did the SET enable_seqscan=false. Why would I do a thing like that, forcing Postgres in the wrong direction? Well, DaCHS' TAP executor makes the same setting. And why does it do that to Postgres? That's a long story closely related to the Q3C and pgsphere troubles I've mentioned above – and for which there's now finally hope: See q3c issue #30 if you're curious.

Category: Operations

Page 1 / 3 »