Posts with the Tag Standards:

  • Porting a DaCHS SIAv1 service to SIAP2

    a distorted title page of the SIAP version 2 standard, centred on the date 2015-12-23

    Ten years after, let me talk about SIAP version 2.

    In December 2015, the IVOA made Simple Image Access Version 2.0 (hereafter: SIAv2) a Recommendation (that is: the standard you should be following). I am fairly sure that most people into computers would have understood that as “Don't do Simple Image Access version 1 (SIAv1) any more“. As of ten years ago.

    This is not how things worked out. Actually, to this day new SIAv1 services still come online. In the talk about major version transitions I gave in College Park last June, I remark that 20% of the registered SIAv1 services were younger than 30 months.

    There are many reasons why obsoleting SIAv1 has not worked (yet); very frankly, I had rather fiercely argued we don't want SIAv2 at all on grounds that Obscore is all you need to discover products of observations.

    But since it's there now I feel I should do something for its adoption, beginning with not pushing out any new SIAv1 services myself. So, when a data provider sent me an RD they built from a previous one and it would have published a new SIAv1 service, I thought this was the time to start updating my own services.

    The next step then is to encourage DaCHS adopters to help out, too, that is, to port over their RDs from doing SIAP version 1 to doing SIAP version 2[1]. That's why I am writing this blog post.

    Going From SIAv1 to SIAv2 in 11 Moderately Difficult Steps

    Since the output table schema (and quite a bit beyond that) changed between the two version, the port is not entirely trivial; if it were, we wouldn't have done a major version (i.e., breaking) change in the first place. But I'd argue it's quite doable when two conditions are met:

    • You have a DaCHS version 2.8 or later (if not, you should upgrade anyway).
    • You are not using siapCutoutCore right now; what this does is hard to replicate in SIAv2 (because positional constraints are now optional), and so if you want to keep the auto-cutout functionality, you probably are stuck on SIAv1.

    That said, here's my recipe:

    1. Change the mixin on the table that keeps the image metadata. So far, you probably had mixin="//siap#pgs". Drop this and add:

      <mixin have_bandpass_id="True">//siap2#pgs</mixin>
      

      to the table body instead. If you really have no bandpass you would like to mention, you can leave out the attribute definition.

    2. Change the obscore mixin in the table body if you did an obscore publication (skip this step if not). With SIAv2, write instead:

      <mixin preview="access_url || '?preview=True'"
        >//obscore#publishObscoreLike</mixin>
      

      It is really simple now because SIAv2 just re-uses the obscore schema.

      Keep your old mixin definition in a scratch pad (or the version control history at least), because it will help you when you fill out the parameters to //siap2#setMeta.

    3. Change any index statements for standard columns you may have; the column names are completely different between SIAv1 and SIAv2. Classic examples include:

      • bandpassId is bandpass_id (if available)
      • bandpassLo is em_min
      • bandpassHi is em_max
      • dateObs should become indexes on t_min and t_max.

      If your table is small enough that you managed without indexes so far, don't bother creating new ones.

    4. Check custom extension fields for whether they are now in core SIAv2. The classic case is exposure time, which was missing in SIAv1. Just drop your custom column definition(s).

    5. If there is datalink on the SIAP table, you will have to change its definition, too; the relevant column is now obs_publisher_did. If your datalink service has the id dl, the result of the operation would be this:

       <meta name="_associatedDatalinkService">
        <meta name="serviceId">dl</meta>
        <meta name="idColumn">obs_publisher_did</meta>
      </meta>
      

      This may lead to datalink failures in DaCHS < 2.13 (in that the datasets are no longer found). If this bites you, let me know.

    6. Fix the rowmaker for the SIAP table. For the computePGS and getBandFromFilter apply, just add a 2 to their procDef references, so that these become //siap2#computePGS and //siap2#getBandFromFilter (if applicable).

      The main work is going from //siap#setMeta to //siap2#setMeta, because their parameter sets are somewhat different, although they do map to each other to some degree.

      The way to do the migration is to go through SIAv2's setMeta's parameter list in the reference documentation and identify the old parameters, or take the values from your obscore definition. Once you are past this point, you have done the heavy lifiting.

      (For completeness, let me mention that you will probably get away with dropping pixflags and keeping the other parameters as they are, as there is some compatibility glue; but you'd miss setting up extra SIAv2 metadata, and that would be a shame).

    7. Experimentally run dachs imp. This will probably fail because there are references to old column names in, say, service definitions. Resolve these based on the names you used in setMeta (which largely double as the column names). When you made DaCHS accept your refurbished RD and have run the import, use dachs info to catch metadata items you have missed.

    8. If you used a shared core for both a service with the siap.xml renderer and the web form service, move that core into the web form service. Use //siap2#humanInput for the new positional constraint, and drop the #protoInput, if it is there, because it is no longer needed.

    9. The protocol service has to have allowed="siap2.xml", and its new core is:

      <dbCore queriedTable="main">
        <FEED source="//siap2#parameters"/>
      </dbCore>
      

      Replace "main" with whatever your table is called, and add any custom parameters you would like to have.

    10. In your regression tests (you have some, don't you?), change the renderers in the URIs (siap2.xml instead of siap.xml), and change POS and SIZE into POS="CIRCLE ..."; it is likely that you will also have to change column names in the assertions.

    11. Run dachs pub q to tell the Registry that your access URL has changed.

    That's it.

    I would argue this is time well spent. Even if one day there will be a successor to SIAv2 (and I do hope there will be one), it is highly likely that its metadata schema will align very well with obscore's, and hence most of the work you just did will put you in a very good position to switch to DAP with just a few keystrokes.

    [1]If you have built SIA services with DaCHS 2.8 (2023) or later using dachs start, you will already have a SIAv2 service; see the discussion in the pertaining release notes.
  • Multimessenger Astronomy and the Virtual Observatory

    A lake with hills behind it; in it a sign saying “Bergbaugelände/ Benutzung auf eigene Gefahr”

    It's pretty in Görlitz, the location of the future German Astrophysics Research Centre DZA. The sign says “Mining area, enter at your own risk”. Indeed, the meeting this post was inspired by happened on the shores of a lake that still was an active brown coal mine as late as 1997.

    This week, I participated in the first workshop on multimessenger astronomy organised by the new DZA (Deutsches Zentrum für Astrophysik), recently founded in the town of Görlitz – do not feel bad if you have not yet heard of it; I trust you will read its name in many an astronomy article's authors' affiliations in the future, though.

    I went there because facilitating research across the electromagnetic spectrum and beyond (neutrinos, recently gravitational waves, eventually charged particles, too) has been one of the Virtual Observatory's foundational narratives (see also this 2004 paper from GAVO's infancy), and indeed the ease with which you can switch between wavebands in, say, Aladin, would have appeared utopian two decades ago:

    A screenshot with four panes showing astronomical images from SCUBA, allWISE, PanSTARRS, XMM, and a few aladin widgets around it.

    That's the classical quasar 3C 273 in radio, mid-infrared, optical, and X-rays, visualised within a few seconds thanks to the miracles of HiPS and Aladin.

    But of course research-level exploitation of astronomical data is far from a solved problem yet. Each messenger – and I would consider the concepts in the IVOA's messenger vocabulary a useful working definition for what sorts of messengers there are[1] – holds different challenges, has different communities using different detectors, tools and conventions.

    For instance, in the radio band, working with raw-ish interferometry data (“visibilities”) is common and requires specialised tools as well as a lot of care and experience. Against that, high energy observations, be them TeV photons or neutrinos, have to cope with the (by optical standards) extreme scarcity of the messengers: at the meeting, ESO's Xavier Rodrigues (unless I misunderstood him) counted one event per year as viable for source detection. To robustly interpret such extremely low signal levels one in particular needs extremely careful modelling of the entire observation, from emission to propagation through various media to background contamination to the instrument state, with a lot of calibration and simulation data frequently necessary to make statistical sense of even fairly benign data.

    The detectors for graviational waves, in turn, basically only match patterns in what looks like noise even to the aided eye – at the meeting, Samaya Nissanke showed impressive examples –, and when they do pick up a signal, the localisation of the signal is a particular challenge resulting, at least at this point, in large, banana-shaped regions.

    At the multimessenger workshop, I was given the opportunity to delineate what, from my Virtual Observatory point of view, I think are requirements for making multi-messenger astronomy more accessible “for the masses”, that is for researchers that do not have immdiate access to experts for a particular sort of messenger. Since a panel pitch is always a bit cramped, let me give a long version here.

    Science-Ready is an Effort

    The most important ingredient is: Science-ready data. Once you can say “we get a flux of X ± Y Janskys from a σ-circle around α, δ between T1 and T2 and messenger energy E1 and E2” or “here is a spectrum, i.e., pairs of sufficiently many messenger energy intervals and a calibrated flux in them, for source S”, matters are at least roughly understandable to visitors from other parts of the spectrum.

    I will not deny that there is still much that can go wrong, for instance because the error models of the data can become really tricky for complex instruments doing indirect measurements (say, gamma-ray telescopes observing atmospheric showers). But having to cope with weirdly correlated errors or strong systematics is something that happens even while staying within your home within the spectrum – I had an example from the quaint optical domain right here on my blog when I posted on the Gaia XP spectra –, so that is not a problem terribly specific to the multi-messenger setting.

    Still, the case of the Gaia XP spectra and the sampling procedure Rene has devised back then are, I think, a nice example for what “provide science-ready data” might concretely mean: work, in this case, trying to de-correlate data points so people unfamiliar with the particular formalism used in Gaia DR3 can do something with the data with low effort. And I will readily admit that it is not only work, it also sacrifices quite a bit of detail that may actually be in the data if you spend more time with the individual dataset and methods.

    That this kind of service to people outside of the narrower sub-domain is rarely honoured certainly is one of the challenges of multi-messenger astronomy for the masses.

    Generality, Systematics, Statistics

    But of course the most important part of “science-ready” is removing instrument signatures. That is particularly important in multi-messenger astronomy because outside users will generally be fairly unfamiliar with the instruments, even with the types of instruments. Granted, even within a sub-domain setting up reduction pipelines and locating calibration data is rarely easy, and it is not uncommon to get three different answers when you ask two instrument specialists about the right formalism and data to calibrate any given observation. But that is not much compared with having to understand the reduction process of, say, LIGO, as someone who has so far mainly divided by flatfields.

    Even in the optical, serving data with strong instrumental signatures (e.g., without flats and bias frames applied) has been standard until fairly recently. Many people in the VLBI community still claim that real-space data is not much good. And I will not dispute that carefully analysing the systematics of a particular dataset may improve your error budget over what a generic pipeline does, possibly even to the point of pushing an observation over the significance threshold.

    But against that, canned science-ready data lets non-experts at least “see” something. That way, they learn that there may be some signal conveyed by a foreign messenger that is worth a closer look.

    Enabling that “closer look” brings me to my second requirement for multimessenger astronomy: expert access.

    From Data Discovery to Expert Discovery

    Of course, on the forefront of research, an extra 10% systematics squeezed out of data may very well make or break a result, and that means that people may need to go back to raw(er) data. Part of this problem is that the necessary artefacts for doing so need to be available. With Datalink, I'd say at least an important building block for enabling that is there.

    Certainly, that is not full provenance information yet – that would, for instance, include references to the tools used in the reduction, and the parameters fed to them. And regrettably, even the IVOA's provenance data model does not really tell you how to provide that. However, even machine-readable provenance will not let an outsider suddenly do, say, correlation with CASA with sufficient confidence to do bleeding-edge science with the result, let alone improve on the generic reduction hopefully provided by the observatory.

    This is the reason for my conviction that there is an important social problem with multi-messenger astronomy: Assuming I have found some interesting data in unfamiliar spectral territories and I want to try and improve on the generic reduction, how do I find someone who can work all the tools and actually know what they are doing?

    Sure, from registry records you can find contact information (see also the .get_contact() in pyVO's registry API), but that is most often a technical contact, and the original authors may very well have moved on and be inaccessible to these technical contacts. I, certainly, have failed to re-establish contact to previous data providers to the GAVO data centre in two separate cases.

    And yes, you can rather easily move to scholarly publications from VO results – in particular if they implement the INFO elements that the new Data Origin in the VO note asks for–, but that may not help either when the authors have moved on to a different institution, regardless of whether that is a scholarly or, say, banking institution.

    On top of that, our notorious 2013 poster on lame excuses for not publishing one's data has, as an excuse: “People will contact me and ask about stuff.” Back then, we flippantly retorted:

    Well, science is about exchange. Think how much you learned by asking other people.

    Plus, you’ll notice that quite a few of those questions are actually quite clever, so answering them is a good use of your time.

    As to the stupid questions – well, they are annoying, but at least for us even those were eye-openers now and then.

    Admittedly, all this is not very helpful, in particular if you are on the requesting side. And truly I doubt there is a (full) technical solution to this problem.

    I also acknowledge that it even has a legal side – the sort of data you need to process when linking up sub-domain experts and would-be data users is GDPR-relevant, and I would much rather not have that kind of thing on my machine. Still, the problem of expert discovery becomes very pertinent whenever a researcher leaves their home turf – it's even more important in cross-discipline data discovery[2] than in multiwavelength. I would therefore advocate at least keeping the problem in mind, as that might yield little steps towards making expert discovery a bit more realistic.

    Perhaps even just planning for friendly and welcoming helpdesks that link people up without any data processing support at all is already good enough?

    Blind Discovery

    The last requirement I have mentioned in my panel discussion pitch for smooth multi-messenger astronomy is, I think, quite a bit further along: Blind discovery. That is, you name your location(s) in space, time, and spectrum, say what sort of data product you are looking for, and let the computer inundate you with datasets matching these constraints. I have recently posted on this very topic and mentioned a few remaining problems in that field.

    Let me pick out one point in particular, both because I believe there is substantial scientific merit in its treatment and because it is critical when it comes to efficient global blind discovery: Sensitivity; while for single-object spectra, I give you that SNR and resolving power are probably enough most of the time, for most other data products or even catalogues, nothing as handy is available across the spectrum.

    For instance, on images (“flux maps”, if you will) the simple concept of a limiting magnitude obviously does not extend across the spectrum without horrible contortions. Replacing it with something that works for many messengers, has robust and accessible statistical interpretations, and is reasonably easy to obtain as part of reduction pipelines even in the case of strongly model-driven data reduction: that would be high art.

    Also in the panel discussion, it was mentioned that infrastructure work as discussed on these pages is thankless art that will, if your institute indulges into too much of it, get your shop closed because your papers/FTE metric looks too lousy. Now… it's true that beancounters are a bane, but if you manage to come up with such a robust, principled, easy-to-obtain measure, I fully expect its wide adoption – and then you never have to worry about your bean^W citation count again.

    [1]In case you are missing gravitational waves: there has been a discussion about the proper extension and labelling of that concept, and it has petered out the first time around. If you miss them operationally (which will give us important hints about how to properly include them), please to contact me or the IVOA Semantics WG.
    [2]Let me use this opportunity to again solicit contributions to my Stories on Cross-Discipline Data Discovery, as – to my chagrin – it seems to me that beyond metrics (which, between disciplines, are even more broken than within any one discipline) we lack convincing ideas why we should strive for data discovery spanning multiple disciplines in the first place.

Page 1 / 1