Multimessenger Astronomy and the Virtual Observatory

2024-03-28 Markus Demleitner

A lake with hills behind it; in it a sign saying “Bergbaugelände/ Benutzung auf eigene Gefahr”

It's pretty in Görlitz, the location of the future German Astrophysics Research Centre DZA. The sign says “Mining area, enter at your own risk”. Indeed, the meeting this post was inspired by happened on the shores of a lake that still was an active brown coal mine as late as 1997.

This week, I participated in the first workshop on multimessenger astronomy organised by the new DZA (Deutsches Zentrum für Astrophysik), recently founded in the town of Görlitz – do not feel bad if you have not yet heard of it; I trust you will read its name in many an astronomy article's authors' affiliations in the future, though.

I went there because facilitating research across the electromagnetic spectrum and beyond (neutrinos, recently gravitational waves, eventually charged particles, too) has been one of the Virtual Observatory's foundational narratives (see also this 2004 paper from GAVO's infancy), and indeed the ease with which you can switch between wavebands in, say, Aladin, would have appeared utopian two decades ago:

A screenshot with four panes showing astronomical images from SCUBA, allWISE, PanSTARRS, XMM, and a few aladin widgets around it.

That's the classical quasar 3C 273 in radio, mid-infrared, optical, and X-rays, visualised within a few seconds thanks to the miracles of HiPS and Aladin.

But of course research-level exploitation of astronomical data is far from a solved problem yet. Each messenger – and I would consider the concepts in the IVOA's messenger vocabulary a useful working definition for what sorts of messengers there are[1] – holds different challenges, has different communities using different detectors, tools and conventions.

For instance, in the radio band, working with raw-ish interferometry data (“visibilities”) is common and requires specialised tools as well as a lot of care and experience. Against that, high energy observations, be them TeV photons or neutrinos, have to cope with the (by optical standards) extreme scarcity of the messengers: at the meeting, ESO's Xavier Rodrigues (unless I misunderstood him) counted one event per year as viable for source detection. To robustly interpret such extremely low signal levels one in particular needs extremely careful modelling of the entire observation, from emission to propagation through various media to background contamination to the instrument state, with a lot of calibration and simulation data frequently necessary to make statistical sense of even fairly benign data.

The detectors for graviational waves, in turn, basically only match patterns in what looks like noise even to the aided eye – at the meeting, Samaya Nissanke showed impressive examples –, and when they do pick up a signal, the localisation of the signal is a particular challenge resulting, at least at this point, in large, banana-shaped regions.

At the multimessenger workshop, I was given the opportunity to delineate what, from my Virtual Observatory point of view, I think are requirements for making multi-messenger astronomy more accessible “for the masses”, that is for researchers that do not have immdiate access to experts for a particular sort of messenger. Since a panel pitch is always a bit cramped, let me give a long version here.

Science-Ready is an Effort

The most important ingredient is: Science-ready data. Once you can say “we get a flux of X ± Y Janskys from a σ-circle around α, δ between T₁ and T₂ and messenger energy E₁ and E₂” or “here is a spectrum, i.e., pairs of sufficiently many messenger energy intervals and a calibrated flux in them, for source S”, matters are at least roughly understandable to visitors from other parts of the spectrum.

I will not deny that there is still much that can go wrong, for instance because the error models of the data can become really tricky for complex instruments doing indirect measurements (say, gamma-ray telescopes observing atmospheric showers). But having to cope with weirdly correlated errors or strong systematics is something that happens even while staying within your home within the spectrum – I had an example from the quaint optical domain right here on my blog when I posted on the Gaia XP spectra –, so that is not a problem terribly specific to the multi-messenger setting.

Still, the case of the Gaia XP spectra and the sampling procedure Rene has devised back then are, I think, a nice example for what “provide science-ready data” might concretely mean: work, in this case, trying to de-correlate data points so people unfamiliar with the particular formalism used in Gaia DR3 can do something with the data with low effort. And I will readily admit that it is not only work, it also sacrifices quite a bit of detail that may actually be in the data if you spend more time with the individual dataset and methods.

That this kind of service to people outside of the narrower sub-domain is rarely honoured certainly is one of the challenges of multi-messenger astronomy for the masses.

Generality, Systematics, Statistics

But of course the most important part of “science-ready” is removing instrument signatures. That is particularly important in multi-messenger astronomy because outside users will generally be fairly unfamiliar with the instruments, even with the types of instruments. Granted, even within a sub-domain setting up reduction pipelines and locating calibration data is rarely easy, and it is not uncommon to get three different answers when you ask two instrument specialists about the right formalism and data to calibrate any given observation. But that is not much compared with having to understand the reduction process of, say, LIGO, as someone who has so far mainly divided by flatfields.

Even in the optical, serving data with strong instrumental signatures (e.g., without flats and bias frames applied) has been standard until fairly recently. Many people in the VLBI community still claim that real-space data is not much good. And I will not dispute that carefully analysing the systematics of a particular dataset may improve your error budget over what a generic pipeline does, possibly even to the point of pushing an observation over the significance threshold.

But against that, canned science-ready data lets non-experts at least “see” something. That way, they learn that there may be some signal conveyed by a foreign messenger that is worth a closer look.

Enabling that “closer look” brings me to my second requirement for multimessenger astronomy: expert access.

From Data Discovery to Expert Discovery

Of course, on the forefront of research, an extra 10% systematics squeezed out of data may very well make or break a result, and that means that people may need to go back to raw(er) data. Part of this problem is that the necessary artefacts for doing so need to be available. With Datalink, I'd say at least an important building block for enabling that is there.

Certainly, that is not full provenance information yet – that would, for instance, include references to the tools used in the reduction, and the parameters fed to them. And regrettably, even the IVOA's provenance data model does not really tell you how to provide that. However, even machine-readable provenance will not let an outsider suddenly do, say, correlation with CASA with sufficient confidence to do bleeding-edge science with the result, let alone improve on the generic reduction hopefully provided by the observatory.

This is the reason for my conviction that there is an important social problem with multi-messenger astronomy: Assuming I have found some interesting data in unfamiliar spectral territories and I want to try and improve on the generic reduction, how do I find someone who can work all the tools and actually know what they are doing?

Sure, from registry records you can find contact information (see also the .get_contact() in pyVO's registry API), but that is most often a technical contact, and the original authors may very well have moved on and be inaccessible to these technical contacts. I, certainly, have failed to re-establish contact to previous data providers to the GAVO data centre in two separate cases.

And yes, you can rather easily move to scholarly publications from VO results – in particular if they implement the INFO elements that the new Data Origin in the VO note asks for–, but that may not help either when the authors have moved on to a different institution, regardless of whether that is a scholarly or, say, banking institution.

On top of that, our notorious 2013 poster on lame excuses for not publishing one's data has, as an excuse: “People will contact me and ask about stuff.” Back then, we flippantly retorted:

Well, science is about exchange. Think how much you learned by asking other people.

Plus, you’ll notice that quite a few of those questions are actually quite clever, so answering them is a good use of your time.

As to the stupid questions – well, they are annoying, but at least for us even those were eye-openers now and then.

Admittedly, all this is not very helpful, in particular if you are on the requesting side. And truly I doubt there is a (full) technical solution to this problem.

I also acknowledge that it even has a legal side – the sort of data you need to process when linking up sub-domain experts and would-be data users is GDPR-relevant, and I would much rather not have that kind of thing on my machine. Still, the problem of expert discovery becomes very pertinent whenever a researcher leaves their home turf – it's even more important in cross-discipline data discovery[2] than in multiwavelength. I would therefore advocate at least keeping the problem in mind, as that might yield little steps towards making expert discovery a bit more realistic.

Perhaps even just planning for friendly and welcoming helpdesks that link people up without any data processing support at all is already good enough?

Global Dataset Discovery in PyVO

2024-02-23 Markus Demleitner

A Tkinter user interface with inputs for Space, Spectrum, and Time, a checkbox marked "inclusive", and buttons Run, Stop, Broadcast, Save, and Quit.

Admittedly somewhat old-style: As part of teaching global dataset discovery to pyVO, I have also come up with a Tkinter GUI for it. See A UI for more on this.

One of the more exciting promises of the Virtual Observatory was global dataset discovery: You say “Give me all spectra of object X that there are“, and the computer relates that request to all the services that might have applicable data. Once the results come in, they are merged into some uniformly browsable form.

In the early VO, there were a few applications that let you do this; I fondly remember VODesktop. As the VO grew and diversified, however, this became harder and harder, partly because there were more and more services, partly because there were more protocols through which to publish data. Thus, for all I can see, there is, at this point, no software that can actually query all services plausibly serving, say, images or spectra in the VO.

I have to say that writing such a thing is not for the faint-hearted, either. I probably wouldn't have tackled it myself unless the pyVO maintainers had made it an effective precondition for cleaning up the pyVO Servicetype constraint.

But they did, and hence as a model I finally wrote some code to do all-VO image searches using all of SIA1, SIA2, and obscore, i.e., the two major versions of the Simple Image Access Protocol plus Obscore tables published through TAP services. I actually have already reported in Tucson on some preparatory work I did last summer and named a few problems:

There are too many services to query on a regular basis, but filtering them would require them to declare their coverage; far too many still don't.
With the current way of registering obscore tables, there is no way to know their coverage.
One dataset may be availble through up to three protocols on a single host.
SIA1 does not even let you constrain time and spectrum.

Some of these problems I can work around, others I can try to fix. Read on to find out how I fared so far.

The pyVO API

Currently, the development happens in pyVO PR #470. While it is still a PR, let me point you to temporary pyVO docs on the proposed pyvo.discover module – of course, all of this is for review and probably not in the shape it will remain in[1].

Followup (2024-11-28)

With the recent release of pyVO 1.6, what is described here is actually available in the release (or by checking out the main branch of the repository).

To quote from there, the basic usage would be something like:

from pyvo import discover
from astropy import units as u
from astropy import time

datasets, log = discover.images_globally(
  space=(339.49, 3.1, 0.1),
  spectrum=650*u.nm,
  time=(time.Time('1995-01-01'), time.Time('1995-12-31')))

At this point, only a cone is supported as a space constraint, and only a single point in spectrum. It would certainly be desirable to be more flexible with the space constraint, but given the capabilities of the various protocols, that is hard to do. Actually, even with the plain cone Obscore (i.e., ironically, the most powerful of the discovery protocols covered here) currently results in an implementation that makes me unhappy: ugly, slow, and wrong. This is requires a longer discussion; see Appendix: Optionality Considered Harmful.

datasets at this point is a list of, conceputally, Obscore records. Technically, the list contains instances of a custom class ImageFound, which have attributes named after the Obscore columns. In case you have doubts about the Semantics of any column, the Obscore specification is there to help. And yes, you can argue we should create a single astropy table from that list. You are probably right.

PyVO adds an extra column over the mandatory obscore set, origin_service. This contains the IVOA identifier (IVOID) of the service at which the dataset was found. You have probably seen IVOIDs before: they are URIs with a scheme of ivo:. What you may not know: these things actually resolve, specifically to registry resource records. You can do this resolution in a web browser: Just prepend https://dc.g-vo.org/I/ to an IVOID and paste the result into the address bar. For instance, my Obscore table has the IVOID ivo://org.gavo.dc/__system__/obscore/obscore; the link below the IVOID leads you to an information page, which happens to be the resource's Registry record formatted with a bit of XSLT. A somewhat more readable but less informative rendering is available when you prepend https://dc.g-vo.org/LP/ (“landing page”).

The second value returned from discover.images_globally is a list of strings with information on how the global discovery progressed. For now, this is not intended to be machine-readable. Humans can figure out which resources were skipped because other services already cover their data, which services yielded how many records, and which services failed, for instance:

Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
Obscore GAVO Data Center Obscore Table: 2 records
SIA2 The VO @ ASTRON SIAP Version 2 Service: 0 records
SIA2 ivo://au.csiro/casda/sia2 skipped: ReadTimeout: HTTPSConnectionPool(host=&apos;casda.csiro.au&apos;, port=443): Read timed out. (read timeout=20)
SIA2 CADC Image Search (SIA): 0 records
SIA2 European HST Archive SIAP service: 0 records
...

(On the skipping, see Relationships below). I consider this crucial provenance, as that lets you assess later what you may have missed. When you save the results, be sure to save these, too.

A feature that will presumably (see Inclusivity for the reasons for this expectation) be important at least for a few years is that you can pass the result of a Registry query, and pyVO will try to find services suitable for image discovery on that set of resources.

A relatively straightforward use case for that is global obscore discovery. This would look like this:

from pyvo import discover
from pyvo import registry
from astropy import units as u
from astropy import time

def say(discoverer, s):
        print(s)

datasets, log = discover.images_globally(
  space=(274.6880, -13.7920, 1),
  time=(time.Time('1995-01-01'), time.Time('1995-12-31')),
  services=registry.search(registry.Datamodel("obscore")),
  watcher=say)

The watcher thing lets you, well, watch the progress of the discovery; it receives an instance of the discoverer -- this is so you can abort a discoverer's activities from within some UI -- and the human-readable string to display or process in some other way.

A UI

To get an idea whether this API might one day work for the average astronomer, I have written a Tkinter-based GUI to global image discovery as it is now: tkdiscover (only available from github at this point). This is what a session with it might look like:

Lots of TOPCAT windows with various graphs and tables, an x-ray image of the sky with overplotted points, and a play gray window offering the specification of space, spectrum, and time constraints.

The actual UI is in the top right: A plain window in which you can configure a global discovery query by straightfoward serialisations of discover.images_globally's arguments:

Space (currently, a cone in RA, Dec, and search radius, separated by whitespace of commas)
Spectrum (currently, a single point as a wavelength in metres)
Time (currently, either a single point in time – which probably is rarely useful – or an interval, to be entered as civil DALI dates
Inclusivity.

When you run this, this basically calls discover.images_globally and lets you know how it is progressing. You can click Broadcast (which sends the current result to all VOTable clients on the SAMP bus) or Save at any time and inspect how discovery is progressing. I predict you will want to do that, because querying dozens of services will take time.

There is also a Stop button that aborts the dataset search (you will still have the records already found). Note that the Stop button will not interrupt running network operations, because the network library underneath pyVO, requests, is not designed for being interrupted. Hence, be patient when you hit stop; this may take as long as the configured timeout (currently is 20 seconds) if the service hangs or has to do a lot of work. You can see that tkdiscover has noticed your stop request because the service counter will show a leading zero.

Service counter? Oh, that's what is at the bottom right of the window. Once service discovery is done, that contains three numbers: The number of services to query, the number of services queried already, and the number of services that failed.

The table contains the obscore records described above, and the log lines are in the discovery_log INFO. I will give you that this is extremely unreadable in particular in TOPCAT, which normalises the line separators to plain whitespace. Perhaps some other representation of these log lines would be preferable: A PARAM with a char[][] (but VOTable still is terrible with arrays of variable-length strings)? Or a separate table with char[*] entries?

Inclusivity

I have promised above I'd explain the “Inclusive” part in both the pyVO API and the Tk UI. Well, this is a bit of a sad story.

All-VO-queries take time. Thus, in pyVO we try to only query services that we expect serve data of interest. How do we arrive at expectations like that? Well, quite a few records in the Registry by now declare their coverage in space and time (cf. my 2018 post for details).

The trouble is: Most still don't. The checkmark at inclusive decides whether or not to query these “undecidable” services. Which makes a huge difference in runtime and effort. With the pre-configured constraints in the current prototype (X-Ray images a degree around 274.6880, -13.7920 from the year 1995), we currently discover three services (of which only one actually needs to be queried) when inclusive is off. When it is on, pyVO will query a whopping 323 services (today).

The inclusivity crisis is particularly bad with Obscore tables because of their broken registration pattern; I can say that so bluntly because I am the author of the standard at fault, TAPRegExt. I am preparing a note with a longer explanation and proposals for fixing matters – <cough> follow me on github –, but in all brevity: Obscore data is discovered using something like a flag on TAP services. That is bad because the TAP services usually have entriely different metadata from their Obscore table; think, in particular, of the physical coverage that is relevant here.

It will be quite a bit of effort to get the data providers to do the Registry work required to improve this situation. Until that is done, you will miss Obscore tables when you don't check inclusive (or override automatic resource selection as above) – and if you do check inclusive, your discovery runs will take something like a quarter of an hour.

Relationships

In general, the sheer number of services to query is the Achilles' heel in the whole plan. There is nothing wrong with having a machine query 20 services, but querying 200 is starting to become an effort.

With multi-data collection services like Obscore (or collective SIA2 services), getting down to a few dozen services globally for a well-constrained search is actually not unrealistic; once all resources properly declare their coverage, it is not very likely that more than 20 institutions worldwide will have data in a credibly small region of space, time, and spectrum. If all these run collective services and properly declare the datasets to be served by them, that's our 20-services global query right there.

However, pyVO has to know when data contained in a resource is actually queriable by a collective service. Fortunately, this problem has already been addressed in the 2019 endorsed note on Discovering Data Collections Within Services: Basically, the individual resource declares an IsServedBy relationship to the collective service. PyVO global discovery already looks at these. That is how it could figure out these two things in the sample log given above:

Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore

But of course the individual services have to declare these relationships. Surprisingly many already do, as you can observe yourself when you run:

select ivoid, related_id from
rr.relationship
natural join rr.capability
where
standard_id like 'ivo://ivoa.net/std/sia%'
and relationship_type='isservedby'

on your favourite RegTAP endpoint (if you have no preferences, use mine: http://dc.g-vo.org/tap). If you have collective services and run individual SIA services, too, please run that query, see if you are in there, and if not, please declare the necessary relationships. In case you are unsure as to what to do, feel free to contact me.

Future Directions

At this point, this is a rather rough prototype that needs a lot of fleshing out. I am posting this in part to invite the more adventurous to try (and break) global discovery and develop further ideas.

Some extensions I am already envisaging include:

Write a similar module for spectra based on SSAP and Obscore. That would then probably also work for time series and similar 1D data.
Do all the Registry work I was just talking about.
Allow interval-valued spectral constraints. That's pretty straightforward; if you are looking for some place to contribute code, this is what I'd point you to.
Track overflow conditions. That should also be simple, probably just a matter of perusing the pyVO docs or source code and then conditionally produce a log entry.
Make an obscore s_region out of the SIA1 WCS information. This should also be easy – perhaps someone already has code for that that's tested around the poles and across the stitching line? Contributions are welcome.
Allow more complex geometries to define the spatial region of interest. To keep SIA1 viable in that scenario it would be conceivable to compute a bounding box for SIA1 POS/SIZE and do “exact” matching locally on the coarser SIA1 result.
Enable multi-position or multi-interval constraints. This pretty certainly would exclude SIA1, and, realistically, I'd probably only enable Obscore services with TAP uploads with this. With those constraints, it would be rather straightforward.
Add SODA support: It would be cool if my ImageFound had a way to say “retrieve data for my RoI only”. This would use SODA and datalink to do server-side cutouts where available and do the cut-out locally otherwise. If this sounds like rocket science: No, the standards for that are actually in place, and pyVO also has the necessary support code. But still the plumbing is somewhat tricky, partly also because pyVO's datalink API still is a bit clunky.
Going async? Right now, we civilly query one service after the other, waiting for each result before proceeding to the next service. This is rather in line with how pyVO is written so far.

However, on the network side for many years asynchronous programming has been a very successful paradigm – for instance, our DaCHS package has been based on an async framework from the start, and Python itself has growing in-language support for async, too.

Async allows you to you fire off a network request and forget about it until the results come back (yes, it's the principle of async TAP, too). That would let people run many queries in parallel, which in turn would result in dramatically reduced waiting times, while we can rather easily ensure that a single client will not overflow any server. Still, it would be handing a fairly powerful tool into possibly unexperienced hands… Well: for now there is no need to decide on this, as pyVO would need rather substantial upgrades to support async.

Appendix: Optionality Considered Harmful

The trouble with obscore and cones is a good illustration of the traps of attempting to fix problems by adding optional features. I currently translate the cone constraint on Obscore using:

"(distance(s_ra, s_dec, {}, {}) < {}".format(
  self.center[0], self.center[1], self.radius)
+" or 1=intersects(circle({}, {}, {}), s_region))".format(
  self.center[0], self.center[1], self.radius))

which is all of ugly, presumably slow, and wrong.

To appreciate what is going on, you need to know that Obscore has two ways to define the spatial coverage of an observation. You can give its “center” (s_ra, s_dec) and something like a rough radius (s_fov), or you can give some sort of geometry (e.g., a polygon: s_region). When the standard was written, the authors wanted to enable Obscore services even on databases that do not know about spherical geometry, and hence s_region is considered rather optional. In consequence, it is missing in many services. And even the s_ra, s_dec, s_fov combo is not mandatory non-null, so you are perfectly entitled to only give s_region.

That is why there are the two conditions or-ed together (ugly) in the code fragment above. 1=intersects(circle(.), s_region) is the correct part; this is basically how the cone is interpreted in SIA1, too. But because s_region may be NULL even when s_ra and s_dec are given, we also need to do a test based on the center position and the field of view. That rather likely makes things slower, possibly quite a bit.

Even worse, the distance-based condition actually is wrong. What I really ought to take into account is s_fov and then do something like distance(.) < {self.radius}+s_fov, that is, the dataset position need only be closer than the cone radius plus the dataset's FoV (“intersects”). But that would again produce a lot of false negatives because s_fov may be NULL, too, and often is, after which the whole condition would be false.

On top of that, it is virtually impossible that such an expression would be evaluated using an index, and hence with this code in place, we would likely be seqscanning the entire obscore table almost every time – which really hurts when you have about 85 Million records in your Obscore table (as I do).

The standard could immediately have sanitised all this by saying: when you have s_ra and s_dec, you must also give a non-empty s_fov and s_region. This is a classic case for where a MUST would have been necessary to produce something that is usable without jumping through hoops. See my post on Requirements and Validators on this blog for a longer exposition on this whole matter.

I'm not sure if there is a better solution than the current “if the operators didn't bother with s_region, the dataset's FoV will be ignored“. If you have good ideas, by all means let me know.

Followup (2024-11-28)

I've given a talk at the Malta interop giving another view on this matter: VO at the limit.

[1]	If you want to try this (in particular without clobbering your “normal” pyVO), do something like this: virtualenv --system-site-packages global-datasets . global-datasets/bin/activate cd global-datasets git clone https://github.com/msdemlei/pyvo cd pyvo git checkout global-datasets pip install .

News From the VO Via ActivityPub

2024-01-22 Markus Demleitner

If you ask us: Get a proper client to join the Fediverse. But as shown here, in a pinch a web browser will do, too.

When Twitter was still fairly young, we had an account there that would tweet out when new data collections appeared in the VO. Even back then, I was rather doubtful whether using a proprietary platform to disseminate open data is a good idea, but as long as the content was also available through standard protocols (RSS in this case), I thought it might be worth a try. Well: It never really took off, and after Twitter broke the whole thing a couple of times by incompatible API changes, I finally let it go ca. 2017.

Given to the recent mass exodus from the smouldering remains of Twitter into the open and standard Fediverse, I thought reviving our little missives there might actually be a worthwhile effort. Specifically, joining Mastodon – which speaks the ActivityPub protocol and hence is part of the Fediverse – has become really straightforward.

So, if the VO Fresh RSS Feed is not for you (perhaps because you do not have an RSS aggregator, which would be a shame), maybe following our new Mastodon account @gavo@botsin.space would be for you?

Followup (2024-11-21)

In late 2024, botsin.space shut down, and we moved our operations to @gavo@astrodon.social; so, please point your fediverse clients there.

Followup (2025-03-13)

While we would obviously nudge people to properly open and federated systems like the mastodon or the Fediverse in general, you can follow us from bluesky, too. Try @gavo.astrodon.social.ap.brid.gy there.

Followup (2025-07-01)

Oh bother, astrodon.social shut down, too. Perhaps we really need to run some sort of activityPub server of our own? Or convince some university to do it? Until then, we have moved on to fediscience.org. You can now follow @gavo@fediscience.org. From bluesky, we are now @gavo.fediscience.org.ap.brid.gy.

Oh, and yes, I give you the previews the Mastodon web client produces for VizieR resources are not overly pretty yet (curse Javascript templating!), but then if I were you, I'd disable URL previews anyway; really, they are little more than an annoyance.

This post also doubles as identity verification, so:
Visit Our Mastodon Page.
Category: Operations

DaCHS 2.9 is out

2023-11-24 Markus Demleitner

Our VO server package DaCHS almost always sees two releases per year, each time roughly after the Interops[1]. So, with the Tucson Interop over, it's time for DaCHS 2.9, and this is the traditional what's new post.

Data Origin – the big headline for this release could be “curation”, in that three upcoming standardoid entities in that field are prototyped in 2.9. One is Data Origin, which is a note on how to embed some very basic provenance information into VOTables.

This is going to help your users figure out how they came up with a VOTable when the referee has clever questions about the paper they submitted half a year earlier. The good news is: if you defined your metadata in your RD with sufficient care, with DaCHS 2.9 you will automatically do Data Origin.

Feed your D links – another curation-related new thing in DaCHS is an implementation of what will hopefully be known as BibVO in the future. At this point, it is an unpublished note on Github. In essence, the purpose is to feed bibliographic services – and in particular the ADS – “D links”, i.e., links from publications to data. A part of this works automatically (the source metadatum), but the more advanced biblinks need a bit of manual intervention.

If you even have, say, an observatory bibliography consisting pairs of papers and data used by these papers, you will probably have to write a handful of code. See biblinks in the reference documentation for details if any of this sounds as if it could apply to you. In this context, I have also enabled passing multiple accrefs to the /get endpoint. Users will then receive a tar file of the referenced data products.

altIdentifiers in relationships – still in the bibliographic realm, VOResource 1.2 will (almost certainly) let you set altIdentifiers, in particular DOIs, when you declare relationships to other resources. That is probably of interest in particular when you want to declare relationships to things outside of the VO to services like b2find that themselves do not understand ivoids. In that situation, you would write something like:

Cites: Some external thing
Cites.altIdentifier: doi:10.fake/123412349876

in a <meta> tag in your RD.

json columns – postgresql has the very tempting and apparently all-powerful json type; it lets you stick complex structures into database columns and thus apparently relieve you of all the tedious tasks of designing database tables and documenting metadata.

Written like this, you probably notice it's a slippery slope at best. Still, there are some non-hazardous uses for such columns, and thus you can now say type="json" or (probably preferably) type="jsonb" in column definitions. You can feed these columns with dicts, lists or JSON literals in strings. Clients will receive both of them as JSON string literals in char[*] FIELDs with an xtype of json. Neither astropy nor TOPCAT do anything with that xtype yet, but I expect that will change soon.

Copy coverage – sometimes two resources have the same spatial (and potentially temporal and spectral) coverage. Since obtaining the coverage is an expensive operation, it would be nice to be able to say “aw, look at that other resource and take its coverage.” The classic example in DaCHS is the system-wide SIAP2 service that really is just a parametric wrapper around obscore. In such cases, you can now say something like:

<coverage fallbackTo="__system__/obscore"/>

– and //siap2 already does. That's one more reason to occasionally run dachs limits //obscore if you offer an obscore table.

First VOTable row in tests – if you have calls to getFirstVOTableRow in regression tests (you have regression tests, right?) that return multiple rows, these will fail now until you also pass rejectExtras=False to that call. I've had regressions that were hidden by the function's liberal acceptance of extra rows, and it's too simple to produce unstable tests (that magically succeed and fail depending to the current state of the database) with the old behaviour. I hence hope for your sympathy and understanding in case I broke one of your tests.

ADQL extensions – there is now arr_count to complement the array extension added in 2.7. Also, our custom UDFs transform, normal_random, to_jd, to_mjd, and simbadpoint now have a prefix of ivo_ rather than the previous gavo_. In order not to break existing queries, DaCHS will still accept the gavo_-prefixed names for the forseeable future, but it will no longer advertise them.

Minor fixes – as usual, there are many minor bug fixes and improvements, the most visible of which is probably that DaCHS now correctly handles literal + chars in multipart-encoded (”uploads”) requests again; that was broken in 2.8 after the removal of the dependency on python's CGI module. Also, MOC-valued columns can now be serialised into non-VOTable formats like JSON or CSV.

If you have been using DaCHS' built-in HTTPS support, certain clients may have rejected its certificates. That was because we were pulling an expired intermediate certificate from letsencrypt. If you don't understand what I was just saying, don't worry. If you do understand that and know a good way to avoid this kind of calamity in the future, I'm grateful for advice.

VCS move – when DaCHS was born, using the venerable subversion for version control was considered reputable. These days, fewer and fewer people can still deal with that, and thus I have moved the DaCHS source code into a git repository: https://gitlab-p4n.aip.de/gavo/dachs/.

I hear you moan “why not github?” Well: don't get me started unless you are prepared to listen to a large helping of proselytising. Suffice it to say that we in academia invented the internet (for all intents and purposes) and it's a shame that we now rely so much on commercial entities to provide our basic services (and then without paying them, as a rule, which is always a dangerous proposition towards commercial entities).

Anyway: Feel free to use that service's bug tracker; we try to find ways to let you log in there without undue hardship, too.

At this point, I customarily urge: don't wait, upgrade. If you have our Debian repository enabled, apt update && apt upgrade should do the trick, except if you missed our announcement on dachs-users that our repository key has changed. If you have not updated it, please have a look at our repo page to see what needs to be done. Sorry about this, but our old 1024D key was being frowned upon, so we had to do something.

Unless you are an old hand and have upgraded many times before, let me recommend a quick glance at our upgrading guide before doing the actual upgrade.

[1]	The reason we wait for the Interops is that we are generally promising to put something into DaCHS at or around these conferences. This time, the preliminary support for json-typed database columns is an example for that.

GAVO at the Fall 2023 Interop in Tucson

2023-11-13 Markus Demleitner

The Virtual Observatory, in practical terms, is the set of standards created and maintained by the IVOA. The IVOA, in turn, is a community almost defined by the two conferences it holds every year, the Interops (previously on this blog). The most recent Interop has just ended: The 2023 Tucson Fall Interop. Here are a few notes on what went on there from my (and to some extent GAVO's) perspective.

This fall's IVOA Interop was hosted by Steward Observatory, where they had ripening oranges in the backyard. They were edible!

For at least a decade and a half, the autumn Interops have been back-to-back with the ADASS conferences. ADASS, short for Astronomical Data Analysis Software and Systems, is a venerable conference series, created far in the last century (this year: ADASS XXXIII) to have a forum for people who work in the magic triangle of astronomy, instrumentation, and data processing. Clearly, such a forum is very well suited to spread the word about the miracles we are working in the VO.

To that end, I was involved in the creation of three posters: One on the use of MOCs in TAP – a somewhat extended version of something you saw on this blog first –, then one on data discovery in pyVO by Renaud Savalle (Paris) et al – a topic again familiar to readers of this blog – and finally one on improving the description of ADQL to enable more reliable machine validation of its grammar by Grégory Mantelet (Strasbourg) et al.

As the conference at large goes, I was really delighted to see how basically everyone talking about data publication at all was stressing they are “doing VO”, which was a very welcome change from, perhaps, 10 years ago when this kind of talk was typcially extolling the virtues of one particular web or javascript framework. One of the great thing about standards in general and the VO in particular is that they tend to be a lot more durable than all those frameworks.

The following Interop was a “short” one, lasting from Friday morning until Sunday noon, which meant that I was far too busy to do anything like a live blog while it went on. Let me hence just briefly point out the main talks related to GAVO's current activities and DaCHS.

In Data Curation and Preservation on Saturday morning, Baptiste Cecconi (Paris) gave a nice overview of – among other things – what our bridge between the Registry and b2find (in particular, using the VOResource to DataCite mapper) enables in the context of the EOSC, and he briefly touched the question of how to properly make landing pages for VO resources (for which I am currently using another piece of XSLT).

In the Radio session later that morning, Ixaka Labadie (Granada) gave a talk on how he is using DaCHS to deliver 3D visualisations for fairly impressive (prototype) SKA data. I particularly liked his illustrations of how DaCHS does Datalink and SODA. See his slide 12:

In the afternoon, there was the Registry session, which featured me talking about the harvest trigger service I have been running for a while to help people across the anticlimactic moment when you have published your new resource but it won't show up in TOPCAT or pyVO for a day or so.

The bulk of this session, however, was used for a discussion about various shortcomings of the Registry or its interfaces that I found pleasantly productive – incidentally, just like the discussion on word lists in EPN-TAP on Friday afternoon's Solar System Session that I had the pleasure to chair.

In the DAL session on that afternoon, I had two talks: One was on the proposed new interoperable user-defined functions already implemented in DaCHS' ADQL and now coming up in several other services, too. Note to self: Some of these would probably be rather suitable blog post material.

The second talk was a sort of brief show-and-tell pitch, in which I pointed out that hierarchical TAP examples using the elegant examples:continued property now actually work in both pyVO and TOPCAT:

Finally, in Sunday morning's Apps session, I talked about global image discovery in pyVO. This was about an early promise of the VO: just say where in space, time, and spectrum you need an image (or spectrum, or time series, or whatever), and some apparatus will find and query all the services that could have pertinent data. It would then present the metadata of the datasets it found in some useful form that would let you make informed decisions which to fetch.

This was not too difficult in the olden days, but by now the VO is so big and complicated that a pyVO module with fairly involved logic is required. If you don't want to read the notes here, don't worry: I can safely predict that you'll read more about that topic on this blog.

This is nowhere near done yet; so, it is one more piece of homework that I am taking home with me.

Category: Meetings

« Page 4 / 22 »