Towards Data Discovery in pyVO

When I struggled with ways to properly integrate TAP services – which may have hundreds or thousands of different resources in one service – into the VO Registry without breaking what we already had, I realised that there are really two fundamentally different modes of using the VO Registry. In Discovering Data Collections's abstract I wrote:

the Registry must support both VO-wide discovery of services by type ("service enumeration") and discovery by data collection ("data discovery").

To illustrate the difference in a non-TAP case, suppose I have archived images of lensed quasars from Telescopes A, B, and C. All these image collections are resources in their own right and should be separately findable when people look for “resources with data from Telescope A“ or perhaps “images obtained between 2011-01-01 and 2011-12-31”.

However, when a machine wants to find all images at a certain position, publishing the three resources through three different services would mean that that machine has to do three requests where one would work just as well. That is very relevant when you think about how the VO will evolve: At this point there are 342 SIAP services in the VO, and when you read this, that number may have grown further. Adding one service per collection will simply not scale when we want to keep the possibility of all-VO searches. Since I claim that is a very desirably thing, we need to enable collective services covering multiple subordinate resources.

So, while in the first (“data discovery”) case one wants to query (or at least discover) the three resources separately, in the second case they should be ignored, and only a collective “images of lensed quasars” service should be queried.

The technical solution to this requirement was creating “auxliary capabilities” as discribed in the endorsed note on discoving data collections cited above. But these of course need client support; VO clients up to now by and large do service enumeration, as that has been what we started with in the VO Registry. Client support would, roughly, mean that clients would present their users with data collections, and then offer the various ways to to access them.

There are quite a number of technicalities involved in why that's not terribly straightforward for the “big” clients like TOPCAT and Aladin (though Aladin's discovery tree already comes rather close).

Now that quite a number of people use pyVO interactively in jupyter notebooks, extending pyVO's registry interface to do data discovery in addition to the conventional service enumeration becomes an attractive target to have data discovery in practice.

I have hence created pyVO PR #289. I think some the rough edges will need to be smoothed out before it can be merged, but meanwhile I'd be grateful if you could try it out already. To facilitate that, I have prepared a jupyter notebook that shows the basic ideas.

Followup (2023-12-15)

I have just prepared a slightly updated version of the notebook.

To run it while the PR is not merged, you need to install the forked pyVO. In order to not clobber your main installation, you can install astropy using your package manager and then do the following (assuming your shell is bash or something suitably similar):

virtualenv --system-site-packages try-discoverdata
. try-discoverdata/bin/activate
cd try-discoverdata
git clone https://github.com/msdemlei/pyvo
cd pyvo
git checkout add-discoverdata
python3 setup.py develop
ipython3 notebook

That should open a browser window in which you can open the notebook (you probably want to download it into the pyvo checkout in order to make the notebook selector see it). Enjoy!

Zitiert in: At the Bologna Interop Another Virtual Interop