• Watch Sphinx Doctests

    No astronomy at all here; please move on if tooling for improving tooling bores you.

    While giving a lecture on pyVO, I am churning out quite a few pull requests against pyVO at the moment. I am also normally also fairly religious about running unit tests before doing a commit. But then PyVO unit tests became really, really slow a while ago when pytesting of the examples in the documentation was turned on, and so I started relying on the github continuous integration, which feels fairly wasteful – and also makes all kinds of minor idiocies public that I would have caught locally with a test suite that finishes within a minute or so.

    Regrettably, tooling for inspecting how doctests with sphinx and pytest run is not really great: All the code from one documentation file translates into a single test, and when that runs for five minutes, it's anyone's guess where the time is spent. After a bit of poking and asking around, it seemed to me that there indeed is no “doctest profiler” (if you will), at least not for pytest-executable doctests embedded in sphinx-processable ReStructuredText.

    Well, I thought, let's write a quick one. Originally, I had wanted to use the docutils parser for robustness, but once I tried to pull in the sphinx extensions and got lost in their modules I decided a simple, RE-based parser has to be enough.

    And here it is, my my quick-and-dirty doctest profiler: watch-doctests.py. Just put it into your path, make it executable, and you can do something like this:

    pyvo/docs/dal > watch-doctests.py index.rst | head -30
    ---0.00---------------
    
    import pyvo as vo
    ---0.94---------------
    
    service = vo.dal.SIAService("http://dc.zah.uni-heidelberg.de/lswscans/res/positions/siap/siap.xml")
    ---0.94---------------
    
    print(service.description)
    Scans of plates kept at Landessternwarte Heidelberg-Königstuhl. They
    were obtained at location, at the German-Spanish Astronomical Center
    (Calar Alto Observatory), Spain, and at La Silla, Chile. The plates
    cover a time span between 1880 and 1999.
    
    Specifically, HDAP is essentially complete for the plates taken with
    the Bruce telescope, the Walz reflector, and Wolf's Doppelastrograph
    at both the original location in Heidelberg and its later home on
    Königstuhl.
    ---1.02---------------
    
    import pyvo as vo
    ---1.02---------------
    
    from astropy.coordinates import SkyCoord
    ---1.02---------------
    
    from astropy.units import Quantity
    

    – so, you pass in the ReStructuredText with the embedded sphinx/pytest doctests, and then the thing extracts every line to be executed in the doctests (it ignores the outputs, so it will not actually check any assertions), prints the runtime so far in a separator and then runs the code through Python as usual: note that no automatic repr() of any non-None results – that the REPL does – happens. This is for profiling, not for test development.

    The quick hack helped me speed up the dal and registry doctests by sizeable factors, for instance because I am now avoiding downloads of large datasets, and I am using faster queries where I can.

    So, that's nice. But unless someone asks, I will distribute the code here only and in this ad-hoc fashion (probably with a link in the pyVO hackers' docs). I still believe there must be something a lot less hacky that does about the same thing somewhere out there…

  • A Data Publisher's Diary: Wide Images in DASCH

    An Aladin screenshot with many green squares overplotted on a DSS image sized 20×15 degrees.

    This is the new resonse when you query the DASCH SIAP service for Aladin's default view on the horsehead nebula. As you can see, at least the returned images no longer are distributed over half of the sky (note the size of the view).

    The first reaction I got when the new DASCH in the VO service hit Aladin was: “your SIAP service is broken, it just dumps all images it has at me rather than honouring my positional constraint.”

    I have to admit I was intially confused as well when an in-view search from Aladin came back with images with centres on almost half the sky as shown in my DASCH-in-Aladin illustration. But no, the computer did the right thing. The matching images in fact did have pixels in the field of view. They were just really wide field exposures, made to “patrol” large parts of the sky or to count meteors.

    DASCH's own web interface keeps these plates out of the casual users' views, too. I am following this example now by having two tables, dasch.narrow_plates (the “narrow” here follows DASCH's nomenclature; of course, most plates in there would still count as wide-field in most other contexts) and dasch.wide_plates. And because the wide plates are probably not very helpful to modern mainstream astronomers, only the narrow plates are searched by the SIAP2 service, and only they are included with obscore.

    In addition to giving you a little glimpse into the decisions one has to make when running a data centre, I wrote this post because making a provisional (in the end, I will follow DASCH's classification, of course) split betwenn “wide” and “narrow” plates involved a bit of simple ADQL that may still be not totally obvious and hence may merit a few words.

    My first realisation was that the problem is less one of pixel scale (it might also be) but of the large coverage. How do we figure out the coverage of the various instruments? Well, to be robust against errors in the astrometric calibration (these happen), let us average; and average over the area of the polygon we have in s_region, for which there is a convenient ADQL function. That is:

    SELECT instrument_name, avg(area(s_region)) as meanarea
    FROM dasch.plates
    GROUP BY instrument_name
    

    It is the power of ADQL aggregate function that for this characterisation of the data, you only need to download a few kilobytes, the equivalent of the following histogram and table:

    A histogram with a peak of about 20 at zero, with groups of bars going all the way beyond 4000.  The abscissa is marked “meanarea/deg**2”.
    Instrument Name mean size [sqdeg]
    Eastman Aero-Ektar K-24 Lens on a K-1...  
    Cerro Tololo 4 meter  
    Logbook Only. Pages without plates.  
    Roe 6-inch  
    Palomar Sky Survey (POSS)  
    1.5 inch Ross (short focus) 4284.199799877725
    Patrol cameras 4220.802442888225
    1.5-inch Ross-Xpress 4198.678060743206
    2.8-inch Kodak Aero-Ektar 3520.3257323233624
    KE Camera with Installed Rough Focus 3387.5206396388453
    Eastman Aero-Ektar K-24 Lens on a K-1... 3370.5283986677637
    Eastman Aero-Ektar K-24 Lens on a K-1... 3365.539790633015
    3 inch Perkin-Zeiss Lens 1966.1600884072298
    3 inch Ross-Tessar Lens 1529.7113188540836
    2.6-inch Zeiss-Tessar 1516.7996790591587
    Air Force Camera 1420.6928219265849
    K-19 Air Force Camera 1414.074101143854
    1.5 in Cooke "Long Focus" 1220.3028263587332
    1 in Cook Lens #832 Series renamed fr... 1215.1434235932702
    1-inch 1209.8102811770807
    1.5-inch Cooke Lenses 1209.7721123964636
    2.5 inch Cooke Lens 1160.1641223648048
    2.5-inch Ross Portrait Lens 1137.0908812243645
    Damons South Yellow 1106.5016573891376
    Damons South Red 1103.327982978934
    Damons North Red 1101.8455616455205
    Damons North Blue 1093.8380971825375
    Damons North Yellow 1092.9407550755682
    New Cooke Lens 1087.918570304363
    Damons South Blue 1081.7800084709982
    2.5 inch Voigtlander (Little Bache or... 548.7147592220762
    NULL 534.9269386355818
    3-inch Ross Fecker 529.9219051692568
    3-inch Ross 506.6278856912204
    3-inch Elmer Ross 503.7932693652602
    4-inch Ross Lundin 310.7279860552893
    4-inch Cooke (1-327) 132.690621660727
    4-inch Cooke Lens 129.39637516917298
    8-inch Bache Doublet 113.96821604869973
    10-inch Metcalf Triplet 99.24964308212328
    4-inch Voightlander Lens 98.07368690379751
    8-inch Draper Doublet 94.57937153909593
    8-inch Ross Lundin 94.5685388440282
    8-inch Brashear Lens 37.40061588712761
    16-inch Metcalf Doublet (Refigured af... 33.61565584978583
    24-33 in Jewett Schmidt 32.95324914757339
    Asiago Observatory 92/67 cm Schmidt 32.71623733985344
    12-inch Metcalf Doublet 31.35112644688316
    24-inch Bruce Doublet 22.10390937657793
    7.5-inch Cooke/Clark Refractor at Mar... 14.625992810622787
    Positives 12.600189007151709
    YSO Double Astrograph 10.770798601877804
    32-36 inch BakerSchmidt 10 1/2 inch r... 10.675406541122827
    13-inch Boyden Refractor 6.409447066606171
    11-inch Draper Refractor 5.134521254785461
    24-inch Clark Reflector 3.191361603405415
    Lowel 40 inch reflector 1.213284257086087
    200 inch Hale Telescope 0.18792105301170514

    For the instruments with an empty mean size, no astrometric calibrations have been created yet. To get a feeling for what these numbers mean, recall that the celestial sphere has an area of 4 π rad², that is, 4⋅180²/π or 42'000 square degrees. So, some instruments here indeed covered 20% of the night sky in one go.

    I was undecided between cutting at 150 (there is a fairly pronounced gap there) or at 50 (the gap there is even more pronounced) square degrees and provisionally went for 150 (note that this might still change in the coming days), mainly because of the distribution of the plates.

    You see, the histogram above is about instruments. To assess the consequences of choosing one cut or the other, I would like to know how many images a given cut will remove from our SIAP and ObsTAP services. Well, aggregate functions to the rescue again:

    SELECT ROUND(AREA(s_region)/100)*100 AS platebin, count(*) AS ct
    FROM dasch.plates
    GROUP BY platebin
    

    To plot such a pre-computed histogram in TOPCAT, tell the histogram plot window to use ct as the weight, and you will see something like this:

    A wide histogram with a high peak at about 50, rising to 1.2e5. Another noticeable concentration is around 1250, and there is signifiant weight also approaching 450 from the left.

    It was this histogram that made me pick 150 deg² as the cutoff point for what should be discoverable in all-VO queries: I simply wanted to retain the plates in the second bar from left.

  • DASCH is now in the VO

    Black dots on a white-ish background.  In the middle, some diffuse greyish stuff around a relatively large black dot.

    This frame would show comet 2P/Encke during its proximity to Earth in 1941 – if it went deep enough. But never mind practicalities: If you want to learn about matching ephemeris against the DASCH plate collection (or, really, any sort of obscore-like table), read on.

    For about a century – that is, into the 1980s –, being an observational astronomer meant taking photographic plates and doing tricks with them (unless you were a radio astronomer or one of the very few astronomers peeking beyond radio and optical in those days, of course). This actually is somewhat fortunate for archivists, because unlike many of the early CCD observations that by now are lost with our ability to read the tapes they were stored on, the plates are still there.

    Why Bother?

    However, to make them usable, the plates need to be digitised. In the GAVO data centre, we keep the results of several scan campaigns large and small, such as HDAP, the various data collections joined in the historical photographic plate image archive HPPA, or the delightfully quirky Münster Flare Plates.

    I personally care a lot about these data collections. This is partly because they are indispensible for understanding the history of astronomy. But more importantly, they are the next best thing we have to a time machine; if we have a way of knowing how the sky looked like seventy years ago, it is these plate collections. Having such a time machine is important for all kinds of scientific efforts, including figuring out whether there are aliens (i.e., 2016ApJ...822L..34S) on Tabby's Star.

    Somewhat to my chagrin, the cited paper 2016ApJ...822L..34S did not use the VO to obtain the plate images but went straight to DASCH's web interface. DASCH, in case you have not heard of it before, is probably the most ambitious project concerned with plate digitisation at the moment – or perhaps: “was”, because they just finished scanning the core part of Harvard's plate collections, which was their primary goal.

    I can understand why Bradley Schaefer, the paper's author, did not bother with a VO search In 2016. For starters, working with halfway homogeneous data from instruments you are somewhat familiar saves a substantial amount of work and thought, in particular if you are, in addition, up against the usual lack of machine-readable metadata. Also, at that time DASCH probably had about as many digitised plates as all the VO's contemporary plate collections taken together.

    DASCH: The Harvard Plates

    Given such stats, I have always wanted to have at least the metadata from DASCH's plates in the VO. Thanks to a recent update to DASCH's publication system, this is now a reality. Since 2024-04-29, I am publishing the metadata of the DASCH plates via Obscore and and SIAP2.

    Followup (2024-05-03)

    This is now DASCH news, and one of my two main contacts on the DASCH side, Peter Williams, has written an insightful post on this, too. Let me use this opportunity to thank him for the delightful cooperation, and extend these thanks to Ben Sabath, who is primarily responsible for the update to the DASCH publication system I mentioned above.

    Matching plates are returned as datalink documents, pointing to a preview, photos of the plate and its jacket, and links to the science data, once downsampled by a factor of 16, once in the original size (example). For now, #this points to the downsampled version, as Amazon charges DASCH about three cents per full-scale plate at the moment, and that can quickly add up by accident (there's nothing wrong with consciously downloading full-scale FITS-es if you need them, of course).

    This is a bit fishy in that the size of the image in the obscore/SIAP2 fields s_xel1 and s_xel2 refers to the unscaled image, and thus I should be returning the full-scale image as datalink #this. I hope I will not cause much confusion with this design.

    In case you look at the links in the datalink documents, let me include a disclaimer: Although they point into the GAVO data centre, the data is served courtesy of the DASCH project. The links only go to us because we need to sign links for you. I mention this because you can save the datalink documents and the links within them; the URLs you are redirected to from there, however, will expire fast. Just do not look at them.

    Plates in Global Discovery

    So – what can you do with DASCH in the VO that you could not do before?

    Most importantly, you will discover DASCH in registry interfaces and its datasets in global queries (in particular the global dataset queries I have discussed a few weeks ago). For instance, DASCH is now in Aladin's discovery tree:

    A screen shot with many selected points, highlighted in green, on the right side.  On the left side, an tree display with many branches folded in.  On a folded-out branch, there is “DASCH SIAP2“ highlighted.  On the right side, there is a large rectangle overplotted in red.

    You can now find DASCH in Aladin and do the usual “in view“ searches. However, currently this yields many matches that are, in practical terms, spurious, as they come from extremely wide-angle instruments. The red rectangle is the footprint of one of these images; note that the view here is a full two pi sky. We will probably do something about this “noise“.

    The addition of DASCH to the VO has a strong effect in some use cases. For instance, at the end of the GAVO plates tutorial, we do an all-VO obscore query that, at the time of the last update of the tutorial in 2019, yielded 4067 datasets (of course, including modern and/or non-optical observations) potentially showing some strongly lensed quasar. With DASCH – and, admittedly, a few more collections that came into the VO since 2019 –, that number is now 10'489; the range of observation dates grew from MJD 12550…52000 to MJD 9800…58600, with the mean decreasing from 51'909 to 30'603. That the mean observation date moves that much back in time is a certain sign that a major part of the expansion is due to DASCH (well, and certainly to APPLAUSE, too).

    Followup (2024-05-03)

    As discussed in my DASCH update, I have taken out the large-coverage plates from my obscore table, which changes the stats (but not the conclusions) quite a bit. They is now 10'098 plates and mean observation date 36'396

    TAP, Uploads, and pyVO on DASCH

    But this is not just about bringing astronomical heritage to the VO. It is also about exposing DASCH through the powerful ADQL/TAP interface. As an example of how this may be useful, consider the comet P2/Encke, which, according to JPL's Small-Body Database was relatively close to Earth (about half an AU) in May 1941. It would have had about 14.5 mag at that point and hence was safely within reach of several of the instruments archived in DASCH. Perhaps we can find serendipitous or even targeted observations of the comet in the collection?

    The plan to find that out is: compute an ephemeris (we are lazy and use an external service, Miriade ephemcc) and then for each day see whether there are DASCH observations in the vicinity of the sky location obtained in this way.

    As usual, it's never that easy because the call to the ephemeris webservice (paste the link into TOPCAT to have a look) returns cursed sexagesimal coordinates. We need to fix them before doing anything serious with the table, and while we are at it, we also repair the date, which is simpler to consume if it is MJD to begin with. Getting the ephemeris thus takes quite a few lines:

    from astropy import table
    from astropy import units as u
    from astropy.coordinates import SkyCoord
    from astropy.time import Time
    
    ephem = table.Table.read(
      "https://vo.imcce.fr/webservices/miriade/ephemcc.php?-from=vespa"
      "&-name=c:p/encke&-ep=1941-04-01&-nbd=90&-step=1d&-observer=500"
      &-mime=votable")
    
    parsed = SkyCoord(ephem["ra"], ephem["dec"], unit=(u.hourangle, u.deg))
    ephem["ra"] = parsed.ra.degree
    ephem["dec"] = parsed.dec.degree
    
    parsed = Time(ephem["epoch"])
    ephem["epoch"] = parsed.mjd
    

    Compared to that, the actual matching against DASCH is almost trivial if you are somewhat familiar with crossmatching in ADQL and the Obscore schema:

    svc = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
    res = svc.run_sync("""
        SELECT *
        FROM
            dasch.plates
            JOIN tap_upload.orbit
            ON (1=CONTAINS(POINT(ra, dec), s_region))
        WHERE
            t_min<epoch
            AND t_max>epoch""",
        uploads={"orbit": ephem})
    

    Followup (2024-05-03)

    You would probably query the dasch.narrow_plates table in actual operations; querying dasch.plates is probably more for people interested in the history of astronomy or DASCH itself.

    Inspect the query for a moment: This is a normal upload join, except we are constructing an ADQL POINT on the fly to be able to see whether we are in the spatial region covered by a DASCH dataset (given in obscore's s_region column). We could have put the temporal condition into the join's ON; but I think the intention is somewhat clearer with the WHERE constraint, and the database engine will probably go through identical motions for both queries – the beauty of having a query planner in the loop is that you do not need to think about such details most of the time.

    Actually, in this case there is one last complication: As said above, we have put a datalink service between you and the downloads to discourage accidental large downloads. We hence use pyVO's (suboptimally documented) datalink interface (iter_datalinks):

    with pyvo.samp.connection() as conn:
        for dl in res.iter_datalinks():
            link = next(dl.bysemantics("#preview-image"))
            pyvo.samp.send_image_to(
                conn,
                link.access_url,
                client_name="Aladin")
    

    Among the artefacts available we pick the scaled jpegs in this fragment (#preview-image), since these are almost free even on the Amazon cloud. Change that #preview-image to #this in the to get scaled calibrated FITS-es, which are still fairly small. This would, for instance, let you overplot the ephemeris in Aladin, which you cannot do with the jpegs as they lack astrometric calibration (for now). But even with #preview-image, we can use Aladin as a glorified image viewer by SAMP-sending the images there, which is why we do the minor magic with functions from pyvo.samp.

    If you want to try this yourself or mangle the program to do something else that requires querying against a reasonable number positions in time and space, just get encke.py and hack away. Make sure to start Aladin before running the program so it has something to send the images to.

    Disclaimers

    This is a contrived example, and it is likely that this particular use case is astronomically wrong in several ways. Let me enumerate a few things that would need looking into before this approaches proper science:

    • We compute the ephemeris for the center of the Earth. At half an AU distance, the resulting parallax will not shift the position enough to hide a plate we should know about, but at least for anything closer, you should try to do a bit better; admittedly, for a resource like DASCH – that contains plates from observatories all over the place – you will have to compromise.
    • The ephemeris is probably wrong; comet's orbits change over time, and I have no idea if the ephemeris service actually uses 2P/Encke's 1941 orbit to compute the positions.
    • The coordinate metadata may be wrong. Ephemcc's documentation says something that sounds a lot as if they were sometimes returning RA and Dec for the equator of the time rather than for J2000 (i.e., ICRS for all intents and purposes), but of course our obscore coverages are for the ICRS. Regrettably, the VOTable returned by the service does not contain a COOSYS element yet, and so there is no easy way to tell.
    • If you look at the table with DASCH matches, you will see they all were observed with an extremely wide-angle instrument sporting an aperture of a mere three inches. Even at the whopping exposure times (two hours), there is probably no way you would see a diffuse object of 14th mag on a plate with a 1940s-era photographic emulsion with that kind of optics (well: feel free to prove me wrong).
    • It would of course be a huge waste of bandwidth to pull the entire plates if we already had a good idea of where we would expect the comet (i.e., had a reliable ephemeris). Hence, a cutout service that would let you retrieve more or less exactly the pixels you would like to use for your research and not the cruft around it would be a nifty supplement. It's in the works, and I'd say you can almost hold your breath. The cutout will simply appear as a SODA service in the datalink documents. See 2020ASPC..522..295D for how you would operate such a service.
  • Multimessenger Astronomy and the Virtual Observatory

    A lake with hills behind it; in it a sign saying “Bergbaugelände/ Benutzung auf eigene Gefahr”

    It's pretty in Görlitz, the location of the future German Astrophysics Research Centre DZA. The sign says “Mining area, enter at your own risk”. Indeed, the meeting this post was inspired by happened on the shores of a lake that still was an active brown coal mine as late as 1997.

    This week, I participated in the first workshop on multimessenger astronomy organised by the new DZA (Deutsches Zentrum für Astrophysik), recently founded in the town of Görlitz – do not feel bad if you have not yet heard of it; I trust you will read its name in many an astronomy article's authors' affiliations in the future, though.

    I went there because facilitating research across the electromagnetic spectrum and beyond (neutrinos, recently gravitational waves, eventually charged particles, too) has been one of the Virtual Observatory's foundational narratives (see also this 2004 paper from GAVO's infancy), and indeed the ease with which you can switch between wavebands in, say, Aladin, would have appeared utopian two decades ago:

    A screenshot with four panes showing astronomical images from SCUBA, allWISE, PanSTARRS, XMM, and a few aladin widgets around it.

    That's the classical quasar 3C 273 in radio, mid-infrared, optical, and X-rays, visualised within a few seconds thanks to the miracles of HiPS and Aladin.

    But of course research-level exploitation of astronomical data is far from a solved problem yet. Each messenger – and I would consider the concepts in the IVOA's messenger vocabulary a useful working definition for what sorts of messengers there are[1] – holds different challenges, has different communities using different detectors, tools and conventions.

    For instance, in the radio band, working with raw-ish interferometry data (“visibilities”) is common and requires specialised tools as well as a lot of care and experience. Against that, high energy observations, be them TeV photons or neutrinos, have to cope with the (by optical standards) extreme scarcity of the messengers: at the meeting, ESO's Xavier Rodrigues (unless I misunderstood him) counted one event per year as viable for source detection. To robustly interpret such extremely low signal levels one in particular needs extremely careful modelling of the entire observation, from emission to propagation through various media to background contamination to the instrument state, with a lot of calibration and simulation data frequently necessary to make statistical sense of even fairly benign data.

    The detectors for graviational waves, in turn, basically only match patterns in what looks like noise even to the aided eye – at the meeting, Samaya Nissanke showed impressive examples –, and when they do pick up a signal, the localisation of the signal is a particular challenge resulting, at least at this point, in large, banana-shaped regions.

    At the multimessenger workshop, I was given the opportunity to delineate what, from my Virtual Observatory point of view, I think are requirements for making multi-messenger astronomy more accessible “for the masses”, that is for researchers that do not have immdiate access to experts for a particular sort of messenger. Since a panel pitch is always a bit cramped, let me give a long version here.

    Science-Ready is an Effort

    The most important ingredient is: Science-ready data. Once you can say “we get a flux of X ± Y Janskys from a σ-circle around α, δ between T1 and T2 and messenger energy E1 and E2” or “here is a spectrum, i.e., pairs of sufficiently many messenger energy intervals and a calibrated flux in them, for source S”, matters are at least roughly understandable to visitors from other parts of the spectrum.

    I will not deny that there is still much that can go wrong, for instance because the error models of the data can become really tricky for complex instruments doing indirect measurements (say, gamma-ray telescopes observing atmospheric showers). But having to cope with weirdly correlated errors or strong systematics is something that happens even while staying within your home within the spectrum – I had an example from the quaint optical domain right here on my blog when I posted on the Gaia XP spectra –, so that is not a problem terribly specific to the multi-messenger setting.

    Still, the case of the Gaia XP spectra and the sampling procedure Rene has devised back then are, I think, a nice example for what “provide science-ready data” might concretely mean: work, in this case, trying to de-correlate data points so people unfamiliar with the particular formalism used in Gaia DR3 can do something with the data with low effort. And I will readily admit that it is not only work, it also sacrifices quite a bit of detail that may actually be in the data if you spend more time with the individual dataset and methods.

    That this kind of service to people outside of the narrower sub-domain is rarely honoured certainly is one of the challenges of multi-messenger astronomy for the masses.

    Generality, Systematics, Statistics

    But of course the most important part of “science-ready” is removing instrument signatures. That is particularly important in multi-messenger astronomy because outside users will generally be fairly unfamiliar with the instruments, even with the types of instruments. Granted, even within a sub-domain setting up reduction pipelines and locating calibration data is rarely easy, and it is not uncommon to get three different answers when you ask two instrument specialists about the right formalism and data to calibrate any given observation. But that is not much compared with having to understand the reduction process of, say, LIGO, as someone who has so far mainly divided by flatfields.

    Even in the optical, serving data with strong instrumental signatures (e.g., without flats and bias frames applied) has been standard until fairly recently. Many people in the VLBI community still claim that real-space data is not much good. And I will not dispute that carefully analysing the systematics of a particular dataset may improve your error budget over what a generic pipeline does, possibly even to the point of pushing an observation over the significance threshold.

    But against that, canned science-ready data lets non-experts at least “see” something. That way, they learn that there may be some signal conveyed by a foreign messenger that is worth a closer look.

    Enabling that “closer look” brings me to my second requirement for multimessenger astronomy: expert access.

    From Data Discovery to Expert Discovery

    Of course, on the forefront of research, an extra 10% systematics squeezed out of data may very well make or break a result, and that means that people may need to go back to raw(er) data. Part of this problem is that the necessary artefacts for doing so need to be available. With Datalink, I'd say at least an important building block for enabling that is there.

    Certainly, that is not full provenance information yet – that would, for instance, include references to the tools used in the reduction, and the parameters fed to them. And regrettably, even the IVOA's provenance data model does not really tell you how to provide that. However, even machine-readable provenance will not let an outsider suddenly do, say, correlation with CASA with sufficient confidence to do bleeding-edge science with the result, let alone improve on the generic reduction hopefully provided by the observatory.

    This is the reason for my conviction that there is an important social problem with multi-messenger astronomy: Assuming I have found some interesting data in unfamiliar spectral territories and I want to try and improve on the generic reduction, how do I find someone who can work all the tools and actually know what they are doing?

    Sure, from registry records you can find contact information (see also the .get_contact() in pyVO's registry API), but that is most often a technical contact, and the original authors may very well have moved on and be inaccessible to these technical contacts. I, certainly, have failed to re-establish contact to previous data providers to the GAVO data centre in two separate cases.

    And yes, you can rather easily move to scholarly publications from VO results – in particular if they implement the INFO elements that the new Data Origin in the VO note asks for–, but that may not help either when the authors have moved on to a different institution, regardless of whether that is a scholarly or, say, banking institution.

    On top of that, our notorious 2013 poster on lame excuses for not publishing one's data has, as an excuse: “People will contact me and ask about stuff.” Back then, we flippantly retorted:

    Well, science is about exchange. Think how much you learned by asking other people.

    Plus, you’ll notice that quite a few of those questions are actually quite clever, so answering them is a good use of your time.

    As to the stupid questions – well, they are annoying, but at least for us even those were eye-openers now and then.

    Admittedly, all this is not very helpful, in particular if you are on the requesting side. And truly I doubt there is a (full) technical solution to this problem.

    I also acknowledge that it even has a legal side – the sort of data you need to process when linking up sub-domain experts and would-be data users is GDPR-relevant, and I would much rather not have that kind of thing on my machine. Still, the problem of expert discovery becomes very pertinent whenever a researcher leaves their home turf – it's even more important in cross-discipline data discovery[2] than in multiwavelength. I would therefore advocate at least keeping the problem in mind, as that might yield little steps towards making expert discovery a bit more realistic.

    Perhaps even just planning for friendly and welcoming helpdesks that link people up without any data processing support at all is already good enough?

    Blind Discovery

    The last requirement I have mentioned in my panel discussion pitch for smooth multi-messenger astronomy is, I think, quite a bit further along: Blind discovery. That is, you name your location(s) in space, time, and spectrum, say what sort of data product you are looking for, and let the computer inundate you with datasets matching these constraints. I have recently posted on this very topic and mentioned a few remaining problems in that field.

    Let me pick out one point in particular, both because I believe there is substantial scientific merit in its treatment and because it is critical when it comes to efficient global blind discovery: Sensitivity; while for single-object spectra, I give you that SNR and resolving power are probably enough most of the time, for most other data products or even catalogues, nothing as handy is available across the spectrum.

    For instance, on images (“flux maps”, if you will) the simple concept of a limiting magnitude obviously does not extend across the spectrum without horrible contortions. Replacing it with something that works for many messengers, has robust and accessible statistical interpretations, and is reasonably easy to obtain as part of reduction pipelines even in the case of strongly model-driven data reduction: that would be high art.

    Also in the panel discussion, it was mentioned that infrastructure work as discussed on these pages is thankless art that will, if your institute indulges into too much of it, get your shop closed because your papers/FTE metric looks too lousy. Now… it's true that beancounters are a bane, but if you manage to come up with such a robust, principled, easy-to-obtain measure, I fully expect its wide adoption – and then you never have to worry about your bean^W citation count again.

    [1]In case you are missing gravitational waves: there has been a discussion about the proper extension and labelling of that concept, and it has petered out the first time around. If you miss them operationally (which will give us important hints about how to properly include them), please to contact me or the IVOA Semantics WG.
    [2]Let me use this opportunity to again solicit contributions to my Stories on Cross-Discipline Data Discovery, as – to my chagrin – it seems to me that beyond metrics (which, between disciplines, are even more broken than within any one discipline) we lack convincing ideas why we should strive for data discovery spanning multiple disciplines in the first place.
  • Global Dataset Discovery in PyVO

    A Tkinter user interface with inputs for Space, Spectrum, and Time, a checkbox marked "inclusive", and buttons Run, Stop, Broadcast, Save, and Quit.

    Admittedly somewhat old-style: As part of teaching global dataset discovery to pyVO, I have also come up with a Tkinter GUI for it. See A UI for more on this.

    One of the more exciting promises of the Virtual Observatory was global dataset discovery: You say “Give me all spectra of object X that there are“, and the computer relates that request to all the services that might have applicable data. Once the results come in, they are merged into some uniformly browsable form.

    In the early VO, there were a few applications that let you do this; I fondly remember VODesktop. As the VO grew and diversified, however, this became harder and harder, partly because there were more and more services, partly because there were more protocols through which to publish data. Thus, for all I can see, there is, at this point, no software that can actually query all services plausibly serving, say, images or spectra in the VO.

    I have to say that writing such a thing is not for the faint-hearted, either. I probably wouldn't have tackled it myself unless the pyVO maintainers had made it an effective precondition for cleaning up the pyVO Servicetype constraint.

    But they did, and hence as a model I finally wrote some code to do all-VO image searches using all of SIA1, SIA2, and obscore, i.e., the two major versions of the Simple Image Access Protocol plus Obscore tables published through TAP services. I actually have already reported in Tucson on some preparatory work I did last summer and named a few problems:

    • There are too many services to query on a regular basis, but filtering them would require them to declare their coverage; far too many still don't.
    • With the current way of registering obscore tables, there is no way to know their coverage.
    • One dataset may be availble through up to three protocols on a single host.
    • SIA1 does not even let you constrain time and spectrum.

    Some of these problems I can work around, others I can try to fix. Read on to find out how I fared so far.

    The pyVO API

    Currently, the development happens in pyVO PR #470. While it is still a PR, let me point you to temporary pyVO docs on the proposed pyvo.discover module – of course, all of this is for review and probably not in the shape it will remain in[1].

    Followup (2024-11-28)

    With the recent release of pyVO 1.6, what is described here is actually available in the release (or by checking out the main branch of the repository).

    To quote from there, the basic usage would be something like:

    from pyvo import discover
    from astropy import units as u
    from astropy import time
    
    datasets, log = discover.images_globally(
      space=(339.49, 3.1, 0.1),
      spectrum=650*u.nm,
      time=(time.Time('1995-01-01'), time.Time('1995-12-31')))
    

    At this point, only a cone is supported as a space constraint, and only a single point in spectrum. It would certainly be desirable to be more flexible with the space constraint, but given the capabilities of the various protocols, that is hard to do. Actually, even with the plain cone Obscore (i.e., ironically, the most powerful of the discovery protocols covered here) currently results in an implementation that makes me unhappy: ugly, slow, and wrong. This is requires a longer discussion; see Appendix: Optionality Considered Harmful.

    datasets at this point is a list of, conceputally, Obscore records. Technically, the list contains instances of a custom class ImageFound, which have attributes named after the Obscore columns. In case you have doubts about the Semantics of any column, the Obscore specification is there to help. And yes, you can argue we should create a single astropy table from that list. You are probably right.

    PyVO adds an extra column over the mandatory obscore set, origin_service. This contains the IVOA identifier (IVOID) of the service at which the dataset was found. You have probably seen IVOIDs before: they are URIs with a scheme of ivo:. What you may not know: these things actually resolve, specifically to registry resource records. You can do this resolution in a web browser: Just prepend https://dc.g-vo.org/I/ to an IVOID and paste the result into the address bar. For instance, my Obscore table has the IVOID ivo://org.gavo.dc/__system__/obscore/obscore; the link below the IVOID leads you to an information page, which happens to be the resource's Registry record formatted with a bit of XSLT. A somewhat more readable but less informative rendering is available when you prepend https://dc.g-vo.org/LP/ (“landing page”).

    The second value returned from discover.images_globally is a list of strings with information on how the global discovery progressed. For now, this is not intended to be machine-readable. Humans can figure out which resources were skipped because other services already cover their data, which services yielded how many records, and which services failed, for instance:

    Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Obscore GAVO Data Center Obscore Table: 2 records
    SIA2 The VO @ ASTRON SIAP Version 2 Service: 0 records
    SIA2 ivo://au.csiro/casda/sia2 skipped: ReadTimeout: HTTPSConnectionPool(host=&apos;casda.csiro.au&apos;, port=443): Read timed out. (read timeout=20)
    SIA2 CADC Image Search (SIA): 0 records
    SIA2 European HST Archive SIAP service: 0 records
    ...
    

    (On the skipping, see Relationships below). I consider this crucial provenance, as that lets you assess later what you may have missed. When you save the results, be sure to save these, too.

    A feature that will presumably (see Inclusivity for the reasons for this expectation) be important at least for a few years is that you can pass the result of a Registry query, and pyVO will try to find services suitable for image discovery on that set of resources.

    A relatively straightforward use case for that is global obscore discovery. This would look like this:

    from pyvo import discover
    from pyvo import registry
    from astropy import units as u
    from astropy import time
    
    def say(discoverer, s):
            print(s)
    
    datasets, log = discover.images_globally(
      space=(274.6880, -13.7920, 1),
      time=(time.Time('1995-01-01'), time.Time('1995-12-31')),
      services=registry.search(registry.Datamodel("obscore")),
      watcher=say)
    

    The watcher thing lets you, well, watch the progress of the discovery; it receives an instance of the discoverer -- this is so you can abort a discoverer's activities from within some UI -- and the human-readable string to display or process in some other way.

    A UI

    To get an idea whether this API might one day work for the average astronomer, I have written a Tkinter-based GUI to global image discovery as it is now: tkdiscover (only available from github at this point). This is what a session with it might look like:

    Lots of TOPCAT windows with various graphs and tables, an x-ray image of the sky with overplotted points, and a play gray window offering the specification of space, spectrum, and time constraints.

    The actual UI is in the top right: A plain window in which you can configure a global discovery query by straightfoward serialisations of discover.images_globally's arguments:

    • Space (currently, a cone in RA, Dec, and search radius, separated by whitespace of commas)
    • Spectrum (currently, a single point as a wavelength in metres)
    • Time (currently, either a single point in time – which probably is rarely useful – or an interval, to be entered as civil DALI dates
    • Inclusivity.

    When you run this, this basically calls discover.images_globally and lets you know how it is progressing. You can click Broadcast (which sends the current result to all VOTable clients on the SAMP bus) or Save at any time and inspect how discovery is progressing. I predict you will want to do that, because querying dozens of services will take time.

    There is also a Stop button that aborts the dataset search (you will still have the records already found). Note that the Stop button will not interrupt running network operations, because the network library underneath pyVO, requests, is not designed for being interrupted. Hence, be patient when you hit stop; this may take as long as the configured timeout (currently is 20 seconds) if the service hangs or has to do a lot of work. You can see that tkdiscover has noticed your stop request because the service counter will show a leading zero.

    Service counter? Oh, that's what is at the bottom right of the window. Once service discovery is done, that contains three numbers: The number of services to query, the number of services queried already, and the number of services that failed.

    The table contains the obscore records described above, and the log lines are in the discovery_log INFO. I will give you that this is extremely unreadable in particular in TOPCAT, which normalises the line separators to plain whitespace. Perhaps some other representation of these log lines would be preferable: A PARAM with a char[][] (but VOTable still is terrible with arrays of variable-length strings)? Or a separate table with char[*] entries?

    Inclusivity

    I have promised above I'd explain the “Inclusive” part in both the pyVO API and the Tk UI. Well, this is a bit of a sad story.

    All-VO-queries take time. Thus, in pyVO we try to only query services that we expect serve data of interest. How do we arrive at expectations like that? Well, quite a few records in the Registry by now declare their coverage in space and time (cf. my 2018 post for details).

    The trouble is: Most still don't. The checkmark at inclusive decides whether or not to query these “undecidable” services. Which makes a huge difference in runtime and effort. With the pre-configured constraints in the current prototype (X-Ray images a degree around 274.6880, -13.7920 from the year 1995), we currently discover three services (of which only one actually needs to be queried) when inclusive is off. When it is on, pyVO will query a whopping 323 services (today).

    The inclusivity crisis is particularly bad with Obscore tables because of their broken registration pattern; I can say that so bluntly because I am the author of the standard at fault, TAPRegExt. I am preparing a note with a longer explanation and proposals for fixing matters – <cough> follow me on github –, but in all brevity: Obscore data is discovered using something like a flag on TAP services. That is bad because the TAP services usually have entriely different metadata from their Obscore table; think, in particular, of the physical coverage that is relevant here.

    It will be quite a bit of effort to get the data providers to do the Registry work required to improve this situation. Until that is done, you will miss Obscore tables when you don't check inclusive (or override automatic resource selection as above) – and if you do check inclusive, your discovery runs will take something like a quarter of an hour.

    Relationships

    In general, the sheer number of services to query is the Achilles' heel in the whole plan. There is nothing wrong with having a machine query 20 services, but querying 200 is starting to become an effort.

    With multi-data collection services like Obscore (or collective SIA2 services), getting down to a few dozen services globally for a well-constrained search is actually not unrealistic; once all resources properly declare their coverage, it is not very likely that more than 20 institutions worldwide will have data in a credibly small region of space, time, and spectrum. If all these run collective services and properly declare the datasets to be served by them, that's our 20-services global query right there.

    However, pyVO has to know when data contained in a resource is actually queriable by a collective service. Fortunately, this problem has already been addressed in the 2019 endorsed note on Discovering Data Collections Within Services: Basically, the individual resource declares an IsServedBy relationship to the collective service. PyVO global discovery already looks at these. That is how it could figure out these two things in the sample log given above:

    Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    

    But of course the individual services have to declare these relationships. Surprisingly many already do, as you can observe yourself when you run:

    select ivoid, related_id from
    rr.relationship
    natural join rr.capability
    where
    standard_id like 'ivo://ivoa.net/std/sia%'
    and relationship_type='isservedby'
    

    on your favourite RegTAP endpoint (if you have no preferences, use mine: http://dc.g-vo.org/tap). If you have collective services and run individual SIA services, too, please run that query, see if you are in there, and if not, please declare the necessary relationships. In case you are unsure as to what to do, feel free to contact me.

    Future Directions

    At this point, this is a rather rough prototype that needs a lot of fleshing out. I am posting this in part to invite the more adventurous to try (and break) global discovery and develop further ideas.

    Some extensions I am already envisaging include:

    • Write a similar module for spectra based on SSAP and Obscore. That would then probably also work for time series and similar 1D data.

    • Do all the Registry work I was just talking about.

    • Allow interval-valued spectral constraints. That's pretty straightforward; if you are looking for some place to contribute code, this is what I'd point you to.

    • Track overflow conditions. That should also be simple, probably just a matter of perusing the pyVO docs or source code and then conditionally produce a log entry.

    • Make an obscore s_region out of the SIA1 WCS information. This should also be easy – perhaps someone already has code for that that's tested around the poles and across the stitching line? Contributions are welcome.

    • Allow more complex geometries to define the spatial region of interest. To keep SIA1 viable in that scenario it would be conceivable to compute a bounding box for SIA1 POS/SIZE and do “exact” matching locally on the coarser SIA1 result.

    • Enable multi-position or multi-interval constraints. This pretty certainly would exclude SIA1, and, realistically, I'd probably only enable Obscore services with TAP uploads with this. With those constraints, it would be rather straightforward.

    • Add SODA support: It would be cool if my ImageFound had a way to say “retrieve data for my RoI only”. This would use SODA and datalink to do server-side cutouts where available and do the cut-out locally otherwise. If this sounds like rocket science: No, the standards for that are actually in place, and pyVO also has the necessary support code. But still the plumbing is somewhat tricky, partly also because pyVO's datalink API still is a bit clunky.

    • Going async? Right now, we civilly query one service after the other, waiting for each result before proceeding to the next service. This is rather in line with how pyVO is written so far.

      However, on the network side for many years asynchronous programming has been a very successful paradigm – for instance, our DaCHS package has been based on an async framework from the start, and Python itself has growing in-language support for async, too.

      Async allows you to you fire off a network request and forget about it until the results come back (yes, it's the principle of async TAP, too). That would let people run many queries in parallel, which in turn would result in dramatically reduced waiting times, while we can rather easily ensure that a single client will not overflow any server. Still, it would be handing a fairly powerful tool into possibly unexperienced hands… Well: for now there is no need to decide on this, as pyVO would need rather substantial upgrades to support async.

    Appendix: Optionality Considered Harmful

    The trouble with obscore and cones is a good illustration of the traps of attempting to fix problems by adding optional features. I currently translate the cone constraint on Obscore using:

    "(distance(s_ra, s_dec, {}, {}) < {}".format(
      self.center[0], self.center[1], self.radius)
    +" or 1=intersects(circle({}, {}, {}), s_region))".format(
      self.center[0], self.center[1], self.radius))
    

    which is all of ugly, presumably slow, and wrong.

    To appreciate what is going on, you need to know that Obscore has two ways to define the spatial coverage of an observation. You can give its “center” (s_ra, s_dec) and something like a rough radius (s_fov), or you can give some sort of geometry (e.g., a polygon: s_region). When the standard was written, the authors wanted to enable Obscore services even on databases that do not know about spherical geometry, and hence s_region is considered rather optional. In consequence, it is missing in many services. And even the s_ra, s_dec, s_fov combo is not mandatory non-null, so you are perfectly entitled to only give s_region.

    That is why there are the two conditions or-ed together (ugly) in the code fragment above. 1=intersects(circle(.), s_region) is the correct part; this is basically how the cone is interpreted in SIA1, too. But because s_region may be NULL even when s_ra and s_dec are given, we also need to do a test based on the center position and the field of view. That rather likely makes things slower, possibly quite a bit.

    Even worse, the distance-based condition actually is wrong. What I really ought to take into account is s_fov and then do something like distance(.) < {self.radius}+s_fov, that is, the dataset position need only be closer than the cone radius plus the dataset's FoV (“intersects”). But that would again produce a lot of false negatives because s_fov may be NULL, too, and often is, after which the whole condition would be false.

    On top of that, it is virtually impossible that such an expression would be evaluated using an index, and hence with this code in place, we would likely be seqscanning the entire obscore table almost every time – which really hurts when you have about 85 Million records in your Obscore table (as I do).

    The standard could immediately have sanitised all this by saying: when you have s_ra and s_dec, you must also give a non-empty s_fov and s_region. This is a classic case for where a MUST would have been necessary to produce something that is usable without jumping through hoops. See my post on Requirements and Validators on this blog for a longer exposition on this whole matter.

    I'm not sure if there is a better solution than the current “if the operators didn't bother with s_region, the dataset's FoV will be ignored“. If you have good ideas, by all means let me know.

    Followup (2024-11-28)

    I've given a talk at the Malta interop giving another view on this matter: VO at the limit.

    [1]

    If you want to try this (in particular without clobbering your “normal” pyVO), do something like this:

    virtualenv --system-site-packages global-datasets
    . global-datasets/bin/activate
    cd global-datasets
    git clone https://github.com/msdemlei/pyvo
    cd pyvo
    git checkout global-datasets
    pip install .
    

« Page 2 / 20 »