We'd still have IDL

2021-11-15 Markus Demleitner

I am newly appointed as a member of the topic group for Federated Infrastructures of DIG-UM (that's an acronym for Digital Transformation in the Research on Universe and Matter), a “bottom-up organization for synergetic research on the digital transformation” (as it says in their Guidelines) in the fields covered by what the German Ministry for Research (BMBF) funds as part of its “Erforschung von Universum und Materie” (ErUM) programme. Since GAVO's work has largely been funded through that programme and its predecessors, I feel obliged to overcome my natural aversion against committee work in this case.

The first thing I am trying to do in that function is explain the VO to our partners, which come from different branches of physics ranging from astroparticle physiscs (where I still feel relatively at home, though I haven't quite got around to figuring out root, a programme and format that's really common there) to accelerator physics to the Komitee Forschung mit nuklearen Sonden und Ionenstrahlen (KFSI), where people are probing into solid state matter using positron beams, which to me sounds (a) cool and (b) as if you'd better have your 511 keV-protective suit on when visiting them.

A part of this was summarising what I think are the VO's most difficult challenges at this point. Probably the most pressing of those is the problem that we now routinely have data that is so large that moving it around in full is not a good idea. Now, for large catalogues, I think TAP and ADQL are a good basis for giving people tools for remote analysis, so there I'd say all that is needed is detail work.

In contrast, for collections of array-like (images, say, but what I'm saying would also apply for things like a bulk analysis of a big collection of spectra) data, we do not have anything remotely comparable; the best you can do is make a remote cutout if you're lucky and your operator has implemented SODA. Doing something like “give me all spectra that have a strong Hα feature”, for instance, requires you to download all spectra, or at least the lines in question.

Most data providers at this point respond to this challenge is to give their users jupyter hubs next to the data, which boils down to letting people write and execute Python scripts on the data providers' boxes from within a web browser. Admittedly, this works rather nicely for the moment, but I consider this a massive regression over the current VO, for at least the following reasons:

Lock-in: You cannot in general transport the jupyter notebooks you write from one provider to the next, because the execution environments are massively different (Python and package versions, package availability, data access).
Ephemeral: You probably will not even be able to execute the notebook reliably after the next update of the provider's platform: Python evolves relatively quickly, and many of the packages evolve even faster.
Undiscoverable: Nobody currently as figured out how these things could sensibly be registered such that you could ask: “Give me all execution environments I can use on data from ivo://dc.g-vo.org/tap.” Not that many are trying, given all the other problems.
Browser-based: Web browsers are probably the most broken and least sustainable element in current computing; if you've ever tried to tweak one of the “major browsers” to your liking, you probably know what I mean. With jupyter hubs, not only do I have to work through one of these horrible “major browsers”, the data providers also control what code is being executed in it. If they don't let me edit in vi, I can't edit in vi. Full stop[1].
Central control: More generally, with the current VO and its API endpoints, users get to choose what tools they use. If you'd like to use the APIs from lua or Haskell or want to cobble together stilts and shell script, go ahead. Yes, there is some initial effort to parse VOTable and perhaps support the more subtle aspects of TAP, but that's still not unreasonable. With the “platforms”, it is up to the service operators what tools they let you use.

As a big fan of Python, I'm happy this platform thing happened exactly in the moment when Python was all the fashion (at least in Astronomy). But Python certainly isn't the end of history. People will think of smarter things (arguably, they already have), and very certainly the expectation that one tool fits all is very wrong.

All that went through my head this morning when riding to work. And then a slogan crossed my mind that I liked so much for bringing the Platform Problem to a point that I wrote this entire post so I could publish it:

If science platforms had come around 15 years ago, we'd all still be stuck with IDL.

[1]	Ok, there's greasemonkey-like hacks, but that's really to fragile to seriously consider.

Tutorial Renewal

2020-06-25 Markus Demleitner

The DaCHS Tutorial among other seminal works

DaCHS' documentation (readthedocs mirror) has two fat pieces and a lot of smaller read-as-you-go pieces. One of the behmoths, the reference documentation, at roughly 350 PDF pages, has large parts generated from source code, and there is no expectation that anyone would ever read it linearly. Hence, I wasn't terribly worried about unreadable^Wpassages of questionable entertainment value in there.

That's a bit different with the tutorial (also available as 150 page PDF; epub on request): I think serious DaCHS deployers ought to read the DaCHS Basics and the chapters on configuring DaCHS and the interaction with the VO Registry, and they should skim the remaining material so they are at least aware of what's there.

Ok. I give you that is a bit utopian. But given that pious wish I felt rather bad that the tutorial has become somewhat incoherent in the years since I had started the piece in April 2009 (perhaps graciously, the early history is not visible at the documentation's current github home). Hence, when applying for funds under our current e-inf-astro project, I had promised to give the tutorial a solid makeover as, hold your breath, Milestone B1-5, due in the 10th quarter. In human terms: last December.

When it turned out the Python 3 migration was every bit as bad as I had feared, it became clear that other matters had to take priority and that we might miss this part of that “milestone” (sorry, I can't resist these quotes). And given e-inf-astro only had two quarters to go after that, I prepared for having to confess I couldn't make good on my promise of fixing the tutorial.

But then along came Corona, and reworking prose seemed the ideal pastime for the home office. So, on April 4, I forked off a new-tutorial branch and started a rather large overhaul that, among others, resulted in the operators' guide with its precarious position between tutorial and reference being largely absorbed into the tutorial. In all, off and on over the last few months I accumulated (according to git diff --shortstat 6372 inserted and 3453 deleted lines in the tutorial's source. Since that source currently is 7762 lines, I'd say that's the complete makeover I had promised. Which is good as e-inf-astro will be over next Wednesday (but don't worry, our work is still funded).

So – whether you are a DaCHS expert, think about running it, or if you're just curious what it takes to build VO services, let me copy from index.html: Tutorial on importing data (tutorial.html,tutorial.pdf,tutorial.rstx). The ideal company for your vacation!

And if you find typos, boring pieces, overly radical advocacy or anything else you don't like: there's a bug tracker for you (not to mention PRs are welcome).

Posts with the Tag Funding: