Posts with the Tag LineTAP:

  • Requirements and Validators

    Content Warning: this is mainly VO lore. I am not claiming any immediate applicability to the use or publication of astronomical data.

    This morning, I set out to reply to a mail by Mark Taylor and noticed after a while that I was writing a philosophical piece on how to write standards – and how not to – that I may want to refer to again later. So, I'll make this a blog post.

    The story started when the excellent stilts taplint during my monthly validation routine produced an error when exercising my data centre's TAP endpoint:

    I-OBS-QSUB-5 Submitting query: SELECT TOP 1 obs_id FROM ivoa.ObsCore WHERE obs_id IS NULL
    E-OBS-QERR-1 TAP query failed [Service error: "Field query: Query timed out (took too long).
    

    What happened is that stilts tried to ascertain that all rows in my obscore table satisfy the standard's requirement that the obs_id column is non-NULL (see page 20). This made Postgres – the database system actually executing the queries – run what is known as a sequential scan through the tables involved in obscore; the reason underlying this bad judgement is a bit involved and has to do with the fact that in DaCHS, ivoa.obscore is a view composed of many tables. I will spare you the details, but the net effect of that is that it is not easy to tell Postgres that rows with obs_id NULL, if they exist at all, will be few and far between.

    By now, the number of data sets in my obscore table approaches 100'000'000, and fetching all that data simply takes time, more time than a synchronous query has on my site[1].

    Granted, I could fix that by adding indexes on the columns involved, but since these come from several dozen tables, that would be quite a bit of work for both me and the computer. Is that work worth it? Well, it certainly is if otherwise I'm breaking the standard, but since it is a serious amount of work, I am tempted to wonder: does the requirement actually make sense? And this leads to the question:

    Why do we require things in standards?

    In the end, there is just one reason to require something in a standard: Without the requirement, something important breaks. When one thinks about this a bit more deeply, one can distinguish two somewhat finer classes of requirements.

    (a) “Internal requirements“. These are rules imposed so machines can do their job. The most obvious examples here are requirements on how to write things. For instance, if a client writes an interval as lower/upper and the service expects lower upper, it just won't work. Hence, a standard has to say “The separator in intervals MUST be whitespace” (or whatever).

    There are more subtle requirements in that department. For instance, many tables need a primary key because other tables may want to refer to them. For Obscore, this becomes relevant just about now, when we think about having extensions for it. Those would add specific metadata for, say, radio or gamma observations. We will probably create them by adding per-extension tables holding a foreign key into ivoa.obscore. This is nice because then you can write something like:

    SELECT ...
    FROM ivoa.obscore
    JOIN ivoa.obs_visibility
      USING (obs_publisher_did)
    WHERE (some visiblity-specific constraint)
    

    – and almost everything just works without further thought or effort: No plethora of columns that are NULL in ivoa.obscore for anything that is not a visibility, and no manual filtering out of non-visibilities either: JOIN does it all nicely for you. Isn't relational algebra great?

    But this only is possible if obs_publisher_did (well: it's not certain yet whether that actually will be obscore's designated primary key, but bear with me there) really is non-NULL, and if there are no two rows with the same publisher DID (which are the general criteria to make something a primary key in a relation). Hence, these two constraints are something we simply MUST (pun intended) require.

    (b) “Functional requirements”. These are requirements resulting from considerations of the use of the standard. I have just encountered a nice example when working on LineTAP, a future standard on how to access data about spectral lines. An important use case there is that the client displays the lines on top of a spectrum, and it will want to put something next to the lines so the user has at least a first indication just what would cause the line to show up. That it can only do if the service provides it with a plausible label – asking clients to invent a label based on the data it has is likely to produce very unsatifying results, as no machine is smart enough to figure out nice, idiomatic strings like „21 cm HI“ or „Hα“. Hence, we simply have to require that each row in such a LineTAP table has a title (technically: the corresponding column has a non-NULL constraint).

    Going back to the obs_id example, it does not seem there is a strong case to invoke either (a) or (b) – since the column explicitly has no uniqueness requirement, it will not work as a primary key, and users will probably only want to use it for “grouped” data, where multiple artefacts belong to one “observation”. For data sets not within such groups, there really is no application for obs_id I can see. Of course, I may be missing something, which is why I asked around on the mailing lists.

    If we figure out nothing breaks when we remove the requirement, then we should drop it: Every requirement causes some overhead in implementation and validation. In the present case, the implementation overhead would be all the indexes on the various obs_id columns, which I would not otherwise need. The validation overhead are the extra queries that taplint needs to do. Having overhead for no benefit (in terms of things not breaking) goes against sensible parsimony in what we ask our adopters to do (and I'll officially admit here that we do ask quite a bit already).

    … and why do we validate them?

    In the mail I have cited above, Mark has kindly offered to just not run the query in the validation suite, and all this philosophy was really intended to lead up to a “thanks, but no thanks”.

    That is because, first of all, requirements that are not checked by a machine are requirements that are not met. You see, what we do is hard. Sure, there are harder problems in computing, but globally distributed information systems run by only loosely connected parties are rather non-trivial. People writing code to solve non-trivial problems will get it wrong.

    The common way to deal with this fact is to test with one client and call it a day when that client seems to work for whatever was chosen as a test case. To mention a non-VO standard where this implement-to-the-client method failed horribly and continues to fail horribly: ACPI, the part of the firmware that's supposed to make, for instance, suspend-to-RAM something one doesn't have to think about. Vendors usually stop developing their ACPI code when the current version of Windows does not fail horribly with their implementation. A paper in the proceedings of the 2007 Linux symposium discusses some of the consequences in the least offensive way conceivable – and in a way that I, as a VO developer running quite a few Linux boxes, can very much relate to.

    The bottom line is that if an unmet requirement breaks things and validators do not check for that requirement, then services will work to some degree with a certain client and break as soon as people switch to a different client (or perhaps only try to be smart). That's in stark contrast to one of my main selling points when I do VO teaching: „Hey, you can prototype with TOPCAT, and when you've figured out things, just switch to pyVO so you can scale, automate, and make your work reproducable“.

    So, let's try to avoid unvalidated requirements.

    Instead, let's have as few requirements as we can while covering the use cases we envision. And then let's have great validators that make sure these requirements are met by the services (or instance documents, or whatever it may be). Such validators not only help making the VO an effective environment that's fun to work with. They also give service operators – like… me – a peace of mind that nothing else can provide.

    [1]I keep a rather tight limit on the sync queries because the system also answers registry discovery queries, and these should be reasonably snappy. If I let long sync queries run, it is very easy to overload the system by accident. If I don't, people who want to run long queries can move to async. There, jobs are queued and only let in one or two at a time. That will not (usually) overload anything.
  • Sofa instead of Granada

    Screenshot from an online talk

    Gesticulating wildly to a computer is what happens in an online conference. To me, at least. Let's hope nobody watched me through the window.

    It was already in the wee hours of Friday last week (CET) when the second "virtual Interop" had its rather unceremonious closing ceremony. Its predecessor in May had about it an air of a state of emergency. For instance, all sessions were monothematic. That was nice on the one hand, because a relatively large part of the time was available for discussion – which, really, is what the Interops are about. But then Interops are also about noticing what everyone else in the Virtual Observatory is cooking up, for which the short-ish talks we usually have at Interops work really well.

    In contrast to that first Corona Interop, this second one, replacing what would have taken place in Granada, Spain, had a much more conventional format, which again accomodated many talks. But of course, this made one feel the lack of possibilities to quickly hash out a problem during a coffee break or in a spontaneous splinter quite a bit more.

    Be that as it may, I would like to give you some insights on what I'm currently up to at the IVOA level; I am grateful for any feedback you can give on any of these topics.

    Given that I currently chair the Semantics Working group, there was a natural focus on topics around vocabularies, and I gave two talks in that department. The one in DAL (DAL is the working group that builds the actual access protocols such as TAP or SIAP) was mainly on Datalink-related aspects of my Vocabularies in the VO 2 draft (VocInVO2), which in particular was an opportunity to thank everyone involved in the Vocabulary Enhancement Proposals we have been running this last year (all of which were about Datalink and hence closely tied to DAL). One thing I was asking for was reviews on a github pull request that would make the bysemantics method of Datalink accesses semantics-aware; basically, as intended by the original Datalink authors, when asking for #calibration links, this will also return, say, #bias links. If you can spare a moment for this: Please do!

    Another thing I tried to raise some interest for is the proposed vocabulary of product types; this, I think, should eventually define what people may put into the dataproduct_type column of Obscore results, and there are related uses in Datalink and, believe it or not, the registration of SSAP (spectral) services. A question Alberto raised while I was discussing that made me realise I forgot to mention another vocabularies-related development relevant for DAL: I've put the gavo_vocmatch ADQL user-defined function into DaCHS. It lets you match something against a term or its narrower terms, referencing an IVOA vocabulary. For instance, if we had different sorts of time series (which, of course, would be odd for obscore that has the o_ucd column for this kind of thing), you could, using ADQL, still get all time series by querying:

    SELECT TOP 5 *
    FROM ivoa.obscore
    WHERE
      1=gavo_vocmatch(
        ’product-type’,
        ’timeseries’,
        dataproduct_type)
    

    Here, the first argument is the vocabulary name (whatever is after the http://www.ivoa.net/rdf in the vocabulary URL), the second the “root” term, and the third the column to match against. Since postgres, for now, isn't aware of IVOA vocabularies, the second argument must be a literal string rather than, say, an expression involving columns.

    I gave a second semantics-related talk in the Registry session. That had its focus on the Unified Astronomy Thesaurus (UAT), from which people should pick the subject keywords in the VO Registry (actually, they should pick from its representation at http://www.ivoa.net/rdf/uat). I'll probably blog about that a little more some other time. For now, let me recommend a little UAT-based game on my Semantics Based Registry Browser sembarebro: Choose two terms that are pretty far apart (like, perhaps, ionized-coma-gases and cosmic-background-radiation) and then try to join the two sub-graphs. Warning: This may waste your time. But it will acquaint you with the UAT, which may be a good thing.

    In that second talk, I also mentioned a second draft vocabulary I've put up in the past six months, http://www.ivoa.net/rdf/messenger. This builds upon the terms for VODataService's waveband element, which enumerated certain flavours of photons (like Radio, Optical, or X-ray). Now that we explore other messengers as well and have more and more solar system resources in the Registry, I'm arguing we ought to open up things by making “Photon” explicit in there and then adding Neutrinos and, later, other messengers. I've received a certain amount of pushback there on mixing the electromagnetic spectrum with particle types; on the other hand, the hierarchical nature of our vocabularies would, I think, let us smartly get away with that.

    Speaking about solar system resources, I'm also listed as an author on Stéphane Erard's talk on EPN-TAP and EPNCore v2.0, probably due to my involvement in finally bringing EPN-TAP into the IVOA document repository. I've already talked about that in a 2017 post on this blog – and again, if you're interested in solar system data, this would be a good time to review the EPN-TAP working draft.

    Talking about things regluar readers of this blog will have heard of: September's Crazy Shapes post I've referenced in a talk on MOCs in pgsphere, together with a fervent appeal to data centers to become involved in pgsphere maintenance.

    And then there was my colleague Margarida's talk on LineTAP, a proposal to obsolete the little-used SLA protocol (which lets people search for spectral lines) with something combining the much more successful VAMDC with our beloved TAP. Me, I'm in this because I'd like to bring TOSS data closer to VAMDC – but also because having competing infrastructures for the same thing sucks.

    And finally, I gave a talk I've called Data Model Posture Review in a session of the Data Models working group; I was somewhat worried that given its rather skeptical outlook it wouldn't be really well-received. But in fact quite a few people shared my main conclusions – and perhaps it was another step towards resolving my decade-old spot of pain: that the VO still doesn't offer tech to reliably bring two catalogues to the same epoch without human intervention.

    With this number of talks I've been involved in, I'm essentially back to the level of a normal Interop. Which means I've been fairly knocked-out on Friday. And I can't lie: I still regret I didn't get to spend a few more warm days in Granada. Corona begone!

Page 1 / 1