Articles from Standards

  • Persistent TAP Uploads Update: A Management Interface

    a screenshot of a python notebook; a few lines of python yield the current date, a few more a date a year from now.

    There is a new version of the jupyter notebook showing off the persistent TAP uploads in python coming with this post, too: Get it.

    Six months ago, I reported on my proposal for persistent uploads into TAP services on this very blog: Basically, you could have and keep your own tables in databases of TAP servers supporting this, either by uploading them or by creating them with an ADQL query. Each such table has a URI; you PUT to it to create it, you GET from it to inspect its metadata, VOSI-style, and you DELETE to it to drop the table once you're done.

    Back then, I enumerated a few open issues; two of these I have recently addressed: Lifetime management and index creation. Here is how:

    Setting Lifetimes

    In my scheme, services assign a lifetime to user-uploaded tables, mainly in order to nicely recover when users don't keep books on what they created. The service will eventually clean up their tables after them, in the case of the reference implementation in DaCHS after a week.

    However, there are clearly cases when you would like to extend the lifetime of your table beyond that week. To let users do that, my new interface copies the pattern of UWS. There, jobs have a destruction child. You can post DALI-style timestamps[1] there, and both the POST and the GET return a DALI timestamp of the destruction time actually set; this may be different from what you asked for because services may set hard limits (in my case, a year).

    For instance, to find out when the service will drop the table you will create when you follow last October's post you could run:

    curl http://dc.g-vo.org/tap/user_tables/my_upload/destruction
    

    To request that the table be preserved until the Epochalypse you would say:

    curl -F DESTRUCTION=2038-01-19T03:14:07Z http://dc.g-vo.org/tap/user_tables/my_upload/destruction
    

    Incidentally (and as you can see from the POST response), until January 2037, my service will reduce your request to “a year from now”.

    I can't say I'm too wild about posting a parameter called “DESTRUCTION” to an endpoint that's called “destruction” (even if it weren't such a mean word). UWS did that because they wanted it make it easy to operate a compliant service from a web browser. Whether that still is a reasonable design goal (in particular because everyone seems to be wild on dumping 20 metric tons of Javascript on their users even things like UWS would make it easy to not do that) is certainly debatable. But I thought it's better to have a single questionable pattern throughout rather than have something a little odd in one place and something a little less odd in another place.

    Creating Indexes

    For many applications of database systems, having indexes is crucial. You really, really don't want to have to go through a table with 2 billion rows just to find a single object (that's a quarter of a day when you manage to pull through 100'000 rows a second; ok: today's computers are faster than that). While persistently uploaded tables won't (regularly) have two billion rows any time soon, indexes are already very valuable even for tables in the million-row range.

    On the other hand, there are many sorts of indexes, and there are many ways to qualify indexes. To get an idea of what you might want to tell a database about an index, see Postgres' CREATE INDEX docs. And that's just for Postgres; other database systems still do it differently, and of course when you index on expressions, there is no limit to the complexity you can build into your indexes.

    Building a cross-database API that would reflect all that is entirely out of the question. Hence, I went for the other extreme: You just specify which column(s) you would like to have indexed, and the service is supposed to choose a plausible index type for you.

    Following the model of destruction (typography matters!), this is done by POST-ing one or more column names in INDEX parameters to the index child of the table url. For instance, if you have put a table my_upload that has a column Kmag (e.g., from last October's post), you would say:

    curl -L -F INDEX=Kmag http://dc.g-vo.org/tap/user_tables/my_upload/index
    

    The -L makes curl follow the redirect that this issues. Why would it redirect, you ask? The index request creates a UWS job behin the scenes, that is, something like a TAP async job. What you get redirected to is that job.

    The background is that for large tables and complex indexes, you may easily get your (appartently idle) connection cut while the index is being created, and you would never learn if a problem had materialised or when the indexing is done. Against that, UWS lets us keep running, and you have a URI at which to inspect the progress of the indexing operation (well, frankly: nothing yet beyond “is it done?”).

    Speaking UWS with curl is no fun, but then you don't need to: The job starts in QUEUED and will automatically execute when the machine next has time. In case you are curious, see the notebook linked above, where there is an example for manually following the job's progress. You could use generic UWS clients to watch it, too.

    A weak point of the scheme (and one that's surprisingly hard to fix) is that the index is immediately shown in the table metadata the notebook linked to above shows this; I'll spare you the VODataService XML that curl-ing the table URL will spit at you, but in there you will see the Kmag index whether or not the indexer job has run.

    It shares this deficit with another way to look at indexes. You see, since there is so much backend-specific stuff you may want to know about an index, I am also proposing that when you GET the index child, you get back the actual database statements, or at least something rather similar. This is expressly not supposed to be machine readable, if only because what you see is highly dependent on the underlying database.

    Here is how this looks like on DaCHS over postgres after the index call on Kmag:

    $ curl http://dc.g-vo.org/tap/user_tables/my_upload/index
    Indexes on table tap_user.my_upload
    
    CREATE INDEX my_upload_Kmag ON tap_user.my_upload (Kmag)
    

    I would not want to claim that this particular human-readable either. But humans that try to understand why a computer does not behave as they expect will certainly appreciate something like this.

    Special Index Types

    If you look at the tmp.vot from last october's post, you will see that there is an a pair of equatorial coordinates in _RAJ2000 and _DEJ2000. It is nicely marked up with pos.eq UCDs, and the units are deg: This is an example of a column set that DaCHS has special index magic for. Try it:

    curl -L -F INDEX=_RAJ2000 -F INDEX=_DEJ2000 \
      http://dc.g-vo.org/tap/user_tables/my_upload/index > /dev/null
    

    Another GET against index will show you that this index is a bit different, stuttering something about q3c (or perhaps spoint at another time or on another service):

    Indexes on table tap_user.my_upload
    
    CREATE INDEX my_upload__RAJ2000__DEJ2000 ON tap_user.my_upload (q3c_ang2ipix("_RAJ2000","_DEJ2000"))
    CLUSTER my_upload__RAJ2000__DEJ2000 ON tap_user.my_upload
    CREATE INDEX my_upload_Kmag ON tap_user.my_upload (Kmag)
    

    DaCHS will also recognise spatial points. Let's quickly create a table with a few points by running:

    CREATE TABLE tap_user.somepoints AS
    SELECT TOP 30 preview, ssa_location
    FROM gdr3spec.ssameta
    

    on the TAP server at http://dc.g-vo.org/tap, for instance in TOPCAT (as explained in post one, the “Table contained no rows” message you will see then is to be expected). Since TOPCAT does not know about persistent uploads yet, you have to create the index using curl:

    curl -LF INDEX=ssa_location http://dc.g-vo.org/tap/user_tables/somepoints/index
    

    GET-ting the index URL after that will yield:

    Indexes on table tap_user.somepoints
    
    CREATE INDEX ndpmaliptmpa_ssa_location ON tap_user.ndpmaliptmpa USING GIST (ssa_location)
    CLUSTER ndpmaliptmpa_ssa_location ON tap_user.ndpmaliptmpa
    

    The slightly shocking name of the table is an implementation detail that I might want to hide at some point; the important thing here is the USING GIST that indicates DaCHS has realised that for spatial queries to be able to use the index, a special method is necessary.

    Incidentally, I was (and still am) not entirely sure what to do when someone asks for this:

    curl -L -F INDEX=_Glon -F INDEX=_DEJ2000 \
      http://dc.g-vo.org/tap/user_tables/my_upload/index > /dev/null
    

    That's a latitude and a longitude all right, but of course they don't belong together. Do I want to treat these as two random columns being indexed together, or do I decide that the user very probably wants to use a very odd coordinate system here?

    Well, try it and see how I decided; after this post, you know what to do.

    [1]Many people call that “ISO format”, but I cannot resist pointing out that ISO, in addition to charging people who want to read their standards an arm and leg, admits a panic-inducing variety of date formats, and so “ISO format” not a particularly useful term.
  • A Proposal for Persistent TAP Uploads

    From its beginning, the IVOA's Table Access Protocol TAP has let users upload their own tables into the services' databases, which is an important element of TAP's power (cf. our upload crossmatch use case for a minimal example). But these uploads only exist for the duration of the request. Having more persistent user-uploaded tables, however, has quite a few interesting applications.

    Inspired by Pat Dowler's 2018 Interop talk on youcat I have therefore written a simple implementation for persistent tables in GAVO's server package DaCHS. This post discusses what is implemented, what is clearly still missing, and how you can play with it.

    If all you care about is using this from Python, you can jump directly to a Jupyter notebook showing off the features; it by and large explains the same things as this blogpost, but using Python instead of curl and TOPCAT. Since pyVO does not know about the proposed extensions, the code necessarily is still a bit clunky in places, but if something like this will become more standard, working with persistent uploads will look a lot less like black art.

    Before I dive in: This is certainly not what will eventually become a standard in every detail. Do not do large implementations against what is discussed here unless you are prepared to throw away significant parts of what you write.

    Creating and Deleting Uploads

    Where Pat's 2018 proposal re-used the VOSI tables endpoint that every TAP service has, I have provisionally created a sibling resource user_tables – and I found that usual VOSI tables and the persistent uploads share virtually no server-side code, so for now this seems a smart thing to do. Let's see what client implementors think about it.

    What this means is that for a service with a base URL of http://dc.g-vo.org/tap[1], you would talk to (children of) http://dc.g-vo.org/tap/user_tables to operate the persistent tables.

    As with Pat's proposal, to create a persistent table, you do an http PUT to a suitably named child of user_tables:

    $ curl -o tmp.vot https://docs.g-vo.org/upload_for_regressiontest.vot
    $ curl -H "content-type: application/x-votable+xml" -T tmp.vot \
      http://dc.g-vo.org/tap/user_tables/my_upload
    Query this table as tap_user.my_upload
    

    The actual upload at this point returns a reasonably informative plain-text string, which feels a bit ad-hoc. Better ideas are welcome, in particular after careful research of the rules for 30x responses to PUT requests.

    Trying to create tables with names that will not work as ADQL regular table identifiers will fail with a DALI-style error. Try something like:

    $ curl -H "content-type: application/x-votable+xml" -T tmp.vot
      http://dc.g-vo.org/tap/user_tables/join
    ... <INFO name="QUERY_STATUS" value="ERROR">'join' cannot be used as an
      upload table name (which must be regular ADQL identifiers, in
      particular not ADQL reserved words).</INFO> ...
    

    After a successful upload, you can query the VOTable's content as tap_user.my_upload:

    A TOPCAT screenshot with a query 'select avg("3.6mag") as blue, avg("5.8mag") as red from tap_user.my_upload' that has a few red warnings, and a result window showing values for blue and red.

    TOPCAT (which is what painted these pixels) does not find the table metadata for tap_user tables (yet), as I do not include them in the “public“ VOSI tables. This is why you see the reddish syntax complaints here.

    I happen to believe there are many good reasons for why the volatile and quickly-changing user table metadata should not be mixed up with the public VOSI tables, which can be several 10s of megabytes (in the case of VizieR). You do not want to have to re-read that (or discard caches) just because of a table upload.

    If you have the table URL of a persistent upload, however, you inspect its metadata by GET-ting the table URL:

    $ curl http://dc.g-vo.org/tap/user_tables/my_upload | xmlstarlet fo
    <vtm:table [...]>
      <name>tap_user.my_upload</name>
      <column>
        <name>"_r"</name>
        <description>Distance from center (RAJ2000=274.465528, DEJ2000=-15.903352)</description>
        <unit>arcmin</unit>
        <ucd>pos.angDistance</ucd>
        <dataType xsi:type="vs:VOTableType">float</dataType>
        <flag>nullable</flag>
      </column>
      ...
    

    – this is a response as from VOSI tables for a single table. Once you are authenticated (see below), you can also retrieve a full list of tables from user_tables itself as a VOSI tableset. Enabling that for anonymous uploads did not seem wise to me.

    When done, you can remove the persistent table, which again follows Pat's proposal:

    $ curl -X DELETE http://dc.g-vo.org/tap/user_tables/my_upload
    Dropped user table my_upload
    

    And again, the text/plain response seems somewhat ad hoc, but in this case it is somewhat harder to imagine something less awkward than in the upload case.

    If you do not delete yourself, the server will garbage-collect the upload at some point. On my server, that's after seven days. DaCHS operators can configure that grace period on their services with the [ivoa]userTableDays setting.

    Authenticated Use

    Of course, as long as you do not authenticate, anyone can drop or overwrite your uploads. That may be acceptable in some situations, in particular given that anonymous users cannot browse their uploaded tables. But obviously, all this is intended to be used by authenticated users. DaCHS at this point can only do HTTP basic authentication with locally created accounts. If you want one in Heidelberg, let me know (and otherwise push for some sort of federated VO-wide authentication, but please do not push me).

    To just play around, you can use uptest as both username and password on my service. For instance:

      $ curl -H "content-type: application/x-votable+xml" -T tmp.vot \
      --user uptest:uptest \
      http://dc.g-vo.org/tap/user_tables/privtab
    Query this table as tap_user.privtab
    

    In recent TOPCATs, you would enter the credentials once you hit the Log In/Out button in the TAP client window. Then you can query your own private copy of the uploaded table:

    A TOPCAT screenshot with a query 'select avg("3.6mag") as blue, avg("5.8mag") as red from tap_user.my_upload' that has a few red warnings, and a result window showing values for blue and red; there is now a prominent Log In/Out-button showing we are logged in.

    There is a second way to create persistent tables (that would also work for anonymous): run a query and prepend it with CREATE TABLE. For instance:

    A TOPCAT screenshot with a query 'create table tap_user.smallgaia AS SELECT * FROM gaia.dr3lite TABLESAMPLE(0.001)'. Again, TOPCAT flags the create as an error, and there is a dialog "Table contained no rows".

    The “error message” about the empty table here is to be expected; since this is a TAP query, it stands to reason that some sort of table should come back for a successful request. Sending the entire newly created table back without solicitation seems a waste of resources, and so for now I am returning a “stub” VOTable without rows.

    As an authenticated user, you can also retrieve a full tableset for what user-uploaded tables you have:

    $ curl --user uptest:uptest http://dc.g-vo.org/tap/user_tables | xmlstarlet fo
    <vtm:tableset ...>
      <schema>
        <name>tap_user</name>
        <description>A schema containing users' uploads. ...  </description>
        <table>
          <name>tap_user.privtab</name>
          <column>
            <name>"_r"</name>
            <description>Distance from center (RAJ2000=274.465528, DEJ2000=-15.903352)</description>
            <unit>arcmin</unit>
            <ucd>pos.angDistance</ucd>
            <dataType xsi:type="vs:VOTableType">float</dataType>
            <flag>nullable</flag>
          </column>
          ...
        </table>
        <table>
          <name>tap_user.my_upload</name>
          <column>
            <name>"_r"</name>
            <description>Distance from center (RAJ2000=274.465528, DEJ2000=-15.903352)</description>
            <unit>arcmin</unit>
            <ucd>pos.angDistance</ucd>
            <dataType xsi:type="vs:VOTableType">float</dataType>
            <flag>nullable</flag>
          </column>
          ...
        </table>
      </schema>
    </vtm:tableset>
    

    Open Questions

    Apart from the obvious question whether any of this will gain community traction, there are a few obvious open points:

    1. Indexing. For tables of non-trivial sizes, one would like to give users an interface to say something like “create an index over ra and dec interpreted as spherical coordinates and cluster the table according to it”. Because this kind of thing can change runtimes by many orders of magnitude, enabling it is not just some optional embellishment.

      On the other hand, what I just wrote already suggests that even expressing the users' requests in a sufficiently flexible cross-platform way is going to be hard. Also, indexing can be a fairly slow operation, which means it will probably need some sort of UWS interface.

    2. Other people's tables. It is conceivable that people might want to share their persistent tables with other users. If we want to enable that, one would need some interface on which to define who should be able to read (write?) what table, some other interface on which users can find what tables have been shared with them, and finally some way to let query writers reference these tables (tap_user.<username>.<tablename> seems tricky since with federated auth, user names may be just about anything).

      Given all this, for now I doubt that this is a use case sufficiently important to make all the tough nuts delay a first version of user uploads.

    3. Deferring destruction. Right now, you can delete your table early, but you cannot tell my server that you would like to keep it for longer. I suppose POST-ing to a destruction child of the table resource in UWS style would be straightforward enough. But I'd rather wait whether the other lacunae require a completely different pattern before I will touch this; for now, I don't believe many persistent tables will remain in use beyond a few hours after their creation.

    4. Scaling. Right now, I am not streaming the upload, and several other implementation details limit the size of realistic user tables. Making things more robust (and perhaps scalable) hence will certainly be an issue. Until then I hope that the sort of table that worked for in-request uploads will be fine for persistent uploads, too.

    Implemented in DaCHS

    If you run a DaCHS-based data centre, you can let your users play with the stuff I have shown here already. Just upgrade to the 2.10.2 beta (you will need to enable the beta repo for that to happen) and then type the magic words:

    dachs imp //tap_user
    

    It is my intention that users cannot create tables in your DaCHS database server unless you say these words. And once you say dachs drop --system //tap_user, you are safe from their huge tables again. I would consider any other behaviour a bug – of which there are probably still quite a few. Which is why I am particularly grateful to all DaCHS operators that try persistent uploads now.

    [1]As already said in the notebook, if http bothers you, you can write https, too; but then it's much harder to watch what's going on using ngrep or friends.
  • Requirements and Validators

    Content Warning: this is mainly VO lore. I am not claiming any immediate applicability to the use or publication of astronomical data.

    This morning, I set out to reply to a mail by Mark Taylor and noticed after a while that I was writing a philosophical piece on how to write standards – and how not to – that I may want to refer to again later. So, I'll make this a blog post.

    The story started when the excellent stilts taplint during my monthly validation routine produced an error when exercising my data centre's TAP endpoint:

    I-OBS-QSUB-5 Submitting query: SELECT TOP 1 obs_id FROM ivoa.ObsCore WHERE obs_id IS NULL
    E-OBS-QERR-1 TAP query failed [Service error: "Field query: Query timed out (took too long).
    

    What happened is that stilts tried to ascertain that all rows in my obscore table satisfy the standard's requirement that the obs_id column is non-NULL (see page 20). This made Postgres – the database system actually executing the queries – run what is known as a sequential scan through the tables involved in obscore; the reason underlying this bad judgement is a bit involved and has to do with the fact that in DaCHS, ivoa.obscore is a view composed of many tables. I will spare you the details, but the net effect of that is that it is not easy to tell Postgres that rows with obs_id NULL, if they exist at all, will be few and far between.

    By now, the number of data sets in my obscore table approaches 100'000'000, and fetching all that data simply takes time, more time than a synchronous query has on my site[1].

    Granted, I could fix that by adding indexes on the columns involved, but since these come from several dozen tables, that would be quite a bit of work for both me and the computer. Is that work worth it? Well, it certainly is if otherwise I'm breaking the standard, but since it is a serious amount of work, I am tempted to wonder: does the requirement actually make sense? And this leads to the question:

    Why do we require things in standards?

    In the end, there is just one reason to require something in a standard: Without the requirement, something important breaks. When one thinks about this a bit more deeply, one can distinguish two somewhat finer classes of requirements.

    (a) “Internal requirements“. These are rules imposed so machines can do their job. The most obvious examples here are requirements on how to write things. For instance, if a client writes an interval as lower/upper and the service expects lower upper, it just won't work. Hence, a standard has to say “The separator in intervals MUST be whitespace” (or whatever).

    There are more subtle requirements in that department. For instance, many tables need a primary key because other tables may want to refer to them. For Obscore, this becomes relevant just about now, when we think about having extensions for it. Those would add specific metadata for, say, radio or gamma observations. We will probably create them by adding per-extension tables holding a foreign key into ivoa.obscore. This is nice because then you can write something like:

    SELECT ...
    FROM ivoa.obscore
    JOIN ivoa.obs_visibility
      USING (obs_publisher_did)
    WHERE (some visiblity-specific constraint)
    

    – and almost everything just works without further thought or effort: No plethora of columns that are NULL in ivoa.obscore for anything that is not a visibility, and no manual filtering out of non-visibilities either: JOIN does it all nicely for you. Isn't relational algebra great?

    But this only is possible if obs_publisher_did (well: it's not certain yet whether that actually will be obscore's designated primary key, but bear with me there) really is non-NULL, and if there are no two rows with the same publisher DID (which are the general criteria to make something a primary key in a relation). Hence, these two constraints are something we simply MUST (pun intended) require.

    (b) “Functional requirements”. These are requirements resulting from considerations of the use of the standard. I have just encountered a nice example when working on LineTAP, a future standard on how to access data about spectral lines. An important use case there is that the client displays the lines on top of a spectrum, and it will want to put something next to the lines so the user has at least a first indication just what would cause the line to show up. That it can only do if the service provides it with a plausible label – asking clients to invent a label based on the data it has is likely to produce very unsatifying results, as no machine is smart enough to figure out nice, idiomatic strings like „21 cm HI“ or „Hα“. Hence, we simply have to require that each row in such a LineTAP table has a title (technically: the corresponding column has a non-NULL constraint).

    Going back to the obs_id example, it does not seem there is a strong case to invoke either (a) or (b) – since the column explicitly has no uniqueness requirement, it will not work as a primary key, and users will probably only want to use it for “grouped” data, where multiple artefacts belong to one “observation”. For data sets not within such groups, there really is no application for obs_id I can see. Of course, I may be missing something, which is why I asked around on the mailing lists.

    If we figure out nothing breaks when we remove the requirement, then we should drop it: Every requirement causes some overhead in implementation and validation. In the present case, the implementation overhead would be all the indexes on the various obs_id columns, which I would not otherwise need. The validation overhead are the extra queries that taplint needs to do. Having overhead for no benefit (in terms of things not breaking) goes against sensible parsimony in what we ask our adopters to do (and I'll officially admit here that we do ask quite a bit already).

    … and why do we validate them?

    In the mail I have cited above, Mark has kindly offered to just not run the query in the validation suite, and all this philosophy was really intended to lead up to a “thanks, but no thanks”.

    That is because, first of all, requirements that are not checked by a machine are requirements that are not met. You see, what we do is hard. Sure, there are harder problems in computing, but globally distributed information systems run by only loosely connected parties are rather non-trivial. People writing code to solve non-trivial problems will get it wrong.

    The common way to deal with this fact is to test with one client and call it a day when that client seems to work for whatever was chosen as a test case. To mention a non-VO standard where this implement-to-the-client method failed horribly and continues to fail horribly: ACPI, the part of the firmware that's supposed to make, for instance, suspend-to-RAM something one doesn't have to think about. Vendors usually stop developing their ACPI code when the current version of Windows does not fail horribly with their implementation. A paper in the proceedings of the 2007 Linux symposium discusses some of the consequences in the least offensive way conceivable – and in a way that I, as a VO developer running quite a few Linux boxes, can very much relate to.

    The bottom line is that if an unmet requirement breaks things and validators do not check for that requirement, then services will work to some degree with a certain client and break as soon as people switch to a different client (or perhaps only try to be smart). That's in stark contrast to one of my main selling points when I do VO teaching: „Hey, you can prototype with TOPCAT, and when you've figured out things, just switch to pyVO so you can scale, automate, and make your work reproducable“.

    So, let's try to avoid unvalidated requirements.

    Instead, let's have as few requirements as we can while covering the use cases we envision. And then let's have great validators that make sure these requirements are met by the services (or instance documents, or whatever it may be). Such validators not only help making the VO an effective environment that's fun to work with. They also give service operators – like… me – a peace of mind that nothing else can provide.

    [1]I keep a rather tight limit on the sync queries because the system also answers registry discovery queries, and these should be reasonably snappy. If I let long sync queries run, it is very easy to overload the system by accident. If I don't, people who want to run long queries can move to async. There, jobs are queued and only let in one or two at a time. That will not (usually) overload anything.
  • Small Change, Big Win

    Screenshot with the Erratum content (2 lines) highlighted

    That's SCS 1.03 Erratum 2 rendered in my browser with a bit of image processing to celebrate that there's one painful VO legacy less on this world.

    PSA: what follows is VO lore that may be entertaining but will not help you use or publish astronomical data.

    Today, I've made a very small commit to my VO publication package DaCHS (revision 8452):

    --- gavo/web/vodal.py (revision 8451)
    +++ gavo/web/vodal.py (working copy)
    @@ -260,7 +260,6 @@
            version = "1.0"
            parameterStyle = "dali"
            standardId = "ivo://ivoa.net/std/ConeSearch"
    -     defaultOutputFormat = "votable1.1"
    

    One deleted line, small cause, huge effect.

    This story starts with the oldest „operational“ VO standard, Simple Cone Search, which was formally published in 2008 but really got its current shape a lot earlier.

    I've not been there back then, but I think the authors expected that clients would be parsing the VOTables that the services were returning using something called XML binding. That, well, was a technique where code was generated from an XML schema, and only instance documents conforming to that exact schema could be parsed with that code.

    That is of course the opposite of the golden rule of interoperability (“be strict in what you produce and lenient in what you accept”) and thus would have been a terrible implementation choice for interoperable clients (and I believe nobody ever tried it). But somehow – or that is my explanation – the XML binding reasoning translated into the requirement that SCS services could only return VOTable 1.0 or VOTable 1.1, and that made it into the standard. It was hence the law. And that it DaCHS had to keep alive VOTable 1.1 for writing (which the above commit of course doesn't remove, but I can remove it now any time I feel like it). And that it couldn't do a lot of useful things that required features not present in VOTable 1.1.

    Nobody dared to touch the problem for about a decade, as it was actually unclear whether some ancient code might still be doing useful work with SCS and XML binding. And I shouldn't be scoulding them after I have recently broken ESO examples under the assumption that “aw, nobody's gonna do this“. Then, starting about five years ago, we had a couple of discussions at various conferences about how we might bring SCS into the present VO (where it, it has to be said, sticks out a bit for several other reasons, too, like its funky error reporting and the funny UCDs it uses). But these weren't easy: What exactly are we allowed to break within a minor version under the above assumption (“aw, nobody… “)? If we do a major version, how do we plan for co-existence for two parallel major version?

    Well: For the version restriction, in the end a simple Erratum was enough. On January 26, 2022, the IVOA Technical Coordination Group accepted SCS 1.03 Erratum 2. And now I can return whatever VOTable version suits me. Phewy.

    I can now have GROUPs in GROUPs (which I need to annotate photometry), I can finally return tables with my old proposal for STC in VOTable in SCS results (where they would have mattered most – not that anyone cares any more, as that ship has sailed somewhere completely different).

    Hey, I can have xtypes. Doesn't mean anything to you? Well, try this: In TOPCAT, open VO/Cone Search. Type “Constellations” and select the “cslt cone“ service. Run a query for some part of the sky, with a size of a few 10s of degrees. Open a sky plot, and in there, do Layers → Add Area Control, and in that control select the table you have just pulled in. Presto: You'll see the constellation boundaries without further configuration, and that's because TOPCAT has the xtype to figure out that the odd numbers it sees are really the vertex coordinates of a spherical polygon in DALI serialisation.

    Not a big deal, you say? Perhaps. But lots of small deals accumulated make the difference between what you can do and what you cannot, in particular across services (which is what the VO is about).

    Removing the erroneous constraint on VOTable versions in SCS opened the standard up for quite a few small deals. Thanks, TCG!

  • We'd still have IDL

    I am newly appointed as a member of the topic group for Federated Infrastructures of DIG-UM (that's an acronym for Digital Transformation in the Research on Universe and Matter), a “bottom-up organization for synergetic research on the digital transformation” (as it says in their Guidelines) in the fields covered by what the German Ministry for Research (BMBF) funds as part of its “Erforschung von Universum und Materie” (ErUM) programme. Since GAVO's work has largely been funded through that programme and its predecessors, I feel obliged to overcome my natural aversion against committee work in this case.

    The first thing I am trying to do in that function is explain the VO to our partners, which come from different branches of physics ranging from astroparticle physiscs (where I still feel relatively at home, though I haven't quite got around to figuring out root, a programme and format that's really common there) to accelerator physics to the Komitee Forschung mit nuklearen Sonden und Ionenstrahlen (KFSI), where people are probing into solid state matter using positron beams, which to me sounds (a) cool and (b) as if you'd better have your 511 keV-protective suit on when visiting them.

    A part of this was summarising what I think are the VO's most difficult challenges at this point. Probably the most pressing of those is the problem that we now routinely have data that is so large that moving it around in full is not a good idea. Now, for large catalogues, I think TAP and ADQL are a good basis for giving people tools for remote analysis, so there I'd say all that is needed is detail work.

    In contrast, for collections of array-like (images, say, but what I'm saying would also apply for things like a bulk analysis of a big collection of spectra) data, we do not have anything remotely comparable; the best you can do is make a remote cutout if you're lucky and your operator has implemented SODA. Doing something like “give me all spectra that have a strong Hα feature”, for instance, requires you to download all spectra, or at least the lines in question.

    Most data providers at this point respond to this challenge is to give their users jupyter hubs next to the data, which boils down to letting people write and execute Python scripts on the data providers' boxes from within a web browser. Admittedly, this works rather nicely for the moment, but I consider this a massive regression over the current VO, for at least the following reasons:

    • Lock-in: You cannot in general transport the jupyter notebooks you write from one provider to the next, because the execution environments are massively different (Python and package versions, package availability, data access).
    • Ephemeral: You probably will not even be able to execute the notebook reliably after the next update of the provider's platform: Python evolves relatively quickly, and many of the packages evolve even faster.
    • Undiscoverable: Nobody currently as figured out how these things could sensibly be registered such that you could ask: “Give me all execution environments I can use on data from ivo://dc.g-vo.org/tap.” Not that many are trying, given all the other problems.
    • Browser-based: Web browsers are probably the most broken and least sustainable element in current computing; if you've ever tried to tweak one of the “major browsers” to your liking, you probably know what I mean. With jupyter hubs, not only do I have to work through one of these horrible “major browsers”, the data providers also control what code is being executed in it. If they don't let me edit in vi, I can't edit in vi. Full stop[1].
    • Central control: More generally, with the current VO and its API endpoints, users get to choose what tools they use. If you'd like to use the APIs from lua or Haskell or want to cobble together stilts and shell script, go ahead. Yes, there is some initial effort to parse VOTable and perhaps support the more subtle aspects of TAP, but that's still not unreasonable. With the “platforms”, it is up to the service operators what tools they let you use.

    As a big fan of Python, I'm happy this platform thing happened exactly in the moment when Python was all the fashion (at least in Astronomy). But Python certainly isn't the end of history. People will think of smarter things (arguably, they already have), and very certainly the expectation that one tool fits all is very wrong.

    All that went through my head this morning when riding to work. And then a slogan crossed my mind that I liked so much for bringing the Platform Problem to a point that I wrote this entire post so I could publish it:

    If science platforms had come around 15 years ago, we'd all still be stuck with IDL.

    [1]Ok, there's greasemonkey-like hacks, but that's really to fragile to seriously consider.

Page 1 / 3 »