• Limits, Materialisation, and Anchor Texts: DaCHS 2.13 is out

    AI slop: Ten badgers on a grassy floor.

    With all the crazy Star Trek-sounding talk of “materialising obscore” below I could not resist and asked stabledifffusion.com for „Thirteen badgers materialising obscore“. Well, counting badgers is hard, and I wouldn't have been sure how to visualise obscore, either. Rest assured, though, that the remainder of this post is not AI slop and at least factually correct.

    It's been almost a year since the last release of our publication package, DaCHS, and so it's high time for DaCHS 2.13. I have put it into our repository last Friday, and here is the obligatory post on the major news coming with it.

    Perhaps the biggest headline (and one that I'd ask you to act upon if you run a DaCHS system) is support for the new features in the brand-new VODataService 1.3 Working Draft. That is:

    • Column statistics. This is following my Note on Advanced Column Statistics on the way to improved blind discovery in the VO. To have them in your DaCHS, all you have to do is upgrade and run dachs limits ALL – and then make sure you run dachs limits after a dachs imp you are satisfied with (or use the new -l flag discussed below). Please do it – one can do a lot of interesting discovery in the Registry (and perhaps quite a bit more) if this is taken up broadly.

    • Product type declaration. So far, when you wanted to discover, say, spectra, you would enumerate the SSAP services in the Registry, perhaps with some additional constraints (e.g., on coverage), and then query each of those.

      Linking data types and protocols was a reasonable shortcut in the early VO. It no longer is, for a whole host of reasons, among which Obscore (which can publish any sort of observational data) ranks pretty high up. So, in the future, we need to be explicit on what among the terms from http://www.ivoa.net/rdf/product-type will come out of a service.

      Where this is immediately useful is when you publish time series through SSAP (which is not uncommon). Then, just put:

      <meta name="productType">timeseries</meta>
      

      into the root of your RD (the time series template in 2.13 already does this). If you publish cubes through SIAP, you should similarly say:

      <meta name="productType">cube</meta>
      

      For other SSAP and SIAP services, you probably don't need to bother at this point.

      For obscore, DaCHS will do the declarations for you if you have run:

      dachs limits //obscore
      

      – which is a good thing to do anyway (see above).

    • Data source declaration. For most purposes, it is really important to know whether some piece of data you found is based on actual observations or whether it's data coming out of some sort of simulation.

      So far, the only protocol that let you say something like that was SSAP. But there's now all kinds of other non-observational data in the VO, and so VODataService 1.3 introduces the vocabulary http://www.ivoa.net/rdf/data-source to let you say where the data you publish comes from.

      The default is going to be observational for a long while. If that's what you have, don't bother. But if you publish results from simulations (more or less: starting from random numbers), put:

      <metaName="dataSource">theory</metaName>
      

      into your RD's root, and if it's data based on actual objects (simulated observations for a new instrument, say, or model spectra for concrete stars), make it:

      <metaName="dataSource">artificial</metaName>
      

    To make filling in the VODataService column statistics somewhat less of a hassle, I have added an -l flag to dachs imp. This makes it run (in effect) a dachs limits after the import. I'm not doing this on every import because that would slow down the development of an RD; obtaining the statistics may take quite some time, and for certain sorts of tables you may prefer to run dachs limits with your own options.

    You could argue I should have inverted the logic, where you'd rather pass a flag saying “don't do limits” during development. You could probably convince me. But until someone protests, just remember to add an -l flag to your last import command.

    There are a few more prototypes for (possibly) upcoming standards in DaCHS 2.13. For one, you can now write units in ADQL queries as per my proposal at the Görlitz ADASS. That is, you can annotate literals with units in curly braces (as in 10{pc}), and you can convert values with known units into other units using a new operator @. For instance, if you were fed up with the stupid angle unit we've been forced to accept since… well, about 2000 BC, you could put the interface to saner units into your queries like this:

    SELECT TOP 20
      ra@{rad}, dec@{rad}, pmra@{rad/hyr}, pmdec@{rad/hyr}
    FROM gaia.dr3lite
    

    This is not a big advantage if you write queries just for a single catalogue. It does make a difference when you write queries that ought to work across multiple tables and services.

    While you should not notice the per-mode limit declarations coming from an unpublished draft of TAPRegExt 1.1 (except that the async limits TOPCAT shows will now better match what DaCHS actually enforces), you could appreciate the support for StaticFile that comes out of DocRegExt 1.0. There, it is used to register single PDF files or perhaps ipython notebooks. When you register such things[1], you can now say something like:

    <publish render="edition" sets="ivo_managed">
      <meta>
        accessURL: \internallink{\rdId/static/myfile.txt}
        accessURL.resultType: text/plain
      </meta>
    </publish>
    

    The result of this will be that DaCHS produces a doc:StaticFile interface rather than vs:WebBrowser, and it will produce a resultType element saying that what you get back is plain text (in this case). If you have other applications for having static files like that in registry records, do let me know.

    My investigation into slow obscore queries I already reported on here led to two changes: For one, some types in the obscore table changed, and in consequence dachs val -vc ALL will complain when you pulled in the obscore columns into your own tables. Just try the val -vc and either re-import the affected resources at your leisure (it's only an aesthetic defect, things will continue to work) or change the column types as described in the blog post linked above.

    Probably more importantly, you can now materialise the obscore view (actually, in order to let you drop the contributing tables at will, it's not a materialised view but a table, but that's… immaterial here). You want to do that if you have many contributions to your obscore table, at least some queries against it become slow and you can't seem to figure out why. See Materialised Obscore in the tutorial to see what to do if you want to materialise your obscore table, too.

    Something perhaps worth exploring for you is that you can now publish entire RDs. I implemented this for a resource with lots of little “services” (actually, HiPSes) that share so many pieces of metadata that it just seemed wrong to have them all separate resource records (though I am in discussion with the HiPS people who are not particularly fond of having multiple HiPSes in one resource record), nsns. Beyond that, you could have, say, a cone search for extracted sources, an image service and a browser service for both in one RD and then say, in the RD section with top-level metadata:

    <publish sets="ivo_managed"/>
    

    – everything should then live nicely as separate capabilities within one resource record and that without any of the publish/@service tomfoolery you had to use so far to glue together VO and browser services.

    For local publications (i.e., browser services appearing on your front page), this will result in a link to the RD info (minor DaCHS secret: <your server URL>/browse/<rd-id> gives an overview over the tables and services defined in an RD). Whether that's useful enough for you in such a case I cannot predict. But you can mix all-RD publications in ivo_managed with conventional <publish sets="local"/> elements for browser services.

    Among the more minor changes, the default web form template now employs a WebSAMP connector, which means that the SAMP button on results of the form renderer is now greyed out until a SAMP hub becomes visible on your machine.

    If you use a display hint type=url, you can now control the anchor text on the a element in HTML output by setting a property anchorText on the corresponding column. Yes, that will then be constant for all the products. If you really need more control than that, you will have to define a formatter for a custom outputField.

    So far, the fullDLURL macro could only be used when you actually had a normal, filename-based DaCHS access reference. This was unfortunate because this kind of thing is particularly convenient for “virtual” data generated on the fly. Hence, you can now pass some python code in a second fullDLURL argument that must return the accref to use. Read a bit more on the context in Datalinks as Product URLs.

    There are many other minor changes and fixes that you hopefully will only notice because some annoying behaviour of DaCHS is now a little less annoying.

    If you spot problems or miss something, feel free to report that at our new repository at Codeberg. The main VCS for DaCHS still is https://gitlab-p4n.aip.de/gavo/dachs. But we will probably migrate to Codeberg by the 2.14 release to make reporting bugs and writing pull requests simpler.

    Perhaps we will receive some from you?

    [1]Using resType: document; I notice I should really add some material on registering educational material with DaCHS to the tutorial.
  • Porting a DaCHS SIAv1 service to SIAP2

    a distorted title page of the SIAP version 2 standard, centred on the date 2015-12-23

    Ten years after, let me talk about SIAP version 2.

    In December 2015, the IVOA made Simple Image Access Version 2.0 (hereafter: SIAv2) a Recommendation (that is: the standard you should be following). I am fairly sure that most people into computers would have understood that as “Don't do Simple Image Access version 1 (SIAv1) any more“. As of ten years ago.

    This is not how things worked out. Actually, to this day new SIAv1 services still come online. In the talk about major version transitions I gave in College Park last June, I remark that 20% of the registered SIAv1 services were younger than 30 months.

    There are many reasons why obsoleting SIAv1 has not worked (yet); very frankly, I had rather fiercely argued we don't want SIAv2 at all on grounds that Obscore is all you need to discover products of observations.

    But since it's there now I feel I should do something for its adoption, beginning with not pushing out any new SIAv1 services myself. So, when a data provider sent me an RD they built from a previous one and it would have published a new SIAv1 service, I thought this was the time to start updating my own services.

    The next step then is to encourage DaCHS adopters to help out, too, that is, to port over their RDs from doing SIAP version 1 to doing SIAP version 2[1]. That's why I am writing this blog post.

    Going From SIAv1 to SIAv2 in 11 Moderately Difficult Steps

    Since the output table schema (and quite a bit beyond that) changed between the two version, the port is not entirely trivial; if it were, we wouldn't have done a major version (i.e., breaking) change in the first place. But I'd argue it's quite doable when two conditions are met:

    • You have a DaCHS version 2.8 or later (if not, you should upgrade anyway).
    • You are not using siapCutoutCore right now; what this does is hard to replicate in SIAv2 (because positional constraints are now optional), and so if you want to keep the auto-cutout functionality, you probably are stuck on SIAv1.

    That said, here's my recipe:

    1. Change the mixin on the table that keeps the image metadata. So far, you probably had mixin="//siap#pgs". Drop this and add:

      <mixin have_bandpass_id="True">//siap2#pgs</mixin>
      

      to the table body instead. If you really have no bandpass you would like to mention, you can leave out the attribute definition.

    2. Change the obscore mixin in the table body if you did an obscore publication (skip this step if not). With SIAv2, write instead:

      <mixin preview="access_url || '?preview=True'"
        >//obscore#publishObscoreLike</mixin>
      

      It is really simple now because SIAv2 just re-uses the obscore schema.

      Keep your old mixin definition in a scratch pad (or the version control history at least), because it will help you when you fill out the parameters to //siap2#setMeta.

    3. Change any index statements for standard columns you may have; the column names are completely different between SIAv1 and SIAv2. Classic examples include:

      • bandpassId is bandpass_id (if available)
      • bandpassLo is em_min
      • bandpassHi is em_max
      • dateObs should become indexes on t_min and t_max.

      If your table is small enough that you managed without indexes so far, don't bother creating new ones.

    4. Check custom extension fields for whether they are now in core SIAv2. The classic case is exposure time, which was missing in SIAv1. Just drop your custom column definition(s).

    5. If there is datalink on the SIAP table, you will have to change its definition, too; the relevant column is now obs_publisher_did. If your datalink service has the id dl, the result of the operation would be this:

       <meta name="_associatedDatalinkService">
        <meta name="serviceId">dl</meta>
        <meta name="idColumn">obs_publisher_did</meta>
      </meta>
      

      This may lead to datalink failures in DaCHS < 2.13 (in that the datasets are no longer found). If this bites you, let me know.

    6. Fix the rowmaker for the SIAP table. For the computePGS and getBandFromFilter apply, just add a 2 to their procDef references, so that these become //siap2#computePGS and //siap2#getBandFromFilter (if applicable).

      The main work is going from //siap#setMeta to //siap2#setMeta, because their parameter sets are somewhat different, although they do map to each other to some degree.

      The way to do the migration is to go through SIAv2's setMeta's parameter list in the reference documentation and identify the old parameters, or take the values from your obscore definition. Once you are past this point, you have done the heavy lifiting.

      (For completeness, let me mention that you will probably get away with dropping pixflags and keeping the other parameters as they are, as there is some compatibility glue; but you'd miss setting up extra SIAv2 metadata, and that would be a shame).

    7. Experimentally run dachs imp. This will probably fail because there are references to old column names in, say, service definitions. Resolve these based on the names you used in setMeta (which largely double as the column names). When you made DaCHS accept your refurbished RD and have run the import, use dachs info to catch metadata items you have missed.

    8. If you used a shared core for both a service with the siap.xml renderer and the web form service, move that core into the web form service. Use //siap2#humanInput for the new positional constraint, and drop the #protoInput, if it is there, because it is no longer needed.

    9. The protocol service has to have allowed="siap2.xml", and its new core is:

      <dbCore queriedTable="main">
        <FEED source="//siap2#parameters"/>
      </dbCore>
      

      Replace "main" with whatever your table is called, and add any custom parameters you would like to have.

    10. In your regression tests (you have some, don't you?), change the renderers in the URIs (siap2.xml instead of siap.xml), and change POS and SIZE into POS="CIRCLE ..."; it is likely that you will also have to change column names in the assertions.

    11. Run dachs pub q to tell the Registry that your access URL has changed.

    That's it.

    I would argue this is time well spent. Even if one day there will be a successor to SIAv2 (and I do hope there will be one), it is highly likely that its metadata schema will align very well with obscore's, and hence most of the work you just did will put you in a very good position to switch to DAP with just a few keystrokes.

    [1]If you have built SIA services with DaCHS 2.8 (2023) or later using dachs start, you will already have a SIAv2 service; see the discussion in the pertaining release notes.
  • Queries Against My Obscore Are Slow!

    Content Warning: This is fairly deep nerd stuff. If you are just a normal VO user, you probably don't want to know about this. You probably even don't want to know about it if you are running a smallish DaCHS site. But perhaps you'll enjoy it anyway.

    Last May, I finally tried to get to the bottom of why certain queries against my obscore table – and in particular some joins I care about – were unneccessarily slow. The immediate use case was that I wanted to join the proposed radio extension for obscore to the main obscore table like this:

    SELECT COUNT(*)
        FROM ivoa.obscore
        JOIN ivoa.obs_radio
        USING (obs_publisher_did)
    

    This looks harmless, in particular since there are almost always indexes on obs_publisher_did columns for operational purposes: DaCHS uses them to locate rows in, for instance, Datalink operation.

    It is not. Harmless, I mean. On the contrary.

    Match Your Types Before You UNION ALL

    The main reason why there is a trap is that ivoa.obscore in DaCHS is a view (i.e., some sort of virtual table defined by a SQL query). This is because typically, multiple data collections contribute, and they can change independently of each other. We do not want to have to rebuild a full obscore table (which has almost 150 million rows in the Heidelberg data centre right now) just because we fix the metadata of a handful of images somewhere.

    Hence, ivoa.obscore is built somewhat like this in DaCHS[1]:

    CREATE OR REPLACE VIEW ivoa.obscore AS
     SELECT 'image'::text AS dataproduct_type,
        NULL::text AS dataproduct_subtype,
        2::smallint AS calib_level,
        'BGDS'::text AS obs_collection,
        ...
     FROM bgds.data
    UNION ALL
      SELECT 'image'::text AS dataproduct_type,
        NULL::text AS dataproduct_subtype,
        3::smallint AS calib_level,
      ...
    [and 42 further subqueries that are union-ed together]
    

    It turns out that this architecture is dangerous in Postgres.

    Laurenz Albe has a writeup on the underlying problem, which he summarises in a cartoon as “Before I UNION ALL you, be sure that your types match”. In short, UNION ALL becomes a planner barrier when the types of the columns of the relations being merged do not exactly match. For this purpose, a bigint is completely different from an integer.

    Full disclosure: it's not like I figured out the applicability of Laurenz' analysis to the DaCHS troubles by myself. It actually took multiple applications of the cluestick by Tom Lane, Laurenz, and others on pgsql-general.

    Known Problem Is Not Solved Problem

    Hence, since May, I sort-of understood the problem. Fixing it, on the other hand, seemed rather overwhelming given the size of the view and sometimes multiple levels of view building. In consequence, I procrastinated actually doing something about it until some time last November when I realised that the computer could support the analysis of what types from which tables do not match.

    I therefore wrote analyze-obscore.py and added it to the DaCHS repo. It will (presumably) never be part of the DaCHS package, but you can simply run it from a clone of the repo – and should do so if you have an obscore view fed from multiple tables.

    The output then is something like:

    ==== access_estsize ====
    
      bgds.data                      accsize/1024
      danish.data                    accsize/1024
      dfbsspec.ssa                   accsize/1024
      plts.data                      accsize/1024
      emi.main                       access_estsize (bigint)
      rosat.images                   accsize/1024
      califadr3.cubes                10
      robott.data                    accsize/1024
      k2c9vst.timeseries             accsize/1024
      dasch.narrow_plates            access_estsize (bigint)
      onebigb.ssa                    accsize/1024
      [...]
    
    ==== access_format ====
    
      bgds.data                      mime (text)
      danish.data                    mime (text)
      dfbsspec.ssa                   mime (text)
      plts.data                      mime (text)
      emi.main                       access_format (text)
      rosat.images                   mime (text)
      califadr3.cubes                'application/x-votable+xml;content=datalink'
      [...]
    
    ==== calib_level ====
    
      bgds.data                      2
      danish.data                    2
      dfbsspec.ssa                   2
      plts.data                      1
      emi.main                       calib_level (smallint)
      rosat.images                   2
      califadr3.cubes                3
      [...]
    

    and so on. That is: for each table contributing to a column, it either shows the source column together with its type, a literal, or the full expression. Literals are not problematic: as it turns out, DaCHS has always cast them to the appropriate type, so as long as the other source columns match what obscore thinks the columns ought to be, you should be fine.

    Expressions are more difficult. The only way to be sure there is to ask Postgres, somewhat like this:

    select pg_typeof(accsize/1024) from bgds.data limit 1
    

    Changing Types En Masse

    In my case, I had lots of inconsistencies between columns coming from SSA and more directly from obscore-like tables. If you have spectra and other things in one obscore table created by DaCHS <2.12.2, so will you.

    This is because in my obscore implementation I followed the somewhat ill-advised types written down in (but in my reading not actually requried by) the obscore specification (p. 21). There is no conceivable scenario that would require more than 231 polarisation states (the pol_xel column, which is supposed to be “adql:BIGINT”), and I do not feel overly future-skeptic when I say that it will also be some time until we have images with a linear dimension of more than two billion. There is also no good reason to have an order-of-magnitude value like em_res_power to 16 significant digits (as implied by “adql:DOUBLE”)[2].

    I have cleaned this up in DaCHS 2.12.2. With this, the types of Obscore and the corresponding columns in SSA and SIAP are consistent within DaCHS' metadata declarations.

    However, the on-disk tables will keep their original types regardless of what DaCHS claims they are. You could fix this by re-importing the tables, but that would take quite a while, at least in my case. I have hence opted for targeted updates.

    The first step in that procedure is to figure out where Postgres' ideas of columns are now different from DaCHS' ideas given the recent metadata updates. For that, dachs val has had the -c (or --compare-db) flag for a long time. Running:

    dachs val -vc ALL
    

    gives you a list of all RDs that need work because the on-disk types (which actually determine the query plan) differ from DaCHS' expectations (which will fix the UNION ALL trouble). Once they match, you can feel entitled to a good query plan.

    Based on this, I have incrementally built a fixing script on my development system. As I'm pointing out towards the end of Publishing a Service in the DaCHS tutorial, the recommended way to run a DaCHS-based data centre is to have test snippets of almost all the resources on the production system on a <cough> development system (presumably: your laptop). That's what I do, and in this way I built this script:

    import subprocess
    
    from gavo import api
    
    with api.getWritableAdminConn() as conn:
            conn.execute("DROP VIEW IF EXISTS ivoa.obscore")
            conn.execute("DROP VIEW IF EXISTS dasch.plates")
    
            for table_name in ["emi.main", "dasch.narrow_plates"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER t_xel TYPE integer")
    
            for table_name in [
                            "emi.main", "dasch.narrow_plates", "ppakm31.cubes", "applause.main",]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_xel1 TYPE integer")
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_xel2 TYPE integer")
    
            for table_name in [
                            "emi.main",
                            "dasch.narrow_plates"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER pol_xel TYPE integer")
    
            for table_name in [
                            "emi.main",
                            "dasch.narrow_plates",
                            "califadr3.cubes"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER em_xel TYPE integer")
                    conn.execute(f"ALTER TABLE {table_name} ALTER em_res_power TYPE real")
    
            for table_name in [
                            "emi.main",
                            "dasch.narrow_plates",
                            "ppakm31.cubes",
                            "applause.main",
                            "califadr3.cubes"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER em_min TYPE real")
                    conn.execute(f"ALTER TABLE {table_name} ALTER em_max TYPE real")
    
            for table_name in [
                            "emi.main",
                            "dasch.narrow_plates",
                            "applause.main"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_resolution TYPE real")
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_pixel_scale TYPE real")
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_fov TYPE real")
    
    
    for rd_id in ["emi/q", "califa/q3", "rome/q", "dasch/q", "ppakm31/q"]:
            subprocess.call(["dachs", "imp", "-m", rd_id])
    
    subprocess.call(["dachs", "imp", "dasch/q", "make-view"])
    subprocess.call(["dachs", "imp", "//obscore"])
    

    As I said: which columns to fix I learned from dachs val -vc; the extra DaCHS operations were necessary because Postgres refused the type changes as long as the views were still defined.

    Success?

    This entire operation has made quite a few obscore queries a lot faster.

    Regrettably, the motivating query, viz.,:

    select count(*)
    from ivoa.obscore
    natural join ivoa.obs_radio
    

    is still slow. I have dug a bit into why Postgres does not find the seemingly obvious plan of just materialising the join with the tiny obs_radio table and contented myself with the note that has been in section 9.21 of the postgres documentation forever:

    Users accustomed to working with other SQL database management systems might be disappointed by the performance of the count aggregate when it is applied to the entire table. A query like:

    SELECT count(*) FROM sometable;
    

    will require effort proportional to the size of the table: PostgreSQL will need to scan either the entire table or the entirety of an index that includes all rows in the table.

    But at least a query like:

    select dataproduct_type, access_url, t_min, t_max
    from ivoa.obscore
    natural join ivoa.obs_radio
    where t_min between 56000 and 56005
    

    is fast, and until further trouble that's good enough for me.

    Followup (2026-03-03)

    Well, futher trouble came afoot, and with DaCHS 2.12.3 you can therefore materialise your obscore table. This is as simple as saying:

    materialiseObscore: True
    

    in the [ivoa] section of your /etc/gavo.rc and then saying:

    dachs imp //obscore
    dachs limits //obscore
    

    in a shell. For large obscore tables, this will take a while (about 30 minutes for the imp in my data centre). I don't intend to do that more than once a month on average, and while queries to ivoa.obscore will block in that time, I think it's worth it: Query plans and all become a lot more readable, and my count(*) query suddenly finishes in less than a second. That's a big win over the several minutes I had before.

    Well: At least I have learned quite a bit about UNION ALL, and also about gathering metadata from many RDs at a time. So, this whole investigation was not a total waste of time.

    And if you have to know: this is not actually a materialised view but rather a normal, full-fledged table. That is because you cannot drop tables that are part of a materialised view, whereas once their rows are in a table, postgres lets you drop them as you like. And dropping is important if you want to develop your data collections.

    [1]In case you wonder: the individual parts of this union are kept in a table ivoa._obscoresources that you can inspect and even manipulate for special effects. The management of that table is among there more complex things one can do in DaCHS RDs. If you are curious, dachs adm dump //obscore will show you all the magic.
    [2]I put these type names into quotation marks because they were never formally defined. What Obscore does there has been identified as an antipattern in the meantime; newer specifications of similar schemas only distinguish floating point, integral, and string types and leave the choice of lengths to the implementations. If I may say so myself, I like the considerations on types within section 8 of RegTAP.
  • Out But Not Down

    A business phone with many custom buttons on a moderately cluttered desk

    Well, at least Uni Heidelberg still lets in calls to the phone on my desk. For connections to our data centre's servers, even after five days: no signal.

    Yesterday morning my phone rang. It was a call from Italy, and it was a complaint that my registry service was terribly loaded and didn't respond in time. That struck me as fairly odd, because I had just used it a few minutes before and it felt particularly snappy.

    A few keystrokes showed that was because it was entirely unloaded. A few more keystrokes showed that was because the University lets all incoming connections starve. They did that for all hosts within the networks of the University of Heidelberg, in particular also for their own web server. No advance warning, nothing. I still have no explanation, only rumours that they may have lost their entire Kerberos^WActive Directory. Even if that were true, I can't really see why they would kill all data services in their network: that's hashed passwords in there, no?

    So, while we're up, to the rest of the world it seems we're terribly down. This is also the longest downtime we've ever had, longer even than during the diskocalypse of 2017.

    I also have no indication when they plan to restore network connectivity. Apologies, and also apologies that they don't even send an honest connection refused and hence your clients are going to hang until there is a timeout.

    Meanwhile, our registry service at reg.g-vo.org keeps working; this is a good opportunity to thank my colleagues in Paris and Potsdam for running backup services for that critical piece of infrastructure.

    Followup (2025-11-21)

    Going into the weekend, there is still no communication from the computation centre on a timeframe to get us back online. At least they sent around a mail to all employees urging them to change their passwords; I am thus inclined to believe that they lost the content of their user database, and given they use these passwords in all kinds of contexts, I could well imagine they were stored using what's called “Reversible Encryption” in Windowsese. If that's true, they are hosed, but that is no excuse for killing my services.

    Followup (2025-11-24)

    Still no news from the University and its “CISO” on when we might get back connectivity. I consider this beyond embarrassing and thus helped myself. While the minor services (telco.g-vo.org, www.g-vo.org, docs.g-vo.org and so on) are still unreachable and still will hang until a timeout (what an unneccessary additional annoyance!), dc.g-vo.org should be back, at least to some extent.

    To pull this off, I went to Hetzner and clicked myself a minimal machine (funnily enough, it's phyiscally located in Helsinki). I then configured the sidedoor Debian package to enable connect to root on that new server (this is a bit tricky; you have to manage the files in /etc/sidedoor manually, including key generation; I ended up pulling the known_hosts entry out of my own ~/.ssh/known_hosts).

    And then you just run your equivalent of:

    sidedoor -R "*:80:dc.zah.uni-heidelberg.de:80" -R "*:443:dc.zah.uni-heidelberg.de:443" root@uhd-kruecke
    

    Regrettably, it needs to be root because of the privileged ports involved.

    So, we should be back in the VO. Please let me know if you disagree.

    Followup (2025-11-24)

    Uh, it seems I was not quite clear in the last update. The main message simply is: You should see dc.g-vo.org and its services normally now.

    All the talk about sidedoor and ssh tunnels was just an illustration of how I fixed the network outage. I was so specific partly to help others in the same situation, partly so the computation centre can't say they didn't know what I was up to.

    Followup (2025-11-28)

    If you speak German, there is a fan page for this entire disaster on the aptly-named page urz.wtf.

    Followup (2025-12-03)

    Two weeks into the disaster, there is the first official communication from the responsible persons to the service providers they cut off. In their denial of large-scale breakage and hermetic murmur about secrecy, the feeble words frankly remind me of Brezhnev-era bulletins, except back then they did not use stock illustrations supposed to illustrate… confusion?

    A question and exclamation mark each in a blue circle, centered between German text.

    I have to say that I am fairly angry with a statement like:

    These ongoing measures [taking everyone offline] proved to be proportionate and effective. [Diese Schritte, deren Umsetzung noch andauert, haben sich als angemessen und effektiv erwiesen.]

    Proportionate!? Shutting off services that have absolutely nothing to do with whatever was compromised for two weeks?

    There is the apt German phrase of „Arroganz der Macht“ (“conceit of the powerful”). Seeing that URZ not only not deigned to give any reaction to the distress signals that not only I have sent them in these past two weeks but clearly completely and utterly ignores them: I can't deny that that is infurating.

    Good disaster management means being transparent and showing some humility, ideally apologising to those that had a hard time because of the accident you had (or, in this case more likely, caused). The URZ does the opposite, pointing in all other directions:

    The computation centre has established a task force and closely works with the responsible agencies [...police, domestic intelligence, “cyber security agency Baden-Württemberg”]. [Das Universitätsrechenzentrum hat einen Krisenstab eingerichtet und arbeitet derzeit sehr eng mit den zuständigen Landesbehörden, insbesondere mit dem Landeskriminalamt Baden-Württemberg unter der Sachleitung der Generalstaatsanwaltschaft Karlsruhe, dem Landesamt für Verfassungsschutz, der Cybersicherheitsagentur Baden-Württemberg sowie dem Landesdatenschutzbeauftragten und der Hochschulföderation bwInfoSec, zusammen.]

    Dear URZ: If you are running Active Directory with “reversible encryption“ (and no, I don't know whether that's what they did[1], but it certainly seems like it), you're juggling with chainsaws, and nobody can help you, least of all the domestic intelligence service.

    At least we are given some perspective:

    The services will now, after a diligent examination and after establishing extra protective measurements, step by step, prospectively in the middle of the coming week, i.e., Wednesday Dec 10 2025, again be available on the internet without VPN. This only applies to services complying with the necessary security standards. [Die Dienste werden jetzt, nach sorgfältiger Prüfung und nach der Etablierung von zusätzlichen Schutzmaßnahmen, Schritt für Schritt voraussichtlich bis Mitte der kommenden Woche, d.h. Mittwoch, den 10. Dezember 2025, wieder über Internet ohne VPN verfügbar sein. Dies gilt nur für Dienste, die die nötigen Sicherheitsstandards erfüllen.]

    That's a downtime of three weeks (well, would be if I hadn't established workarounds for the most important services), a large multiple of the combined downtimes I had due to all the mishaps in 15 years of running a data centre on a shoestring budget. It is hard to imagine an attack that causes worse damage.

    And I shudder to imagine what “necessary security standards” might be unleashed on us.

    Sorry for venting. But it's really not nice to be on the receiving end of an entirely botched crisis reaction.

    Followup (2025-12-19)

    Over the last few weeks, I have brought back all of GAVO's services, including the version control system with our sample RDs, the web page with the various educational materials, and our Debian repository, through our reverse ssh tunnels. It's a fragile mess, but I think by and large I have successfully mitigated the attack on our services by… who knows; there's still no reliable statement who has decided to take everyone offline.

    Of course, nothing happened with the block against our services on Dec 10th (see the university's announcement above). Even nine days later, I've still not been officially informed who I'd have to petition and beg to get my job done without a metric ton of kludges.

    I was almost content with the fragile mess when today I received the official press thingy of the University, Unispiegel Digital, more precisely the December issue. And that came with another statement that is so outlandishly bizarre that I just can't keep quiet. They say:

    Auf die Universität Heidelberg ist ein großangelegter Cyberangriff mit dem Ziel vorbereitet worden, die gesamten universitären IT-Dienste zum Erliegen zu bringen. Dieser Angriff konnte identifiziert und rechtzeitig abgewehrt werden.

    It seems there is no official translation, so let me paraphrase: “A large-scale cyber attack on Heidelberg University has been prepared with the goal of shutting down all the university's IT services. This attack could be identified and thwarted in time.”

    Call me negative an illoyal, but… what can I say? You see, all of GAVO's IT services would have been shut down for four week by now without my emergency hacks. How this could possibly pass as successful defence of an attack with the goal of shutting down these said services is completely and utterly beyond me.

    Dear university officials: Since I have been unable to otherwise contact the responsible persons, let me state again here that I understand that shit happens and you can lose all your employees' credentials. But then own up to it, don't clam up, and work with the victims of your failure to get operations back up. Completely counterfactual pie-in-the-sky declarations, on the other hand, will certainly not help to calm the waters.

    Followup (2026-02-05)

    Believe it or not, we'd still be offline without our mitigations. If you can read German, you may enjoy the main communication I've received about the whole disaster so far, an interview in the Unispiegel with the University's “CIO“ Heuveline that repeats the story about an “attack that is larger and more dangerous than everything detected so far“ and then goes on to speculate that this was ransomware (“such attacks mostly aim to encrypt data in order to paralyse and blackmail the institution”). Larger and more dangerous my ass. Ransomware!

    Also Heuveline says that “we still have to close 30 to 40 cases”, where a case is a service they've blown. Ha! That would make me and my various services about 10% of their remaining workload, which sounds extremely unlikely to me.

    Anyway, at least Heuveline concedes that it would be a mistake to “turn the University into a fortress”. Good to hear that after more than two months of having to hobble on with ssh tunnels.

    Which brings me to my main topic: For some reason I have been unable to fathom so far, the sidedoor tunnels through which I keep my infrastructure open sometimes get “clogged”: The look open, but no data moves.

    My icinga duly notices that, but until I read the mail, there is significant downtime. Hence, I have written a quick hack that tells the sidedoors to rebuild the tunnels immediately (by sending a USR1 signal) when it seems the box is unreachable. It looks like this, and I'll run it in the foreground of a screen session, hoping I will get some better insight of what might cause the clogging:

    #!/bin/sh
    WATCH_URL=https://dc.g-vo.org/.well-known/security.txt
    while true; do
            curl -s -S --connect-timeout 10 "$WATCH_URL" > /dev/null
            if [ $? -eq 28 ]; then
                    # it's a timeout
                    logger "Found dc.g-vo.org hanging, USR1-ing all sidedoors"
                    killall -USR1 sidedoor
                    date
            fi
            sleep 120
    done
    

    Yes, I am piling hack upon hack. But there is a perspective for one day moving my VO services out of Fortress Heidelberg and hence get rid of all the ugly hackery again. It's just such a bad shame that that seems to be necessary.

    [1]I don't know that because URZ, against all sane policies, still doesn't confess up and instead murmurs “further information cannot be transmitted while investigations are going on [Weitere Informationen können während der laufenden Ermittlungen derzeit nicht übermittelt werden].” I'm sorry, but if I had to write a book on what not to do if you've been compromised, I'd include exactly that sentence, including the awkward „übermittelt“.
  • ADASS and Interop in Görlitz

    The end of a poster wall; there is a simple poster with large text: „Enthusiastic about the VO?  Interested?“ and a lot of small print.

    This is what DZA kindly turned my little A3-format job ad into. They even let me display it next to the serious science posters of ADASS. Well: we will be hiring soon.

    It's time for the Southern Spring Interop (coverage of previous Interops) again, which traditionally happens back-to-back with ADASS. And since ADASS XXXV (yes, it's 35 years now since the first ADASS, a timespan that Christophe Arviset illustrated rather impressively in a conference talk) takes place in Görlitz, Germany, at least the ground legwork for the Interop fell into the lap of the German VO organisation, i.e. GAVO. Oh my: I'm LOC chair!

    Right now, ADASS is still going on, and thus I am just the other blissful conference participant at this point. Well, except that we will be hiring soon, and the ADASS organisers were kind enough to print and let me display something like an oversized and somewhat vague vacancy notice. I had thought about something in A3. See the opening photo for how it has worked out: Thanks!

    Let me repeat the contents to my gentle readers: If you are enthusiastic about the VO and would like to contribute to it, do contact me (or perhaps first have a look the PDF detailing what you could be doing).

    Given my extra duties as part of the LOC, I do not think I will do my traditional live coverage of the Interop (which starts on Thursday). But still: Watch this space for updates.

    Update: Soapbox (2025-11-12)

    We have heard a lot of talks again advertising one “science platform“ or other here at ADASS. I fairly invariably cringe when watching them because to me these platforms are (usually) the return to the old „data silos“ (where someone sat on a bunch of tapes or later disks and handed out data on request if you politely asked and had some way to divine it was there), except that now people not only control the metadata and data but also who can perform which sort of computation until when.

    Even worse: Something you developed on one such platform will almost never work on the next platform; it will also break at the platform operators' discretion, and even the data you worked with will be gone at the whim of the platform operators or, more frequently, their funders.

    Against that, I'm a strong believer in Mike Masnick's 2019 credo Protocols, Not Platforms – which of course is also underlying the much older IVOA; back in 2000, it would have been “protocols, not FTP Servers“, and a little later “protocols, not data silos“.

    Let's try really hard to keep the user in control of their data and execution environments.

    „But, but“, I hear you pant, „nobody can download our petabytes or data“.

    Sure. Nor should they. You can do exciting things with the dozens-of-Terabyte (soon to be roughly-a-Petabyte) Gaia data from a tiny little device thanks to TAP, because you can select and aggregate using standard protocols (“learn once, use anywhere“) on the server side – and then only transfer and store locally not much more than 10 times the data you will eventually use in your research. That is thanks to TAP and ADQL.

    For array-like data (images, cubes, and the like) we don't have anything standardised that would be nearly as powerful as TAP and ADQL (well: there is ArraySQL as advertised by me in 2017), which is part why so many people feel compelled to take refuge to platforms. Which is a pity, because all the work that's sunk into these endeavours would be much better spent on developing standards that lets people work with remote arrays through standard protocols.

    An example for such standards was just presented here at ADASS: Pierre Fernique talked about “Big data exploration: a hierarchical visualisation solution for cubic surveys“. Check out his talk materials on the talk's ADASS page. In particular before you embark und building yet another platform.

    Update: Looking Back at the Interop (2025-11-16)

    The 2025 Southern Spring IVOA Interop is now over, and I will freely admit that I took a deep breath when everyone was out of Görlitz' Wichernhaus, where we have discussed the Virtual Observatory's past, present, and future since Friday.

    As I had expected, I had too much else to worry about to think about live reporting; and by my standards, I was fairly modest in having talks, too. I was only talking about evolving TAPRegExt (that is rather technical, and the main user-visible change would be that clients like TOPCAT would report more accurate limits as you switch between sync and async modes) and about Plans for Cone Search 2.

    This last thing was an outcome of the session on major version transitions at the College Park Interop last June (my coverage of that; and thoughts leading up to it). As promised back then, I have recently sketched what I think it will take to replace one major version of a protocol with another in a draft for SCS2.

    I do not think the plan for the standard itself is terribly interesting or creative, but since people have asked why the migration timeframe lasts until 2031 when Google and their ilk shove down changes down their users' throats within half a year if they (the users) are lucky: have a look at Appendix B to get an idea of what it ideally takes if you don't have Google's lock-in and commercial power and you hence have no means of shoving anything down anyone's throat – not the server-side adopters and much less the service users.

    In the talk, I have not discussed the plan in all its gory details but only showed the time line from the document:

    A coloured timeline starting with a WD review in 2026 and long bars for transition teams trying to manage takeup.

    Mind you: I consider it likely that all of this takes a whole lot longer, in particular because this is only a side project of mine.

    And now I will now sink back into my train seat and take a long break. The 7 days of straight conference action are bad enough for normal ADASS+Interop combos. When you are LOC[1] for the Interop, it's quite a bit worse. Heartfelt thanks to my LOC colleagues Daniela, Kai, and Sebastian, without whose help everything would have been a lot messier; running a hybrid conference without the resources of an established university is, let me share that experience with you, nothing for people with my sort of nerves.

    [1]That's Local Organising Committee if you're not into science argot: The people who make sure there's chairs, network, coffee, and everything else you need for a successful meeting these days.

Page 1 / 23 »