Out But Not Down

A business phone with many custom buttons on a moderately cluttered desk

Well, at least Uni Heidelberg still lets in calls to the phone on my desk. For connections to our data centre's servers, even after five days: no signal.

Yesterday morning my phone rang. It was a call from Italy, and it was a complaint that my registry service was terribly loaded and didn't respond in time. That struck me as fairly odd, because I had just used it a few minutes before and it felt particularly snappy.

A few keystrokes showed that was because it was entirely unloaded. A few more keystrokes showed that was because the University lets all incoming connections starve. They did that for all hosts within the networks of the University of Heidelberg, in particular also for their own web server. No advance warning, nothing. I still have no explanation, only rumours that they may have lost their entire Kerberos^WActive Directory. Even if that were true, I can't really see why they would kill all data services in their network: that's hashed passwords in there, no?

So, while we're up, to the rest of the world it seems we're terribly down. This is also the longest downtime we've ever had, longer even than during the diskocalypse of 2017.

I also have no indication when they plan to restore network connectivity. Apologies, and also apologies that they don't even send an honest connection refused and hence your clients are going to hang until there is a timeout.

Meanwhile, our registry service at reg.g-vo.org keeps working; this is a good opportunity to thank my colleagues in Paris and Potsdam for running backup services for that critical piece of infrastructure.

Followup (2025-11-21)

Going into the weekend, there is still no communication from the computation centre on a timeframe to get us back online. At least they sent around a mail to all employees urging them to change their passwords; I am thus inclined to believe that they lost the content of their user database, and given they use these passwords in all kinds of contexts, I could well imagine they were stored using what's called “Reversible Encryption” in Windowsese. If that's true, they are hosed, but that is no excuse for killing my services.

Followup (2025-11-24)

Still no news from the University and its “CISO” on when we might get back connectivity. I consider this beyond embarrassing and thus helped myself. While the minor services (telco.g-vo.org, www.g-vo.org, docs.g-vo.org and so on) are still unreachable and still will hang until a timeout (what an unneccessary additional annoyance!), dc.g-vo.org should be back, at least to some extent.

To pull this off, I went to Hetzner and clicked myself a minimal machine (funnily enough, it's phyiscally located in Helsinki). I then configured the sidedoor Debian package to enable connect to root on that new server (this is a bit tricky; you have to manage the files in /etc/sidedoor manually, including key generation; I ended up pulling the known_hosts entry out of my own ~/.ssh/known_hosts).

And then you just run your equivalent of:

sidedoor -R "*:80:dc.zah.uni-heidelberg.de:80" -R "*:443:dc.zah.uni-heidelberg.de:443" root@uhd-kruecke

Regrettably, it needs to be root because of the privileged ports involved.

So, we should be back in the VO. Please let me know if you disagree.

Followup (2025-11-24)

Uh, it seems I was not quite clear in the last update. The main message simply is: You should see dc.g-vo.org and its services normally now.

All the talk about sidedoor and ssh tunnels was just an illustration of how I fixed the network outage. I was so specific partly to help others in the same situation, partly so the computation centre can't say they didn't know what I was up to.

Followup (2025-11-28)

If you speak German, there is a fan page for this entire disaster on the aptly-named page urz.wtf.

Followup (2025-12-03)

Two weeks into the disaster, there is the first official communication from the responsible persons to the service providers they cut off. In their denial of large-scale breakage and hermetic murmur about secrecy, the feeble words frankly remind me of Brezhnev-era bulletins, except back then they did not use stock illustrations supposed to illustrate… confusion?

A question and exclamation mark each in a blue circle, centered between German text.

I have to say that I am fairly angry with a statement like:

These ongoing measures [taking everyone offline] proved to be proportionate and effective. [Diese Schritte, deren Umsetzung noch andauert, haben sich als angemessen und effektiv erwiesen.]

Proportionate!? Shutting off services that have absolutely nothing to do with whatever was compromised for two weeks?

There is the apt German phrase of „Arroganz der Macht“ (“conceit of the powerful”). Seeing that URZ not only not deigned to give any reaction to the distress signals that not only I have sent them in these past two weeks but clearly completely and utterly ignores them: I can't deny that that is infurating.

Good disaster management means being transparent and showing some humility, ideally apologising to those that had a hard time because of the accident you had (or, in this case more likely, caused). The URZ does the opposite, pointing in all other directions:

The computation centre has established a task force and closely works with the responsible agencies [...police, domestic intelligence, “cyber security agency Baden-Württemberg”]. [Das Universitätsrechenzentrum hat einen Krisenstab eingerichtet und arbeitet derzeit sehr eng mit den zuständigen Landesbehörden, insbesondere mit dem Landeskriminalamt Baden-Württemberg unter der Sachleitung der Generalstaatsanwaltschaft Karlsruhe, dem Landesamt für Verfassungsschutz, der Cybersicherheitsagentur Baden-Württemberg sowie dem Landesdatenschutzbeauftragten und der Hochschulföderation bwInfoSec, zusammen.]

Dear URZ: If you are running Active Directory with “symmetric encryption“ (and no, I don't know whether that's what they did[1], but it certainly seems like it), you're juggling with chainsaws, and nobody can help you, least of all the domestic intelligence service.

At least we are given some perspective:

The services will now, after a diligent examination and after establishing extra protective measurements, step by step, prospectively in the middle of the coming week, i.e., Wednesday Dec 10 2025, again be available on the internet without VPN. This only applies to services complying with the necessary security standards. [Die Dienste werden jetzt, nach sorgfältiger Prüfung und nach der Etablierung von zusätzlichen Schutzmaßnahmen, Schritt für Schritt voraussichtlich bis Mitte der kommenden Woche, d.h. Mittwoch, den 10. Dezember 2025, wieder über Internet ohne VPN verfügbar sein. Dies gilt nur für Dienste, die die nötigen Sicherheitsstandards erfüllen.]

That's a downtime of three weeks (well, would be if I hadn't established workarounds for the most important services), a large multiple of the combined downtimes I had due to all the mishaps in 15 years of running a data centre on a shoestring budget. It is hard to imagine an attack that causes worse damage.

And I shudder to imagine what “necessary security standards” might be unleashed on us.

Sorry for venting. But it's really not nice to be on the receiving end of an entirely botched crisis reaction.

[1]I don't know that because URZ, against all sane policies, still doesn't confess up and instead murmurs “further information cannot be transmitted while investigations are going on [Weitere Informationen können während der laufenden Ermittlungen derzeit nicht übermittelt werden].” I'm sorry, but if I had to write a book on what not to do if you've been compromised, I'd include exactly that sentence, including the awkward „übermittelt“.