Posts with the Tag Disaster:

Out But Not Down

2025-11-20 Markus Demleitner

A business phone with many custom buttons on a moderately cluttered desk

Well, at least Uni Heidelberg still lets in calls to the phone on my desk. For connections to our data centre's servers, even after five days: no signal.

Yesterday morning my phone rang. It was a call from Italy, and it was a complaint that my registry service was terribly loaded and didn't respond in time. That struck me as fairly odd, because I had just used it a few minutes before and it felt particularly snappy.

A few keystrokes showed that was because it was entirely unloaded. A few more keystrokes showed that was because the University lets all incoming connections starve. They did that for all hosts within the networks of the University of Heidelberg, in particular also for their own web server. No advance warning, nothing. I still have no explanation, only rumours that they may have lost their entire Kerberos^WActive Directory. Even if that were true, I can't really see why they would kill all data services in their network: that's hashed passwords in there, no?

So, while we're up, to the rest of the world it seems we're terribly down. This is also the longest downtime we've ever had, longer even than during the diskocalypse of 2017.

I also have no indication when they plan to restore network connectivity. Apologies, and also apologies that they don't even send an honest connection refused and hence your clients are going to hang until there is a timeout.

Meanwhile, our registry service at reg.g-vo.org keeps working; this is a good opportunity to thank my colleagues in Paris and Potsdam for running backup services for that critical piece of infrastructure.

Followup (2025-11-21)

Going into the weekend, there is still no communication from the computation centre on a timeframe to get us back online. At least they sent around a mail to all employees urging them to change their passwords; I am thus inclined to believe that they lost the content of their user database, and given they use these passwords in all kinds of contexts, I could well imagine they were stored using what's called “Reversible Encryption” in Windowsese. If that's true, they are hosed, but that is no excuse for killing my services.

Followup (2025-11-24)

Still no news from the University and its “CISO” on when we might get back connectivity. I consider this beyond embarrassing and thus helped myself. While the minor services (telco.g-vo.org, www.g-vo.org, docs.g-vo.org and so on) are still unreachable and still will hang until a timeout (what an unneccessary additional annoyance!), dc.g-vo.org should be back, at least to some extent.

To pull this off, I went to Hetzner and clicked myself a minimal machine (funnily enough, it's phyiscally located in Helsinki). I then configured the sidedoor Debian package to enable connect to root on that new server (this is a bit tricky; you have to manage the files in /etc/sidedoor manually, including key generation; I ended up pulling the known_hosts entry out of my own ~/.ssh/known_hosts).

And then you just run your equivalent of:

sidedoor -R "*:80:dc.zah.uni-heidelberg.de:80" -R "*:443:dc.zah.uni-heidelberg.de:443" root@uhd-kruecke

Regrettably, it needs to be root because of the privileged ports involved.

So, we should be back in the VO. Please let me know if you disagree.

Followup (2025-11-24)

Uh, it seems I was not quite clear in the last update. The main message simply is: You should see dc.g-vo.org and its services normally now.

All the talk about sidedoor and ssh tunnels was just an illustration of how I fixed the network outage. I was so specific partly to help others in the same situation, partly so the computation centre can't say they didn't know what I was up to.

Followup (2025-11-28)

If you speak German, there is a fan page for this entire disaster on the aptly-named page urz.wtf.

Followup (2025-12-03)

Two weeks into the disaster, there is the first official communication from the responsible persons to the service providers they cut off. In their denial of large-scale breakage and hermetic murmur about secrecy, the feeble words frankly remind me of Brezhnev-era bulletins, except back then they did not use stock illustrations supposed to illustrate… confusion?

A question and exclamation mark each in a blue circle, centered between German text.

I have to say that I am fairly angry with a statement like:

These ongoing measures [taking everyone offline] proved to be proportionate and effective. [Diese Schritte, deren Umsetzung noch andauert, haben sich als angemessen und effektiv erwiesen.]

Proportionate!? Shutting off services that have absolutely nothing to do with whatever was compromised for two weeks?

There is the apt German phrase of „Arroganz der Macht“ (“conceit of the powerful”). Seeing that URZ not only not deigned to give any reaction to the distress signals that not only I have sent them in these past two weeks but clearly completely and utterly ignores them: I can't deny that that is infurating.

Good disaster management means being transparent and showing some humility, ideally apologising to those that had a hard time because of the accident you had (or, in this case more likely, caused). The URZ does the opposite, pointing in all other directions:

The computation centre has established a task force and closely works with the responsible agencies [...police, domestic intelligence, “cyber security agency Baden-Württemberg”]. [Das Universitätsrechenzentrum hat einen Krisenstab eingerichtet und arbeitet derzeit sehr eng mit den zuständigen Landesbehörden, insbesondere mit dem Landeskriminalamt Baden-Württemberg unter der Sachleitung der Generalstaatsanwaltschaft Karlsruhe, dem Landesamt für Verfassungsschutz, der Cybersicherheitsagentur Baden-Württemberg sowie dem Landesdatenschutzbeauftragten und der Hochschulföderation bwInfoSec, zusammen.]

Dear URZ: If you are running Active Directory with “symmetric encryption“ (and no, I don't know whether that's what they did[1], but it certainly seems like it), you're juggling with chainsaws, and nobody can help you, least of all the domestic intelligence service.

At least we are given some perspective:

The services will now, after a diligent examination and after establishing extra protective measurements, step by step, prospectively in the middle of the coming week, i.e., Wednesday Dec 10 2025, again be available on the internet without VPN. This only applies to services complying with the necessary security standards. [Die Dienste werden jetzt, nach sorgfältiger Prüfung und nach der Etablierung von zusätzlichen Schutzmaßnahmen, Schritt für Schritt voraussichtlich bis Mitte der kommenden Woche, d.h. Mittwoch, den 10. Dezember 2025, wieder über Internet ohne VPN verfügbar sein. Dies gilt nur für Dienste, die die nötigen Sicherheitsstandards erfüllen.]

That's a downtime of three weeks (well, would be if I hadn't established workarounds for the most important services), a large multiple of the combined downtimes I had due to all the mishaps in 15 years of running a data centre on a shoestring budget. It is hard to imagine an attack that causes worse damage.

And I shudder to imagine what “necessary security standards” might be unleashed on us.

Sorry for venting. But it's really not nice to be on the receiving end of an entirely botched crisis reaction.

[1]

I don't know that because URZ, against all sane policies, still doesn't confess up and instead murmurs “further information cannot be transmitted while investigations are going on [Weitere Informationen können während der laufenden Ermittlungen derzeit nicht übermittelt werden].” I'm sorry, but if I had to write a book on what not to do if you've been compromised, I'd include exactly that sentence, including the awkward „übermittelt“.

Heidelberg Data Center Down^WUp again

2017-11-11 Markus Demleitner

Well, it has happened – perhaps it was the strain of restoring a couple of terabyte of data (as reported yesterday), perhaps it's uncorrelated, but our main database server's RAID threw errors and then disappeared from the SCSI bus today at about 15:03 UTC.

This means that all services from http://dc.g-vo.org are broken for the moment. We're sorry, and we will try to at least limp on as fast as possible.

Update (2017-11-13, 14:30 UTC): Well, it's official. What's broken is the lousy Adaptec controller – whatever configuration we tried, it can't talk to its backplane any more. Worse, we don't have a spare part for that piece here. We're trying to get one as quickly as possible, but even medium-sized shops don't have multi-channel SAS controllers in stock, so it'll have to be express mail.

Of course, the results of the weekend's restore are lost; so, we'll need about 24 hours of restore again to get up to 90% of the services after the box is back up, with large tables being restored after that. Again, we're unhappy about the long downtime, but it could only have been averted by having a hot spare, which for this kind of infrastructure just wouldn't have been justifiable over the last ten years.

Another lesson learned: Hardware RAID sucks. It was really hard to analyse the failure, and the messages of the controller BIOS were completely unhelpful. We, at least, will migrate to JBOD (one of the cool IT acronyms with a laid-back expansion: Just a Bunch Of Disks) and software RAID.

And you know what? At least the box had two power supplies. If these weren't redundant, you bet the power supply would have failed.

To give you an idea how bad things are, here is the open server with the controller card that probably caused the mayhem (left), and 12 TB of fast disk, yearning for action (right).

Update (2017-11-14, 12:21 UTC): We're cursed. The UPS guys with the new controller were in the main institute building. They claimed they couldn't find anyone. Ok, our janitor is on sick leave, and it was lunch break, but still. It can't be that hard to see walk up a single flight of steps. Do we really have to wait another day?

Update (2017-11-14, 14:19 UTC): Well, UPS must have read this – or the original delivery report was bogus. Anyway, not an hour after the last entry the delivery status changed to "delivered", and there the thing was in our mailbox.

Except – it wasn't the controller in the first place. It turned out that, in fact, four disks had failed at the same time. It's hard to believe but that's what it is. Seems we'll have to step carefully until the disks are replaced. We'll run a thorough check tonight while we prepare the database tables.

Unless more disaster strikes, we should be back by tomorrow morning CET – but without the big tables, and I'm not sure yet whether I dare putting them in on these flimsy, enterprise-class, 15k, SAS disks. Well, I give you they've run for five years now.

Update (2017-11-15, 14:37 UTC): After a bit more consideration, I figured I wouldn't trust the aging enterprise disks any more. Our admins then gave me a virtual machine on one of their boxes that should be powerful enough to keep the data center afloat for a while. So, the data center is back up at 90% (counting by the number of regression tests still failing) since an hour ago or so.

Again, the big tables are missing (and a few obscure services the RDs of which showed bitrot and need polishing); they should come in over the next days, one by one; provided the VM isn't much slower than our DB server, you should see about two of them come in per day, with my planned sequence being hsoy, ppmxl, gps1, gaia, 2mass, sdssdr7, urat1, wise, ucac5, ucac4, rosat, ucac3, mwsc, mwsc-e14a, usnob, supercosmos.

Feel free to vote tables up if you severely miss a table.

And all this assumes no further disaster strikes...

Update (2017-11-16, 9:22 UTC): Well, it ain't pretty. The first large catalog, HSOY, is finally in, and the CLUSTER operation ((which dominates restore time) took almost 12 hours; and HSOY, at 0.5 Gigarecord, isn't all that large. So, our replacement machine really is a good deal slower than our normal database server that did that operation in less than three hours. I guess you'll want to do your large-table queries on a different service for the next couple of weeks. Use the Registry!

Update (2017-11-20, 9:05 UTC): With a bit more RAM (DaCHS operators: version 1.1 will have a new configuration item for indexing work memory!), things have been going faster over the weekend. We're now down to 15 regression tests failing (of 330), with just 4 large catalogs missing still, and then a few nitty-gritty, almost invisible tables still needing some manual work.

Update (2017-11-23, 14:51 UTC): Only 10 regression tests are still failing, but progress has become slow again – the machine has been clustering supercosmos.data for the last 36 hours now; it's not that huge a table, so it's a bit hard to understand why this table is holding up things so much. On the plus side, new SSDs for our database server are being shipped, so we should see faster operation soon.

Update (2017-12-01, 13:05 UTC): We've just switched back the database server back to our own server with its fresh SSDs. A few esoteric big tables are yet missing, but we'd say the crisis is over. Hence, that's the last update. Thank you for your attention.

Category: Operations
A Tale of CLUSTER and Failure

2017-11-10 Markus Demleitner

This command nuked 5 TB of database tables (with a bit of folly before).

Whenever you read “backup”, the phrase “lessons learned” is usually not far off. And so it is here, with a little story for DaCHS operators (food for thought, I'd say), astronomers (knowing what's going on behind the curtain sometimes helps write better queries), and everyone else (for amusement and a generous helping of schadenfreude).

It all started yesterday when I upgraded the main database server of our data center (most anything in the VO with a org.gavo.dc in the IVOID depends on it) to Debian stretch. When that was done, I decided that with about 1000 installed packages, too much cruft had accumulated and started happily removing unused software. Until I accidentally removed the postgres package. In itself, that would not have been so disastrous – we're running Debian, which means packages usually keep the configuration and, in particular, the data around even if you remove them. The postgres packages, at the very least, do, and so does DaCHS.

Unless, that is, you purge the postgres package before you notice you've removed it. I, for one, found it appropriate to purge all packages deleted but not purged right after my package deletion spree. Oh bother. Can you imagine my horror when the beastly machine said “dropping cluster main”? And ignored my panic-induced ^C (which, of course, was the right thing to do; the database was toast already anyway).

There I had just flushed 5 Terabytes of highly structured data down the drain.

Well, go restore from backup, you say? As usual with backups, it's not that simple™. You see, backing up databases is tricky. One can of course just back up the files as they are and then try to restore from them. However, while the database is running, it is continually modifying what's on the disk, so such a backup will be an inconsistent, unusable mess. Even if one had a file system that can do snapshots, a running server has in-memory state that is typically needed to make heads and tails of the disk image.

So, to back up a database, there are essentially variations of two themes, roughly:
- ask the database to dump itself. The result is a conventional file that essentially is a recipe for how to re-create a particular state of the database.
- have a “hot spare”. That's another machine with a database server running. In one way or another that other box snoops on what the main machine is doing and just replicates the actions it sees. The net effect is that you have an immediately usable copy of your database server.
Anyway, after the opening of this article you'll not be surprised to learn that we did neither. The hot spare scenario needs a machine powerful enough to usefully serve as a stand-in and to not slow down the main machine when we feed data by the Gigarecords. Running such a machine just for backup would be a major waste of electricity – after all, this is the first time in about 10 years that it would really have been needed, and such a box slurps juice like it's... well, juice.

As to maintaining a dump: Well, for the big catalogs, we use DaCHS' direct grammars [PSA: don't follow this link unless you're running DaCHS]. These are, except perhaps for a small factor, just as fast as a restore from a dump. And the indices (i.e., data structures that tell the computer where to look for objects with a certain position or magnitude rather than having to go through the whole table) need to be re-made when restoring from dumps, too, so we'd be pushing around files of several terabyte for almost no benefit.

Except. Except I could have known better, because during catalog ingestions the most time-consuming task usually is the CLUSTER operation. That's when the machine re-organises the data on disk so it matches expected access patterns – for astronomical data, that's usually by spatial location. Having a large table clustered makes an astonishing difference, in particular when you're still using spinning disks (as we are). So, there's really no way around it.

But it takes time. And more time. And that time is saved when restoring from a dump, because the dump (hopefully) largely preserves the on-disk organisation, and so the CLUSTER is almost a no-op.

Well, the bottom line is: on our Heidelberg data center, the big tables are only coming back slowly; as I write this, from the gigarecord league PPMXL and GPS1 are back, with SDSS DR7 and HSOY expected later today. But it'll probably take until late next week until all the big tables are back in and properly indexed and clustered.

Apologies for any inconvenience. On the other hand, as measured by our regression tests (DaCHS operators: required reading!) 90% of our stuff is fine again, so we could fare worse given we just had a database disaster of magnitude 5 on the Terabyte scale.

Which begs the question: Was it better this way? At least many important services are safely back up, and that might very well not be the case were we running the restore from an actual dump. Hm.

Category: Operations

Page 1 / 1