Heidelberg Data Center Down^WUp again

Well, it has happened – perhaps it was the strain of restoring a couple of terabyte of data (as reported yesterday), perhaps it’s uncorrelated, but our main database server’s RAID threw errors and then disappeared from the SCSI bus today at about 15:03 UTC.

This means that all services from http://dc.g-vo.org are broken for the moment. We’re sorry, and we will try to at least limp on as fast as possible.

Update (2017-11-13, 14:30 UTC): Well, it’s official. What’s broken is the lousy Adaptec controller – whatever configuration we tried, it can’t talk to its backplane any more. Worse, we don’t have a spare part for that piece here. We’re trying to get one as quickly as possible, but even medium-sized shops don’t have multi-channel SAS controllers in stock, so it’ll have to be express mail.

Of course, the results of the weekend’s restore are lost; so, we’ll need about 24 hours of restore again to get up to 90% of the services after the box is back up, with large tables being restored after that. Again, we’re unhappy about the long downtime, but it could only have been averted by having a hot spare, which for this kind of infrastructure just wouldn’t have been justifiable over the last ten years.

Another lesson learned: Hardware RAID sucks. It was really hard to analyse the failure, and the messages of the controller BIOS were completely unhelpful. We, at least, will migrate to JBOD (one of the cool IT acronyms with a laid-back expansion: Just a Bunch Of Disks) and software RAID.

And you know what? At least the box had two power supplies. If these weren’t redundant, you bet the power supply would have failed.

To give you an idea how bad things are, here is the open server with the controller card that probably caused the mayhem (left), and 12 TB of fast disk, yearning for action (right).

[A database server in pieces]

Update (2017-11-14, 12:21 UTC): We’re cursed. The UPS guys with the new controller were in the main institute building. They claimed they couldn’t find anyone. Ok, our janitor is on sick leave, and it was lunch break, but still. It can’t be that hard to see walk up a single flight of steps. Do we really have to wait another day?

Update (2017-11-14, 14:19 UTC): Well, UPS must have read this – or the original delivery report was bogus. Anyway, not an hour after the last entry the delivery status changed to “delivered”, and there the thing was in our mailbox.

Except – it wasn’t the controller in the first place. It turned out that, in fact, four disks had failed at the same time. It’s hard to believe but that’s what it is. Seems we’ll have to step carefully until the disks are replaced. We’ll run a thorough check tonight while we prepare the database tables.

Unless more disaster strikes, we should be back by tomorrow morning CET – but without the big tables, and I’m not sure yet whether I dare putting them in on these flimsy, enterprise-class, 15k, SAS disks. Well, I give you they’ve run for five years now.

Update (2017-11-15, 14:37 UTC): After a bit more consideration, I figured I wouldn’t trust the aging enterprise disks any more. Our admins then gave me a virtual machine on one of their boxes that should be powerful enough to keep the data center afloat for a while. So, the data center is back up at 90% (counting by the number of regression tests still failing) since an hour ago or so.

Again, the big tables are missing (and a few obscure services the RDs of which showed bitrot and need polishing); they should come in over the next days, one by one; provided the VM isn’t much slower than our DB server, you should see about two of them come in per day, with my planned sequence being hsoy, ppmxl, gps1, gaia, 2mass, sdssdr7, urat1, wise, ucac5, ucac4, rosat, ucac3, mwsc, mwsc-e14a, usnob, supercosmos.

Feel free to vote tables up if you severely miss a table.

And all this assumes no further disaster strikes…

Update (2017-11-16, 9:22 UTC): Well, it ain’t pretty. The first large catalog, HSOY, is finally in, and the CLUSTER operation ((which dominates restore time) took almost 12 hours; and HSOY, at 0.5 Gigarecord, isn’t all that large. So, our replacement machine really is a good deal slower than our normal database server that did that operation in less than three hours. I guess you’ll want to do your large-table queries on a different service for the next couple of weeks. Use the Registry!

Update (2017-11-20, 9:05 UTC): With a bit more RAM (DaCHS operators: version 1.1 will have a new configuration item for indexing work memory!), things have been going faster over the weekend. We’re now down to 15 regression tests failing (of 330), with just 4 large catalogs missing still, and then a few nitty-gritty, almost invisible tables still needing some manual work.

Update (2017-11-23, 14:51 UTC): Only 10 regression tests are still failing, but progress has become slow again – the machine has been clustering supercosmos.data for the last 36 hours now; it’s not that huge a table, so it’s a bit hard to understand why this table is holding up things so much. On the plus side, new SSDs for our database server are being shipped, so we should see faster operation soon.

Update (2017-12-01, 13:05 UTC): We’ve just switched back the database server back to our own server with its fresh SSDs. A few esoteric big tables are yet missing, but we’d say the crisis is over. Hence, that’s the last update. Thank you for your attention.

A Tale of CLUSTER and Failure

[Screenshot: aptitude purge '~c']
This command nuked 5 TB of database tables (with a bit of folly before).

Whenever you read “backup”, the phrase “lessons learned” is usually not far off. And so it is here, with a little story for DaCHS operators (food for thought, I’d say), astronomers (knowing what’s going on behind the curtain sometimes helps write better queries), and everyone else (for amusement and a generous helping of schadenfreude).

It all started yesterday when I upgraded the main database server of our data center (most anything in the VO with a org.gavo.dc in the IVOID depends on it) to Debian stretch. When that was done, I decided that with about 1000 installed packages, too much cruft had accumulated and started happily removing unused software. Until I accidentally removed the postgres package. In itself, that would not have been so disastrous – we’re running Debian, which means packages usually keep the configuration and, in particular, the data around even if you remove them. The postgres packages, at the very least, do, and so does DaCHS.

Unless, that is, you purge the postgres package before you notice you’ve
removed it. I, for one, found it appropriate to purge all packages deleted but not purged right after my package deletion spree. Oh bother. Can you imagine my horror when the beastly machine said “dropping cluster main”? And ignored my panic-induced ^C (which, of course, was the right thing to do; the database was toast already anyway).

There I had just flushed 5 Terabytes of highly structured data down the drain.

Well, go restore from backup, you say? As usual with backups, it’s not that simple™. You see, backing up databases is tricky. One can of course just back up the files as they are and then try to restore from them. However, while the database is running, it is continually modifying what’s on the disk, so such a backup will be an inconsistent, unusable mess. Even if one had a file system that can do snapshots, a running server has in-memory state that is typically needed to make heads and tails of the disk image.

So, to back up a database, there are essentially variations of two themes, roughly:

  • ask the database to dump itself. The result is a conventional file that essentially is a recipe for how to re-create a particular state of the database.
  • have a “hot spare”. That’s another machine with a database server running. In one way or another that other box snoops on what the main machine is doing and just replicates the actions it sees. The net effect is that you have an immediately usable copy of your database server.

Anyway, after the opening of this article you’ll not be surprised to learn that we did neither. The hot spare scenario needs a machine powerful enough to usefully serve as a stand-in and to not slow down the main machine when we feed data by the Gigarecords. Running such a machine just for backup would be a major waste of electricity – after all, this is the first time in about 10 years that it would really have been needed, and such a box slurps juice like it’s… well, juice.

As to maintaining a dump: Well, for the big catalogs, we use DaCHS’ direct grammars [PSA: don’t follow this link unless you’re running DaCHS]. These are, except perhaps for a small factor, just as fast as a restore from a dump. And the indices (i.e., data structures that tell the computer where to look for objects with a certain position or magnitude rather than having to go through the whole table) need to be re-made when restoring from dumps, too, so we’d be pushing around files of several terabyte for almost no benefit.

Except. Except I could have known better, because during catalog ingestions the most time-consuming task usually is the CLUSTER operation. That’s when the machine re-organises the data on disk so it matches expected access patterns – for astronomical data, that’s usually by spatial location. Having a large table clustered makes an astonishing difference, in particular when you’re still using spinning disks (as we are). So, there’s really no way around it.

But it takes time. And more time. And that time is saved when restoring from a dump, because the dump (hopefully) largely preserves the on-disk organisation, and so the CLUSTER is almost a no-op.

Well, the bottom line is: on our Heidelberg data center, the big tables are only coming back slowly; as I write this, from the gigarecord league PPMXL and GPS1 are back, with SDSS DR7 and HSOY expected later today. But it’ll probably take until late next week until all the big tables are back in and properly indexed and clustered.

Apologies for any inconvenience. On the other hand, as measured by our regression tests (DaCHS operators: required reading!) 90% of our stuff is fine again, so we could fare worse given we just had a database disaster of magnitude 5 on the Terabyte scale.

Which begs the question: Was it better this way? At least many important services are safely back up, and that might very well not be the case were we running the restore from an actual dump. Hm.

Register your stuff with purx!

TOPCAT screenshot
If you open the TAP dialog of TOPCAT, what you see is Registry content.

The VO Registry lets people find astronomical resources (which is jargon for “dataset, service, or stuff“). Currently, most of its users don’t even notice they’re using the Registry, as when TOPCAT just magically lists what TAP services are available (image above) – but there are also interfaces that let you directly interact with the registry, for instance GAVO’s WIRR service or ESAVO’s Registry Search.

Arguably, the usefulness of the Registry scales with its completeness. With sufficient completeness, the domain-specific, structured metadata will also make it interesting for generic discovery of astronomical data; in a quip, looking for UCDs in google will never work quite well – and without that, it’s hard to find things with queries like „radio fluxes of early-type stars”.

Either way: If you have a data set or a service dealing with astronomy, it’d be great if you could register it. To do this, so far you either had to set up a publishing registry, which is nontrivial even if you have a software that natively speaks a protocol called OAI-PMH (DaCHS does, but most other publishing suites don’t) or you could use one of two web interfaces to define your resource (notes for a talk on this I gave in 2016).

Neither of these options is really attractive if you publish only a few resources (so the overhead of running a publishing registry looks excessive) that change now and then (so using a web browser to update the resource records again and again is tedious). Therefore, GAVO has developed purx, the publishing registry proxy. We’ve officially announced it during the recent Southern Spring Interop in Santiago de Chile (Program), and the lecture notes for that talk are probably a good introduction to what this is about.

If you’re running VO services and have not registered them so far, you probably want to read both these notes and the service documentation. If, on the other hand, you just have a web-published directory of files or a browser-based service, you probably can skip even that. Just grab a sample record (use the one for a simple browser service in both cases) and adapt it to what’s fitting for your website. Then put the resulting file online somewhere and paste the URL of that location on purx’ enrollment service. In case you’re uncertain about some of the terms in the record, perhaps our crib sheet for metadata we ask our data providers for will be helpful.

There’s really no excuse any more for not being in the Registry!