2017-11-11
Markus Demleitner
Well, it has happened – perhaps it was the strain of restoring a couple
of terabyte of data (as reported yesterday), perhaps it's
uncorrelated, but our main database server's RAID threw errors and then
disappeared from the SCSI bus today at about 15:03 UTC.
This means that all services from http://dc.g-vo.org are broken for the
moment. We're sorry, and we will try to at least limp on as fast as
possible.
Update (2017-11-13, 14:30 UTC): Well, it's official. What's broken
is the lousy Adaptec controller – whatever configuration we tried, it
can't talk to its backplane any more. Worse, we don't have a spare part
for that piece here. We're trying to get one as quickly as possible, but
even medium-sized shops don't have multi-channel SAS controllers in
stock, so it'll have to be express mail.
Of course, the results of the weekend's restore are lost; so, we'll need
about 24 hours of restore again to get up to 90% of the services after
the box is back up, with large tables being restored after that. Again,
we're unhappy about the long downtime, but it could only have been
averted by having a hot spare, which for this kind of infrastructure
just wouldn't have been justifiable over the last ten years.
Another lesson learned: Hardware RAID sucks. It was really hard to
analyse the failure, and the messages of the controller BIOS were
completely unhelpful. We, at least, will migrate to JBOD (one of the
cool IT acronyms with a laid-back expansion: Just a Bunch Of Disks) and
software RAID.
And you know what? At least the box had two power supplies. If these
weren't redundant, you bet the power supply would have failed.
To give you an idea how bad things are, here is the open server with the
controller card that probably caused the mayhem (left), and 12 TB of
fast disk, yearning for action (right).
Update (2017-11-14, 12:21 UTC): We're cursed. The UPS guys with the
new controller were in the main institute building. They claimed they
couldn't find anyone. Ok, our janitor is on sick leave, and it was lunch
break, but still. It can't be that hard to see walk up a single flight
of steps. Do we really have to wait another day?
Update (2017-11-14, 14:19 UTC): Well, UPS must have read this – or
the original delivery report was bogus. Anyway, not an hour after the
last entry the delivery status changed to "delivered", and there the
thing was in our mailbox.
Except – it wasn't the controller in the first place. It turned out
that, in fact, four disks had failed at the same time. It's hard to
believe but that's what it is. Seems we'll have to step carefully until
the disks are replaced. We'll run a thorough check tonight while we
prepare the database tables.
Unless more disaster strikes, we should be back by tomorrow morning CET
– but without the big tables, and I'm not sure yet whether I dare
putting them in on these flimsy, enterprise-class, 15k, SAS disks. Well,
I give you they've run for five years now.
Update (2017-11-15, 14:37 UTC): After a bit more consideration, I
figured I wouldn't trust the aging enterprise disks any more. Our admins
then gave me a virtual machine on one of their boxes that should be
powerful enough to keep the data center afloat for a while. So, the data
center is back up at 90% (counting by the number of regression tests
still failing) since an hour ago or so.
Again, the big tables are missing (and a few obscure services the RDs of
which showed bitrot and need polishing); they should come in over the
next days, one by one; provided the VM isn't much slower than our DB
server, you should see about two of them come in per day, with my
planned sequence being hsoy, ppmxl, gps1, gaia, 2mass, sdssdr7, urat1,
wise, ucac5, ucac4, rosat, ucac3, mwsc, mwsc-e14a, usnob, supercosmos.
Feel free to vote tables up if you severely miss a table.
And all this assumes no further disaster strikes...
Update (2017-11-16, 9:22 UTC): Well, it ain't pretty. The first
large catalog, HSOY, is finally in, and the CLUSTER operation ((which
dominates restore time) took almost
12 hours; and HSOY, at 0.5 Gigarecord, isn't all that large. So, our
replacement machine really is a good deal slower than our normal
database server that did that operation in less than three hours. I
guess you'll want to do your large-table queries on a different service
for the next couple of weeks. Use the Registry!
Update (2017-11-20, 9:05 UTC): With a bit more RAM (DaCHS operators:
version 1.1 will have a new configuration item for indexing work
memory!), things have been going faster over the weekend. We're now down
to 15 regression tests failing (of 330), with just 4 large catalogs
missing still, and then a few nitty-gritty, almost invisible tables
still needing some manual work.
Update (2017-11-23, 14:51 UTC): Only 10 regression tests are still
failing, but progress has become slow again – the machine has been
clustering
supercosmos.data for the last 36 hours now; it's not that huge a
table, so it's a bit hard to understand why this table is holding up
things so much. On the plus side, new SSDs for our database server are
being shipped, so we should see faster operation soon.
Update (2017-12-01, 13:05 UTC): We've just switched back the
database server back to our own server with its fresh SSDs. A few
esoteric big tables are yet missing, but we'd say the crisis is over.
Hence, that's the last update. Thank you for your attention.