Apologies that this root log took so long in coming. I'm going to try
to document here mostly the things that happened during July
concerning the server downtime and the recovery up until the point
when I'm writing this. Although coding slowed down for June, stuff
still happened, and I've asked OldMiner to document that, which he
should post soon hereafter.
Downtime and recovery
The exact chronology of events is fuzzy in my head. I know I have the
order in which things happened correct, but exactly when, I don't know.
Summary
The actual complete downtime was three or four days around July 7,
2009. I know I managed to bring back the site to a flakey status on
Friday the 10th, remember it being a Friday quite well, because I
stayed very late at the office at work until I knew the site could at
least get back on its feet, however shakily. It took maybe ten more
days to get the site in a usable state, and I've been tweaking it
almost daily since then until this noding.
The dirty details
The beginning
Out of necessity this may get a bit technical, but even despite the
jargon, you might get the general idea of what happened.
The site went down due to overheating. The AC in the university server
room where e2 was hosted malfunctioned. Now, e2 was hosted on
basically three machines:
- web2
- Load balancer, web frontend, runs cronjobs
(maintenance tasks like updating New Writeups and cleaning out node
row)
- web5
- Main webhead, runs all the Perl, serves requests
for web2
- db2
- Database server, hosts the actual data, doesn't do
much else.
In addition we have
- web4
- What we use for development, completely sandboxed from everything
else, running its own database and cronjobs
- web6
- A webhead supposedly with identical function to web5 which wasn't
doing much of anything other than contributing to global warming,
not to mention server room warming.
All of them were running various Ubuntu versions in various states
of update. The proposed scheduled downtime that probably jinxed us to
endure unscheduled downtime had for a goal bringing all of the Ubuntu
versions up to the latest server version, Hardy Heron.
So clampe received notice that not all was right in the server
room. He went in. Blown fuse or something. He powered down all the
servers. And then...
web5 and db2 didn't come back.
So we were out of a database server and the main webhead we had.
nate remotely rerouted power to web6 but without a database, all it
could serve was Nate's word galaxy. web4 was also powered down, at
the time unclear whether it had suffered damage too or not, meaning we
were running on pretty much nothing.
Emeritus commander-in-chief nate decided that we needed
on-site backup or else we'd never recover from this. Taking nothing
but some rations, a jackknife, a standard-issue potato gun, and bare
equipment donated by Kurt, nate infiltrated the
tremendous heat of the server room through the ventilation ducts,
suspended horizontally from a safety harness so as to not touch the
Hades the place had become. alex and I offered remote backup.
nate soon confirmed that indeed our main webhead and database
server were out of commission, their motherboards fried. Performing
careful surgery, he removed the RAIDed hard drives from the database
server and backed up our database, reportedly also backed up the
database to DVD. He brought web4 back online, much to my relief, which
wasn't fried. I quickly proceeded to backup our development work to
our spare web6, in case anything else might happen. nate
worked most of that day getting the basic layout ready for e2
recovery. db2 was replaced with kurt1, our new database server, and
web5 got replaced with tcwest, our new webhead, both running Fedora
and the temporary replacement equipment that Kurt had
temporarily donated.
In the meantime, we had already planned to redirect people with the
Word Galaxy to an IRC room in case they wanted to know more about the
server outage or just plain see some familiar faces for comfort. While
nate was working, alex and I did what we could with
the Word Galaxy and checking things over in IRC. Once she seemed
networthy, the rest could be done remotely. nate retreated
from the site and left most of the remaining work to me and alex.
Except for the RAID1 that we lost in db2, our replacement hardware
was superior to what got burned. More RAM, more disk space, faster
processors.
First attempts at recovery
The new servers had to be readied for serving e2. I worked on
installing ecore on tcwest, our
replacement webhead (which all the other servers still called "web5").
This involved fetching all of the necessary Perl modules from CPAN,
and for both of us (alex and I), getting acquainted with package
management in Fedora. Ecore's configuration had to be adapted to the
new server and the Perl configuration also took a little while.
alex handled routing web2's to serve requests
from both tcwest and web6. In addition, alex handled
most of the Apache webserver configuration in web5, with just a little
help from me for pointing out the e2-specific configurations and our
URL-rewriting rules. web6 only needed a small ecore update since she
was already more or less configured to handle serving e2.
The MySQL database configuration in kurt1 was much more difficult .
All three of us tried a couple of stock configurations for MySQL, but
when we tried to bring up the site with them, she keeled over in
minutes and we had tons of locked queries. We were also running MySQL
5.1 instead of 5.0, which although declared stable, most server
distributions still ship 5.0 in stable and 5.1 as experimental. This
is a consequence of Fedora being primary a desktop distribution, not
server.
At any rate, we had no better idea of how to restore site operation,
so we decided to see if switching database engine would help. This was
part of the scheduled upgrades anyways. So after a few tries due to
various bugs, I stayed up late one night in the office at work,
because I miscalculated how long it would take to switch engines and
unable to leave the operation in course. After several hours, we
switched away from the MyISAM engine to the InnoDB engine which is
supposed to handle locks much better but has different
assumptions of operation. InnoDB handles concurrent queries
better without locking, but has other issues we would soon discover.
This was about three or four days of downtime. In the meantime, the
IRC channel became a refugee camp. Around the third or fourth night,
after the move to a new database engine was complete, I hit the gas,
started the cronjobs, redirected web2 to serve ecore pages, and with
much difficulty, we got the website to 88 miles per hour and our first
few pageloads started trickling in.
Improving pageloads, losing and recovering one engine
Once the website was up and at least staying up, it looked like most
of our work was done, and I suggested that the IRC refugee camp could
be liberated. However, pageloads were unbearable for about ten days or
so. We tried everything we could think of, and I spent a lot of time
more or less randomly tinkering with our MySQL configuration.
Somewhere along the line, I also tried to synchronise ecores and
homenodes images between tcwest and web6 so that it would be easier to
keep ecore updated with our work and you wouldn't get a semirandom
homenode image depending on which of the two webheads served your
request. Unfortunately, core temperature in web6 was probably too high
and she shut down most likely to overheating. She was the one hosting
the single ecore and single set of homenode images, so that incurred a
couple of hours of downtime. alex rerouted all power to tcwest
and questioned the wisdom of serving ecore and homenode images from a
single server. nate rebooted web6 remotely, observed she was
still webworthy, and I begrudgingly kept the ecore separate, but
stubbornly insisted on mounting homenode images from one place only,
still web6.
Back to MySQL and InnoDB configuration, I tried increasing memory
usage of InnoDB to allow it to load as much of the database as
possible into RAM to alleviate the biggest bottleneck in almost
anything: hard disk input/output. This had a marginal improvement in
pageloads and 503's became more rare, but load times were still
unacceptable and occasionally web2 would give up waiting for tcwest or
web6 to respond, hence 503.
I spent a lot of time reading about InnoDB and MySQL optimisation,
trying various things, but I was out of my element. We all were. I
asked around for help. Reasoning that since our hardware was in fact
better than what we burned except for the lost RAID, I
couldn't give up in finding the magic combination that could bring
pageloads back to what they were before the server crash. Finally,
after asking around for help wherver I could, I found an IRC chap by
the name of
Raymond DeRoo who
was kind enough to teach me a few basic optimisation tricks. Turns out
that kurt1 had tons of sleeping processes hogging MySQL connections,
so the first thing Raymond suggested was to reduce the timeout on
sleeping connections from 8 hours to 3 seconds, and this had a very
noticeable effect of bringing down pageload averages from a few
minutes or longer to just a few seconds. He offered some advice on how
to setup our my.cnf for improved performance as well as teach me how
to track down and monitor bad queries. Some queries that worked well
with MyISAM are terrible for InnoDB, so with the
tricks learned, I spent the next week tracking down and modifying or
sometimes outright killing bad queries. OldMiner proved to be
an invaluable sounding board and also squashed a few bad queries of
his own. After a few iterations of doing this, site is slow, find bad
query, kill bad query, we arrived at where we are now. During this
process, pageloads were generally acceptable enough that
edev could now offer some much-needed help.
So it looks like the Sourceforge netops' wisdom, as retold by
nate, worked: we fixed our fucking code, and pageloads seem to
be good.
From here on
For the past few days, it seems to me like every pageload is very
good, and subjectively to me, it looks that they're even better than
what they were before the crash. I'm hoping that we killed all the
major sources of slowdown, but it's almost certain that more still
exist. Work in this direction should still be relevant. Also, long
pageloads can result in database burps, writeup reputations not being
accrued properly, votes or cools not registering, and that sort of
thing. We should work to reduce the possibility of this happening.
I've been keeping tabs on core temperatures on all of our servers as
well as load. They're mostly doing ok, but web6's core is constantly at
an alarming 70 C, so it's no surprise she went offline. Thankfully,
she's not currently mission-critical at this point, so in case we lose
her to heat as well, the site shouldn't suffer considerably.
Additionally, nate has hinted that we will probably get
replacement hardware again near the beginning of August. He describes
kurt1 and tcwest as "loaners", so they're not intended to be permanent
replacements. I am thus loosely documenting the work necessary to
bring e2 up to speed here in hopes that changing hardware again can go
a bit smoother next time.
Other odds and ends
Amidst bringing the site back up to speed, a few bugs arose due to
various changes. For one, we had to use the development ecore that we
had to patch as quickly as possible. I think most of this instability
is now behind us. Part of this work had the side effect of making our
URLs a bit more readable, as they should be in almost all cases, but
also had the side effect of temporarily disabling things like
favouriting noders or bookmarking nodes.
Additionally, since site stability seemed dubious, under
GrouchyOldMan's insistence I coded up node backup. No more offsite
unmaintained clients for backing up nodes! It seems to work, but it's
possible it's still a little broken.
About bugs, let me remind you that all software
has bugs. Bugs are a fact of life. We can't get rid of all of them,
but we can squash the biggest ones. Please help us do so by reporting
bugs to e2 bugs whenever you encounter them. Don't tolerate them,
don't hesitate to report them because you think they only affect you
because they shouldn't be tolerated and they probably affect other
people besides you. We can't guarantee that we'll be able to squash
all the bugs, but we should always try.
A personal note
I am going to be taking an extended vacation from e2, starting this
Saturday, my birthday, one day after sysadmin day. I am aiming for
four or five months, however long I can manage or need. The grand goal
is to not come back to e2 until 2010, but I can't guarantee I'll be
able to stay away from e2 for this long. I'm already suffering
premature withdrawal symptoms. ;-)
Problem is that I've been feeling way too responsible for the site,
and I need a little time to cool off from e2 and also dedicate more
time to number theory and similar endeavours. It's not my job to fix
e2, just something I do on my spare time, so it shouldn't be something
that makes me feel ultimately responsible for site operations. As soon
as I get permission from alex and/or nate, I'd like to
give fellow splat OldMiner a stroll through the backend
servers so that we can still have someone reliable fixing backend
stuff and familiar with server layout. OldMiner has privately
agreed with me to pick up the pace from where I leave off, so this
gives me peace of mind.
Of course, this shouldn't mean that you should now direct all coding
issues to OldMiner. The primary venue of communication with coders is
still e2 bugs and suggestions for e2 at least until we get the
ticketing system ready. Even if one person happens to hog the code
like I did since February, we still have a team of people available,
and we should exploit everyone's strengths if we are to bring e2 out
of late 20th century web development practices.
For the rest of this week, I want to go on a bug-squashing spree and
clean out as much of e2 bugs as I can. On the Sabbath, I rest. I
will come back to keep working on e2. I can't stay away from this site
for too long, but I need some vacations.
1 RAID: Several hard drives, in this case two,
working in tandem as if they were one, for speed and backup burposes.