Servers Asplode!

During November we have had some severe site responsiveness issues which were capped off, most recently, by the loss of one of the three main servers that power E2.  Although most of the hardware from that server has been recovered, its motherboard is fried, and so we are temporarily going to have to do without it.

I appreciate that having the site slow or outright unavailable can be very frustrating. I empathize. I know many of you folks feel frustrated having a site you care about unavailable and being powerless to fix it. The words of support I've gotten from folks thanking me for trying to get the site back on its feet have been quite heartening, and I thank you all for sending them. I'm sorry I've not been able to get the site functioning better sooner.

This situation isn't the best for anybody involved. Although I am the most active person with regards to working with the server configuration, I would rather be spending more time improving site features and giving the place a more modern feel. Hopefully, I will be able to put more time into that area soon.

Here's a timeline of recent events (times in U.S. Central, since that's our servers' location):

Tuesday Nov. 22, initial downtime

Around 10:15am, somebody powered down and unplugged two of our servers. Sadly, the room in which our servers are currently stored, an electronics work room, has been used as a classroom during this semester, something that was not expected when our servers were placed there. This has resulted in downtime multiple times when folks unaffiliated with the site, and unaware of the consequence of what they were doing, powered down our servers.

When I saw the servers down, I contacted the folks on site by email to inform them of the situation. They were able to get to the server room a bit after noon, where a presentation was going on, so the room was dark. They brough the primary DB server online, but were unable to power up the other box. The site was very slow, but back up.

At 4:10pm, Joe was able to get in and turn on the other machine. He discovered it wasn't only powered down, but unplugged from the wall. Hence why its power button did not work earlier. This got all of our machines back online.

Tuesday Nov. 22, further downtime

Around 6:45pm, our primary database server went offline. Two folks were good enough to travel back to campus to check on it. Let's call them Dilligence and Hiro. They arrived to find the server powered up, fan spinning, but without monitor output and unresponsive to input. It was the day before Thanksgiving Break starts and our primary database server was down. The site went offline.

Let me stress that at no time was the data backing the site in danger. The drives holding the data for E2 are still in working order, the replication database server was still perfectly functional, and we have recent offsite backups of all nodes. I hope you'll forgive me for being redundant, as I say this every time there are server issues, but we care about the data that makes up E2 and have taken steps to keep it safe.

Wednesday Nov. 23

Joe and Dilligence spent most of the morning attempting to get a scavenged machine together to get us a replacement database server. Unable to get the scavenged bits to work together, they retasked our replication database server to fill in for the dead primary server. This brought the site back up, limping, after around a full day of downtime. This is the configuration we've been running on since then.

Unfortunately, our replication database server had been our static file server, as well as our cron job server, so we needed a new place for these. We moved the cron job server to one of the overtaxed web virtual machines. For the static file server, alex was kind enough to let us use one of his off-site machines. Hammering out this configuration was part of why the site looked very weird for a while; stylesheets were not loading.

Saturday Nov. 26
After profiling the site, made a small technical change to reduce load.
Sunday Nov. 27
The site still overtaxed, I made a somewhat risky code change. CPU load dropped significantly. Things are still problematic, but a fair bit faster than they were before. However, since it was a change that affects most of the code on the site, bugs may remain undiscovered. As always, please report problems to E2 Bugs, even if you suspect it may be just you. I'm afraid I don't have the time to respond to all of the messages I receive, let alone monitor the chatterbox 24/7, so if you report problems in one of those venues, it is less likely to be attended to by one of the coders.

Going Forward

The site administration, including alex, clampe, and nate are actively discussing the site's needs and potential for relocation to other servers in the coming weeks to months.  I'm advising as far as technical requirements go, and, should we relocate, will do what I can to assist in the transition.

If we do move, relocation and downtime will be anounced on the front page beforehand with a lead time of days if not weeks.  Since we have the luxury of good support staff, relocation will almost certainly have the hardware and site running on the hypothetical new machines prior to us shutting down the current servers.

Keep calm and don't panic.  If you are concerned about the site, please contact the people who know what's going on. I am more than happy to answer questions regarding site functioning addressed privately to me either via /msg or email.  Those things I can not answer, I will refer you to somebody who can. Or, if you think others may share your concern and benefit from a public response, drop a note in a daylog, or a Letters to the Editors. Any lack of transparency isn't due to shadowy plotting; that's not our bag. We just sometimes lack time or don't know folks are curious.

Slowness Prior to Downtime

Prior to the downtime in November, we have been seeing a fair number of error pages and slow pageloads. I can get fairly technical about this, but suffice it to say, E2 could really use a couple of extra servers right now. If you are seeing slow pageloads or frequent timeouts, for the time being, you may try going to your nodelet settings and disabling some of them. For some of our more expensive users, an average pageload spends more than half its time generating nodelets.

Offers to Help

Several people have offered to send money or hardware to get the site functioning better. Although the offers are appreciated, we can't take money right now. For those unfamiliar with how E2 got into its current hosting relationship, read clampe's writeup on the state of the site back in 2007. In the past, hardware was a difficult proposition as we couldn't guarantee resources were available to plug the boxes in, even if somebody could be generous enough to ship us up-to-date boxes. That is less of an issue now, but what with potential relocation, it might be sketchy to try to plug a new machine in, especially if we pull up stakes a week later. Regardless, if you are interested in such a thing, get in contact with alex. I can't really say yes or no on these things, but I get these questions a lot, so I wanted to provide a (non-authoritative) answer.

There are plenty of opportunities to improve E2's code and reap performance improvements with judicious patches. I know quite a few of the active users at E2, especially the more vocally irritated, have the technical background that could assist in this area. I've got a list as long as my arm of small projects that can make a difference which I simply can't get to. If you're at all interested, please join edev and let us know. Prior knowledge of Perl not required. We'd be happy to have you aboard.

Server Configuration

In my last root log, several months back, I covered our servers. They didn't change much since then, except I moved the dev server off of dom00, over to dom10, and I expanded its RAM to 1GB from 512MB. The configuration at the start of the month was like this:

dom00
4 core Intel Xeon X3360 (@2.83GHz), 8GB RAM, 147 GB SSD, RAID1 of unimpressive stats
Primary MySQL server
xen host to:
  • www1 - 2GB RAM, 2 core virtual web server
  • www2 - 2GB RAM, 2 core virtual web server
dom10
4 core Intel Xeon X3360 (@2.83GHz), 8GB RAM, RAID1 of unimpressive stats
xen host to:
  • www3 - 2GB RAM, 2 core virtual web server
  • www4 - 2GB RAM, 2 core virtual web server
  • dev3 - 1GB RAM, 1 core dev box
dom52
4 core Intel Xeon X3360 (@2.83GHz), 8GB RAM, 147 GB SSD, RAID1 of unimpressive stats
MySQL replication server
backup server
cron job server
nfs server
stats server
dom51
Intel P4 (@3GHz), 1GB RAM, a regular-old HD
haproxy server (gateway to the outside world)
Mercurial server
Itty bitty stats, big job

As you can see, dom00 had several pretty important jobs. Since the web servers and MySQL are on the same machine, they compete for CPU time. Further, any RAM assigned to one of the web servers is RAM that MySQL can't use. Ideally, the primary MySQL server would have its own machine, but we didn't have an extra machine to do that.

To reduce load, I tasked dom52 with the job of being a static file server, serving images, stylesheets, static Javascript, and all of Guest User's Javascript. This was a natural move since dom52 was being underutilized, and it was already the cron job server, so the generated files were already being dropped there. It moved about a quarter of our HTTP requests off of our primary web servers, which helped.

Here's the configuration at present:

dom00 - the new name for the machine that *was* dom52
4 core Intel Xeon X3360 (@2.83GHz), 8GB RAM, 147 GB SSD, RAID1 of unimpressive stats
Primary MySQL server
xen host to:
  • www1 - 2GB RAM, 2 core virtual web server
  • www2 - 2GB RAM, 2 core virtual web server
dom10
4 core Intel Xeon X3360 (@2.83GHz), 8GB RAM, RAID1 of unimpressive stats
xen host to:
  • www3 - 3GB RAM, 2 core virtual web server, cron server
  • www4 - 3GB RAM, 2 core virtual web server
  • dev server shut down to reduce load
old dom00
4 core Intel Xeon X3360 (@2.83GHz), 8GB RAM, 147 GB SSD, RAID1 of unimpressive stats, disassembled
dom51
Intel P4 (@3GHz), 1GB RAM, a regular-old HD
haproxy server (gateway to the outside world)
Mercurial server
Itty bitty stats, big job
alex's external server - e2forum
Unknown stats
Static file server

Since dom10 had RAM that was going to waste, we expanded the memory for www3 and www4. Since we lost our cron server, we had to make one of our web machines dual purpose, or assign resources from somewhere else. For now, www3 gets the burden. www3 also serves as the backup static file server if the external static server goes down. The dev server was shut down to reduce load on that machine.

Something I didn't realized had been overlooked until I wrote this up is the NFS server. The job of the NFS server is to hold uploaded images and writeup backups so they are consistent across all servers. I'll have to fix that up. Current dom00 was the old nfs server, so I should be able to copy the files off of there. Most likely dom51 will become their new home. Will update this when it's been taken care of.

Enumerated Patches

isSpecialDate
- I made Halloween last a *little* bit longer by making special date use L.A. time, so the special Halloween doohickeys kicked on at midnight UTC, briefly turned off at midnight UTC, and finally went off for 2011 at midnight U.S. Pacific Time. For future holidays, we will likely be having festivities run from midnight UTC until midnight Pacific from the start, but that code hasn't been put in place yet.
Your Nodeshells II
- A bug report (I forget by whom) came in that this no longer worked since we changed the owners of nodeshells. Changed it to use createdby_user instead of author_user and it works once more
document linkview page
- DonJaime pointed out that linkview wasn't making softlinks. So I applied appropriate classes so that softlinking works as expected; removed no-longer necessary angle-bracket direct linking stuff
showchatter
- Fixed a bug where /fireball would sometimes display wierdly, showing 'singe' as if it were a 'sing' command
message
- Big code cleanup; regularize spaces/tabs; remove dangerous /settings feature; remove dead code; clean up regexes; stop commands from being silently swallowed; make sendPrivateMessage pass arguments the new way
static javascript
stylesheet serve page
linkStylesheet
To speed up pageloads, I wrote a cron job to generate the stylesheets and Guest User's Javascript in static files that could be served from a separate server than our main application web servers. DonJaime helped out quite a bit pointing out items I had overlooked, like initially not generating softlink gradients, and not taking care of the autofixing behavior that stylesheet serve page does. Further, DonJaime patched things up so that the stylesheet author would always get the dynamically generated stylesheet, so they don't have to wait for the cron job to fire before seeing stylesheet changes on their end.
voteit
weblog
Random Nodes

After turning on htmlcode compile caching, some lexically scoped-variables were retaining there values between calls. For example:

my $subRef = sub {
  my $foo;
  $foo .= "Foo!\n";
  return $foo;
};

On its second call, &$subRef(); would return "Foo!\nFoo!\n" So we had to patch some htmlcodes to explicitly initialize their variables to resolve some resulting bugs. I am honestly confused why this happens; normally lexically-scoped variables simply expire when their scope ends, and a new one is generated when the scope is entered again. I can only guess this is an artifact of running in a mod_perl environment. Worst-case scenario, patches like this will allow us to eventually run E2 with use warnings, which will make coding easier anyhow.

DonJaime submitted a bunch more patches to cover potential bugs like this, as well, but I'm sure he'll cover those in his log.

Mercurial-side Patches

Fixed wonky error output (coders only)
Because printf was being used to generate them, error messages that had percent signs in them would get garbled. Fixed.
More logging on Apache crashes
When Apache crashed mid-pageload, it wasn't logging much data, and I had hoped it was being caused by some particular type of query. So I made it log all query parameters on these occasions. Unfortunately, it didn't reveal much. It seems Apache is receiving an external signal which causes it to shut down all procceses at once.
Fix tag balancing code
The tag balancing code had been broken for quite a while, and raincomplex brought it to my attention. I believe I resolved this, but didn't get to test it on the live server. If you're interested in testing, try making a draft with broken table tags and see if they still get closed properly. Post a link in E2 Bugs if it appears still broken.
Calculate date without shelling out
The logging code called `date` to get the date, which showed up surprisingly high in the profile data. Presumably because shelling out requires accessing the disk to load bash, then to check for .bash_profile and .bashrc, and then to load /bin/date. Replaced with an appropriate use of DateTime. I believe Tick Tock has this same issue and still needs patching, though it runs rarely and is cached.
Statically generate stylesheets and Guest User Javascript
As described above, there are now cron jobs to generate static versions of these files in an attempt to reduce load on our primary web servers. Since there are no scripts to distribute these files to another server as yet,the cron server must also be the static file server. Since we are using an external server for static files, updates to Javascript and stylesheets will not propagate without manually copying the files. If we get a reliable cron server soon, this won't be an issue. If not, we'll have to find a way to distribute these files when they're updated.
Avoid uninitialized value warnings
Made some patches to ecore to get us slightly closer to being able to run with use warnings
Set nodetypes to be static
Static nodetypes was an option built into ecore long ago, before I came on staff, intended to allow lower load at the expense of the site being less dynamic under code and design changes. When profiling, a significant portion of every page load was spent doing dynamic lookups for nodetypes, so I set this to static. Realistically, when a nodetype is updated, ecore should find all types that derive from that type, update them, and dump the type cache for each of those types. That would be the best of both worlds. For now, any changes to nodetypes will require a restart of apache.
Cache compiled htmlcodes and opcodes

E2 uses the function evalCode to dynamically evaluate Perl stored in the database. This feature, viewed contentiously, is very neat from a development perspective as it allows significant code changes without server access, and without restarting the web server. It also comes with the downside that we were compiling code for regularly called functions each time they were called.

Near the beginning of the year, I had attempted to cache the results of these compilations to reduce CPU load. When I did so, DonJaime pointed out I had messed up, and the value of $NODE was being frozen to whatever it was when it first ran. I effectively disabled the code (while leaving most of its guts in place) and left the project by the wayside, worrying that another attempt might result in skulking bugs or requiring a lot of cleverness to ensure reference to package level variables like $USER and $VARS didn't get frozen. But good DJ had previously suggested a course of action to try again, and seeing huge CPU spikes, I felt it would be foolish not to try.

So now we cache the result of htmlcodes and opcodes being compiled. This has greatly reduced CPU load, and has triggered few bugs that we've observed. What's even nicer is that the cached code is tied to the node, and so we get the benefits of the existing node caching. If we patch an htmlcode or opcode, the cached code will be silently discarded.

superdocs, pages, htmlpages, and containers could all potentially benefit from the same treatment, but doing so is not as straightforward, so that will have to wait until further profiling shows it's worthwhile.

Did I Forget Anything?

Please toss me a message if you feel there is anything I should have covered here but didn't. Aside from bug reports, this is effectively the only communication I have with you guys where I'm acting as a staff member, so I try to be thorough.