Scaling, in addition to the activity of cleaning a fish, describes the process of adding resources to a system in order to increase its capability. For more information on the properties of a system w/r/t this, see the scalability node.
In the wider world of networked systems (actually, in the narrower world that I'm going to discuss) scaling doesn't simply mean throwing resources at a problem, although that's part of it. The general steps I follow when scaling a system roughly correspond to the following:
- Look Ahead
- TEST and MEASURE
(I know, I need to come up with a list with a better acronym…SMELPIT don't cut it…)
These aren't hard and fast rules; they're the general phases of the process that I go through. I find it tends to keep me on track to have this tacked up above my desk/cube/burrow/server/booth at McDonald's while working. Other folks will have other ideas. There is no doubt a 'standard' model that you can find in some books. Whatever works to keep you thorough and consistent.
Let's start at the top!
What do you have now? This is the base question. Find out precisely what your present system looks like. Even if you were the one that put it together, go look at it carefully. Touch it. Inspect it. You never know if someone decided to stick another component in when you weren't looking, or if one of your RAID systems has been limping along in degraded mode because your monitoring failed…you get the idea.
A good survey will include more than a server list (although that's the core of it). A good survey should explain to any qualified reader who doesn't know this particular system:
I find it's much more helpful, when going over a complex system, to have a clear picture of what this thing or collection of things is supposed to be doing. Ideally, if you are working on a system that supports a product or software package or process, you will have easy access to a pre-existing diagram and/or flowchart and text that explain clearly what goes on in them there boxen. If not, well, you need one. Write it up now.
In this phase, you want to find out how well (or how poorly) your current system performs. In order to scale, you need a metric; a baseline if you will. Your current system is your baseline.
You may be fortunate enough to have been given metrics by someone else. If not, you'll have to identify/invent them yourself. Try to avoid inventing them; that just introduces an enormous level of uncertainty into the process. Of course, if your system does something that no-one else's does, or does it in a way that no-one else's does, then you may have to use unique metrics. For example, if my system is designed to, say, frob llamas, then I may simply say "OK, this system can frob a certain number of llamas per minute consistently without failing more often than is acceptable (or has been historically observed)." Thus, my metric is llamas frobbed/minute.
Your situation may be easier. There are, of course, standard metrics, and they will usually be given to you by those folks who foresee or demand an increased load on your system. For example, some metrics might include:
- Web 'hits' per second
- Unique users per hour
- (x)Bytes of data transferred per hour
- Number of database records updated per minute
- Number of tracked (but not simultaneous) users in the system
- Response time to a web request in milliseconds
- etc. etc.
So, your task is to determine the current performance of your system in your chosen metric(s). You should have a method of doing this in place; if not, then you'll need to implement and test one. Don't skip this to meet a deadline; the answers you produce will be useless unless backed up by solid pre-modification measurement. If you are working (as I do) on large networked systems that run websites or databases, there are more than likely a plethora of means to performance test them, some freeware and some commercialware. Check freshmeat.net, or look around in your market space. Measure. Rule: Don't just measure the metric that you're trying to improve. Measure every variable you reasonably can that affects your system's QoS. It never fails that if you only measure the desired metric, then the upgrade will deliver on that one…and kill another one that you hadn't thought to check.
Ideally, you want to measure these variables across at least an entire 'cycle' of use of the system. If the system's use runs in repeating weekly cycles (e.g. 'heavy Monday, light Tues.-Fri., heavy Saturday, light Sunday, repeat) then get at least two to three weeks of data. Hopefully, you'll have access to this data from regular evaluations performed earlier. It's a good idea to note which things you end up having to track due to lack of existing information; when you're done with the upgrade, make a note to implement monitoring or logging of those variables.
This is the 'catch-all' category. Here's where you do the skullwork. Well, most of it. Have a nice hit of caffeine. Find an empty whiteboard (or, if your numbers are on paper, a nice empty conference table). Spread out your data. What you're looking for is the limit of your desired metric. For example, what's the most llamas your system appears to have frobbed in a minute over the course of the week? Did it ever hit that level again? Why not? Was demand not high enough? (Unlikely, or you wouldn't be scaling, but if you're pre-emptively scaling then it might not have been).
What resources can you identify that limited the maximum performance? Check your logs and look for correlations. Some examples on servers: Available memory, CPU cycles, disk space, disk system throughput, NIC throughput, network throughput, file serving requests/sec, etc.etc. Somewhere in here is the reason your system isn't fast enough. It is most likely a combination of two or more factors. You should be able to identify the constraint and be confident in that identification. If possible, test your hypothesis on a test-bed environment; lower the RAM and see if the constraint lowers. Raise the RAM and see if it goes up. Etc. NEVER do this in the production environment!
When you're pretty sure you know what factors are causing your 'ceiling,' move on.
Armed with this data (your performance ceiling and its likely causes), take some time to think about what your goals are. How many more users per minute are you going to get? How many more llamas will present themselves for frobbing per minute? And so on. You may have been given a 'target performance' level; great. If you have, DO THIS ANYWAY. Who knows if the marketing guys are right, anyhow?
Try to determine the degree of scaling you'll need to do. This means to carefully consider the limitations on your scaling techniques. For example, if you have a CPU cycle constraint, and your system is a maxed-out Sun UltraSPARC Enterprise 420, then you'll need to get either more boxen, or bigger box(en) to make this work. This, of course, means you'll get more RAM, disk, etc. as well - but you'll also increase the network load, the power demand, the complexity of distributing those resources. Note that you shouldn't yet be thinking about how precisely you're going to do this; you're thinking about what levels you want your upgrade to achieve. Just bear in mind that it's almost impossible to increase one variable (CPU, memory, etc.) without affecting others. In some cases it's viable (adding a CPU to a box which has an empty CPU slot), but even then, you'll now be hosting twice as many processes on the box; do you have enough RAM? …and so on.
Okay, in the last step, you figured out what you're trying to achieve. In this step, you figure out how you're going to do it. Here, you need to produce a list of changes to the system that will effect the performance increases required. So, say, if you know you need to double the number of concurrent web connections, and your constraint was CPU cycles, you will likely be adding CPUs, and/or adding machines, and/or adding RAM to handle the increased process load, at a minimum. If you're adding machines, you're adding network load, clustering complexity, potential failure points (you have backup units, right?) and more.
You should come out of this with a 'shopping list' for those who buy the gear (you, maybe) and a budget for the changes. Likely the budget will have been given to you. In either case, have a list and pricing. In addition to the list, have a clear picture of the kind of impact on service this upgrade will have. Will you be building the new system in parallel with the old and switching over, keeping the old one running just-in-case? Will you be adding new machines to a DNS round-robin? Etc. Your plan should have a clear and concise explanation of what all this activity will do to the system as the outside world sees it. Will you need new SSL certs? Will you need new IP addresses? Etc. Ideally, you will also have a schedule explaining when you'll be getting the gear and when you'll be performing the steps. This way, client services can be prepared if on that particular Monday the system isn't up by 6 a.m. because something broke (although you really want to avoid this, obviously).
I know, this is vague. That's because everyone's planning process is different, as are their requirements. Do what works for you. If you don't know how to plan a project like this, you shouldn't be trying to do it; find someone who does and either have them do it while you watch or have them advise/assist you.
Just what it sounds like. Do the deed. Build the boxen. Pull the lever. Push the buttons. Frobben das blinkenleitz. Make the changes, make them according to plan, and (hopefully) make them perform as advertised. However (and this is the kicker) if possible, DON'T BRING THEM LIVE YET. If you have increased the size of a production system, constrain usage from utilizing the new resources until you've tested them. If this is a new system in parallel, don't switch it online yet. The most important phase is next.
This is probably the most important part. Run the system under load (not in production, but a comparable one you've generated). Make SURE it works the way you think. Then run it under every test scenario you ever put the old one through, and any other ones you can think of; make sure you didn't cause a problem elsewhere while fixing one here. Run it for several cycles, in as many conditions as you can come up with. Do you have a QA or QE department? Good, go get them to do it.
You're looking for performance levels (does it work as well as you planned?) as well as reliability (has anything become more fragile?) and compatibility (does the backup system still work? Are the tapes big enough for the new size? Can your internet feed handle the traffic?). I know you ideally would have thought of those during the planning phase; however, that's why you test. No-one ever thinks of everything in advance, and no admin/engineer worth his or her salt would ever think of releasing a product to the user (be it public or internal) with their name on it unless they'd thoroughly tested it first.
And always have a plan for what happens when it fails. Always. Because if you don't have a plan, it will fail. Murph guarantees it.