An operating system which implements paging is, under the right circumstances, susceptible to thrashing. In order for a computer system to be actually thrashing, it is not sufficient for it to be doing a lot of paging. A system which is doing a lot of paging might still be running quite well. In fact, it is possible to design a system (see mainframe) so that it can handle a pretty amazing paging load (I can recall using systems that were doing a couple of hundred page faults per second and which were still not thrashing).

Strictly speaking, a computer system is thrashing if the reason that it is doing a lot of paging is that it is doing a lot of paging. In other words, thrashing is the result of a positive feedback effect.

An example is probably in order:

Consider a system which is currently doing a modest amount of paging. Processes are being delayed from time to time because they suffer a page fault. These delays are probably unimportant to the processes themselves and are almost certainly essentially irrelevant to the computer system as a whole for the simple reason that when one process is delayed by a page fault, there's almost always another process which can take over the CPU (of course, the real reason why the current level of paging is irrelevant to the system as a whole is because the system administrator's phone and e-mail in-basket aren't being inundated by complaints about the system performance but that, as they say, is "another story").

I'm sure that everyone will agree that this system is NOT thrashing.

Time goes by and the load on the system increases. The system is now doing quite a lot of paging and processes are much more likely to experience a page fault. Although the paging subsystem is busy, it is able to deal with the page faults in a timely fashion. The delay experienced by any given process due to any given page fault is still relatively insignificant although the performance of individual processes has probably suffered noticeably. System performance as a whole is being impacted by the amount of paging and there's little doubt that the system would run noticeably faster if it had more real memory (or less load). System performance may have even degraded to the point where it is "clearly unacceptable" (see previous point about calls and e-mails to the system administrator and think about where you're going to get funding for more memory).

By the definition state above, this system is NOT thrashing.

Time goes by and the load on the system continues to increase. The system is still doing quite a lot of paging. The total paging rate (i.e. paging related disk i/o's per second) hasn't changed all that much but something new IS happening - it's taking so long for an "average" page fault to be processed that a process which suffers a page fault tends to lose memory-resident pages (i.e. suffer page outs) while it is waiting for the page fault to be processed. Processes are losing these memory resident pages because the paging subsystem is being almost constantly forced to find real memory into which it can read pages to satisfy already outstanding page fault requests. Memory-resident pages are being taken from processes in the page fault queue because these pages appear to be idle (i.e. good candidates for being paged out). The problem is that some of the pages which are being taken (the technical term is "stolen" (really!)) are only idle because the processes which own them are waiting for other pages to be paged in. Once the "other pages" have been paged in, the process will almost immediately need at least some of the pages which were just taken.

This is REALLY bad news because the system is now having to process page faults which have happened only because earlier page faults delayed processes long enough that they had pages taken away from them. i.e. the system's paging load is increasing BECAUSE the system's paging load is heavy! (the system administrator probably isn't answering the phone or replying to e-mails anymore and the "powers that be" are likely to be much less interested than they used to be in mundane questions like "is there money in the budget for more memory?")

A system in this condition does satisfy the definition of thrashing given above.

Sidebar: A system which is almost but not quite thrashing can suddenly find itself to be thrashing because of a momentary event (e.g. a process makes a sudden and fairly short request for more memory-resident pages). Unfortunately, once the system is thrashing it might very well continue thrashing even though the condition which caused it to start thrashing (i.e. a short term spike in demand for memory-resident pages) is no longer in effect.

This can happen because once the system starts thrashing, it starts paging out pages that really shouldn't be paged out. Since the pages shouldn't be paged out, they get paged back in again almost right away. This maintains the pressure on the paging subsystem (i.e. page in request queues continue to be excessively long) which in turn results in more pages being paged out (because they appear to be idle) which really shouldn't be paged out (because they are only idle because the process that owns the pages is being slowed down by the page faults that it is suffering).

i.e. a system which starts thrashing is potentially in a LOT of trouble!

There's really only one short term solution that will get a system out of the thrashing state - the paging load on the system has to be reduced. This observation leads to the classic solution for thrashing - pick a process which is doing a lot of paging and simply suspend it (i.e. do not allow it to execute). Two things happen almost immediately:
  1. the system does less paging because one of the processes which was generating a log of paging requests is no longer generating any paging requests.

  2. the memory-resident pages owned by the process which was suspended almost immediately become candidates for being paged out (some systems deliberately page out these pages right away while others don't bother since they'll get paged out fairly quickly anyways and simplicity is a virtue of the highest order in an operating system kernel).
The end result is usually quite dramatic as the system almost immediately stops thrashing. Of course, if the system load was high enough then suspending one process might not be sufficient. Consequently, the kernel continues to monitor the system and if it is still thrashing a short time later (a second or two is usually long enough to wait) then it picks another "victim" and suspends it. It doesn't take very long before the system stops thrashing.

Now the question becomes "what do we do with the suspended processes?". Once the kernel is fairly certain that the system is no longer thrashing, it releases one of the suspended processes. This causes an almost immediate flurry of page ins as the previously suspended process needs to get the pages that it was using back into memory. Once the suspended process has had a chance to "get back into the game", if the system is still not thrashing then it releases another "victim". This continues until either the system is found to be thrashing again or all suspended processes have been released.

And that, as they say, is about that. There are, of course, a number of details which have been skimmed over (e.g. it isn't a good idea to suspend a process which is currently holding a kernel resource that other processes will need almost immediately) and there are variations on the described approach. That said, the basic idea is quite simple: the way to get a system out of a thrashing condition is to reduce the paging load.


References

  • personal experience