Two phase commit (2PC, sometimes tpc) is a technique used for distributing data between
databases. There is no standard two phase commit protocol or
implementation; every database has their own implementation. Two phase commit
is used when data has to be updated in several databases, and when it is
paramount that all databases are in synchronization. Examples of such database
environments are
- airline booking systems
- stock and money markets
- banking
systems
- credit card systems.
In these systems it is important that all data
is in synch, in order to maintain data integrity. If just one of the
many databases in these systems are out of synch, the whole system can be
corrupt. End-users can be working on the same data entities but in different
databases, and if they aren't correctly updated throughout the system, there's a
huge problem. Systems with only one database use locking of the data that is
accessed, but this doesn't work in distributed database systems.
The problem with synchronizing data is not as trivial as one first may think.
If two separate databases are to be kept in synch, they have to rely on
transactions. They also have to know that the transaction was not interrupted
in any way. If you have a Transaction Processing Monitor (TPM), you rely on
the TPM to deliver the transaction at any price - which is also done by a two
phase commit protocol. Usually, the price is high, since TPM's are very
expensive. The other solution is to use database two phase commit, which also
will ensure data integrity when transactions fail. The implementations differ
between TPM and DB two phase commits.
Consider the following setup: One database system (A) has to send data
to another database (B), and A cannot use the data itself until it knows that the
data has entered the other database as well. As you know from using the
internet, data transactions can disappear for no apparent reason. How can A be
sure that B has received the data ? This could be accomplished by having B send
an acknowledgement when it has received the data. But, and this is the
crux; the ACK is data in itself, so how can we know that the ACK has been
received ? We would end up in an infinite loop of ACK's being sent between A
and B.
___ ___
| | | | A: Sending data
| A |-->---| B | B: Waiting
|___| |___|
___ ___
| | | | A: Waiting for ack for data received
| A |------> B | B: Receiving data
|___| |___|
___ ___
| | | | A: Waiting for ack for data received
| A |---<--| B | B: Sending ack for data received
|___| |___|
___ ___
| | | | A: Receiving ack for data received
| A <------| B | B: Waiting for ack for ack received
|___| |___|
___ ___
| | | | A: Sending ack for ack received
| A |-->---| B | B: Waiting for ack for ack received
|___| |___|
___ ___
| | | | A: Waiting for ack for ack for ack received
| A |------> B | B: Receiving ack for ack received
|___| |___|
This would go on and on and on, as you can see, and therefore the two phase
commit technology is used.
There are two steps in the two phase commit, as the name suggests:
- Prepare - The initiating database ask all receiving
databases to promise to either commit or rollback a given
transaction
- Commit - The receiving databases respond whether they are
prepared to commit the transaction, and the initiating database
request the commit, if all are ready.
If not all receiving databases respond that they are prepared, the initiating
database will require everyone to rollback the transaction. When the
commit is issued, all databases commit the transaction and send back an acknowledgement.
No further communication is used, and it is expected and taken for granted that
all data is simultaneously updated throughout the whole system.
In order to protect the commit process from network, software or
hardware problems, the databases have a pending transactions table,
which is used by the recovery process that is automatically invoked if
the commit fails. After the decision has been made by the initiating database
whether to commit or rollback, this decision will be honored by each system as
soon as the become available again. The reason is that when the databases
reply that they are prepared to commit, they have already committed the
transaction in the pending transactions table. If they don't get an answer
back that they should commit, the pending transaction stays until they get this
message. The initiating database will send the commit until it gets the the
acknowledgement back.
The downside with the two phase commit is, besides that it requires a
lot of network traffic, it is also dependent on its weakest link. If
something fails, the data entity will be locked for use. If you don't have
valuable data, or trust your network technicians, you can just leave the two
phase commit turned off.