Wednesday, March 17, 2010

PBXT Engine Level replication, works!

I have been talking about this for a while, now at last I have found the time to get started! Below is a picture from my 2008 MySQL User Conference presentation. It illustrates how engine level replication works, and also shows how this can be ramped up to provide a multi-master HA setup.

What I now have running is the first phase: asynchronous replication, in a master/slave configuration. The way it works is simple. For every slave in the configuration the master PBXT engine starts a thread which reads the transaction log, and transfers modifications to a thread which applies the changes to PBXT tables on the slave.

Where to get it

I have pushed the changes that do this trick to PBXT 2.0 on Launchpad. The branch to try out is lp:pbxt/2.0.

Getting started

Setup of the replication is dead easy. Assuming you already have a PBXT database, what you need to do is the following:

1. Copy the Master data: Shutdown the MySQL server and make a complete copy of the data directory.

2. Setup a Slave server: Setup a second MySQL server using the copy of the data directory.

3. Declare the Slave: Create a text file called slaves, in the data/pbxt directory of the master server, with the following entry:
slave-process-name is any name you like, and is used to identify the replication thread running on the master. host-name-of-slave is the host name or IP address of the slave MySQL server. 37656 is the default port used by the PBXT slave engine to receive replication changes.

4. Enable replication: On the master server set pbxt_enable_replication=1, and on the slave server set pbxt_enable_replication=2. Also make sure that both servers have different server IDs (system parameter: server_id).

5. Start both servers: Replication will begin immediately if the slave server is started before master server, otherwise replication will begin after a minute (see below).

How it works

PBXT engine level replication, unlike MySQL replication, pushes changes to the slave. For every entry in the data/pbxt/slaves file, PBXT starts a thread (the supplier thread). The thread connects to the slave on the given address, and pushes the changes to an applier thread run by the PBXT engine on the slave side. If any error occurs, the supplier thread on the master will pause, and then try again in a minute.

On connect the supplier thread requests the global transaction ID (GID) of the last transaction committed on the slave. The applier determines the GID of the last transaction by searching backwards through its own transaction logs.

Replication is row-based, and fairly low level. Changes refer to the PBXT internal row and table IDs. The row data is transferred in the same format used to store the information on disk. This makes the replication extremely efficient. The supplier thread does not even have to read the log from disk if it is fairly up-to-date, because PBXT already caches the last changes to the transaction log for use by the writer and the sweeper threads.

Probably the most important thing about this type of replication is that it (theoretically) has almost no affect on the "foreground" activity on the master machine. I am interested to find out if this really is the case.

What's next?

Replication of DDL changes are not implemented yet. So if you do ALTER TABLE or any other such operation, replication will stop, and have to be restarted by copying over the data directory to the slave again.

After DDL changes the next step is to add synchronous replication, as illustrated above. This requires waiting for a commit from the slave before continuing. Latency in this case can be kept to a minimum by sending transactions to the slave before they have been committed on the master.

I believe this would then provide the basis for an extremely simple (and efficient) HA solution based on MySQL.