Wednesday, March 17, 2010

PBXT Engine Level replication, works!

I have been talking about this for a while, now at last I have found the time to get started! Below is a picture from my 2008 MySQL User Conference presentation. It illustrates how engine level replication works, and also shows how this can be ramped up to provide a multi-master HA setup.

What I now have running is the first phase: asynchronous replication, in a master/slave configuration. The way it works is simple. For every slave in the configuration the master PBXT engine starts a thread which reads the transaction log, and transfers modifications to a thread which applies the changes to PBXT tables on the slave.

Where to get it

I have pushed the changes that do this trick to PBXT 2.0 on Launchpad. The branch to try out is lp:pbxt/2.0.

Getting started

Setup of the replication is dead easy. Assuming you already have a PBXT database, what you need to do is the following:

1. Copy the Master data: Shutdown the MySQL server and make a complete copy of the data directory.

2. Setup a Slave server: Setup a second MySQL server using the copy of the data directory.

3. Declare the Slave: Create a text file called slaves, in the data/pbxt directory of the master server, with the following entry:
slave-process-name is any name you like, and is used to identify the replication thread running on the master. host-name-of-slave is the host name or IP address of the slave MySQL server. 37656 is the default port used by the PBXT slave engine to receive replication changes.

4. Enable replication: On the master server set pbxt_enable_replication=1, and on the slave server set pbxt_enable_replication=2. Also make sure that both servers have different server IDs (system parameter: server_id).

5. Start both servers: Replication will begin immediately if the slave server is started before master server, otherwise replication will begin after a minute (see below).

How it works

PBXT engine level replication, unlike MySQL replication, pushes changes to the slave. For every entry in the data/pbxt/slaves file, PBXT starts a thread (the supplier thread). The thread connects to the slave on the given address, and pushes the changes to an applier thread run by the PBXT engine on the slave side. If any error occurs, the supplier thread on the master will pause, and then try again in a minute.

On connect the supplier thread requests the global transaction ID (GID) of the last transaction committed on the slave. The applier determines the GID of the last transaction by searching backwards through its own transaction logs.

Replication is row-based, and fairly low level. Changes refer to the PBXT internal row and table IDs. The row data is transferred in the same format used to store the information on disk. This makes the replication extremely efficient. The supplier thread does not even have to read the log from disk if it is fairly up-to-date, because PBXT already caches the last changes to the transaction log for use by the writer and the sweeper threads.

Probably the most important thing about this type of replication is that it (theoretically) has almost no affect on the "foreground" activity on the master machine. I am interested to find out if this really is the case.

What's next?

Replication of DDL changes are not implemented yet. So if you do ALTER TABLE or any other such operation, replication will stop, and have to be restarted by copying over the data directory to the slave again.

After DDL changes the next step is to add synchronous replication, as illustrated above. This requires waiting for a commit from the slave before continuing. Latency in this case can be kept to a minimum by sending transactions to the slave before they have been committed on the master.

I believe this would then provide the basis for an extremely simple (and efficient) HA solution based on MySQL.


Shlomi N. said...

Honestly, I can't wait to put my hands on PBXT on production!

Please keep up the good work

Diego Medina said...

This looks great, one small question, is it:





Paul McCullagh said...

Thanks Shlomi!

Diego: I guess thread or process would both be correct.

The most important thing about the name is that errors during replication are identified using the name in the error log.

So if you are replicating to multiple slaves you can easily identify which slave is causing the problem.

erkules said...

great great great!

Mark Callaghan said...

Can queries run on the slave while replication is running?

Paul McCullagh said...

Hi Mark,

Yes, queries can be run on the slave during replication.

In fact, there is nothing stopping updates from being done either.

However, replication will stop if a row is not found, that needs to be updated.

Øystein said...

Why do you need to shut down the server to make a copy of the database? I would assume that since you are replaying the transaction log, it would be possible to start with an fuzzy image.

Paul McCullagh said...

Whew Øystein!

You have really got me thinking on that one...

Firstly, there is no need for a "clean" snapshot, because when the slave starts up on the copy it will perform recovery.

So basically the requirement is to have a recoverable copy of the data. So something like an LVM snapshot would be fine.

Now if you make a copy while the server is running (assuming update activity), one thing that will not work is if you make a copy of the transaction logs first.

This is because you will have changes applied inconsistently to tables that you have no record of in the transaction log.

What would have to be confirmed is: whether there is a copy order for the restart files, table files, transaction logs and data logs that would produce a recoverable image.

Assuming we have a way to tell PBXT not to remove data or transaction logs while the copy is in progress, I _think_ this may be the case!

So thanks for the suggesting! :)

Chang said...

will you implement master-master replication and when?