Thursday, December 20, 2007

Making PBXT Fully Durable

Until now PBXT has been ACId (with a lower-case d). This is soon to change as I have had some weeks to work on a fully durable version of the transactional engine (

My first concern in making PBXT fully durable was to what extent I would have to abandon the original "write-once" design. While there are a number of ways to implement durability, the only method used by databases (as far as I know) is the write-ahead log.

The obvious advantage of this method is that all changes can be flushed at once. However, this requires that all data be written twice: once to the log and after that, to the database itself.

My solution to this problem is a compromise, but I think it is a good one. In a nutshell: short records are written twice, and long records are written once. When it comes to durability, this compromise, I believe, is a good one.

If a transaction writes only short records, then one flush will suffice to commit it. Because the records are short, contention on the write-ahead log is at a minimum. If a transaction writes any long records, most of the data will be written once to a data log (as opposed to a transaction log). Contention for writing on the data log is zero (because each writer has its own data log), but two flushes are required to commit the transaction.

By doing this I have saved other transactions having to wait while a certain transaction copies a large amount of data to the transaction log. Although the transaction log uses a double buffering system, this will still cause a hold up.

In summary: if you have a transaction which writes large records, then it will basically just hold up itself, and not everybody else.

Another innovation I have introduced to reduce contention on the transaction log is an "operation sequence number".

Normally operations must be synchronized on the transaction log to ensure consistency. For example, the allocation of a block must be written to the log before usage. But this means all threads need to lock the transaction log when performing an operation.

Instead of doing this, I issue a unique sequence number for each operation done on a table. The operations are then written to the log in batches without concern about the order.

The process that then applies the changes in the log to the database sorts the operations by sequence number before they are applied. This is also done on restart, during recovery.


Frank said...

Very interesting. I am curious to know what criteria you will use to classify a transaction as "short" vs "long"?


Paul McCullagh said...

Hi Frank,

This will probably not be a case of "one size fits all", so I plan to make this a tunable parameter (an amount in bytes). :)

Anonymous said...

Then how is this affecting replication? Obviously, you need MySQL to be able to read from the binlog and ship the DML to the slaves. You cannot just ship the short DML statements, disregarding the big ones. Cheers!

Paul McCullagh said...

Hi Gigiduru,

The engine's transaction log is not the same as the MySQL binlog. This is the case for all storage engines. So all the data is still written to the binlog.

Certainly it would be more efficient if MySQL accessed the engine's transaction log directly for replication purposes. This is an improvement to be considered in the future.