PrimeBase XT

Friday, June 11, 2010

An Overview of PBXT Versions

If you follow PBXT development you may have noticed a number of different versions of the engine have been mentioned in various talks and blogs.

There is actually a consistent strategy behind all this, which I would like to explain here.

PBXT 1.0 - Current: 1.0.11-3 Pre-GA

Launchpad: lp:pbxt

This is the current PBXT production release. It is stable in all tests and environments in which it is currently in use.

The 1.0.11 version of the engine is available in MariaDB 5.1.47.

PBXT 1.1 - Stability: RC

Launchpad: lp:pbxt/1.1

PBXT 1.1 implements memory resident (MR) tables. These tables can be used for fast, concurrent access to non-persistent data.

1.1 also adds parallel checkpointing. To do this, PBXT starts multiple threads to flush several tables at once during a checkpoint.

This version is feature complete. Unless someone is interested in using MR tables in production, my plan is to leave 1.1 at the RC level and concentrate development on PBXT 1.5 and 2.0.

PBXT 1.1 is part of Drizzle.

PBXT 1.5 - Current: 1.5.01 Beta

Launchpad: lp:pbxt/1.5

PBXT 1.5 changes how the data logs are written, which makes the engine much faster, depending on the database schema.

Previously each user thread wrote its own data log. In version 1.5 the data logs are written the same way the transaction log is written. This means that group commit is implemented for the data logs.

I have also added a data log cache which can be help significantly if your data has hot spots.

The log-based architecture of PBXT makes it possible to write Terabytes of data without degrading performance. But, as the amount of data increases, garbage collection and random read speed can become a problem. I am currently focusing on solving these problems in 1.5.

PBXT 2.0 - Stability: Alpha

Launchpad: lp:pbxt/2.0

The major feature in PBXT 2.0 is engine level replication (ELR). This is an extremely efficient form of replication, while being fully transactional and reliable.

ELR works by transferring changes directly from the the PBXT transaction and data logs to the PBXT engine on the slave. This means the binary log does not need to be written or flushed, which can greatly increase the speed of the master server (up to 10x in some tests).

Currently the replication does not handle database schema changes, but it works and is ready for testing.

Setting Priorities

PBXT is a free, open source project which is largely funded by a big name database company.

Nevertheless, I am not bound as to how I set priorities, which means I usually focus on what is important to those using and testing the engine.

Now that you have an overview of what's happening in the PBXT world, let me know if you have a problem that PBXT might fix. I'd be happy to hear from you... :)

Wednesday, May 12, 2010

PBXT 1.0.11 Pre-GA Released!

I have just released PBXT 1.0.11, which I have titled "Pre-GA". Going by our internal tests, and all instances of PBXT in production and testing by the community this is a GA version!

However, although PBXT has 1000's of instances in production, it is not used in very diverse applications. So I am waiting for wider testing and usage before removing the "Pre" prefix.

You can download the source code from primebase.org, or pull it straight from Launchpad. Here are instructions how to compile and build the engine with MySQL. PBXT builds with MySQL 5.1.46 GA, and earlier 5.1 versions.

If you don't want to compile it yourself, PBXT 1.0.11 will soon be available in the 5.1.46 release of MariaDB. And, for the more adventurous, PBXT 1.1 is included in Drizzle.

A complete list of all the changes in this version are in the release notes.

If you are testing PBXT and have any questions send me an e-mail. I will be glad to help.

And, oh yes. If you are looking for development or production support for MySQL/MariaDB and PBXT then please write to: support-at-primebase-dot-org.

We are working together with Percona and Monty Program Ab to provide the service level you require.

Monday, April 19, 2010

Stuck in the US of A

As far as I know, nobody who was at the MySQL User Conference and lives in Europe has made it back home yet!

Please leave a comment on this blog as soon as you get home. I am interested to know...

My flight was yesterday, so I have the worst prospects. I am booked on a flight for next week Wednesday (10 days delay)! No joke! :(

Friday, April 16, 2010

The other Oracle ACE Director

While the choice of Ronald Bradford and Sheeri Cabral were natural for Oracle ACE Director my own nomination was perhaps a bit of a surprise. Well, it was to me anyway.

Those of you at the conference may have noticed that I had no (super-cool) ACE Director jacket when I was called up on the stage...

Well that was because the jacket was too big, and I had already returned it to Lenz for it to be exchanged.

Unfortunately I can't return the shoes because they are too big for me as well...

Wednesday, April 14, 2010

Slides of the PBXT Presentation

Here are the slides to my talk yesterday: A Practical Guide to the PBXT Storage Engine.

For anyone who missed my talk, I think it is worth going through the slides, because the are fairly self explanatory.

If there are any questions, please post them as a comment to the blog. I will be glad to answer :)

Friday, April 09, 2010

PBXT at the MySQL User Conference 2010

At this year's User Conference I have some interesting results to present. But more than anything else, my talk will explain how you can really get the most out of the engine. The design of PBXT makes it flexible, but this provides a lot of options. What tools are available to help you make the right decisions? I will explain.

Every design has trade-offs. How does this work out in practice for PBXT? And how can you take advantage of the strengths of the storage engine? I will explain in:

A Practical Guide to the PBXT Storage Engine
Paul McCullagh
2:00pm - 3:00pm Tuesday, 04/13/2010
Ballroom E

Don't miss it! :)

Wednesday, March 17, 2010

PBXT Engine Level replication, works!

I have been talking about this for a while, now at last I have found the time to get started! Below is a picture from my 2008 MySQL User Conference presentation. It illustrates how engine level replication works, and also shows how this can be ramped up to provide a multi-master HA setup.

What I now have running is the first phase: asynchronous replication, in a master/slave configuration. The way it works is simple. For every slave in the configuration the master PBXT engine starts a thread which reads the transaction log, and transfers modifications to a thread which applies the changes to PBXT tables on the slave.

Where to get it

I have pushed the changes that do this trick to PBXT 2.0 on Launchpad. The branch to try out is lp:pbxt/2.0.

Getting started

Setup of the replication is dead easy. Assuming you already have a PBXT database, what you need to do is the following:

1. Copy the Master data: Shutdown the MySQL server and make a complete copy of the data directory.

2. Setup a Slave server: Setup a second MySQL server using the copy of the data directory.

3. Declare the Slave: Create a text file called slaves, in the data/pbxt directory of the master server, with the following entry:

[slave]
name=slave-thread-name
host=host-name-of-slave
port=37656

slave-process-name is any name you like, and is used to identify the replication thread running on the master. host-name-of-slave is the host name or IP address of the slave MySQL server. 37656 is the default port used by the PBXT slave engine to receive replication changes.

4. Enable replication: On the master server set pbxt_enable_replication=1, and on the slave server set pbxt_enable_replication=2. Also make sure that both servers have different server IDs (system parameter: server_id).

5. Start both servers: Replication will begin immediately if the slave server is started before master server, otherwise replication will begin after a minute (see below).

How it works

PBXT engine level replication, unlike MySQL replication, pushes changes to the slave. For every entry in the data/pbxt/slaves file, PBXT starts a thread (the supplier thread). The thread connects to the slave on the given address, and pushes the changes to an applier thread run by the PBXT engine on the slave side. If any error occurs, the supplier thread on the master will pause, and then try again in a minute.

On connect the supplier thread requests the global transaction ID (GID) of the last transaction committed on the slave. The applier determines the GID of the last transaction by searching backwards through its own transaction logs.

Replication is row-based, and fairly low level. Changes refer to the PBXT internal row and table IDs. The row data is transferred in the same format used to store the information on disk. This makes the replication extremely efficient. The supplier thread does not even have to read the log from disk if it is fairly up-to-date, because PBXT already caches the last changes to the transaction log for use by the writer and the sweeper threads.

Probably the most important thing about this type of replication is that it (theoretically) has almost no affect on the "foreground" activity on the master machine. I am interested to find out if this really is the case.

What's next?

Replication of DDL changes are not implemented yet. So if you do ALTER TABLE or any other such operation, replication will stop, and have to be restarted by copying over the data directory to the slave again.

After DDL changes the next step is to add synchronous replication, as illustrated above. This requires waiting for a commit from the slave before continuing. Latency in this case can be kept to a minimum by sending transactions to the slave before they have been committed on the master.

I believe this would then provide the basis for an extremely simple (and efficient) HA solution based on MySQL.

Friday, February 26, 2010

Embedded PBXT is Cool

Martin Scholl (@zeit_geist) has started a new project based on the PBXT storage engine: EPBXT - Embedded PBXT! In his first blog he describes how you can easily build the latest version: Building Embedded PBXT from bzr.

The interesting thing about this project is that it exposes the "raw" power of the engine. Some basic performance tests show this really is the case.

At the lowest level, PBXT does not impose any format on the data stored in tables and indexes. When running as a MySQL storage engine it uses the MySQL native row and index formats. Theoretically it would be possible to expose this in an embedded API. The work Martin is doing goes in at this level. The wrapper around the engine determines the data types, data sizes, row and index format. Comparison operations for the data types are also supplied by the embedded code or user program.

This flexibility will make it possible for an application to store its own data very efficiently. As Martin suggested, it would also be possible to use Google's protobuf's for the row format. This would eliminate the need to use an ALTER TABLE for many types of changes to a table's definition!

Of course, EPBXT is still a way from realizing this vision, and Martin has some very specific problems he wants to solve with the development. However, judging by his command of the code within such a short time, this is going to be a project to watch in the future!

Monday, February 08, 2010

Ken we will miss you!

What does it take for someone, fiercely loyal to a company to suddenly leave? Ken Jakobs, Oracle employee number 18, a man that sincerely loves the company, has resigned! The only reason I can think of is an extreme snub!

I must say, I am very disappointed. The prospect of Ken running MySQL was a light at the end of the tunnel for the community. Why? Because Ken is a MySQL insider! He knows the project, he knows the community.

As an engine developer I have come to know Ken well over the last 4 years. He lead the InnoDB team and is largely responsible for the improvements made to the engine since the Oracle acquisition. At the yearly Engine Summit he was always professional and constructive in his suggestions, with a deep technical knowledge of the subject. His track record shows that he has always kept his word with regard to Oracle's intensions with InnoDB, and I would trust him to do the same with MySQL.

Goodbye Ken. This is great loss for both the MySQL community and Oracle!

Thursday, December 31, 2009

PBXT 1.0.10, New Year Release!

I have just released PBXT 1.0.10 RC4. The sources can be downloaded from primebase.org, or from Launchpad.

The major feature in this release is the implementation of the pbxt_flush_log_at_trx_commit system variable. Similar to the InnoDB equivalent, this variable allows you to determine the level of durability of transactions.

This is a trade-off: by decreasing durability, the speed of database update operations can be increased.

The default setting is 1, which means full durability: the transaction log is flushed on every transaction commit.

Setting the variable to 2 reduces durability, by just writing the log on transaction commit (no flush is done). In this case, transactions can only be lost if the entire server machine goes down (for example a power failure).

The lowest level of durability is 0. In this case the transaction log is not written on transaction commit. Transactions can be lost if the server crashes.

In the case of 2 and 0, the engine flushes the transaction log at least once per second. So only transactions executed within the last second can be lost.

Ironically, PBXT started life as a "partially durable" storage engine (level 2 according to the description above). Almost exactly 2 years ago I started the implementation of full durability. It has taken a while to build in the original "feature" :)

The main reason for doing this has been the Mac version, and our work with TeamDrive. On the Mac the fsync() operations is a fake. To do a true flush to disk you have to call fcntl(of->of_filedes, F_FULLFSYNC, 0). Problem is, the real flush is incredibly slow (about 20 times slower than fsync), but necessary to avoid any corruption.

The advantage of a lot of applications like TeamDrive is that they can tolerate a lower level of durability. So we can look forward to an even speedier TeamDrive in the future :)

I would love to hear from anyone testing the new version. Bugs can be reported on Launchpad, as usual.

Happy New Year to you all!

Monday, December 14, 2009

Monty's appeal is selfless!

What many people don't get is that Monty's appeal to the MySQL community to help save MySQL is really quite selfless.

The fact is, Monty's own company, Monty Program Ab, stands to benefit the most from bad stewardship of MySQL by Oracle.

If Oracle slows and closes up development, rejects community contributions and creates a commercial version of MySQL, then Monty Program's MariaDB fork will become very popular, very quickly.

Which would translate into income for Monty Program Ab as customers come to his company for additions, features and bug fixes that they need to secure there own production.

What Monty is concerned about is the commercial vendors of MySQL (one of which Monty Program is not).

These vendors either:

OEM MySQL and integrate it into a commercial software or hardware product, or
they produce a closed source (or dual-license) storage engine, which is sold with a commercial version of MySQL.

Oracle could kill both businesses, and this is Monty's main concern. As Monty explained in a phone call this morning: he sees the existence of commercial/dual-license vendors of MySQL as very important to the long-term survival of "his baby".

Of course Oracle cannot prevent 3rd parties from continuing to offer consulting, support and training for MySQL. But close sourcing and vigorous enforcement of trademarks can make things very difficult for such companies.

Unfortunately Oracles latest concessions may not be enough to satisfy investors in MySQL based technology either, because there is no guarantee of what happens after 5 years.

Wednesday, November 11, 2009

The EU's real problem: MySQL and Oracle do not compete!

I think that most people are missing the point, Oracle included. The main objection of the EU is not that Oracle is swallowing up a major competitor.

To understand this you have to read between the lines of the EU decision:

"The regulators see a major conflict of interest in the world's largest commercial database company owning its largest open-source competitor"

This should actually read: "the world's largest commercial database company owning the largest open-source database"

The database market is divided into 2 parts: the back-office and the online world.

And now you know what I am going to say ... Oracle has an near monopoly in back-office and MySQL has a monopoly in online applications.

So let's do a little maths:

If we assume that back-office and online applications divide the database market into 2 equal parts, and that Oracle owns 60% of the back-office, and MySQL 90% of the online world.

This means that Oracle controls 30% (60% of 50%) of the entire database market today, but after the acquisition this number will be 75% (30% + 90% of 50%).

Something to think about.

Friday, September 18, 2009

The mysterious Storage Engine Independent Test Suite

Recently Mark observed that we now all need a storage engine independent test suite, Sun included! Well, as far as I know, there is such a thing at Sun, sort of. Apparently it has been used to test PBXT and other engines, but I've heard it is not in good enough shape to be released.

But my question is, why not release it anyway? We could turn it into an engine community project. I believe there are enough engine developers out there to get this moving forward.

The secret is to start small, and just get a few tests to run with all engines. Then additional tests can be added step by step. Engines need a way to specify that they want to skip a test entirely (e.g. transactional tests), and it should be easy to customize results for various engines.

An example of a simple and elegant solution can be found in Drizzle. As Monty Taylor mentioned in a comment to Marks blog: "We have some patches to test-run in Drizzle to allow running the whole test suite with a specified storage engine".

I think it has been long enough. This could be a good opportunity to start a Sun/Community project, something like Drizzle. In other words, get something out there, even if it is incomplete, and let the community also take a large part of the responsibility.

Friday, September 11, 2009

PBXT 1.0.09 RC3 implements XA and online backup

I have just released PBXT 1.0.09 RC3. Besides bug fixes (details in the release notes), this version includes 2 Beta features:

XA/2-Phase Commit support
Native online backup Driver

XA support has been around MySQL for quite a while, and we all know of it usefulness, for example when sharding. So I was surprised to find a bug in the XA recovery: Bug #47134. Contrary to what is reported, the crash can also occur when using XA with just the default engines installed, so watch out for that one (the good news: the bug fix is simple).

Online backup is really cool! I have heard that it may soon be released in a coming version of 5.4, so lets hope that this is true.

In a little test, I did a backup of a 10GB database in 49.26 seconds! Admitedly this was on a system with 4 15K drives in a RAID 0 configuration. But that is still a fantastic, considering the tables are not even locked during this time!

The database itself took 19 min. 56 sec. to generate. A complete restore took only 14 min. 29 sec.

But, it gets even better....

I have been working on PBXT 1.1, where I have done a number of things to improve the I/O performance of the engine.

In the same test as above, run with PBXT 1.1, the time to generate the database was 9 min. 35 sec., and the time to restore was 6 min 18 sec! (Time to generate the backup was identical.)

PBXT 1.1 is available directly from Launchpad here: lp:~pbxt-core/pbxt/staging, if you are interested in trying it out. 1.1 also has full support for memory based tables.

The new release candidate (PBXT 1.0.09) can be downloaded from primebase.org/download. It is also available from Lauchpad as the rc3 series: lp:pbxt/rc3.

Please report bugs here.

Any feedback is welcome! You can use Launchpad questions or the PBXT mailing list for this purpose.

Friday, August 21, 2009

PBXT at the OpenSQL Camp hosted by the FrOSCon 2009

Vladimir will be giving a presentation on PBXT at the FrOSCon 2009 in St. Augustin, near Bonn in Germany tomorrow:

PBXT: Technology trends that affect your Database
Room: C120/OpenSQLCamp
Time: 22 Aug 2009, 18:15 - 18:45

The talks is packed with interesting information about how the design of PBXT handles the major technological challenges of the future, including multiple cores, lots of RAM and solid state drives.

If you are in the area, check it out! :)

Thursday, August 20, 2009

What if MySQL dropped the Dual License?

In his blog Does the GPL Matter? In a Word, Yes, Stephen O'Grady makes the significant point that the dual-licensing model has a major drawback:

Sun/MySQL can only include patches and contributions if they fully own the copyright to those changes.

This gives forks like Drizzle, OurDelta, Percona and MariaDB a major advantage over the Sun version: they can include the best patches from all over. And it is clear that the momentum is building.

In a follow-up blog, Stephen asks: "what would the implications be if MySQL, of all projects, were forced to abandon the dual-licensing model it had long championed?"

Thinking about this, there is something that really bothers me:

Let's assume MySQL took on patches without ownership of the copyright, and thereby lost the ability to provide a commercial license to OEM customers.

According to the GPL this would mean that nobody could ever ship a commercial product with MySQL built-in!

To avoid this possibility from being lost to the world forever, surely MySQL would have to abandon the GPL, and maybe change to LGPL or BSD!

Tuesday, August 11, 2009

Jeremy's article on PBXT in Linux Magazine

Jeremy Zawodny of Craigslist wrote a great article on PBXT for Linux Magazine:

PBXT: Your Next MySQL Storage Engine?

Check it out...

Thanks Jeremy :)

Tuesday, June 30, 2009

PBXT 1.0.08 RC2 Released!

The second Release Candidate of PBXT, version 1.0.08, has just been released.

As I have mentioned in my previous blogs (here and here), I did a lot to improve performance for this version.

At the same time I am confident that this release is stable as we now have a large number of tests, including functionality, concurrency and crash recovery. But even more important, the number of users of PBXT has increased significantly since the last RC release, and that is the best test for an engine.

So there has never been a better time to try out PBXT! :)

You can download the source code, and selected binaries from here: primebase.org/download.

Vladimir and I have made a lot of changes, for details checkout the release notes.

Bugs can be reported on Launchpad, here.

There is also a new PBXT mailing lis t, so if you have any questions this is the best place for them.

PBXT is a high-performance, MVCC-based, transactional storage engine for MySQL. The project is open source (GPL) and hosted on Launchpad. PBXT supports referential integrity, row-level locking and is fully ACID compliant.

For more information please go to the PBXT home at: primebase.org.

Wednesday, May 13, 2009

At last we have a MySQL Foundation, its called The Open Database Alliance

Just over a year ago we registered the domain name mysqlfoundation.org in the hopes that Sun/MySQL will actually create such an entity.

My idea was to move the development of the MySQL Community server to the Foundation and make the development fully community orientated. The Foundation would have its own development goals and release schedule. Sun could then pull patches from the Foundation's Community server into the Enterprise server once they had stabilized.

I pitched the idea to several people at Sun back then and over the last year, however, for some reason, the foundation concept just proved impossible to push through.

I believe this would have been a great opportunity for Sun to take the leadership in the community, as the foundation idea dates back to before things really started splitting up. But Sun's loss is now that of Oracle, who perhaps doesn't care anyway.

What is really most important is that we in the community now have an entity that is going to tie our side of things together: The Open Database Alliance. For the community it is critical that things do not split up any further and that instead our efforts are bundled. I believe the Alliance can do this for us.

So where does that leave Oracle?

Well, as I see it, we now have a new, more relevant, community/enterprise split: the Oracle MySQL Enterprise server and the MariaDB Community server.

And, I guess I have to stand up and say, for us (primebase.org) this difference is real and significant.

PBXT is already part of most community builds including MariaDB, OurDelta and XAMPP. But is is not part of the official MySQL 5.1 Community Server.

Please note, this has nothing to do with my many great friends at MySQL! They help us in lots of other ways and I am very thankful for this :)

But even with the "community" label, any download offered by Sun (now Oracle of course - no change there) is about business! That is very difficult to change, and I accept that.

But the community does not need to change anything. It is, what it is.

Friday, April 17, 2009

PrimeBase Engines at the MySQL Conference 2009

Barry, Vladimir and I (the entire PrimeBase dev team!) will be presenting next week at the MySQL User Conference and Expo.

We've got lots of cool stuff going on. Barry will tell you how PBMS can store your BLOBs in the clouds, Vladimir will be explaining what makes PBXT so fast, and I will be talking about the past, the present and the future...

Even if that all doesn't interest you, be sure to just drop by to say hi. We're friendly, really! :)

The PBXT Storage Engine: Meeting Future Challenges
Paul McCullagh
3:05pm - 3:50pm Tuesday, 04/21/2009
Ballroom B

BLOB Streaming: Efficient Reliable BLOB Handling
for all Storage Engines
Barry Leslie
2:50pm - 3:35pm Thursday, 04/23/2009
Ballroom B

Making PBXT Faster
Vladimir Kolesnikov
11:15am - 11:55am Thursday, 04/23/2009
Percona Performance Conference - Rooms 203 & 204