PrimeBase XT: 2009

Thursday, December 31, 2009

PBXT 1.0.10, New Year Release!

I have just released PBXT 1.0.10 RC4. The sources can be downloaded from primebase.org, or from Launchpad.

The major feature in this release is the implementation of the pbxt_flush_log_at_trx_commit system variable. Similar to the InnoDB equivalent, this variable allows you to determine the level of durability of transactions.

This is a trade-off: by decreasing durability, the speed of database update operations can be increased.

The default setting is 1, which means full durability: the transaction log is flushed on every transaction commit.

Setting the variable to 2 reduces durability, by just writing the log on transaction commit (no flush is done). In this case, transactions can only be lost if the entire server machine goes down (for example a power failure).

The lowest level of durability is 0. In this case the transaction log is not written on transaction commit. Transactions can be lost if the server crashes.

In the case of 2 and 0, the engine flushes the transaction log at least once per second. So only transactions executed within the last second can be lost.

Ironically, PBXT started life as a "partially durable" storage engine (level 2 according to the description above). Almost exactly 2 years ago I started the implementation of full durability. It has taken a while to build in the original "feature" :)

The main reason for doing this has been the Mac version, and our work with TeamDrive. On the Mac the fsync() operations is a fake. To do a true flush to disk you have to call fcntl(of->of_filedes, F_FULLFSYNC, 0). Problem is, the real flush is incredibly slow (about 20 times slower than fsync), but necessary to avoid any corruption.

The advantage of a lot of applications like TeamDrive is that they can tolerate a lower level of durability. So we can look forward to an even speedier TeamDrive in the future :)

I would love to hear from anyone testing the new version. Bugs can be reported on Launchpad, as usual.

Happy New Year to you all!

Monday, December 14, 2009

Monty's appeal is selfless!

What many people don't get is that Monty's appeal to the MySQL community to help save MySQL is really quite selfless.

The fact is, Monty's own company, Monty Program Ab, stands to benefit the most from bad stewardship of MySQL by Oracle.

If Oracle slows and closes up development, rejects community contributions and creates a commercial version of MySQL, then Monty Program's MariaDB fork will become very popular, very quickly.

Which would translate into income for Monty Program Ab as customers come to his company for additions, features and bug fixes that they need to secure there own production.

What Monty is concerned about is the commercial vendors of MySQL (one of which Monty Program is not).

These vendors either:

OEM MySQL and integrate it into a commercial software or hardware product, or
they produce a closed source (or dual-license) storage engine, which is sold with a commercial version of MySQL.

Oracle could kill both businesses, and this is Monty's main concern. As Monty explained in a phone call this morning: he sees the existence of commercial/dual-license vendors of MySQL as very important to the long-term survival of "his baby".

Of course Oracle cannot prevent 3rd parties from continuing to offer consulting, support and training for MySQL. But close sourcing and vigorous enforcement of trademarks can make things very difficult for such companies.

Unfortunately Oracles latest concessions may not be enough to satisfy investors in MySQL based technology either, because there is no guarantee of what happens after 5 years.

Wednesday, November 11, 2009

The EU's real problem: MySQL and Oracle do not compete!

I think that most people are missing the point, Oracle included. The main objection of the EU is not that Oracle is swallowing up a major competitor.

To understand this you have to read between the lines of the EU decision:

"The regulators see a major conflict of interest in the world's largest commercial database company owning its largest open-source competitor"

This should actually read: "the world's largest commercial database company owning the largest open-source database"

The database market is divided into 2 parts: the back-office and the online world.

And now you know what I am going to say ... Oracle has an near monopoly in back-office and MySQL has a monopoly in online applications.

So let's do a little maths:

If we assume that back-office and online applications divide the database market into 2 equal parts, and that Oracle owns 60% of the back-office, and MySQL 90% of the online world.

This means that Oracle controls 30% (60% of 50%) of the entire database market today, but after the acquisition this number will be 75% (30% + 90% of 50%).

Something to think about.

Friday, September 18, 2009

The mysterious Storage Engine Independent Test Suite

Recently Mark observed that we now all need a storage engine independent test suite, Sun included! Well, as far as I know, there is such a thing at Sun, sort of. Apparently it has been used to test PBXT and other engines, but I've heard it is not in good enough shape to be released.

But my question is, why not release it anyway? We could turn it into an engine community project. I believe there are enough engine developers out there to get this moving forward.

The secret is to start small, and just get a few tests to run with all engines. Then additional tests can be added step by step. Engines need a way to specify that they want to skip a test entirely (e.g. transactional tests), and it should be easy to customize results for various engines.

An example of a simple and elegant solution can be found in Drizzle. As Monty Taylor mentioned in a comment to Marks blog: "We have some patches to test-run in Drizzle to allow running the whole test suite with a specified storage engine".

I think it has been long enough. This could be a good opportunity to start a Sun/Community project, something like Drizzle. In other words, get something out there, even if it is incomplete, and let the community also take a large part of the responsibility.

Friday, September 11, 2009

PBXT 1.0.09 RC3 implements XA and online backup

I have just released PBXT 1.0.09 RC3. Besides bug fixes (details in the release notes), this version includes 2 Beta features:

XA/2-Phase Commit support
Native online backup Driver

XA support has been around MySQL for quite a while, and we all know of it usefulness, for example when sharding. So I was surprised to find a bug in the XA recovery: Bug #47134. Contrary to what is reported, the crash can also occur when using XA with just the default engines installed, so watch out for that one (the good news: the bug fix is simple).

Online backup is really cool! I have heard that it may soon be released in a coming version of 5.4, so lets hope that this is true.

In a little test, I did a backup of a 10GB database in 49.26 seconds! Admitedly this was on a system with 4 15K drives in a RAID 0 configuration. But that is still a fantastic, considering the tables are not even locked during this time!

The database itself took 19 min. 56 sec. to generate. A complete restore took only 14 min. 29 sec.

But, it gets even better....

I have been working on PBXT 1.1, where I have done a number of things to improve the I/O performance of the engine.

In the same test as above, run with PBXT 1.1, the time to generate the database was 9 min. 35 sec., and the time to restore was 6 min 18 sec! (Time to generate the backup was identical.)

PBXT 1.1 is available directly from Launchpad here: lp:~pbxt-core/pbxt/staging, if you are interested in trying it out. 1.1 also has full support for memory based tables.

The new release candidate (PBXT 1.0.09) can be downloaded from primebase.org/download. It is also available from Lauchpad as the rc3 series: lp:pbxt/rc3.

Please report bugs here.

Any feedback is welcome! You can use Launchpad questions or the PBXT mailing list for this purpose.

Friday, August 21, 2009

PBXT at the OpenSQL Camp hosted by the FrOSCon 2009

Vladimir will be giving a presentation on PBXT at the FrOSCon 2009 in St. Augustin, near Bonn in Germany tomorrow:

PBXT: Technology trends that affect your Database
Room: C120/OpenSQLCamp
Time: 22 Aug 2009, 18:15 - 18:45

The talks is packed with interesting information about how the design of PBXT handles the major technological challenges of the future, including multiple cores, lots of RAM and solid state drives.

If you are in the area, check it out! :)

Thursday, August 20, 2009

What if MySQL dropped the Dual License?

In his blog Does the GPL Matter? In a Word, Yes, Stephen O'Grady makes the significant point that the dual-licensing model has a major drawback:

Sun/MySQL can only include patches and contributions if they fully own the copyright to those changes.

This gives forks like Drizzle, OurDelta, Percona and MariaDB a major advantage over the Sun version: they can include the best patches from all over. And it is clear that the momentum is building.

In a follow-up blog, Stephen asks: "what would the implications be if MySQL, of all projects, were forced to abandon the dual-licensing model it had long championed?"

Thinking about this, there is something that really bothers me:

Let's assume MySQL took on patches without ownership of the copyright, and thereby lost the ability to provide a commercial license to OEM customers.

According to the GPL this would mean that nobody could ever ship a commercial product with MySQL built-in!

To avoid this possibility from being lost to the world forever, surely MySQL would have to abandon the GPL, and maybe change to LGPL or BSD!

Tuesday, August 11, 2009

Jeremy's article on PBXT in Linux Magazine

Jeremy Zawodny of Craigslist wrote a great article on PBXT for Linux Magazine:

PBXT: Your Next MySQL Storage Engine?

Check it out...

Thanks Jeremy :)

Tuesday, June 30, 2009

PBXT 1.0.08 RC2 Released!

The second Release Candidate of PBXT, version 1.0.08, has just been released.

As I have mentioned in my previous blogs (here and here), I did a lot to improve performance for this version.

At the same time I am confident that this release is stable as we now have a large number of tests, including functionality, concurrency and crash recovery. But even more important, the number of users of PBXT has increased significantly since the last RC release, and that is the best test for an engine.

So there has never been a better time to try out PBXT! :)

You can download the source code, and selected binaries from here: primebase.org/download.

Vladimir and I have made a lot of changes, for details checkout the release notes.

Bugs can be reported on Launchpad, here.

There is also a new PBXT mailing lis t, so if you have any questions this is the best place for them.

PBXT is a high-performance, MVCC-based, transactional storage engine for MySQL. The project is open source (GPL) and hosted on Launchpad. PBXT supports referential integrity, row-level locking and is fully ACID compliant.

For more information please go to the PBXT home at: primebase.org.

Wednesday, May 13, 2009

At last we have a MySQL Foundation, its called The Open Database Alliance

Just over a year ago we registered the domain name mysqlfoundation.org in the hopes that Sun/MySQL will actually create such an entity.

My idea was to move the development of the MySQL Community server to the Foundation and make the development fully community orientated. The Foundation would have its own development goals and release schedule. Sun could then pull patches from the Foundation's Community server into the Enterprise server once they had stabilized.

I pitched the idea to several people at Sun back then and over the last year, however, for some reason, the foundation concept just proved impossible to push through.

I believe this would have been a great opportunity for Sun to take the leadership in the community, as the foundation idea dates back to before things really started splitting up. But Sun's loss is now that of Oracle, who perhaps doesn't care anyway.

What is really most important is that we in the community now have an entity that is going to tie our side of things together: The Open Database Alliance. For the community it is critical that things do not split up any further and that instead our efforts are bundled. I believe the Alliance can do this for us.

So where does that leave Oracle?

Well, as I see it, we now have a new, more relevant, community/enterprise split: the Oracle MySQL Enterprise server and the MariaDB Community server.

And, I guess I have to stand up and say, for us (primebase.org) this difference is real and significant.

PBXT is already part of most community builds including MariaDB, OurDelta and XAMPP. But is is not part of the official MySQL 5.1 Community Server.

Please note, this has nothing to do with my many great friends at MySQL! They help us in lots of other ways and I am very thankful for this :)

But even with the "community" label, any download offered by Sun (now Oracle of course - no change there) is about business! That is very difficult to change, and I accept that.

But the community does not need to change anything. It is, what it is.

Friday, April 17, 2009

PrimeBase Engines at the MySQL Conference 2009

Barry, Vladimir and I (the entire PrimeBase dev team!) will be presenting next week at the MySQL User Conference and Expo.

We've got lots of cool stuff going on. Barry will tell you how PBMS can store your BLOBs in the clouds, Vladimir will be explaining what makes PBXT so fast, and I will be talking about the past, the present and the future...

Even if that all doesn't interest you, be sure to just drop by to say hi. We're friendly, really! :)

The PBXT Storage Engine: Meeting Future Challenges
Paul McCullagh
3:05pm - 3:50pm Tuesday, 04/21/2009
Ballroom B

BLOB Streaming: Efficient Reliable BLOB Handling
for all Storage Engines
Barry Leslie
2:50pm - 3:35pm Thursday, 04/23/2009
Ballroom B

Making PBXT Faster
Vladimir Kolesnikov
11:15am - 11:55am Thursday, 04/23/2009
Percona Performance Conference - Rooms 203 & 204

Wednesday, March 25, 2009

Solving the PBXT DBT2 Scaling Problem

One little bit of wisdom I would like to pass on:

If a program runs fast with 20 threads, that does not mean it will run fast with 50. And if it runs fast with 50, it does not mean that it will run fast with 100, and if it runs fast with 100 ... don't bet on it running fast with 200 :)

In my last blog I discussed some improvement to the performance of PBXT running the DBT2 benchmark. Despite the overall significant increase in performance I noted a drop off at 32 threads that indicated a scaling problem. For the last couple of weeks I have been working on this problem and I have managed to fix it:

As before, this test was done using MySQL 5.1.30 on an 8 core, 64-bit, Linux machine with an SSD drive and a 5 warehouse DBT2 database. The test is memory bound and does not test the affects of checkpointing.

PBXT Baseline is the code revision indicated as PBXT 1.0.08 in my last blog. PBXT 1.0.07 is the current PBXT GA release version. PBXT 1.0.08 is the latest revision of the PBXT trunk. The baseline graph shows the extent of the scaling problem of the last version.

The latest version is over 20 times faster than PBXT 1.0.07 and 140% faster than the previous version. But most important is the fact that performance remains almost constant as the number of threads increases.

My thanks also to InnoDB which, looking at it positively, offers an excellent measure of how well you are doing! :) It looks like PBXT now actually scales better than InnoDB for this type of test.

So what has changed?

Basically I have made 2 changes, one major and one smaller but significant change. The first change, which got PBXT running faster with 50 threads has to do with conflict handling.

As I mentioned before DBT2 causes a lot of row level conflicts. This is especially the case as the number threads increase. In fact, at any given time during the test with 100 threads (performance results above), 80 of the threads are waiting for row locks. (Of the remaining 20, 4 are waiting for network I/O, and the rest are doing the actual work!)

The result is, if the handling of these conflicts is not optimal the engine looses a lot of time. Which you can clearly see from both the baseline and 1.0.07 results reported above.

To fix this I completely re-wrote the row-level conflict handling. Code paths are now much shorter for row-lock/update detection and handling and threads are now notified directly when they can continue.

The other change I made involved the opening and closing of tables when the MySQL open table cache is too small. This is something that really killed performance starting at about 100 threads. PBXT was doing quite a bit of unnecessary stuff on open and close table, which was fairly easy to move out.

So now that the scaling is good up to 200 threads, should I assume that performance will also be good for 400 threads? Of course it is! Well, at least until I test it... :)

Friday, March 06, 2009

Improving PBXT DBT2 Performance

DBT2, with over 40% conflicts, is an very challenging benchmark, especially for an MVCC based engine. And, as a result, it is not a test that an engine is automatically good at. InnoDB has been extensively optimized for DBT2, and it shows.

For the last few weeks I have had the opportunity to focus on PBXT DBT2 performance for the first time. I started with a memory bound DBT2 test and the current state of this work is illustrated below.

These results were achieved using MySQL 5.1.30 on an 8 core, 64-bit, Linux machine with an SSD drive and a 5 warehouse DBT2 database.

The dip off at 32 threads is left as an exercise for the reader :) Patches will be excepted!

So what were the major changes that lead to this improvement?

Don't Wait Too Long!

When I began the optimizations, PBXT was only using 120% CPU (i.e. just over 1 core), while InnoDB uses 440% (i.e. about 4.5 cores). I noticed that the pauses I was using in some situations were way too long (even at 1/1000 of a second). For example, shortly before commit, PBXT waits if some other transaction may soon commit in order to improve group commit efficiency.

If the pause is too long the program waits around even after the condition to continue has been fulfilled. So I changed these pauses to a yield. However, I have noticed that even a yield causes some threads to wait too long, so I am considering putting in a small amount of spinning.

Anyway, after that change, PBXT performance was still only 50% of InnoDB.

Too Many memcpy's

The next problem was the number of memcpy's in the standard index operations: search, insert and delete. PBXT was using the index cache like a disk with a read and write call interface. Such functions involve a buffer and a memcpy on every access. On average 8K was transferred per call, which is significant.

To get rid of the memcpy's I had to change the way the indexing system and the index cache work together. This was a major change. Instead of an index operation using a buffer and a copy it now "pins" the index cache page, and accesses the index cache page directly.

Unfortunately I didn't see the improvement that I expected after this change because I ran straight into the next problem...

Updating an Index in Parallel

Threads were now waiting on an exclusive lock required to perform modification to an index. I was using an exclusive lock because it is simply the easiest way keep indices consistent during update.

To fix this I found a way to do modify an index in parallel with other readers and writers in over 90% of index update cases. This was one of the most complex changes. Although, the idea behind the solution is relatively straight forward.

But, I have decided I'm not going to say how I did it here ... for that you will have to attend my (plug) talk at the User's Conference! He he :)

The Cost of Index Scans

At this stage PBXT was running 3 of the 5 procedures used by DBT2 (slightly) faster than InnoDB. The remaining problem involved the index scan, something that InnoDB is pretty good at.

In order to scan an index, PBXT was making a copy of each index page, and then stepping through the items. A copy of the page was required because after each step the engine returns a row to MySQL. So, all-in-all, the time taken to scan a page is too long to pin the index cache page.

To avoid making a copy of each page scanned I implemented index cache page "handles". An index scanner now gets a handle to the index page it is scanning. The handles are "copy-on-write", so changes to the index page are transparent to the scanner.

Work in Progress...

So DBT2 performance of PBXT is now more or less on par with InnoDB for memory bound tests. There are some scaling issues which I will look at next, and I have not yet investigated the affects of a disk bound load and checkpointing.

I also had a quick look at the mysqlslap performance following the optimizations. Some of it is great and some of it is "interesting". But I think I'll leave that for another blog...