PrimeBase XT

Friday, September 11, 2009

PBXT 1.0.09 RC3 implements XA and online backup

I have just released PBXT 1.0.09 RC3. Besides bug fixes (details in the release notes), this version includes 2 Beta features:

XA/2-Phase Commit support
Native online backup Driver

XA support has been around MySQL for quite a while, and we all know of it usefulness, for example when sharding. So I was surprised to find a bug in the XA recovery: Bug #47134. Contrary to what is reported, the crash can also occur when using XA with just the default engines installed, so watch out for that one (the good news: the bug fix is simple).

Online backup is really cool! I have heard that it may soon be released in a coming version of 5.4, so lets hope that this is true.

In a little test, I did a backup of a 10GB database in 49.26 seconds! Admitedly this was on a system with 4 15K drives in a RAID 0 configuration. But that is still a fantastic, considering the tables are not even locked during this time!

The database itself took 19 min. 56 sec. to generate. A complete restore took only 14 min. 29 sec.

But, it gets even better....

I have been working on PBXT 1.1, where I have done a number of things to improve the I/O performance of the engine.

In the same test as above, run with PBXT 1.1, the time to generate the database was 9 min. 35 sec., and the time to restore was 6 min 18 sec! (Time to generate the backup was identical.)

PBXT 1.1 is available directly from Launchpad here: lp:~pbxt-core/pbxt/staging, if you are interested in trying it out. 1.1 also has full support for memory based tables.

The new release candidate (PBXT 1.0.09) can be downloaded from primebase.org/download. It is also available from Lauchpad as the rc3 series: lp:pbxt/rc3.

Please report bugs here.

Any feedback is welcome! You can use Launchpad questions or the PBXT mailing list for this purpose.

Friday, August 21, 2009

PBXT at the OpenSQL Camp hosted by the FrOSCon 2009

Vladimir will be giving a presentation on PBXT at the FrOSCon 2009 in St. Augustin, near Bonn in Germany tomorrow:

PBXT: Technology trends that affect your Database
Room: C120/OpenSQLCamp
Time: 22 Aug 2009, 18:15 - 18:45

The talks is packed with interesting information about how the design of PBXT handles the major technological challenges of the future, including multiple cores, lots of RAM and solid state drives.

If you are in the area, check it out! :)

Thursday, August 20, 2009

What if MySQL dropped the Dual License?

In his blog Does the GPL Matter? In a Word, Yes, Stephen O'Grady makes the significant point that the dual-licensing model has a major drawback:

Sun/MySQL can only include patches and contributions if they fully own the copyright to those changes.

This gives forks like Drizzle, OurDelta, Percona and MariaDB a major advantage over the Sun version: they can include the best patches from all over. And it is clear that the momentum is building.

In a follow-up blog, Stephen asks: "what would the implications be if MySQL, of all projects, were forced to abandon the dual-licensing model it had long championed?"

Thinking about this, there is something that really bothers me:

Let's assume MySQL took on patches without ownership of the copyright, and thereby lost the ability to provide a commercial license to OEM customers.

According to the GPL this would mean that nobody could ever ship a commercial product with MySQL built-in!

To avoid this possibility from being lost to the world forever, surely MySQL would have to abandon the GPL, and maybe change to LGPL or BSD!

Tuesday, August 11, 2009

Jeremy's article on PBXT in Linux Magazine

Jeremy Zawodny of Craigslist wrote a great article on PBXT for Linux Magazine:

PBXT: Your Next MySQL Storage Engine?

Check it out...

Thanks Jeremy :)

Tuesday, June 30, 2009

PBXT 1.0.08 RC2 Released!

The second Release Candidate of PBXT, version 1.0.08, has just been released.

As I have mentioned in my previous blogs (here and here), I did a lot to improve performance for this version.

At the same time I am confident that this release is stable as we now have a large number of tests, including functionality, concurrency and crash recovery. But even more important, the number of users of PBXT has increased significantly since the last RC release, and that is the best test for an engine.

So there has never been a better time to try out PBXT! :)

You can download the source code, and selected binaries from here: primebase.org/download.

Vladimir and I have made a lot of changes, for details checkout the release notes.

Bugs can be reported on Launchpad, here.

There is also a new PBXT mailing lis t, so if you have any questions this is the best place for them.

PBXT is a high-performance, MVCC-based, transactional storage engine for MySQL. The project is open source (GPL) and hosted on Launchpad. PBXT supports referential integrity, row-level locking and is fully ACID compliant.

For more information please go to the PBXT home at: primebase.org.

Wednesday, May 13, 2009

At last we have a MySQL Foundation, its called The Open Database Alliance

Just over a year ago we registered the domain name mysqlfoundation.org in the hopes that Sun/MySQL will actually create such an entity.

My idea was to move the development of the MySQL Community server to the Foundation and make the development fully community orientated. The Foundation would have its own development goals and release schedule. Sun could then pull patches from the Foundation's Community server into the Enterprise server once they had stabilized.

I pitched the idea to several people at Sun back then and over the last year, however, for some reason, the foundation concept just proved impossible to push through.

I believe this would have been a great opportunity for Sun to take the leadership in the community, as the foundation idea dates back to before things really started splitting up. But Sun's loss is now that of Oracle, who perhaps doesn't care anyway.

What is really most important is that we in the community now have an entity that is going to tie our side of things together: The Open Database Alliance. For the community it is critical that things do not split up any further and that instead our efforts are bundled. I believe the Alliance can do this for us.

So where does that leave Oracle?

Well, as I see it, we now have a new, more relevant, community/enterprise split: the Oracle MySQL Enterprise server and the MariaDB Community server.

And, I guess I have to stand up and say, for us (primebase.org) this difference is real and significant.

PBXT is already part of most community builds including MariaDB, OurDelta and XAMPP. But is is not part of the official MySQL 5.1 Community Server.

Please note, this has nothing to do with my many great friends at MySQL! They help us in lots of other ways and I am very thankful for this :)

But even with the "community" label, any download offered by Sun (now Oracle of course - no change there) is about business! That is very difficult to change, and I accept that.

But the community does not need to change anything. It is, what it is.

Friday, April 17, 2009

PrimeBase Engines at the MySQL Conference 2009

Barry, Vladimir and I (the entire PrimeBase dev team!) will be presenting next week at the MySQL User Conference and Expo.

We've got lots of cool stuff going on. Barry will tell you how PBMS can store your BLOBs in the clouds, Vladimir will be explaining what makes PBXT so fast, and I will be talking about the past, the present and the future...

Even if that all doesn't interest you, be sure to just drop by to say hi. We're friendly, really! :)

The PBXT Storage Engine: Meeting Future Challenges
Paul McCullagh
3:05pm - 3:50pm Tuesday, 04/21/2009
Ballroom B

BLOB Streaming: Efficient Reliable BLOB Handling
for all Storage Engines
Barry Leslie
2:50pm - 3:35pm Thursday, 04/23/2009
Ballroom B

Making PBXT Faster
Vladimir Kolesnikov
11:15am - 11:55am Thursday, 04/23/2009
Percona Performance Conference - Rooms 203 & 204

Wednesday, March 25, 2009

Solving the PBXT DBT2 Scaling Problem

One little bit of wisdom I would like to pass on:

If a program runs fast with 20 threads, that does not mean it will run fast with 50. And if it runs fast with 50, it does not mean that it will run fast with 100, and if it runs fast with 100 ... don't bet on it running fast with 200 :)

In my last blog I discussed some improvement to the performance of PBXT running the DBT2 benchmark. Despite the overall significant increase in performance I noted a drop off at 32 threads that indicated a scaling problem. For the last couple of weeks I have been working on this problem and I have managed to fix it:

As before, this test was done using MySQL 5.1.30 on an 8 core, 64-bit, Linux machine with an SSD drive and a 5 warehouse DBT2 database. The test is memory bound and does not test the affects of checkpointing.

PBXT Baseline is the code revision indicated as PBXT 1.0.08 in my last blog. PBXT 1.0.07 is the current PBXT GA release version. PBXT 1.0.08 is the latest revision of the PBXT trunk. The baseline graph shows the extent of the scaling problem of the last version.

The latest version is over 20 times faster than PBXT 1.0.07 and 140% faster than the previous version. But most important is the fact that performance remains almost constant as the number of threads increases.

My thanks also to InnoDB which, looking at it positively, offers an excellent measure of how well you are doing! :) It looks like PBXT now actually scales better than InnoDB for this type of test.

So what has changed?

Basically I have made 2 changes, one major and one smaller but significant change. The first change, which got PBXT running faster with 50 threads has to do with conflict handling.

As I mentioned before DBT2 causes a lot of row level conflicts. This is especially the case as the number threads increase. In fact, at any given time during the test with 100 threads (performance results above), 80 of the threads are waiting for row locks. (Of the remaining 20, 4 are waiting for network I/O, and the rest are doing the actual work!)

The result is, if the handling of these conflicts is not optimal the engine looses a lot of time. Which you can clearly see from both the baseline and 1.0.07 results reported above.

To fix this I completely re-wrote the row-level conflict handling. Code paths are now much shorter for row-lock/update detection and handling and threads are now notified directly when they can continue.

The other change I made involved the opening and closing of tables when the MySQL open table cache is too small. This is something that really killed performance starting at about 100 threads. PBXT was doing quite a bit of unnecessary stuff on open and close table, which was fairly easy to move out.

So now that the scaling is good up to 200 threads, should I assume that performance will also be good for 400 threads? Of course it is! Well, at least until I test it... :)

Friday, March 06, 2009

Improving PBXT DBT2 Performance

DBT2, with over 40% conflicts, is an very challenging benchmark, especially for an MVCC based engine. And, as a result, it is not a test that an engine is automatically good at. InnoDB has been extensively optimized for DBT2, and it shows.

For the last few weeks I have had the opportunity to focus on PBXT DBT2 performance for the first time. I started with a memory bound DBT2 test and the current state of this work is illustrated below.

These results were achieved using MySQL 5.1.30 on an 8 core, 64-bit, Linux machine with an SSD drive and a 5 warehouse DBT2 database.

The dip off at 32 threads is left as an exercise for the reader :) Patches will be excepted!

So what were the major changes that lead to this improvement?

Don't Wait Too Long!

When I began the optimizations, PBXT was only using 120% CPU (i.e. just over 1 core), while InnoDB uses 440% (i.e. about 4.5 cores). I noticed that the pauses I was using in some situations were way too long (even at 1/1000 of a second). For example, shortly before commit, PBXT waits if some other transaction may soon commit in order to improve group commit efficiency.

If the pause is too long the program waits around even after the condition to continue has been fulfilled. So I changed these pauses to a yield. However, I have noticed that even a yield causes some threads to wait too long, so I am considering putting in a small amount of spinning.

Anyway, after that change, PBXT performance was still only 50% of InnoDB.

Too Many memcpy's

The next problem was the number of memcpy's in the standard index operations: search, insert and delete. PBXT was using the index cache like a disk with a read and write call interface. Such functions involve a buffer and a memcpy on every access. On average 8K was transferred per call, which is significant.

To get rid of the memcpy's I had to change the way the indexing system and the index cache work together. This was a major change. Instead of an index operation using a buffer and a copy it now "pins" the index cache page, and accesses the index cache page directly.

Unfortunately I didn't see the improvement that I expected after this change because I ran straight into the next problem...

Updating an Index in Parallel

Threads were now waiting on an exclusive lock required to perform modification to an index. I was using an exclusive lock because it is simply the easiest way keep indices consistent during update.

To fix this I found a way to do modify an index in parallel with other readers and writers in over 90% of index update cases. This was one of the most complex changes. Although, the idea behind the solution is relatively straight forward.

But, I have decided I'm not going to say how I did it here ... for that you will have to attend my (plug) talk at the User's Conference! He he :)

The Cost of Index Scans

At this stage PBXT was running 3 of the 5 procedures used by DBT2 (slightly) faster than InnoDB. The remaining problem involved the index scan, something that InnoDB is pretty good at.

In order to scan an index, PBXT was making a copy of each index page, and then stepping through the items. A copy of the page was required because after each step the engine returns a row to MySQL. So, all-in-all, the time taken to scan a page is too long to pin the index cache page.

To avoid making a copy of each page scanned I implemented index cache page "handles". An index scanner now gets a handle to the index page it is scanning. The handles are "copy-on-write", so changes to the index page are transparent to the scanner.

Work in Progress...

So DBT2 performance of PBXT is now more or less on par with InnoDB for memory bound tests. There are some scaling issues which I will look at next, and I have not yet investigated the affects of a disk bound load and checkpointing.

I also had a quick look at the mysqlslap performance following the optimizations. Some of it is great and some of it is "interesting". But I think I'll leave that for another blog...

Thursday, December 18, 2008

PBXT goes RC!

With all the booha about MySQL not being ready for GA, it makes me almost afraid to announce, ahem, ... and PBXT is, ehr, RC.

It has been just over a year now since I started developing the fully durable version of PBXT. Before that, PBXT was Beta. After that, it was Alpha again.

Now we have 2 solid Beta versions behind us, Vladimir and I have fixed all known bugs for this version, including quite a number of foreign key bugs. We have all 259 mysql-test-run tests that were adapted for PBXT (and a bunch of our own) running through without any errors on 4 platforms: Mac OS X, Linux 32-bit and 64-bit, and Windows. Our buildbot is giving us a green light, at last!

Besides this we have done crash tests, load tests and crash and load tests (I mean recovery)! And maybe most important, we have it ticking away in a very demanding OEM product called TeamDrive. And it is doing it 20% to 30% faster than the "most commonly used" transactional storage engine.

Then we also have PBXT 1.0.07 RC compiling and running with MySQL 5.1.30, MySQL 6.0.8 and Drizzle! And it compiles on Linux, Windows, Mac OS X, FreeBSD, netbsd, OpenSolaris and Solaris (last patch pending on this one), whew!

So what's next?

Well next stop is GA, and I would like to have it done before the MySQL conference. Heard that one before? Nah ;)

Seriously though, we are not planning to add anymore features to this version so there is only one way to stop us: by testing and reporting bugs! Right here: https://bugs.launchpad.net/pbxt

Would be much appreciated! :)

BTW, the version is available, as usual, from http://www.primebase.org/download, or get it straight from Launchpad.net:

bzr branch lp:pbxt/1.0.07-rc

Monday, December 15, 2008

xtstat: Tells you exactly what PBXT is doing!

I have created a new tool, called xtstat, for analyzing the performance of the PBXT storage engine.

The way it works is simple. PBXT now counts all kinds of things: transactions committed and rolled back, statements executed, records read and written, tables and indexes scanned, bytes read, written and flushed to various types of files: record, index, data logs, transaction logs, and so on.

A SELECT on the system table PBXT.STATISTICS (or INFORMATION_SCHEMA.PBXT_STATISTICS if PBXT was built inside the MySQL tree) returns the current totals of all these counters. xtstat does a SELECT every second on this table and prints the difference. In this way, you can see how much work PBXT is doing in each area.

There are currently 48 different statistics:

To ensure all this counting does not itself cost any performance, each thread counts for itself, so no locking is required. The SELECT on STATISTICS then sums over all running threads.

The default output of xtstat is 201 characters wide (281 characters are required to display all statistics), but using the --display option you can specify exactly which statistics you would like to look at.

Here is an example, of the default output with some of the middle columns removed:

To display large byte values, such as data read from the data log files (data-in column), xtstat uses K (Kilobytes), M (Megabytes) and G (Gigabytes) to ensure the values don't overflow the column space. Counts like the number of rows inserted (row-ins column) use, t (thousands), m (millions) and b (billions) to keep things lined up.

Using xtstat it is possible to ask questions like:

How efficiently is group commit working?

Notice here that the number of transaction commits in this example (xact-commt column), is larger than the number transaction log flushes (xlog-syncs).

Is their enough index cache?

Displaying the index stats shows that index cache misses (ind-miss column) are at about 30%, and that the server is reading 8MB per second from the index file. So increasing the index cache would result in better performance.

How much time is being spent flushing files?

In the example above you can see that xtstat displays the number of flushes (syncs) and the time spent in flushing in milliseconds (ms and msec columns).

When you build PBXT with make install, xtstat will be installed in the same directory as other MySQL tools such as mysqladmin, mysqldump, etc.

For more information on xtstat, just enter:

$ bin/xtstat --help

All in all, I think xtstat will be very useful in analyzing and tuning the performance of the engine.

Monday, November 10, 2008

PBXT 1.0.06 Beta Released

On friday we released the second Beta version of PBXT. PBXT is a transactional storage engine for MySQL 5.1 and 6.0. You can find out more about the engine at www.primebase.org.

PBXT is pluggable, so it can be built separately from the MySQL tree, and loaded dynamically at runtime using the LOAD PLUGIN statement.

You can download PBXT from here. A "quick guide" to building and installing the plugin is provided. I have also updated the documentation for this version.

There are no major new features in this release because we are working towards the RC version in December. But we wrote some release notes to prove we have been busy :)

There is now also a version of PBXT available for Drizzle. You will find the source code here: https://code.launchpad.net/~drizzle-pbxt/drizzle/pbxt.

By the way, did any of you see this report: Sun releases MySQL 5.1?! It's dated 7 Nov, but no sign of the new release on the MySQL website ... to bad.

Tuesday, September 30, 2008

PBXT moves to Launchpad

It's been a week or 2 and some of you may already have heard that PBXT has moved from Sourceforge to Lauchpad.net: https://launchpad.net/pbxt.

There are several very good reasons for the move, not the least of which is that MySQL has already moved to Launchpad, and Drizzle is there too. It simply makes sense for a storage engine like PBXT to be on the same platform.

And check this out, Stewart Smith has already ported PBXT to Drizzle. You will find the tree here: PBXT in Drizzle. I will be pulling Stewart's changes back into the PBXT tree. Creating new branches, merging branches and generally contributing to projects is easy on Launchpad.

Besides this, Launchpad has great tools for bug reporting, planning, Q&A and managing releases which we plan to use. In general, I find these tools are much better integrated than those on Sourceforge. For example it is easy to attach a branch which fixes a bug to the bug report.

Jay Pipes has written some excellent articles on getting started with Launchpad:

A Contributor's Guide to Launchpad.net - Part 1 - Getting Started

A Contributor's Guide to Launchpad.net - Part 2 - Code Management

As Jay explains, making a contribution is done in a few easy steps: create a branch, make your changes, push the branch back to Launchpad and request a merge into the project. That's it!

Give it a try, Vladimir and I will certainly be glad to have your help. :)

Monday, September 01, 2008

PBXT Beta Version Released!

I am pleased to announce that the Beta version of PBXT has just been released. You can download the source code of the storage engine from www.primebase.org/download. I have also updated the documentation for this version.

Configuring and building the engine is easier than ever now. To configure PBXT all you have to do is specify the path to the MySQL source code tree (after building MySQL), for example:

./configure --with-mysql=/home/foo/mysql/mysql-5.1.26-rc

The PBXT configure command will retrieve all required options from the MySQL build. For example whether to do a debug or optimized build and where to install the plugin are determined automatically, depending on how you configured MySQL.

This was a source of some mistakes when building the plugin, so I think it is really cool!

So what's next?

My goal is a RC (release candidate) version before the end of the year. Considering the stability of the new Beta, I think this is realistic.

The main work is testing, performance tuning, and fixing all those bugs you are about to find as you give PBXT a spin, right? :)

Besides, the size of the PBXT programming team will soon double! But more about that later...

Another thing I would love to do soon is a Drizzle version of PBXT. This has one significant advantage. If I discover a bottleneck in Drizzle, while performance tuning the engine, a patch for the problem in the server will probably be accepted fairly quickly.

But first I need to move PBXT to launchpad where all the music is playing these days!

Saturday, August 02, 2008

New PBXT Release 1.0.04 Improves Performance

Lets face it, when it comes to storage engines, performance is everything. But then again, so is stability and data integrity!

So as a developer of an engine, which should you concentrate on first: performance, stability or data integrity?

I know there are not many that have to deal with this stuff, but here is my advice anyway: go for performance first.

The reason is simple, significant performance tuning can have a serious affect on both stability and data integrity. And this means you need to repeat a lot of the debugging and testing you did before.

For example one of the optimizations I made for 1.0.04 required a number of changes to the index cache. One thing was to make the LRU (least recently used) list global, it was segment based before. During the change I copy-pasted a "lru" pointer instead of a "mru" pointer :(

The result was not a crash, but the engine lost cache pages! So I only noticed the problem when a test just ran to slowly. When I got it up in the debugger, I noticed that the engine was flushing the index constantly, and this was because it was running on only 4 cache pages! All-in-all that typo cost me a half a day of debugging.

Anyway, there is still more to be done in way of optimization, but so far I am happy with the results. Here is a comparison between 1.0.04 and the previous version of PBXT:

This test was done on a 2-core machine using sysbench-0.4.7 running various selects on a table with 1M rows.

As you can see performance of the 1.0.03 version breaks completely at 4 threads. However, although 1.0.04 performance is significantly better (10 times faster at 4 threads), it also degrades substantially.

So why is this?

Well that is the thing that prompted me to have a look at the performance of MySQL itself, which I reported here: Mutex contention and other bottlenecks in MySQL.

Suffice to say that at 16 threads, MySQL is hanging 43% of the time in a mutex in open_table(), and 45% of the time in a mutex in lock_table(). And the solution is ... on its way down ... Drizzle :)

As usual you can download the latest version from www.primebase.org/download or checkout using svn directly from SourceForge.net. Give it a spin...

Wednesday, July 23, 2008

Drizzle goes back to the Roots

Will Drizzle (Brian, Monty, Mark, MontyT, and others ...) become a cloudburst? I think so, and here is why...

First a simple question: what made diverse systems such as PHP, the HTTP protocol and memcached so popular?

Answer: ease of use, simplicity, speed and scalability.

And what made the original version of MySQL so popular? Well, exactly the same things.

Drizzle goes back to the roots, concentrating on what made the use of MySQL so widespread in the first place.

You could say, with 5.0, MySQL lost its way while introducing many complex features: stored procedures, triggers, views, query cache, etc.

So why did MySQL add these features? I see two reasons:

Popular opinion: It is a simple fact that analysts, journalists and, in particular, investors, refused to take MySQL seriously unless it "grew up", and gained all the features that a mature database should have. As a venture capital financed company heading for IPO its hard to ignore popular opinion.

To compete with Oracle: MySQL management believed (understandably) that MySQL would not make it unless it competed head-to-head with the industry leader. Characteristic of this was the effort to run SAP on MySQL.

And what came of all this?

Two years ago already MySQL gave up trying to compete directly with Oracle. Back then Martin Mickos stated MySQL's mission as follows: "to become the best online database in the world". And all efforts to run SAP, including MaxDB, have also been dropped since then.

But at least the critics have been silenced! And let's face it, Sun would never have paid $1B for a "toy" database. And still today, these heavy duty features are important for Sun's effort to sell MySQL into the corporate IT space.

However, this leaves a void to be filled by Drizzle: a lightweight database that scales for demanding Web 2.0 applications and Cloud computing. By concentrating on core functionality I believe Drizzle can really make progress in this space. Just one example: developers don't have to worry whether the query cache breaks scalability on each release.

So what can I learn from this?

So far I have resisted adding features such as savepoints and 2-phase commit to PBXT, but I was thinking I would have to do this stuff at some stage. Well, I am not so sure anymore... :)

Monday, July 14, 2008

Mutex contention and other bottlenecks in MySQL

Over the last few weeks I have been doing some work on improving the concurrency performance of PBXT. The last Alpha version (1.0.03) has quite a few problems in this area.

Most of the problems have been with r/w lock and mutex contention but, I soon discovered that MySQL has some serious problems of it's own. In fact, I had to remove some of the bottlenecks in MySQL in order to continue the optimization of PBXT.

The result for simple SELECT performance is shown in the graph below.

Here you can see that the gain is over 60% for 32 or more concurrent threads. Both results show the performance with the newly optimized version of PBXT. The test is running on a 2.16 MHz dual core processor, so I expect an even greater improvement on 4 or 8 cores. The query I ran for this test is of the form SELECT * FROM table WHERE ID = ?.

So what did it do to achieve this? Well first of all, as you will see below, I cheated in some cases. I commented out or avoided some locks that were a bit too complicated to solve properly right now. But in other cases, I used solutions that can actually be taken over, as-is, by MySQL. In particular, the use of spinlocks.

All-in-all though, my intension here is just to demonstration the potential for concurrency optimization in MySQL.

Optimization 1: LOCK_plugin in plugin_foreach_with_mask()

The LOCK_plugin mutex in plugin_foreach_with_mask() is the first bottleneck you hit in just about any query. In my tests with 32 threads it takes over 60% of the overall execution time.

In order to get further with my own optimizations, I commented out the pthread_mutex_lock() and pthread_mutex_lock() calls in this function, knowing that the lock is only really needed if plug-ins are installed or uninstalled. However, later I needed to find a better solution (see below).

Optimization 2: LOCK_grant in check_grant()

After removing the above bottleneck I hit a wall in check_grant(). pthread_rwlock_rdlock() was taking 50%, and pthread_rwlock_unlock() was taking 45.6% CPU time! Once again I commented out the calls rw_rdlock(&LOCK_grant) and rw_unlock(&LOCK_grant) in check_grant() to get around the problem.

In order to really eliminate this lock, MySQL needs to switch to a different type of read/write lock. 99.9% of the time only a read lock is required because a write lock is only required when loading and changing privileges.

For similar purposes, in PBXT, I have invented a special type of read/write lock that requires almost zero time to gain a read lock ... hmmmm ;)

Optimization 3: Mutex in LOCK and UNLOCK tables

I then discovered that 51.7% of the time was taken in pthread_mutex_lock() called from thr_lock() called from open_and_lock_tables().
And, 44.5% of the time was taken in thread_mutex_lock() called from thr_unlock() called from mysql_unlock_tables().

Now this is a tough nut. The locks used here are used all over the place, but I think they can be replaced with a spinlock to good effect (see below). I did not try this though. Instead I used LOCK TABLES in my test code, to avoid the calls to LOCK and UNLOCK tables for every query.

Optimization 4: LOCK_plugin in plugin_unlock_list()

Once again the LOCK_plugin is the bottleneck, this time taking 94.7% of the CPU time in plugin_unlock_list(). This time I did a bit of work. Instead of commenting it out, I replaced LOCK_plugin with a spinlock (I copied and adapted the PBXT engine implementation for the server).

This worked to remove the bottleneck because LOCK_plugin is normally only held for a very short time. However, when a plugin is installed or unstalled this lock will be a killer and some more work probably needs to be done here.

Optimization 5: pthread_setschedparam()

I was a bit shocked to find pthread_setschedparam() was now taking 17% of the CPU time required to execute the SELECT. This call can be easily avoided by first checking to see if the schedule parameter needs to be changed at all. For the moment, I commented the call out.

Of course, the more optimized the code is, the worse such a call becomes. After all other optimizations pthread_setschedparam() CPU time increases to 52.6%!

Optimization 6: LOCK_thread_count in dispatch_command()

The LOCK_thread_count mutex in dispatch_command() is next in line with 96.1% of the execution time.

Changing this to a spinlock completely removes the bottleneck.

Optimization 7: LOCK_alarm in thr_end_alarm() and thr_alarm()

my_net_read() calls my_real_read() which calls the functions thr_end_alarm() and thr_alarm(). At this point in the optimization these 2 calls required 99.5% of the CPU time between them. Replacing LOCK_alarm with a spinlock fixed this problem.

Conclusion:

Without too much effort it is possible to make a huge improvement to the threading performance of MySQL. The fact that such bottlenecks have not yet been investigated may be due the fact that MySQL currently has no performance analysis team.

Following the last optimization, execution time was divided as follows:

25.8% of the time in net_end_statement(), which hangs in net_flush()
32.8% of the time in my_net_read()
7.6% in ha_pbxt::index_read(), this is the time spent in the engine
32.2% in init_sql_alloc() which waits on the spinlock in malloc()

From this you can see that the optimization is almost optimal because the program is spending almost 60% of its time waiting on the network.

However, it is also clear where the next optimization would come from. Remove the call to malloc() in init_sql_alloc() which is called by open_tables(). This could be done by reusing the block of memory required by the thread, from call to call.

Ultimately, the goal of optimizing for scale like this is to bring the code to the point that it is either network, CPU, or disk bound. Only then will the end-user really see an improvement in performance as the hardware is upgraded.

I think I have shown that it is worth putting some effort into such optimizations. Even more so as multi-core systems become more and more commonplace.

Friday, June 13, 2008

PBXT compiles without change under MySQL 5.1.25!

OK, now I know that the GA version of 5.1 is rapidly approaching. PBXT compiles with the latest release of MySQL without any changes!

This has never been the case before. Just search the PBXT code for MYSQL_VERSION_ID, and you will find things like:

#if MYSQL_VERSION_ID < 50114
    XT_RETURN_VOID;
#else
    XT_RETURN(0);
#endif

and, even worse:

#if MYSQL_VERSION_ID < 60000
#if MYSQL_VERSION_ID >= 50124
#define USE_CONST_SAVE
#endif
#else
#if MYSQL_VERSION_ID >= 60005
#define USE_CONST_SAVE
#endif
#endif

The lack of changes that affect pluggable storage engines can only mean that the bug fixes required are diminishing in scope.

And I believe this is a far better gauge of whether GA is close than any other marketing orientated statements! :)

Wednesday, June 04, 2008

PBXT 1.0.03 Alpha has been released!

I have released PBXT 1.0.03 Alpha and it is available for download from http://www.primebase.org/download. I have also posted binary plugins for a few platforms.

If you are building from source I have added a Quick Guide: Building and Installing PBXT from Source, which I hope makes the task really simple. If not, I would appreciate any feedback!

With this version I have completed the implementation of full-durability, and other features that are scheduled for RC and ultimately for the first GA release.

Still to be done is the Windows port which I plan to do before the first Beta release.

Please send any comments, questions, bug reports, etc. directly to me: paul dot mccullagh at primebase dot org.

Thursday, May 01, 2008

PBXT & BLOB Streaming Conference Presentations & Videos

The slides of my presentations at the MySQL Conference & Expo 2008 are now available for download. Videos of the presentations have been uploaded to YouTube:

Inside the PrimeBase XT Storage Engine

Presentation: pbxt-uc-2008.pdf
Videos: Part 1/7, Part 2/7, Part 3/7, Part 4/7, Part 5/7, Part 6/7, Part 7/7

Introduction to the BLOB Streaming Project

Presentation: mybs-uc-2008.pdf
Videos: Part 1/5, Part 2/5, Part 3/5, Part 4/5, Part 5/5

With this link you will find all the videos at once. If you watch the movies, then it may help to look at the PDF presentation slides at the same time, because the video quality is "not ideal" :)