PrimeBase XT: 2008

Thursday, December 18, 2008

PBXT goes RC!

With all the booha about MySQL not being ready for GA, it makes me almost afraid to announce, ahem, ... and PBXT is, ehr, RC.

It has been just over a year now since I started developing the fully durable version of PBXT. Before that, PBXT was Beta. After that, it was Alpha again.

Now we have 2 solid Beta versions behind us, Vladimir and I have fixed all known bugs for this version, including quite a number of foreign key bugs. We have all 259 mysql-test-run tests that were adapted for PBXT (and a bunch of our own) running through without any errors on 4 platforms: Mac OS X, Linux 32-bit and 64-bit, and Windows. Our buildbot is giving us a green light, at last!

Besides this we have done crash tests, load tests and crash and load tests (I mean recovery)! And maybe most important, we have it ticking away in a very demanding OEM product called TeamDrive. And it is doing it 20% to 30% faster than the "most commonly used" transactional storage engine.

Then we also have PBXT 1.0.07 RC compiling and running with MySQL 5.1.30, MySQL 6.0.8 and Drizzle! And it compiles on Linux, Windows, Mac OS X, FreeBSD, netbsd, OpenSolaris and Solaris (last patch pending on this one), whew!

So what's next?

Well next stop is GA, and I would like to have it done before the MySQL conference. Heard that one before? Nah ;)

Seriously though, we are not planning to add anymore features to this version so there is only one way to stop us: by testing and reporting bugs! Right here: https://bugs.launchpad.net/pbxt

Would be much appreciated! :)

BTW, the version is available, as usual, from http://www.primebase.org/download, or get it straight from Launchpad.net:

bzr branch lp:pbxt/1.0.07-rc

Monday, December 15, 2008

xtstat: Tells you exactly what PBXT is doing!

I have created a new tool, called xtstat, for analyzing the performance of the PBXT storage engine.

The way it works is simple. PBXT now counts all kinds of things: transactions committed and rolled back, statements executed, records read and written, tables and indexes scanned, bytes read, written and flushed to various types of files: record, index, data logs, transaction logs, and so on.

A SELECT on the system table PBXT.STATISTICS (or INFORMATION_SCHEMA.PBXT_STATISTICS if PBXT was built inside the MySQL tree) returns the current totals of all these counters. xtstat does a SELECT every second on this table and prints the difference. In this way, you can see how much work PBXT is doing in each area.

There are currently 48 different statistics:

To ensure all this counting does not itself cost any performance, each thread counts for itself, so no locking is required. The SELECT on STATISTICS then sums over all running threads.

The default output of xtstat is 201 characters wide (281 characters are required to display all statistics), but using the --display option you can specify exactly which statistics you would like to look at.

Here is an example, of the default output with some of the middle columns removed:

To display large byte values, such as data read from the data log files (data-in column), xtstat uses K (Kilobytes), M (Megabytes) and G (Gigabytes) to ensure the values don't overflow the column space. Counts like the number of rows inserted (row-ins column) use, t (thousands), m (millions) and b (billions) to keep things lined up.

Using xtstat it is possible to ask questions like:

How efficiently is group commit working?

Notice here that the number of transaction commits in this example (xact-commt column), is larger than the number transaction log flushes (xlog-syncs).

Is their enough index cache?

Displaying the index stats shows that index cache misses (ind-miss column) are at about 30%, and that the server is reading 8MB per second from the index file. So increasing the index cache would result in better performance.

How much time is being spent flushing files?

In the example above you can see that xtstat displays the number of flushes (syncs) and the time spent in flushing in milliseconds (ms and msec columns).

When you build PBXT with make install, xtstat will be installed in the same directory as other MySQL tools such as mysqladmin, mysqldump, etc.

For more information on xtstat, just enter:

$ bin/xtstat --help

All in all, I think xtstat will be very useful in analyzing and tuning the performance of the engine.

Monday, November 10, 2008

PBXT 1.0.06 Beta Released

On friday we released the second Beta version of PBXT. PBXT is a transactional storage engine for MySQL 5.1 and 6.0. You can find out more about the engine at www.primebase.org.

PBXT is pluggable, so it can be built separately from the MySQL tree, and loaded dynamically at runtime using the LOAD PLUGIN statement.

You can download PBXT from here. A "quick guide" to building and installing the plugin is provided. I have also updated the documentation for this version.

There are no major new features in this release because we are working towards the RC version in December. But we wrote some release notes to prove we have been busy :)

There is now also a version of PBXT available for Drizzle. You will find the source code here: https://code.launchpad.net/~drizzle-pbxt/drizzle/pbxt.

By the way, did any of you see this report: Sun releases MySQL 5.1?! It's dated 7 Nov, but no sign of the new release on the MySQL website ... to bad.

Tuesday, September 30, 2008

PBXT moves to Launchpad

It's been a week or 2 and some of you may already have heard that PBXT has moved from Sourceforge to Lauchpad.net: https://launchpad.net/pbxt.

There are several very good reasons for the move, not the least of which is that MySQL has already moved to Launchpad, and Drizzle is there too. It simply makes sense for a storage engine like PBXT to be on the same platform.

And check this out, Stewart Smith has already ported PBXT to Drizzle. You will find the tree here: PBXT in Drizzle. I will be pulling Stewart's changes back into the PBXT tree. Creating new branches, merging branches and generally contributing to projects is easy on Launchpad.

Besides this, Launchpad has great tools for bug reporting, planning, Q&A and managing releases which we plan to use. In general, I find these tools are much better integrated than those on Sourceforge. For example it is easy to attach a branch which fixes a bug to the bug report.

Jay Pipes has written some excellent articles on getting started with Launchpad:

A Contributor's Guide to Launchpad.net - Part 1 - Getting Started

A Contributor's Guide to Launchpad.net - Part 2 - Code Management

As Jay explains, making a contribution is done in a few easy steps: create a branch, make your changes, push the branch back to Launchpad and request a merge into the project. That's it!

Give it a try, Vladimir and I will certainly be glad to have your help. :)

Monday, September 01, 2008

PBXT Beta Version Released!

I am pleased to announce that the Beta version of PBXT has just been released. You can download the source code of the storage engine from www.primebase.org/download. I have also updated the documentation for this version.

Configuring and building the engine is easier than ever now. To configure PBXT all you have to do is specify the path to the MySQL source code tree (after building MySQL), for example:

./configure --with-mysql=/home/foo/mysql/mysql-5.1.26-rc

The PBXT configure command will retrieve all required options from the MySQL build. For example whether to do a debug or optimized build and where to install the plugin are determined automatically, depending on how you configured MySQL.

This was a source of some mistakes when building the plugin, so I think it is really cool!

So what's next?

My goal is a RC (release candidate) version before the end of the year. Considering the stability of the new Beta, I think this is realistic.

The main work is testing, performance tuning, and fixing all those bugs you are about to find as you give PBXT a spin, right? :)

Besides, the size of the PBXT programming team will soon double! But more about that later...

Another thing I would love to do soon is a Drizzle version of PBXT. This has one significant advantage. If I discover a bottleneck in Drizzle, while performance tuning the engine, a patch for the problem in the server will probably be accepted fairly quickly.

But first I need to move PBXT to launchpad where all the music is playing these days!

Saturday, August 02, 2008

New PBXT Release 1.0.04 Improves Performance

Lets face it, when it comes to storage engines, performance is everything. But then again, so is stability and data integrity!

So as a developer of an engine, which should you concentrate on first: performance, stability or data integrity?

I know there are not many that have to deal with this stuff, but here is my advice anyway: go for performance first.

The reason is simple, significant performance tuning can have a serious affect on both stability and data integrity. And this means you need to repeat a lot of the debugging and testing you did before.

For example one of the optimizations I made for 1.0.04 required a number of changes to the index cache. One thing was to make the LRU (least recently used) list global, it was segment based before. During the change I copy-pasted a "lru" pointer instead of a "mru" pointer :(

The result was not a crash, but the engine lost cache pages! So I only noticed the problem when a test just ran to slowly. When I got it up in the debugger, I noticed that the engine was flushing the index constantly, and this was because it was running on only 4 cache pages! All-in-all that typo cost me a half a day of debugging.

Anyway, there is still more to be done in way of optimization, but so far I am happy with the results. Here is a comparison between 1.0.04 and the previous version of PBXT:

This test was done on a 2-core machine using sysbench-0.4.7 running various selects on a table with 1M rows.

As you can see performance of the 1.0.03 version breaks completely at 4 threads. However, although 1.0.04 performance is significantly better (10 times faster at 4 threads), it also degrades substantially.

So why is this?

Well that is the thing that prompted me to have a look at the performance of MySQL itself, which I reported here: Mutex contention and other bottlenecks in MySQL.

Suffice to say that at 16 threads, MySQL is hanging 43% of the time in a mutex in open_table(), and 45% of the time in a mutex in lock_table(). And the solution is ... on its way down ... Drizzle :)

As usual you can download the latest version from www.primebase.org/download or checkout using svn directly from SourceForge.net. Give it a spin...

Wednesday, July 23, 2008

Drizzle goes back to the Roots

Will Drizzle (Brian, Monty, Mark, MontyT, and others ...) become a cloudburst? I think so, and here is why...

First a simple question: what made diverse systems such as PHP, the HTTP protocol and memcached so popular?

Answer: ease of use, simplicity, speed and scalability.

And what made the original version of MySQL so popular? Well, exactly the same things.

Drizzle goes back to the roots, concentrating on what made the use of MySQL so widespread in the first place.

You could say, with 5.0, MySQL lost its way while introducing many complex features: stored procedures, triggers, views, query cache, etc.

So why did MySQL add these features? I see two reasons:

Popular opinion: It is a simple fact that analysts, journalists and, in particular, investors, refused to take MySQL seriously unless it "grew up", and gained all the features that a mature database should have. As a venture capital financed company heading for IPO its hard to ignore popular opinion.

To compete with Oracle: MySQL management believed (understandably) that MySQL would not make it unless it competed head-to-head with the industry leader. Characteristic of this was the effort to run SAP on MySQL.

And what came of all this?

Two years ago already MySQL gave up trying to compete directly with Oracle. Back then Martin Mickos stated MySQL's mission as follows: "to become the best online database in the world". And all efforts to run SAP, including MaxDB, have also been dropped since then.

But at least the critics have been silenced! And let's face it, Sun would never have paid $1B for a "toy" database. And still today, these heavy duty features are important for Sun's effort to sell MySQL into the corporate IT space.

However, this leaves a void to be filled by Drizzle: a lightweight database that scales for demanding Web 2.0 applications and Cloud computing. By concentrating on core functionality I believe Drizzle can really make progress in this space. Just one example: developers don't have to worry whether the query cache breaks scalability on each release.

So what can I learn from this?

So far I have resisted adding features such as savepoints and 2-phase commit to PBXT, but I was thinking I would have to do this stuff at some stage. Well, I am not so sure anymore... :)

Monday, July 14, 2008

Mutex contention and other bottlenecks in MySQL

Over the last few weeks I have been doing some work on improving the concurrency performance of PBXT. The last Alpha version (1.0.03) has quite a few problems in this area.

Most of the problems have been with r/w lock and mutex contention but, I soon discovered that MySQL has some serious problems of it's own. In fact, I had to remove some of the bottlenecks in MySQL in order to continue the optimization of PBXT.

The result for simple SELECT performance is shown in the graph below.

Here you can see that the gain is over 60% for 32 or more concurrent threads. Both results show the performance with the newly optimized version of PBXT. The test is running on a 2.16 MHz dual core processor, so I expect an even greater improvement on 4 or 8 cores. The query I ran for this test is of the form SELECT * FROM table WHERE ID = ?.

So what did it do to achieve this? Well first of all, as you will see below, I cheated in some cases. I commented out or avoided some locks that were a bit too complicated to solve properly right now. But in other cases, I used solutions that can actually be taken over, as-is, by MySQL. In particular, the use of spinlocks.

All-in-all though, my intension here is just to demonstration the potential for concurrency optimization in MySQL.

Optimization 1: LOCK_plugin in plugin_foreach_with_mask()

The LOCK_plugin mutex in plugin_foreach_with_mask() is the first bottleneck you hit in just about any query. In my tests with 32 threads it takes over 60% of the overall execution time.

In order to get further with my own optimizations, I commented out the pthread_mutex_lock() and pthread_mutex_lock() calls in this function, knowing that the lock is only really needed if plug-ins are installed or uninstalled. However, later I needed to find a better solution (see below).

Optimization 2: LOCK_grant in check_grant()

After removing the above bottleneck I hit a wall in check_grant(). pthread_rwlock_rdlock() was taking 50%, and pthread_rwlock_unlock() was taking 45.6% CPU time! Once again I commented out the calls rw_rdlock(&LOCK_grant) and rw_unlock(&LOCK_grant) in check_grant() to get around the problem.

In order to really eliminate this lock, MySQL needs to switch to a different type of read/write lock. 99.9% of the time only a read lock is required because a write lock is only required when loading and changing privileges.

For similar purposes, in PBXT, I have invented a special type of read/write lock that requires almost zero time to gain a read lock ... hmmmm ;)

Optimization 3: Mutex in LOCK and UNLOCK tables

I then discovered that 51.7% of the time was taken in pthread_mutex_lock() called from thr_lock() called from open_and_lock_tables().
And, 44.5% of the time was taken in thread_mutex_lock() called from thr_unlock() called from mysql_unlock_tables().

Now this is a tough nut. The locks used here are used all over the place, but I think they can be replaced with a spinlock to good effect (see below). I did not try this though. Instead I used LOCK TABLES in my test code, to avoid the calls to LOCK and UNLOCK tables for every query.

Optimization 4: LOCK_plugin in plugin_unlock_list()

Once again the LOCK_plugin is the bottleneck, this time taking 94.7% of the CPU time in plugin_unlock_list(). This time I did a bit of work. Instead of commenting it out, I replaced LOCK_plugin with a spinlock (I copied and adapted the PBXT engine implementation for the server).

This worked to remove the bottleneck because LOCK_plugin is normally only held for a very short time. However, when a plugin is installed or unstalled this lock will be a killer and some more work probably needs to be done here.

Optimization 5: pthread_setschedparam()

I was a bit shocked to find pthread_setschedparam() was now taking 17% of the CPU time required to execute the SELECT. This call can be easily avoided by first checking to see if the schedule parameter needs to be changed at all. For the moment, I commented the call out.

Of course, the more optimized the code is, the worse such a call becomes. After all other optimizations pthread_setschedparam() CPU time increases to 52.6%!

Optimization 6: LOCK_thread_count in dispatch_command()

The LOCK_thread_count mutex in dispatch_command() is next in line with 96.1% of the execution time.

Changing this to a spinlock completely removes the bottleneck.

Optimization 7: LOCK_alarm in thr_end_alarm() and thr_alarm()

my_net_read() calls my_real_read() which calls the functions thr_end_alarm() and thr_alarm(). At this point in the optimization these 2 calls required 99.5% of the CPU time between them. Replacing LOCK_alarm with a spinlock fixed this problem.

Conclusion:

Without too much effort it is possible to make a huge improvement to the threading performance of MySQL. The fact that such bottlenecks have not yet been investigated may be due the fact that MySQL currently has no performance analysis team.

Following the last optimization, execution time was divided as follows:

25.8% of the time in net_end_statement(), which hangs in net_flush()
32.8% of the time in my_net_read()
7.6% in ha_pbxt::index_read(), this is the time spent in the engine
32.2% in init_sql_alloc() which waits on the spinlock in malloc()

From this you can see that the optimization is almost optimal because the program is spending almost 60% of its time waiting on the network.

However, it is also clear where the next optimization would come from. Remove the call to malloc() in init_sql_alloc() which is called by open_tables(). This could be done by reusing the block of memory required by the thread, from call to call.

Ultimately, the goal of optimizing for scale like this is to bring the code to the point that it is either network, CPU, or disk bound. Only then will the end-user really see an improvement in performance as the hardware is upgraded.

I think I have shown that it is worth putting some effort into such optimizations. Even more so as multi-core systems become more and more commonplace.

Friday, June 13, 2008

PBXT compiles without change under MySQL 5.1.25!

OK, now I know that the GA version of 5.1 is rapidly approaching. PBXT compiles with the latest release of MySQL without any changes!

This has never been the case before. Just search the PBXT code for MYSQL_VERSION_ID, and you will find things like:

#if MYSQL_VERSION_ID < 50114
    XT_RETURN_VOID;
#else
    XT_RETURN(0);
#endif

and, even worse:

#if MYSQL_VERSION_ID < 60000
#if MYSQL_VERSION_ID >= 50124
#define USE_CONST_SAVE
#endif
#else
#if MYSQL_VERSION_ID >= 60005
#define USE_CONST_SAVE
#endif
#endif

The lack of changes that affect pluggable storage engines can only mean that the bug fixes required are diminishing in scope.

And I believe this is a far better gauge of whether GA is close than any other marketing orientated statements! :)

Wednesday, June 04, 2008

PBXT 1.0.03 Alpha has been released!

I have released PBXT 1.0.03 Alpha and it is available for download from http://www.primebase.org/download. I have also posted binary plugins for a few platforms.

If you are building from source I have added a Quick Guide: Building and Installing PBXT from Source, which I hope makes the task really simple. If not, I would appreciate any feedback!

With this version I have completed the implementation of full-durability, and other features that are scheduled for RC and ultimately for the first GA release.

Still to be done is the Windows port which I plan to do before the first Beta release.

Please send any comments, questions, bug reports, etc. directly to me: paul dot mccullagh at primebase dot org.

Thursday, May 01, 2008

PBXT & BLOB Streaming Conference Presentations & Videos

The slides of my presentations at the MySQL Conference & Expo 2008 are now available for download. Videos of the presentations have been uploaded to YouTube:

Inside the PrimeBase XT Storage Engine

Presentation: pbxt-uc-2008.pdf
Videos: Part 1/7, Part 2/7, Part 3/7, Part 4/7, Part 5/7, Part 6/7, Part 7/7

Introduction to the BLOB Streaming Project

Presentation: mybs-uc-2008.pdf
Videos: Part 1/5, Part 2/5, Part 3/5, Part 4/5, Part 5/5

With this link you will find all the videos at once. If you watch the movies, then it may help to look at the PDF presentation slides at the same time, because the video quality is "not ideal" :)

Tuesday, April 22, 2008

Sun is serious about Open Source and the MySQL Community

In probably the best move by Sun during the whole MySQL Conference and Expo, Rich Green and Jonathan Schwartz turned up at the Community Dinner on the Sunday night before the conference.

As we walked into the restaurant I saw a face that I thought was familiar. Jonathan and Rich were standing outside the restaurant talking. However, only when we got inside did I hear Jay saying that that was Jonathan Schwartz.

So just before we all took our places, and while we were trying to work out how we were going to organize payment for the dinner, Rich and Jonathan turned up and quickly ended the discussion. Rich said his credit card would be good for the tab. So thanks to Sun for that!

But besides good food and plenty to drink, it was a great opportunity to talk and ask some questions that have been on my mind since the acquisition of MySQL by Sun. I have expressed these concerns on this blog, and they can be summarized as follows:

How important is open source, and in particular the MySQL community to Sun?

Both Rich and Jonathan were able to give me an adequate answer to this question. I will summarize this in my own words.

Sun bought MySQL to expand its business and influence in the open source world. So the MySQL community is the key to this.

I believe this means that Sun is not interested in commercializing any parts of the MySQL server, and here I am referring to the massive discussion that has resulted from the announcement MySQL to launch new features only in MySQL Enterprise on Jeremy Cole's blog. After all, it is clear that MySQL's bottom line (although profitable) makes no difference to Sun. They are interested in access to the over 10 million users of MySQL to sell services and hardware, those things that Sun already does well.

It is the MySQL's task to expand the user base, not endanger it. So I think we will see a change of strategy in the coming weeks and months.

And I can add the following: from what I have seen of it, MySQL's enterprise offering is really a great package without having to add a proprietary version of the server. It has everything a serious user of MySQL wants: 24 hour support, monitoring tools, design tools, service packs and priority bug fixing. And with Sun's backing, nobody doubts anymore that they can deliver this service.

Jonathan and Rich clearly demonstrated their support for the MySQL community by coming to the dinner. Besides clearing up some important questions, it was a great photo op.:

You may have seen this photo already on Ronald's blog. The picture is of Jonathan and I with the PrimBase Technologies conference T-shirt. If you look closely you will see another little detail. I have a dolphin in my pocket! I wonder if that has any symbolic meaning...

Jonathan tells a great story on his blog. But what is significant is the picture of Monty he posted, who is wearing a shirt that says "my free software runs your company". We have every reason to believe Jonathan fully supports this sentiment. So note that the T-shirt does not say "my partially free software ..."!

Oh, and in the picture of Monty, do you recognize the shirt of the person standing next to him? Since I generally only wear a shirt once, we know that this picture was also taking at the Community Dinner.

Friday, April 11, 2008

BLOB Streaming presentation at the MySQL Conference

My presentation on BLOB Streaming at the MySQL Conference next week will be very practical.

I have made quite a few graphics to show how it works, and plan to demonstrate the current version of the BLOB Streaming engine.

"To BLOB or not to BLOB?" is a common question in the database world. There are advantages and disadvantages to both sides. I'll be explaining why I believe that the "BLOB Repository" (a central component of the BLOB Streaming Architecture) combines the advantages of both approaches.

Check it out:

An Introduction to BLOB Streaming for MySQL Project
3:05pm - 3:50pm Wednesday, 04/16/2008
Ballroom A

Tuesday, April 08, 2008

Replication is dead, long live Replication!

Brian Aker has found general agreement with his post: "The Death of Read Replication".

Arjen Lentz says "I think Brian is right...", and Frank Mash confirmed: "what Brian says about replication, caching and memcached is very true".

Just like Video killed the Radio Star it looks like maybe Memcached killed the Replication Hierarchy!

But of course, Brian and others are talking about replication for scaling reads.

In my session on PBXT next week at the conference I will be talking about how we plan to use synchronous replication to produce an HA solution for MySQL at the engine level.

I will also discuss how some flexibility in the PBXT architecture makes it possible to actually scale writes efficiently as mentioned by Arjen in his blog.

So don't miss it:

Inside the PBXT Storage Engine
10:50am - 11:50am Thursday, 04/17/2008
Ballroom G

Wednesday, April 02, 2008

Welcome Ronald! Great to have you on board!

If you've been following his blog, then you will already know that Ronald Bradford has joined PrimeBase Technologies. We are very pleased to have him on board! As many know, Ronald has always been very active in the MySQL community as far as his job has made this possible.

Ironically during his time at MySQL he was less present in the community than before. When we discussed our plans for PrimeBase with him, Ronald was interested because it was an opportunity to return to a more active role in the community. I am very glad that this motivation was understood by almost everyone at MySQL and we are all looking forward to seeing and hearing more from Ronald.

But, of course, Ronald is not "just a pretty face" ;) He will be helping us to design and specify our open source products (including Blob Streaming). Ronald's extensive experience with both MySQL and end-users will contribute significantly to what we produce.

Ronald will also be helping us to refine our business model. We want all PrimeBase software to be open and free, so we've been thinking hard about how we can make this possible. All this makes it a very exciting time for us, and we will be talking more about of our plans in the days and weeks to come.

Of course, Ronald and I will be at the MySQL conference, so be sure to look us up!

Friday, March 14, 2008

New version and a new home for PBXT!

I have just released the first fully durable version of PBXT. Because of the amount of new code I have reverted PBXT to Alpha status. This version, 1.0-alpha, can be downloaded from: http://www.primebase.org/download.

Oh, which reminds me: PBXT now has a new home at http://www.primebase.org, so take a look around! I have actually found a bit of time to write some documentation. Right now the documentation describes building, installation, and the PBXT system parameters. Future additions will include information on performance tuning and a road map for PBXT development.

But there is more to the new home than just a new web-site. The PBXT project is now owned and funded by PrimeBase Technologies, an open source software development company. So altogether this is a important step forward on the road to my goal which is to make PBXT a significant contribution to the MySQL community and business/eco-system.

Besides full durability, the latest release includes the following improvements:

Calculation of index statistics as required by the optimizer (execute FLUSH TABLES to refresh the statistics).
New system variables: pbxt_log_cache_size, pbxt_log_file_threshold, pbxt_transaction_buffer_size and pbxt_checkpoint_frequency (details here).
Implementation of SELECT FOR UPDATE, which performs row-level locking to prevent concurrent updates.
Group commit: increases update throughput by committing multiple transactions concurrently.
Support for SHOW ENGINE PBXT STATUS, which displays information about memory usage.

What this release does not have is an option to relax durability. The transaction log is always flushed on commit. I plan to add a system parameter shortly that will allow you, in the spirit of the original version, to trade performance for durability if this suites your application.

Even better would be to be able to specify this per table. Now if only MySQL would allow engines to specify custom table attributes...

Friday, February 08, 2008

PBXT & DBT2: Dubugging C/C++ 101

Yesterday I starting testing PBXT using the DBT2 benchmark. Following the implementation of durability and SELECT FOR UPDATE for the engine I was more interested in the benchmark as a test for stability and concurrency than performance. I was not disappointed...

Which bug first?

Well I immediately ran into 3 bugs. Isn't it funny how bugs often come in batches, which leaves you thinking: "Oh sh.. where do I start?". Here's my advice: start with the bug that is most likely to disappear if you fix the others!

A simple example, you have 2 bugs: an unexpected exception is occurring, and you're loosing memory. First look for the memory loss, because it may disappear when you fix the exception (because you may be loosing memory in the error handler).

Take things one problem at time:

Another thing: once you have decided for one of the bugs, stick with it (no matter how hard it gets) to the bitter end! Thrashing around will build frustration!

So what happened with the DBT2 test? I started the test and immediately noticed that the engine was throwing "duplicate key" errors (it was too much to hope that this behavior was intended). Next I hit an assertion that claimed that a semaphore was not initialized (but I knew the semaphore was initialized). Finally, on restart after the assertion failed the engine crashed on recovery, in the clib memory manager (not a good sign!).

So were to start? Taking my own advice I quickly secured the state of the database before the restart, and confirmed that I could repeat the restart crash. So that one could wait for later.

The duplicate key error seemed be a fairly stable repeat, so I took a closer look at the semaphore problem. Here I noticed that the assertion was failing because the check bytes that indicate that the semaphore was initialized had been overwritten, not a happy situation!

Make the bug quick and easy to repeat:

This bug was also difficult to repeat, I had to restore a fresh environment to get it to repeat consistently. So this is where I started.

But before we go on: make sure, in such a situation, that you can repeat the bug as quickly and easily as possible. Eliminate as many manual steps as you can, it will save time in the long run. For example, in this case I wrote a line of shell commands to delete and copy in the database to provide the correct starting point for repeating the bug.

Check your last bug fix first!

Unfortunately this bug turned out to be the result of a short laps of concentration during my last bug fix. But I did not notice the error during my testing of the big fix and so I moved on to DBT2. When the error occurred during the DBT2 test, I did not relate the problem to my last bug fix.

If I had, I would have found the problem quick enough by a simple code read of the bug fix again. This has happened to me before, so my advice is: check your last bug fix, even when the new error does not seem to be related!

Debugging C/C++ 101, 3 lessons:

Conveniently for my little refresher course, each of the 3 bugs proved to touch on a different aspect of C/C++ debugging:

Bug 1. Using an uninitialized pointer.

For goodness sake, if you suspect this, then compile you program with optimization on, and the warning for "uninitialized variables" enabled. I didn't do this, and I may have saved myself a lot of time. Anyway, this does not always work (for example if you used '&'). Unfortunately, if the compiler does not help, there is no easy way to find these bugs.

Debugging method: Probing

I call the method I used to find this error "probing". The idea is to write a special piece of check code which tests for the memory overwrite. The semaphore that was being overwritten was not global, but I added some code when it was initialized to set a global pointer to the semaphore. Then I wrote a little check (or probe) function which tested to see if the check bytes were still OK.

Next I spread calls to the check function around my program, trying to close in on the point were things go wrong. When doing this you have 2 difficulties to deal with:

Finding the right thread - If you are probing the wrong thread, then you get very miss-leading results. For example, I started by adding the probes to the engine API functions. When the probe was failing on the entry point it took me a while to realize that the problem must be in the PBXT background threads. So, using the probe try to first isolate the thread(s) that are causing the corruption.

This meant removing all probes from the engine API functions and placing them in the background threads, starting with the main loops. Then by elimination I managed to narrow the problem down to one particular thread.

Dealing with disappearing repeatability - The problem with this kind of bug is that it is really shy! As soon as you start to probe it, it disappears. I mean the changes made to the program change the executions so that the bug does not repeat.

At this stage it is very tempting to leave the debug statements in the code and declare the bug as fixed! But, alas, the bug is still there, it has just moved on to overwrite some other part of memory in some quite little corner.

Here I can give the following advice: When the bug disappears always return to the last repeat point and try taking smaller steps this time.

Another thing: approach the corruption point from the bottom. By this I mean, close the probe in from a point after the corruption has occurred. This is because if the corruption is due an uninitialized stack value, then as you move the probe towards the corruption point from the top, the probe disturbs the state of the stack.

As I mentioned above, when I found this bug it turned out to be a result of the last bug fix, bummer :(

Bug 2. Overwriting memory.

Next I decided to look at the crash on startup. This bug caused a crash in the memory manager. This is most often due to a memory overwrite which has wiped out some of the management data stored per block by the memory manager.

Fortunately I have to right tools to deal with this problem.

Debugging method: Scanning Memory

Its like "don't leave home without it": don't start C/C++ program without a debug memory manager that does at least the following:

Adds and checks headers and footers on every block of memory allocated.
Wipes out blocks that are freed (for example set all bytes to 0xFE).
Always moves a block of memory that is reallocated.
Records every block allocated, and notes the line number and file where allocated.
Checks on shutdown that all memory has been freed, and reports blocks not freed.

This can also be done for objects allocated in C++ by overriding the delete and new operators.

Now using the fact that I have a list of all allocated pointers I have implemented a function which scans all allocated pointers and checks the headers and footers to tell me if anything is corrupted.

So to find the recovery crash I added the scan call to some of the loops that do recovery and soon managed to narrow things down and find the point were the corruption was occurring.

Note that with this method it may not even be necessary to do such a search. One call to the scan routine tells me which block has been corrupted. If it was a simple overwrite of the end of the block, then my debug memory manager will tells me which block it was, and were it was allocated. This may be enough to find the problem.

In my case it turned out that I had taken a pointer to a block and then some sub-function reallocated the block. But this is why it is so important that the debug memory manager always moves blocks on realloc(). If it had not done this, I probably never would have noticed the bug until it happened in some production situation (ugh!).

Bug 3. Concurrency problems.

I was worried about the 3rd bug which was causing an unexpected "duplicate key" error, because I was afraid it might be a conceptual problems. This fear stems from the fact that there are indeed serious conceptual problems involving MVCC and SELECT FOR UPDATE (which requires locking), but fortunately, my fear was unfounded, and it turned out to be just a normal bug, whew!

I new this bug must be related to concurrency because I had tested all aspects of the row level locking in a simple controlled environment.

Debugging method: Trace it

The way to find concurrency problems is to trace them. The better your trace, the easier it is to find the bug. I think it is almost impossible to find a concurrency bug just using the debugger (unless you have a deadlock, for example). This is because in the debugger you only have a snapshot of the situation, and you don't see the interaction between the threads.

In my case the duplicate key error turned out to be the result of a SELECT FOR UPDATE that failed earlier.

There is, of course a big problem with tracing and concurrency. Sometimes the error is robust, and sticks around while you bombard it with printf() statements. Mine was more of the shy type where the timing was really critical, and it disappeared when I added print statements.

In this case, I also have the right tool for the job. It is a trace function which records the information only in RAM, in one huge block of memory which rolls over if necessary. It is worth also taking the bit of extra time to make the trace function handle printf() type syntax, so that it is as easy to using as printf() itself.

What I did was I set a breakpoint at the spot in my code where the duplicate key error is generated. At this point in the code, I built in a call to my "trace dump" function. This function dumps the trace information which I have collected so far to a file.

Then I can examine the trace to find out how I got to this point, and I can also use the debugger to examine the threads and find other information I need.

Now some advice on trace code:

Firstly, never build in trace code that you think you might need! This is a waste of time if it is never used, and it clutters the code unnecessarily.
Secondly, when faced with a problem that needs to be traced, do not waste to much time or thought trying to guess what information you will need. Initially, just get something out there. Examining the trace is the best way to decide what information is missing.
Finally, when you are done, and you have found your bug, you will probably feel quite attached to you new trace code and not want to part with it. Don't worry, I understand, and I am not going now tell you that you have to remove it :) Well, not all of it, anyway.

Really, you have to be very critical and decide what parts of the trace are good in general, and what parts helped you just find this bug. That stuff must go. And the other stuff: take a bit of time to clean it up, and make sure it can only be enabled in debug mode! Ever have to recall a version because you forgot a trace in it? Hmmm...

Well in the end I was very happy with my trace code. It allowed me to pinpoint the bug in SELECT FOR UPDATE, and I added a GOTCHA to my code which you can search for as soon as I get this version released (soon I hope).

OK, so this has been quite a long post, thanks for sticking with me :)

Thursday, January 17, 2008

Good move, congratulations MySQL and Sun!

Its already a day old, but the news is as hot as ever. Sun will acquire MySQL before the end of the year.

Congratulation to MySQL and Sun!

And well done to all who were involved in making this deal, in particular, those I know personally: Marten, Monty, David, Zack and Kaj!

As I mentioned to Kaj, I am sure that MySQL has a very bright future under the wings of Sun. A deal for $1 billion made in 5 weeks can only mean both sides are extremely motivated to make it work.

I have just 3 concerns:

I hope that the MySQL web-site will not disappear into the Sun web-site like the proverbial needle in a haystack! Sun's download page alone is as big as the MySQL web-site ;) I would like to see a mysql.sun.com, where we can find our way around easily.
And the second is similar to the first but relates to the people. There is a massive difference between dealing with a company that has 400 employees, and one with 34000! I hope that this deal will not affect the access we have to the decision makers, the community support team and the developers.
Thirdly, Sun wants to sell MySQL to enterprise customers, but I want to be involved in a database that is there for everyone. I am concerned that the non-paying customers, which is a large part of the community, may become neglected.

I know that Kaj and many others at MySQL will be working hard to alleviate these concerns, so thanks in advance. Time will tell how successful they are. But they can be sure, in this quest they have my full support!