Wednesday, July 23, 2008

Drizzle goes back to the Roots

Will Drizzle (Brian, Monty, Mark, MontyT, and others ...) become a cloudburst? I think so, and here is why...

First a simple question: what made diverse systems such as PHP, the HTTP protocol and memcached so popular?

Answer: ease of use, simplicity, speed and scalability.

And what made the original version of MySQL so popular? Well, exactly the same things.

Drizzle goes back to the roots, concentrating on what made the use of MySQL so widespread in the first place.

You could say, with 5.0, MySQL lost its way while introducing many complex features: stored procedures, triggers, views, query cache, etc.

So why did MySQL add these features? I see two reasons:

Popular opinion: It is a simple fact that analysts, journalists and, in particular, investors, refused to take MySQL seriously unless it "grew up", and gained all the features that a mature database should have. As a venture capital financed company heading for IPO its hard to ignore popular opinion.

To compete with Oracle: MySQL management believed (understandably) that MySQL would not make it unless it competed head-to-head with the industry leader. Characteristic of this was the effort to run SAP on MySQL.

And what came of all this?

Two years ago already MySQL gave up trying to compete directly with Oracle. Back then Martin Mickos stated MySQL's mission as follows: "to become the best online database in the world". And all efforts to run SAP, including MaxDB, have also been dropped since then.

But at least the critics have been silenced! And let's face it, Sun would never have paid $1B for a "toy" database. And still today, these heavy duty features are important for Sun's effort to sell MySQL into the corporate IT space.

However, this leaves a void to be filled by Drizzle: a lightweight database that scales for demanding Web 2.0 applications and Cloud computing. By concentrating on core functionality I believe Drizzle can really make progress in this space. Just one example: developers don't have to worry whether the query cache breaks scalability on each release.

So what can I learn from this?

So far I have resisted adding features such as savepoints and 2-phase commit to PBXT, but I was thinking I would have to do this stuff at some stage. Well, I am not so sure anymore... :)

Monday, July 14, 2008

Mutex contention and other bottlenecks in MySQL

Over the last few weeks I have been doing some work on improving the concurrency performance of PBXT. The last Alpha version (1.0.03) has quite a few problems in this area.

Most of the problems have been with r/w lock and mutex contention but, I soon discovered that MySQL has some serious problems of it's own. In fact, I had to remove some of the bottlenecks in MySQL in order to continue the optimization of PBXT.

The result for simple SELECT performance is shown in the graph below.

Here you can see that the gain is over 60% for 32 or more concurrent threads. Both results show the performance with the newly optimized version of PBXT. The test is running on a 2.16 MHz dual core processor, so I expect an even greater improvement on 4 or 8 cores. The query I ran for this test is of the form SELECT * FROM table WHERE ID = ?.

So what did it do to achieve this? Well first of all, as you will see below, I cheated in some cases. I commented out or avoided some locks that were a bit too complicated to solve properly right now. But in other cases, I used solutions that can actually be taken over, as-is, by MySQL. In particular, the use of spinlocks.

All-in-all though, my intension here is just to demonstration the potential for concurrency optimization in MySQL.

Optimization 1: LOCK_plugin in plugin_foreach_with_mask()

The LOCK_plugin mutex in plugin_foreach_with_mask() is the first bottleneck you hit in just about any query. In my tests with 32 threads it takes over 60% of the overall execution time.

In order to get further with my own optimizations, I commented out the pthread_mutex_lock() and pthread_mutex_lock() calls in this function, knowing that the lock is only really needed if plug-ins are installed or uninstalled. However, later I needed to find a better solution (see below).

Optimization 2: LOCK_grant in check_grant()

After removing the above bottleneck I hit a wall in check_grant(). pthread_rwlock_rdlock() was taking 50%, and pthread_rwlock_unlock() was taking 45.6% CPU time! Once again I commented out the calls rw_rdlock(&LOCK_grant) and rw_unlock(&LOCK_grant) in check_grant() to get around the problem.

In order to really eliminate this lock, MySQL needs to switch to a different type of read/write lock. 99.9% of the time only a read lock is required because a write lock is only required when loading and changing privileges.

For similar purposes, in PBXT, I have invented a special type of read/write lock that requires almost zero time to gain a read lock ... hmmmm ;)

Optimization 3: Mutex in LOCK and UNLOCK tables

I then discovered that 51.7% of the time was taken in pthread_mutex_lock() called from thr_lock() called from open_and_lock_tables().
And, 44.5% of the time was taken in thread_mutex_lock() called from thr_unlock() called from mysql_unlock_tables().

Now this is a tough nut. The locks used here are used all over the place, but I think they can be replaced with a spinlock to good effect (see below). I did not try this though. Instead I used LOCK TABLES in my test code, to avoid the calls to LOCK and UNLOCK tables for every query.

Optimization 4: LOCK_plugin in plugin_unlock_list()

Once again the LOCK_plugin is the bottleneck, this time taking 94.7% of the CPU time in plugin_unlock_list(). This time I did a bit of work. Instead of commenting it out, I replaced LOCK_plugin with a spinlock (I copied and adapted the PBXT engine implementation for the server).

This worked to remove the bottleneck because LOCK_plugin is normally only held for a very short time. However, when a plugin is installed or unstalled this lock will be a killer and some more work probably needs to be done here.

Optimization 5: pthread_setschedparam()

I was a bit shocked to find pthread_setschedparam() was now taking 17% of the CPU time required to execute the SELECT. This call can be easily avoided by first checking to see if the schedule parameter needs to be changed at all. For the moment, I commented the call out.

Of course, the more optimized the code is, the worse such a call becomes. After all other optimizations pthread_setschedparam() CPU time increases to 52.6%!

Optimization 6: LOCK_thread_count in dispatch_command()

The LOCK_thread_count mutex in dispatch_command() is next in line with 96.1% of the execution time.

Changing this to a spinlock completely removes the bottleneck.

Optimization 7: LOCK_alarm in thr_end_alarm() and thr_alarm()

my_net_read() calls my_real_read() which calls the functions thr_end_alarm() and thr_alarm(). At this point in the optimization these 2 calls required 99.5% of the CPU time between them. Replacing LOCK_alarm with a spinlock fixed this problem.

Conclusion:

Without too much effort it is possible to make a huge improvement to the threading performance of MySQL. The fact that such bottlenecks have not yet been investigated may be due the fact that MySQL currently has no performance analysis team.

Following the last optimization, execution time was divided as follows:

25.8% of the time in net_end_statement(), which hangs in net_flush()
32.8% of the time in my_net_read()
7.6% in ha_pbxt::index_read(), this is the time spent in the engine
32.2% in init_sql_alloc() which waits on the spinlock in malloc()

From this you can see that the optimization is almost optimal because the program is spending almost 60% of its time waiting on the network.

However, it is also clear where the next optimization would come from. Remove the call to malloc() in init_sql_alloc() which is called by open_tables(). This could be done by reusing the block of memory required by the thread, from call to call.

Ultimately, the goal of optimizing for scale like this is to bring the code to the point that it is either network, CPU, or disk bound. Only then will the end-user really see an improvement in performance as the hardware is upgraded.

I think I have shown that it is worth putting some effort into such optimizations. Even more so as multi-core systems become more and more commonplace.