PrimeBase XT: scaling

One little bit of wisdom I would like to pass on:

If a program runs fast with 20 threads, that does not mean it will run fast with 50. And if it runs fast with 50, it does not mean that it will run fast with 100, and if it runs fast with 100 ... don't bet on it running fast with 200 :)

In my last blog I discussed some improvement to the performance of PBXT running the DBT2 benchmark. Despite the overall significant increase in performance I noted a drop off at 32 threads that indicated a scaling problem. For the last couple of weeks I have been working on this problem and I have managed to fix it:

As before, this test was done using MySQL 5.1.30 on an 8 core, 64-bit, Linux machine with an SSD drive and a 5 warehouse DBT2 database. The test is memory bound and does not test the affects of checkpointing.

PBXT Baseline is the code revision indicated as PBXT 1.0.08 in my last blog. PBXT 1.0.07 is the current PBXT GA release version. PBXT 1.0.08 is the latest revision of the PBXT trunk. The baseline graph shows the extent of the scaling problem of the last version.

The latest version is over 20 times faster than PBXT 1.0.07 and 140% faster than the previous version. But most important is the fact that performance remains almost constant as the number of threads increases.

My thanks also to InnoDB which, looking at it positively, offers an excellent measure of how well you are doing! :) It looks like PBXT now actually scales better than InnoDB for this type of test.

So what has changed?

Basically I have made 2 changes, one major and one smaller but significant change. The first change, which got PBXT running faster with 50 threads has to do with conflict handling.

As I mentioned before DBT2 causes a lot of row level conflicts. This is especially the case as the number threads increase. In fact, at any given time during the test with 100 threads (performance results above), 80 of the threads are waiting for row locks. (Of the remaining 20, 4 are waiting for network I/O, and the rest are doing the actual work!)

The result is, if the handling of these conflicts is not optimal the engine looses a lot of time. Which you can clearly see from both the baseline and 1.0.07 results reported above.

To fix this I completely re-wrote the row-level conflict handling. Code paths are now much shorter for row-lock/update detection and handling and threads are now notified directly when they can continue.

The other change I made involved the opening and closing of tables when the MySQL open table cache is too small. This is something that really killed performance starting at about 100 threads. PBXT was doing quite a bit of unnecessary stuff on open and close table, which was fairly easy to move out.

So now that the scaling is good up to 200 threads, should I assume that performance will also be good for 400 threads? Of course it is! Well, at least until I test it... :)

Over the last few weeks I have been doing some work on improving the concurrency performance of PBXT. The last Alpha version (1.0.03) has quite a few problems in this area.

Most of the problems have been with r/w lock and mutex contention but, I soon discovered that MySQL has some serious problems of it's own. In fact, I had to remove some of the bottlenecks in MySQL in order to continue the optimization of PBXT.

The result for simple SELECT performance is shown in the graph below.

Here you can see that the gain is over 60% for 32 or more concurrent threads. Both results show the performance with the newly optimized version of PBXT. The test is running on a 2.16 MHz dual core processor, so I expect an even greater improvement on 4 or 8 cores. The query I ran for this test is of the form SELECT * FROM table WHERE ID = ?.

So what did it do to achieve this? Well first of all, as you will see below, I cheated in some cases. I commented out or avoided some locks that were a bit too complicated to solve properly right now. But in other cases, I used solutions that can actually be taken over, as-is, by MySQL. In particular, the use of spinlocks.

All-in-all though, my intension here is just to demonstration the potential for concurrency optimization in MySQL.

Optimization 1: LOCK_plugin in plugin_foreach_with_mask()

The LOCK_plugin mutex in plugin_foreach_with_mask() is the first bottleneck you hit in just about any query. In my tests with 32 threads it takes over 60% of the overall execution time.

In order to get further with my own optimizations, I commented out the pthread_mutex_lock() and pthread_mutex_lock() calls in this function, knowing that the lock is only really needed if plug-ins are installed or uninstalled. However, later I needed to find a better solution (see below).

Optimization 2: LOCK_grant in check_grant()

After removing the above bottleneck I hit a wall in check_grant(). pthread_rwlock_rdlock() was taking 50%, and pthread_rwlock_unlock() was taking 45.6% CPU time! Once again I commented out the calls rw_rdlock(&LOCK_grant) and rw_unlock(&LOCK_grant) in check_grant() to get around the problem.

In order to really eliminate this lock, MySQL needs to switch to a different type of read/write lock. 99.9% of the time only a read lock is required because a write lock is only required when loading and changing privileges.

For similar purposes, in PBXT, I have invented a special type of read/write lock that requires almost zero time to gain a read lock ... hmmmm ;)

Optimization 3: Mutex in LOCK and UNLOCK tables

I then discovered that 51.7% of the time was taken in pthread_mutex_lock() called from thr_lock() called from open_and_lock_tables().
And, 44.5% of the time was taken in thread_mutex_lock() called from thr_unlock() called from mysql_unlock_tables().

Now this is a tough nut. The locks used here are used all over the place, but I think they can be replaced with a spinlock to good effect (see below). I did not try this though. Instead I used LOCK TABLES in my test code, to avoid the calls to LOCK and UNLOCK tables for every query.

Optimization 4: LOCK_plugin in plugin_unlock_list()

Once again the LOCK_plugin is the bottleneck, this time taking 94.7% of the CPU time in plugin_unlock_list(). This time I did a bit of work. Instead of commenting it out, I replaced LOCK_plugin with a spinlock (I copied and adapted the PBXT engine implementation for the server).

This worked to remove the bottleneck because LOCK_plugin is normally only held for a very short time. However, when a plugin is installed or unstalled this lock will be a killer and some more work probably needs to be done here.

Optimization 5: pthread_setschedparam()

I was a bit shocked to find pthread_setschedparam() was now taking 17% of the CPU time required to execute the SELECT. This call can be easily avoided by first checking to see if the schedule parameter needs to be changed at all. For the moment, I commented the call out.

Of course, the more optimized the code is, the worse such a call becomes. After all other optimizations pthread_setschedparam() CPU time increases to 52.6%!

Optimization 6: LOCK_thread_count in dispatch_command()

The LOCK_thread_count mutex in dispatch_command() is next in line with 96.1% of the execution time.

Changing this to a spinlock completely removes the bottleneck.

Optimization 7: LOCK_alarm in thr_end_alarm() and thr_alarm()

my_net_read() calls my_real_read() which calls the functions thr_end_alarm() and thr_alarm(). At this point in the optimization these 2 calls required 99.5% of the CPU time between them. Replacing LOCK_alarm with a spinlock fixed this problem.

Conclusion:

Without too much effort it is possible to make a huge improvement to the threading performance of MySQL. The fact that such bottlenecks have not yet been investigated may be due the fact that MySQL currently has no performance analysis team.

Following the last optimization, execution time was divided as follows:

25.8% of the time in net_end_statement(), which hangs in net_flush()
32.8% of the time in my_net_read()
7.6% in ha_pbxt::index_read(), this is the time spent in the engine
32.2% in init_sql_alloc() which waits on the spinlock in malloc()

From this you can see that the optimization is almost optimal because the program is spending almost 60% of its time waiting on the network.

However, it is also clear where the next optimization would come from. Remove the call to malloc() in init_sql_alloc() which is called by open_tables(). This could be done by reusing the block of memory required by the thread, from call to call.

Ultimately, the goal of optimizing for scale like this is to bring the code to the point that it is either network, CPU, or disk bound. Only then will the end-user really see an improvement in performance as the hardware is upgraded.

I think I have shown that it is worth putting some effort into such optimizations. Even more so as multi-core systems become more and more commonplace.

PrimeBase XT

Wednesday, March 25, 2009

Solving the PBXT DBT2 Scaling Problem

Monday, July 14, 2008

Mutex contention and other bottlenecks in MySQL