Monday, July 14, 2008

Mutex contention and other bottlenecks in MySQL

Over the last few weeks I have been doing some work on improving the concurrency performance of PBXT. The last Alpha version (1.0.03) has quite a few problems in this area.

Most of the problems have been with r/w lock and mutex contention but, I soon discovered that MySQL has some serious problems of it's own. In fact, I had to remove some of the bottlenecks in MySQL in order to continue the optimization of PBXT.

The result for simple SELECT performance is shown in the graph below.

Here you can see that the gain is over 60% for 32 or more concurrent threads. Both results show the performance with the newly optimized version of PBXT. The test is running on a 2.16 MHz dual core processor, so I expect an even greater improvement on 4 or 8 cores. The query I ran for this test is of the form SELECT * FROM table WHERE ID = ?.

So what did it do to achieve this? Well first of all, as you will see below, I cheated in some cases. I commented out or avoided some locks that were a bit too complicated to solve properly right now. But in other cases, I used solutions that can actually be taken over, as-is, by MySQL. In particular, the use of spinlocks.

All-in-all though, my intension here is just to demonstration the potential for concurrency optimization in MySQL.

Optimization 1: LOCK_plugin in plugin_foreach_with_mask()

The LOCK_plugin mutex in plugin_foreach_with_mask() is the first bottleneck you hit in just about any query. In my tests with 32 threads it takes over 60% of the overall execution time.

In order to get further with my own optimizations, I commented out the pthread_mutex_lock() and pthread_mutex_lock() calls in this function, knowing that the lock is only really needed if plug-ins are installed or uninstalled. However, later I needed to find a better solution (see below).

Optimization 2: LOCK_grant in check_grant()

After removing the above bottleneck I hit a wall in check_grant(). pthread_rwlock_rdlock() was taking 50%, and pthread_rwlock_unlock() was taking 45.6% CPU time! Once again I commented out the calls rw_rdlock(&LOCK_grant) and rw_unlock(&LOCK_grant) in check_grant() to get around the problem.

In order to really eliminate this lock, MySQL needs to switch to a different type of read/write lock. 99.9% of the time only a read lock is required because a write lock is only required when loading and changing privileges.

For similar purposes, in PBXT, I have invented a special type of read/write lock that requires almost zero time to gain a read lock ... hmmmm ;)

Optimization 3: Mutex in LOCK and UNLOCK tables

I then discovered that 51.7% of the time was taken in pthread_mutex_lock() called from thr_lock() called from open_and_lock_tables().
And, 44.5% of the time was taken in thread_mutex_lock() called from thr_unlock() called from mysql_unlock_tables().

Now this is a tough nut. The locks used here are used all over the place, but I think they can be replaced with a spinlock to good effect (see below). I did not try this though. Instead I used LOCK TABLES in my test code, to avoid the calls to LOCK and UNLOCK tables for every query.

Optimization 4: LOCK_plugin in plugin_unlock_list()

Once again the LOCK_plugin is the bottleneck, this time taking 94.7% of the CPU time in plugin_unlock_list(). This time I did a bit of work. Instead of commenting it out, I replaced LOCK_plugin with a spinlock (I copied and adapted the PBXT engine implementation for the server).

This worked to remove the bottleneck because LOCK_plugin is normally only held for a very short time. However, when a plugin is installed or unstalled this lock will be a killer and some more work probably needs to be done here.

Optimization 5: pthread_setschedparam()

I was a bit shocked to find pthread_setschedparam() was now taking 17% of the CPU time required to execute the SELECT. This call can be easily avoided by first checking to see if the schedule parameter needs to be changed at all. For the moment, I commented the call out.

Of course, the more optimized the code is, the worse such a call becomes. After all other optimizations pthread_setschedparam() CPU time increases to 52.6%!

Optimization 6: LOCK_thread_count in dispatch_command()

The LOCK_thread_count mutex in dispatch_command() is next in line with 96.1% of the execution time.

Changing this to a spinlock completely removes the bottleneck.

Optimization 7: LOCK_alarm in thr_end_alarm() and thr_alarm()

my_net_read() calls my_real_read() which calls the functions thr_end_alarm() and thr_alarm(). At this point in the optimization these 2 calls required 99.5% of the CPU time between them. Replacing LOCK_alarm with a spinlock fixed this problem.

Conclusion:

Without too much effort it is possible to make a huge improvement to the threading performance of MySQL. The fact that such bottlenecks have not yet been investigated may be due the fact that MySQL currently has no performance analysis team.

Following the last optimization, execution time was divided as follows:

25.8% of the time in net_end_statement(), which hangs in net_flush()
32.8% of the time in my_net_read()
7.6% in ha_pbxt::index_read(), this is the time spent in the engine
32.2% in init_sql_alloc() which waits on the spinlock in malloc()

From this you can see that the optimization is almost optimal because the program is spending almost 60% of its time waiting on the network.

However, it is also clear where the next optimization would come from. Remove the call to malloc() in init_sql_alloc() which is called by open_tables(). This could be done by reusing the block of memory required by the thread, from call to call.

Ultimately, the goal of optimizing for scale like this is to bring the code to the point that it is either network, CPU, or disk bound. Only then will the end-user really see an improvement in performance as the hardware is upgraded.

I think I have shown that it is worth putting some effort into such optimizations. Even more so as multi-core systems become more and more commonplace.

13 comments:

Anonymous said...

Most of these issues are already being worked on and MySQL does have a performance analysis team.

Anonymous said...

Hi!

In a week I have something hilarious to share with you :)

BTW you missed the nasty lock contention around the query cache... but unless you have implemented it you probably have not noticed it.

Dozens more are available as well.

Cheers,
-Brian

Anonymous said...

Hi!

In a week I have something hilarious to share with you :)

BTW you missed the nasty lock contention around the query cache... but unless you have implemented it you probably have not noticed it.

Dozens more are available as well.

Cheers,
-Brian

Mark Callaghan said...

I think you hit http://bugs.mysql.com/bug.php?id=34409 for LOCK_plugin. Eventually, one of the 5.1 RCs will fix it.

Other issues include:
* there are not mutex contention stats. SHOW MUTEX STATS displays the count of threads that sleep on an InnoDB mutex. Above InnoDB there are no stats.

* there are no spin locks. Some of these mutexes are held for a short time and it makes sense to spin waiting for the lock rather than going to sleep immediately.

Mark Callaghan said...

I would love to see my_net_read rewritten to not set an alarm. The code today for net read and write calls setsockopt to put the socket in non-blocking mode, tries the read or write, if the read or write fails the socket is put into blocking mode, an alarm is set and the read or write is retried. Beyond the locking overhead for alarms there are a lot of systems calls used here. For many systems (Linux) this could be replaced by a read/write with a timeout.

Anonymous said...

In fact, for any system my_net_read could be rewritten with a simple select and non-blocking read/write.

Anonymous said...

Paul, did you try to ./configure --with-fast-mutexes ?
You get mutexes with spinlocks that way.

LOCK_plugin issue is BUG#34409
pthread_attr_setschedparam issue is BUG#35164

Anonymous said...

out of interest, what profiling tools were you using?

Mark Callaghan said...

We got big speedups from linking with tcmalloc from Google perftools to replace malloc. With that, we were able to get rid of a custom malloc cache used by InnoDB (guarded by one mutex).

Paul McCullagh said...

Hi Sergei,

No, I did not compile --with-fast-mutexes. That would be interesting to try.

I used the same compile options used to build the MySQL binary distribution.

Any ideas why --with-fast-mutexes is not used by default?

Paul McCullagh said...

Hi Stewart,

I use Mac OS X for development so getting a profile of a running process is really easy.

I used the Activity Monitor which has a "Sample Process" option.

Paul McCullagh said...

Hi Anonymous,

Please excuse my ignorance.

I know that Kelly Long is a Performance Architect, and does great work for Falcon, but I didn't know there was a performance analysis team as such.

It would be great to know who's on the team...

NMMM.NU said...

default value for --with-fast-mutexes is enabled. So even if you skip it, you will get them.