Friday, March 06, 2009

Improving PBXT DBT2 Performance

DBT2, with over 40% conflicts, is an very challenging benchmark, especially for an MVCC based engine. And, as a result, it is not a test that an engine is automatically good at. InnoDB has been extensively optimized for DBT2, and it shows.

For the last few weeks I have had the opportunity to focus on PBXT DBT2 performance for the first time. I started with a memory bound DBT2 test and the current state of this work is illustrated below.

These results were achieved using MySQL 5.1.30 on an 8 core, 64-bit, Linux machine with an SSD drive and a 5 warehouse DBT2 database.

The dip off at 32 threads is left as an exercise for the reader :) Patches will be excepted!

So what were the major changes that lead to this improvement?

Don't Wait Too Long!


When I began the optimizations, PBXT was only using 120% CPU (i.e. just over 1 core), while InnoDB uses 440% (i.e. about 4.5 cores). I noticed that the pauses I was using in some situations were way too long (even at 1/1000 of a second). For example, shortly before commit, PBXT waits if some other transaction may soon commit in order to improve group commit efficiency.

If the pause is too long the program waits around even after the condition to continue has been fulfilled. So I changed these pauses to a yield. However, I have noticed that even a yield causes some threads to wait too long, so I am considering putting in a small amount of spinning.

Anyway, after that change, PBXT performance was still only 50% of InnoDB.

Too Many memcpy's

The next problem was the number of memcpy's in the standard index operations: search, insert and delete. PBXT was using the index cache like a disk with a read and write call interface. Such functions involve a buffer and a memcpy on every access. On average 8K was transferred per call, which is significant.

To get rid of the memcpy's I had to change the way the indexing system and the index cache work together. This was a major change. Instead of an index operation using a buffer and a copy it now "pins" the index cache page, and accesses the index cache page directly.

Unfortunately I didn't see the improvement that I expected after this change because I ran straight into the next problem...

Updating an Index in Parallel

Threads were now waiting on an exclusive lock required to perform modification to an index. I was using an exclusive lock because it is simply the easiest way keep indices consistent during update.

To fix this I found a way to do modify an index in parallel with other readers and writers in over 90% of index update cases. This was one of the most complex changes. Although, the idea behind the solution is relatively straight forward.

But, I have decided I'm not going to say how I did it here ... for that you will have to attend my (plug) talk at the User's Conference! He he :)

The Cost of Index Scans

At this stage PBXT was running 3 of the 5 procedures used by DBT2 (slightly) faster than InnoDB. The remaining problem involved the index scan, something that InnoDB is pretty good at.

In order to scan an index, PBXT was making a copy of each index page, and then stepping through the items. A copy of the page was required because after each step the engine returns a row to MySQL. So, all-in-all, the time taken to scan a page is too long to pin the index cache page.

To avoid making a copy of each page scanned I implemented index cache page "handles". An index scanner now gets a handle to the index page it is scanning. The handles are "copy-on-write", so changes to the index page are transparent to the scanner.

Work in Progress...

So DBT2 performance of PBXT is now more or less on par with InnoDB for memory bound tests. There are some scaling issues which I will look at next, and I have not yet investigated the affects of a disk bound load and checkpointing.

I also had a quick look at the mysqlslap performance following the optimizations. Some of it is great and some of it is "interesting". But I think I'll leave that for another blog...

11 comments:

rpbouman said...

Paul, congrats!

great results ;)

Looking forward to hearing about parallel index updates at the UC.

Roland

Anonymous said...

PBXT is already a little rockstar:-)
How is the HA going? Read the paper from the MySQL Conference 2008, so Im curious.

Giuseppe Maxia said...

Well done, Paul!
It seems to me that, by looking into DBT2 performance, you have found and fixed some general purpose bottlenecks.

Looking forward to seeing the next enhancements!

Giuseppe

ryan said...

Out of curiousity, was InnoDB in your test set to flush at 0,1, or 2?

Either way some very promising results!

Ronald Bradford said...

A 100%+ improvement in any performance analysis is a good result.

It's a shame we have to rely on standard tests such as DBT2 for comparison when these don't always represent the real world.

I would be good for the readers to also know the specifics of the SSD in use, as there is some variance between providers.

Paul McCullagh said...

Hi Erkan,

About the HA solution. The first step is asynchronous engine level replication.

We hope to be able to show something in this area by the 3rd quarter this year.

So we still have a way to go ... :)

Paul McCullagh said...

Hi Roland & Giuseppe!

Thanks to both of you for your support.

I look forward to meeting again at the conference...

Paul McCullagh said...

Hi Ryan,

innodb_flush_log_at_trx_commit was set to 1.

Both PBXT and InnoDB are flushing every commit in the test, i.e. full durability for all transactions.

The TPMs are quite high, largely due to the SSD's high flush rate.

Paul McCullagh said...

Hi Ronald!

You are absolutely right about DBT2. Its an old benchmark which doesn't have all that much to do with most applications being run by MySQL today.

With your field experience you know this better than me.

But I have been told it is an important industry standard so, "I just got to do it" :)

Regarding SSDs, you are right, there is a huge difference between drives today.

The one I am using is a fairly old (> 1 year) 32GB Mtron drive. I don't know the exact characteristics but the newer drives, for example from Intel, are a lot faster.

Unknown said...

hi Paul,
I tried to build and run dbt2 on launchpad(https://code.launchpad.net/~dbt2-core-dev/dbt2/trunk), and found that script/run_workload.sh call src/driver with wrong argument.
The argument: " -d ${CLIENT_HOSTNAME} " is rejected by src/driver.
How to fix the problem, Paul?
Thank you.

Paul McCullagh said...

Hi Peter,

Must be a bug. I have been using --database=[db-name] to indicate the database.

Note that by default the dbt2 build has the STANDALONE switch enabled. This makes things a lot simpler.

The driver calls the database directly and does not use the "terminal server".

This makes it possible to start the test without using the script. for example:

mkdir output
mkdir output/report
mkdir output/driver
src/driver -l 30 -wmin 1 -wmax 5 -w 5 -sleep 0 -tpw 8 -outdir output/driver -ktd 0 -ktn 0 -kto 0 -ktp 0 -kts 0 -ttd 0 -ttn 0 -tto 0 -ttp 0 -tts 0 --socket=/tmp/mysql.sock --database=dbt2_pbxt_w5 -dbc 16; src/analyze_dbt_2 output ; cat output/report/mix.log.summary