Wednesday, March 25, 2009

Solving the PBXT DBT2 Scaling Problem

One little bit of wisdom I would like to pass on:

If a program runs fast with 20 threads, that does not mean it will run fast with 50. And if it runs fast with 50, it does not mean that it will run fast with 100, and if it runs fast with 100 ... don't bet on it running fast with 200 :)

In my last blog I discussed some improvement to the performance of PBXT running the DBT2 benchmark. Despite the overall significant increase in performance I noted a drop off at 32 threads that indicated a scaling problem. For the last couple of weeks I have been working on this problem and I have managed to fix it:

As before, this test was done using MySQL 5.1.30 on an 8 core, 64-bit, Linux machine with an SSD drive and a 5 warehouse DBT2 database. The test is memory bound and does not test the affects of checkpointing.

PBXT Baseline is the code revision indicated as PBXT 1.0.08 in my last blog. PBXT 1.0.07 is the current PBXT GA release version. PBXT 1.0.08 is the latest revision of the PBXT trunk. The baseline graph shows the extent of the scaling problem of the last version.

The latest version is over 20 times faster than PBXT 1.0.07 and 140% faster than the previous version. But most important is the fact that performance remains almost constant as the number of threads increases.

My thanks also to InnoDB which, looking at it positively, offers an excellent measure of how well you are doing! :) It looks like PBXT now actually scales better than InnoDB for this type of test.

So what has changed?

Basically I have made 2 changes, one major and one smaller but significant change. The first change, which got PBXT running faster with 50 threads has to do with conflict handling.

As I mentioned before DBT2 causes a lot of row level conflicts. This is especially the case as the number threads increase. In fact, at any given time during the test with 100 threads (performance results above), 80 of the threads are waiting for row locks. (Of the remaining 20, 4 are waiting for network I/O, and the rest are doing the actual work!)

The result is, if the handling of these conflicts is not optimal the engine looses a lot of time. Which you can clearly see from both the baseline and 1.0.07 results reported above.

To fix this I completely re-wrote the row-level conflict handling. Code paths are now much shorter for row-lock/update detection and handling and threads are now notified directly when they can continue.

The other change I made involved the opening and closing of tables when the MySQL open table cache is too small. This is something that really killed performance starting at about 100 threads. PBXT was doing quite a bit of unnecessary stuff on open and close table, which was fairly easy to move out.

So now that the scaling is good up to 200 threads, should I assume that performance will also be good for 400 threads? Of course it is! Well, at least until I test it... :)

Friday, March 06, 2009

Improving PBXT DBT2 Performance

DBT2, with over 40% conflicts, is an very challenging benchmark, especially for an MVCC based engine. And, as a result, it is not a test that an engine is automatically good at. InnoDB has been extensively optimized for DBT2, and it shows.

For the last few weeks I have had the opportunity to focus on PBXT DBT2 performance for the first time. I started with a memory bound DBT2 test and the current state of this work is illustrated below.

These results were achieved using MySQL 5.1.30 on an 8 core, 64-bit, Linux machine with an SSD drive and a 5 warehouse DBT2 database.

The dip off at 32 threads is left as an exercise for the reader :) Patches will be excepted!

So what were the major changes that lead to this improvement?

Don't Wait Too Long!

When I began the optimizations, PBXT was only using 120% CPU (i.e. just over 1 core), while InnoDB uses 440% (i.e. about 4.5 cores). I noticed that the pauses I was using in some situations were way too long (even at 1/1000 of a second). For example, shortly before commit, PBXT waits if some other transaction may soon commit in order to improve group commit efficiency.

If the pause is too long the program waits around even after the condition to continue has been fulfilled. So I changed these pauses to a yield. However, I have noticed that even a yield causes some threads to wait too long, so I am considering putting in a small amount of spinning.

Anyway, after that change, PBXT performance was still only 50% of InnoDB.

Too Many memcpy's

The next problem was the number of memcpy's in the standard index operations: search, insert and delete. PBXT was using the index cache like a disk with a read and write call interface. Such functions involve a buffer and a memcpy on every access. On average 8K was transferred per call, which is significant.

To get rid of the memcpy's I had to change the way the indexing system and the index cache work together. This was a major change. Instead of an index operation using a buffer and a copy it now "pins" the index cache page, and accesses the index cache page directly.

Unfortunately I didn't see the improvement that I expected after this change because I ran straight into the next problem...

Updating an Index in Parallel

Threads were now waiting on an exclusive lock required to perform modification to an index. I was using an exclusive lock because it is simply the easiest way keep indices consistent during update.

To fix this I found a way to do modify an index in parallel with other readers and writers in over 90% of index update cases. This was one of the most complex changes. Although, the idea behind the solution is relatively straight forward.

But, I have decided I'm not going to say how I did it here ... for that you will have to attend my (plug) talk at the User's Conference! He he :)

The Cost of Index Scans

At this stage PBXT was running 3 of the 5 procedures used by DBT2 (slightly) faster than InnoDB. The remaining problem involved the index scan, something that InnoDB is pretty good at.

In order to scan an index, PBXT was making a copy of each index page, and then stepping through the items. A copy of the page was required because after each step the engine returns a row to MySQL. So, all-in-all, the time taken to scan a page is too long to pin the index cache page.

To avoid making a copy of each page scanned I implemented index cache page "handles". An index scanner now gets a handle to the index page it is scanning. The handles are "copy-on-write", so changes to the index page are transparent to the scanner.

Work in Progress...

So DBT2 performance of PBXT is now more or less on par with InnoDB for memory bound tests. There are some scaling issues which I will look at next, and I have not yet investigated the affects of a disk bound load and checkpointing.

I also had a quick look at the mysqlslap performance following the optimizations. Some of it is great and some of it is "interesting". But I think I'll leave that for another blog...