Monday, July 14, 2008

Mutex contention and other bottlenecks in MySQL

Over the last few weeks I have been doing some work on improving the concurrency performance of PBXT. The last Alpha version (1.0.03) has quite a few problems in this area.

Most of the problems have been with r/w lock and mutex contention but, I soon discovered that MySQL has some serious problems of it's own. In fact, I had to remove some of the bottlenecks in MySQL in order to continue the optimization of PBXT.

The result for simple SELECT performance is shown in the graph below.

Here you can see that the gain is over 60% for 32 or more concurrent threads. Both results show the performance with the newly optimized version of PBXT. The test is running on a 2.16 MHz dual core processor, so I expect an even greater improvement on 4 or 8 cores. The query I ran for this test is of the form SELECT * FROM table WHERE ID = ?.

So what did it do to achieve this? Well first of all, as you will see below, I cheated in some cases. I commented out or avoided some locks that were a bit too complicated to solve properly right now. But in other cases, I used solutions that can actually be taken over, as-is, by MySQL. In particular, the use of spinlocks.

All-in-all though, my intension here is just to demonstration the potential for concurrency optimization in MySQL.

Optimization 1: LOCK_plugin in plugin_foreach_with_mask()

The LOCK_plugin mutex in plugin_foreach_with_mask() is the first bottleneck you hit in just about any query. In my tests with 32 threads it takes over 60% of the overall execution time.

In order to get further with my own optimizations, I commented out the pthread_mutex_lock() and pthread_mutex_lock() calls in this function, knowing that the lock is only really needed if plug-ins are installed or uninstalled. However, later I needed to find a better solution (see below).

Optimization 2: LOCK_grant in check_grant()

After removing the above bottleneck I hit a wall in check_grant(). pthread_rwlock_rdlock() was taking 50%, and pthread_rwlock_unlock() was taking 45.6% CPU time! Once again I commented out the calls rw_rdlock(&LOCK_grant) and rw_unlock(&LOCK_grant) in check_grant() to get around the problem.

In order to really eliminate this lock, MySQL needs to switch to a different type of read/write lock. 99.9% of the time only a read lock is required because a write lock is only required when loading and changing privileges.

For similar purposes, in PBXT, I have invented a special type of read/write lock that requires almost zero time to gain a read lock ... hmmmm ;)

Optimization 3: Mutex in LOCK and UNLOCK tables

I then discovered that 51.7% of the time was taken in pthread_mutex_lock() called from thr_lock() called from open_and_lock_tables().
And, 44.5% of the time was taken in thread_mutex_lock() called from thr_unlock() called from mysql_unlock_tables().

Now this is a tough nut. The locks used here are used all over the place, but I think they can be replaced with a spinlock to good effect (see below). I did not try this though. Instead I used LOCK TABLES in my test code, to avoid the calls to LOCK and UNLOCK tables for every query.

Optimization 4: LOCK_plugin in plugin_unlock_list()

Once again the LOCK_plugin is the bottleneck, this time taking 94.7% of the CPU time in plugin_unlock_list(). This time I did a bit of work. Instead of commenting it out, I replaced LOCK_plugin with a spinlock (I copied and adapted the PBXT engine implementation for the server).

This worked to remove the bottleneck because LOCK_plugin is normally only held for a very short time. However, when a plugin is installed or unstalled this lock will be a killer and some more work probably needs to be done here.

Optimization 5: pthread_setschedparam()

I was a bit shocked to find pthread_setschedparam() was now taking 17% of the CPU time required to execute the SELECT. This call can be easily avoided by first checking to see if the schedule parameter needs to be changed at all. For the moment, I commented the call out.

Of course, the more optimized the code is, the worse such a call becomes. After all other optimizations pthread_setschedparam() CPU time increases to 52.6%!

Optimization 6: LOCK_thread_count in dispatch_command()

The LOCK_thread_count mutex in dispatch_command() is next in line with 96.1% of the execution time.

Changing this to a spinlock completely removes the bottleneck.

Optimization 7: LOCK_alarm in thr_end_alarm() and thr_alarm()

my_net_read() calls my_real_read() which calls the functions thr_end_alarm() and thr_alarm(). At this point in the optimization these 2 calls required 99.5% of the CPU time between them. Replacing LOCK_alarm with a spinlock fixed this problem.

Conclusion:

Without too much effort it is possible to make a huge improvement to the threading performance of MySQL. The fact that such bottlenecks have not yet been investigated may be due the fact that MySQL currently has no performance analysis team.

Following the last optimization, execution time was divided as follows:

25.8% of the time in net_end_statement(), which hangs in net_flush()
32.8% of the time in my_net_read()
7.6% in ha_pbxt::index_read(), this is the time spent in the engine
32.2% in init_sql_alloc() which waits on the spinlock in malloc()

From this you can see that the optimization is almost optimal because the program is spending almost 60% of its time waiting on the network.

However, it is also clear where the next optimization would come from. Remove the call to malloc() in init_sql_alloc() which is called by open_tables(). This could be done by reusing the block of memory required by the thread, from call to call.

Ultimately, the goal of optimizing for scale like this is to bring the code to the point that it is either network, CPU, or disk bound. Only then will the end-user really see an improvement in performance as the hardware is upgraded.

I think I have shown that it is worth putting some effort into such optimizations. Even more so as multi-core systems become more and more commonplace.

Friday, June 13, 2008

PBXT compiles without change under MySQL 5.1.25!

OK, now I know that the GA version of 5.1 is rapidly approaching. PBXT compiles with the latest release of MySQL without any changes!

This has never been the case before. Just search the PBXT code for MYSQL_VERSION_ID, and you will find things like:
#if MYSQL_VERSION_ID < 50114
XT_RETURN_VOID;
#else
XT_RETURN(0);
#endif
and, even worse:
#if MYSQL_VERSION_ID < 60000
#if MYSQL_VERSION_ID >= 50124
#define USE_CONST_SAVE
#endif
#else
#if MYSQL_VERSION_ID >= 60005
#define USE_CONST_SAVE
#endif
#endif
The lack of changes that affect pluggable storage engines can only mean that the bug fixes required are diminishing in scope.

And I believe this is a far better gauge of whether GA is close than any other marketing orientated statements! :)

Wednesday, June 04, 2008

PBXT 1.0.03 Alpha has been released!

I have released PBXT 1.0.03 Alpha and it is available for download from http://www.primebase.org/download. I have also posted binary plugins for a few platforms.

If you are building from source I have added a Quick Guide: Building and Installing PBXT from Source, which I hope makes the task really simple. If not, I would appreciate any feedback!

With this version I have completed the implementation of full-durability, and other features that are scheduled for RC and ultimately for the first GA release.

Still to be done is the Windows port which I plan to do before the first Beta release.

Please send any comments, questions, bug reports, etc. directly to me: paul dot mccullagh at primebase dot org.

Thursday, May 01, 2008

PBXT & BLOB Streaming Conference Presentations & Videos

The slides of my presentations at the MySQL Conference & Expo 2008 are now available for download. Videos of the presentations have been uploaded to YouTube:

Inside the PrimeBase XT Storage Engine

Presentation: pbxt-uc-2008.pdf
Videos: Part 1/7, Part 2/7, Part 3/7, Part 4/7, Part 5/7, Part 6/7, Part 7/7

Introduction to the BLOB Streaming Project

Presentation: mybs-uc-2008.pdf
Videos: Part 1/5, Part 2/5, Part 3/5, Part 4/5, Part 5/5

With this link you will find all the videos at once. If you watch the movies, then it may help to look at the PDF presentation slides at the same time, because the video quality is "not ideal" :)

Tuesday, April 22, 2008

Sun is serious about Open Source and the MySQL Community

In probably the best move by Sun during the whole MySQL Conference and Expo, Rich Green and Jonathan Schwartz turned up at the Community Dinner on the Sunday night before the conference.

As we walked into the restaurant I saw a face that I thought was familiar. Jonathan and Rich were standing outside the restaurant talking. However, only when we got inside did I hear Jay saying that that was Jonathan Schwartz.

So just before we all took our places, and while we were trying to work out how we were going to organize payment for the dinner, Rich and Jonathan turned up and quickly ended the discussion. Rich said his credit card would be good for the tab. So thanks to Sun for that!

But besides good food and plenty to drink, it was a great opportunity to talk and ask some questions that have been on my mind since the acquisition of MySQL by Sun. I have expressed these concerns on this blog, and they can be summarized as follows:

How important is open source, and in particular the MySQL community to Sun?

Both Rich and Jonathan were able to give me an adequate answer to this question. I will summarize this in my own words.

Sun bought MySQL to expand its business and influence in the open source world. So the MySQL community is the key to this.

I believe this means that Sun is not interested in commercializing any parts of the MySQL server, and here I am referring to the massive discussion that has resulted from the announcement MySQL to launch new features only in MySQL Enterprise on Jeremy Cole's blog. After all, it is clear that MySQL's bottom line (although profitable) makes no difference to Sun. They are interested in access to the over 10 million users of MySQL to sell services and hardware, those things that Sun already does well.

It is the MySQL's task to expand the user base, not endanger it. So I think we will see a change of strategy in the coming weeks and months.

And I can add the following: from what I have seen of it, MySQL's enterprise offering is really a great package without having to add a proprietary version of the server. It has everything a serious user of MySQL wants: 24 hour support, monitoring tools, design tools, service packs and priority bug fixing. And with Sun's backing, nobody doubts anymore that they can deliver this service.

Jonathan and Rich clearly demonstrated their support for the MySQL community by coming to the dinner. Besides clearing up some important questions, it was a great photo op.:



You may have seen this photo already on Ronald's blog. The picture is of Jonathan and I with the PrimBase Technologies conference T-shirt. If you look closely you will see another little detail. I have a dolphin in my pocket! I wonder if that has any symbolic meaning...

Jonathan tells a great story on his blog. But what is significant is the picture of Monty he posted, who is wearing a shirt that says "my free software runs your company". We have every reason to believe Jonathan fully supports this sentiment. So note that the T-shirt does not say "my partially free software ..."!

Oh, and in the picture of Monty, do you recognize the shirt of the person standing next to him? Since I generally only wear a shirt once, we know that this picture was also taking at the Community Dinner.

Friday, April 11, 2008

BLOB Streaming presentation at the MySQL Conference

My presentation on BLOB Streaming at the MySQL Conference next week will be very practical.

I have made quite a few graphics to show how it works, and plan to demonstrate the current version of the BLOB Streaming engine.

"To BLOB or not to BLOB?" is a common question in the database world. There are advantages and disadvantages to both sides. I'll be explaining why I believe that the "BLOB Repository" (a central component of the BLOB Streaming Architecture) combines the advantages of both approaches.

Check it out:

An Introduction to BLOB Streaming for MySQL Project
3:05pm - 3:50pm Wednesday, 04/16/2008
Ballroom A

Tuesday, April 08, 2008

Replication is dead, long live Replication!

Brian Aker has found general agreement with his post: "The Death of Read Replication".

Arjen Lentz says "I think Brian is right...", and Frank Mash confirmed: "what Brian says about replication, caching and memcached is very true".

Just like Video killed the Radio Star it looks like maybe Memcached killed the Replication Hierarchy!

But of course, Brian and others are talking about replication for scaling reads.

In my session on PBXT next week at the conference I will be talking about how we plan to use synchronous replication to produce an HA solution for MySQL at the engine level.

I will also discuss how some flexibility in the PBXT architecture makes it possible to actually scale writes efficiently as mentioned by Arjen in his blog.

So don't miss it:

Inside the PBXT Storage Engine
10:50am - 11:50am Thursday, 04/17/2008
Ballroom G

Wednesday, April 02, 2008

Welcome Ronald! Great to have you on board!

If you've been following his blog, then you will already know that Ronald Bradford has joined PrimeBase Technologies. We are very pleased to have him on board! As many know, Ronald has always been very active in the MySQL community as far as his job has made this possible.

Ironically during his time at MySQL he was less present in the community than before. When we discussed our plans for PrimeBase with him, Ronald was interested because it was an opportunity to return to a more active role in the community. I am very glad that this motivation was understood by almost everyone at MySQL and we are all looking forward to seeing and hearing more from Ronald.

But, of course, Ronald is not "just a pretty face" ;) He will be helping us to design and specify our open source products (including Blob Streaming). Ronald's extensive experience with both MySQL and end-users will contribute significantly to what we produce.

Ronald will also be helping us to refine our business model. We want all PrimeBase software to be open and free, so we've been thinking hard about how we can make this possible. All this makes it a very exciting time for us, and we will be talking more about of our plans in the days and weeks to come.

Of course, Ronald and I will be at the MySQL conference, so be sure to look us up!

Friday, March 14, 2008

New version and a new home for PBXT!

I have just released the first fully durable version of PBXT. Because of the amount of new code I have reverted PBXT to Alpha status. This version, 1.0-alpha, can be downloaded from: http://www.primebase.org/download.

Oh, which reminds me: PBXT now has a new home at http://www.primebase.org, so take a look around! I have actually found a bit of time to write some documentation. Right now the documentation describes building, installation, and the PBXT system parameters. Future additions will include information on performance tuning and a road map for PBXT development.

But there is more to the new home than just a new web-site. The PBXT project is now owned and funded by PrimeBase Technologies, an open source software development company. So altogether this is a important step forward on the road to my goal which is to make PBXT a significant contribution to the MySQL community and business/eco-system.

Besides full durability, the latest release includes the following improvements:
  • Calculation of index statistics as required by the optimizer (execute FLUSH TABLES to refresh the statistics).
  • New system variables: pbxt_log_cache_size, pbxt_log_file_threshold, pbxt_transaction_buffer_size and pbxt_checkpoint_frequency (details here).
  • Implementation of SELECT FOR UPDATE, which performs row-level locking to prevent concurrent updates.
  • Group commit: increases update throughput by committing multiple transactions concurrently.
  • Support for SHOW ENGINE PBXT STATUS, which displays information about memory usage.
What this release does not have is an option to relax durability. The transaction log is always flushed on commit. I plan to add a system parameter shortly that will allow you, in the spirit of the original version, to trade performance for durability if this suites your application.

Even better would be to be able to specify this per table. Now if only MySQL would allow engines to specify custom table attributes...

Friday, February 08, 2008

PBXT & DBT2: Dubugging C/C++ 101

Yesterday I starting testing PBXT using the DBT2 benchmark. Following the implementation of durability and SELECT FOR UPDATE for the engine I was more interested in the benchmark as a test for stability and concurrency than performance. I was not disappointed...

Which bug first?

Well I immediately ran into 3 bugs. Isn't it funny how bugs often come in batches, which leaves you thinking: "Oh sh.. where do I start?". Here's my advice: start with the bug that is most likely to disappear if you fix the others!

A simple example, you have 2 bugs: an unexpected exception is occurring, and you're loosing memory. First look for the memory loss, because it may disappear when you fix the exception (because you may be loosing memory in the error handler).

Take things one problem at time:

Another thing: once you have decided for one of the bugs, stick with it (no matter how hard it gets) to the bitter end! Thrashing around will build frustration!

So what happened with the DBT2 test? I started the test and immediately noticed that the engine was throwing "duplicate key" errors (it was too much to hope that this behavior was intended). Next I hit an assertion that claimed that a semaphore was not initialized (but I knew the semaphore was initialized). Finally, on restart after the assertion failed the engine crashed on recovery, in the clib memory manager (not a good sign!).

So were to start? Taking my own advice I quickly secured the state of the database before the restart, and confirmed that I could repeat the restart crash. So that one could wait for later.

The duplicate key error seemed be a fairly stable repeat, so I took a closer look at the semaphore problem. Here I noticed that the assertion was failing because the check bytes that indicate that the semaphore was initialized had been overwritten, not a happy situation!

Make the bug quick and easy to repeat:

This bug was also difficult to repeat, I had to restore a fresh environment to get it to repeat consistently. So this is where I started.

But before we go on: make sure, in such a situation, that you can repeat the bug as quickly and easily as possible. Eliminate as many manual steps as you can, it will save time in the long run. For example, in this case I wrote a line of shell commands to delete and copy in the database to provide the correct starting point for repeating the bug.

Check your last bug fix first!

Unfortunately this bug turned out to be the result of a short laps of concentration during my last bug fix. But I did not notice the error during my testing of the big fix and so I moved on to DBT2. When the error occurred during the DBT2 test, I did not relate the problem to my last bug fix.

If I had, I would have found the problem quick enough by a simple code read of the bug fix again. This has happened to me before, so my advice is: check your last bug fix, even when the new error does not seem to be related!

Debugging C/C++ 101, 3 lessons:

Conveniently for my little refresher course, each of the 3 bugs proved to touch on a different aspect of C/C++ debugging:

Bug 1. Using an uninitialized pointer.

For goodness sake, if you suspect this, then compile you program with optimization on, and the warning for "uninitialized variables" enabled. I didn't do this, and I may have saved myself a lot of time. Anyway, this does not always work (for example if you used '&'). Unfortunately, if the compiler does not help, there is no easy way to find these bugs.

Debugging method: Probing

I call the method I used to find this error "probing". The idea is to write a special piece of check code which tests for the memory overwrite. The semaphore that was being overwritten was not global, but I added some code when it was initialized to set a global pointer to the semaphore. Then I wrote a little check (or probe) function which tested to see if the check bytes were still OK.

Next I spread calls to the check function around my program, trying to close in on the point were things go wrong. When doing this you have 2 difficulties to deal with:

Finding the right thread - If you are probing the wrong thread, then you get very miss-leading results. For example, I started by adding the probes to the engine API functions. When the probe was failing on the entry point it took me a while to realize that the problem must be in the PBXT background threads. So, using the probe try to first isolate the thread(s) that are causing the corruption.

This meant removing all probes from the engine API functions and placing them in the background threads, starting with the main loops. Then by elimination I managed to narrow the problem down to one particular thread.

Dealing with disappearing repeatability - The problem with this kind of bug is that it is really shy! As soon as you start to probe it, it disappears. I mean the changes made to the program change the executions so that the bug does not repeat.

At this stage it is very tempting to leave the debug statements in the code and declare the bug as fixed! But, alas, the bug is still there, it has just moved on to overwrite some other part of memory in some quite little corner.

Here I can give the following advice: When the bug disappears always return to the last repeat point and try taking smaller steps this time.

Another thing: approach the corruption point from the bottom. By this I mean, close the probe in from a point after the corruption has occurred. This is because if the corruption is due an uninitialized stack value, then as you move the probe towards the corruption point from the top, the probe disturbs the state of the stack.

As I mentioned above, when I found this bug it turned out to be a result of the last bug fix, bummer :(

Bug 2. Overwriting memory.

Next I decided to look at the crash on startup. This bug caused a crash in the memory manager. This is most often due to a memory overwrite which has wiped out some of the management data stored per block by the memory manager.

Fortunately I have to right tools to deal with this problem.

Debugging method: Scanning Memory

Its like "don't leave home without it": don't start C/C++ program without a debug memory manager that does at least the following:
  • Adds and checks headers and footers on every block of memory allocated.
  • Wipes out blocks that are freed (for example set all bytes to 0xFE).
  • Always moves a block of memory that is reallocated.
  • Records every block allocated, and notes the line number and file where allocated.
  • Checks on shutdown that all memory has been freed, and reports blocks not freed.
This can also be done for objects allocated in C++ by overriding the delete and new operators.

Now using the fact that I have a list of all allocated pointers I have implemented a function which scans all allocated pointers and checks the headers and footers to tell me if anything is corrupted.

So to find the recovery crash I added the scan call to some of the loops that do recovery and soon managed to narrow things down and find the point were the corruption was occurring.

Note that with this method it may not even be necessary to do such a search. One call to the scan routine tells me which block has been corrupted. If it was a simple overwrite of the end of the block, then my debug memory manager will tells me which block it was, and were it was allocated. This may be enough to find the problem.

In my case it turned out that I had taken a pointer to a block and then some sub-function reallocated the block. But this is why it is so important that the debug memory manager always moves blocks on realloc(). If it had not done this, I probably never would have noticed the bug until it happened in some production situation (ugh!).

Bug 3. Concurrency problems.

I was worried about the 3rd bug which was causing an unexpected "duplicate key" error, because I was afraid it might be a conceptual problems. This fear stems from the fact that there are indeed serious conceptual problems involving MVCC and SELECT FOR UPDATE (which requires locking), but fortunately, my fear was unfounded, and it turned out to be just a normal bug, whew!

I new this bug must be related to concurrency because I had tested all aspects of the row level locking in a simple controlled environment.

Debugging method: Trace it

The way to find concurrency problems is to trace them. The better your trace, the easier it is to find the bug. I think it is almost impossible to find a concurrency bug just using the debugger (unless you have a deadlock, for example). This is because in the debugger you only have a snapshot of the situation, and you don't see the interaction between the threads.

In my case the duplicate key error turned out to be the result of a SELECT FOR UPDATE that failed earlier.

There is, of course a big problem with tracing and concurrency. Sometimes the error is robust, and sticks around while you bombard it with printf() statements. Mine was more of the shy type where the timing was really critical, and it disappeared when I added print statements.

In this case, I also have the right tool for the job. It is a trace function which records the information only in RAM, in one huge block of memory which rolls over if necessary. It is worth also taking the bit of extra time to make the trace function handle printf() type syntax, so that it is as easy to using as printf() itself.

What I did was I set a breakpoint at the spot in my code where the duplicate key error is generated. At this point in the code, I built in a call to my "trace dump" function. This function dumps the trace information which I have collected so far to a file.

Then I can examine the trace to find out how I got to this point, and I can also use the debugger to examine the threads and find other information I need.

Now some advice on trace code:
  • Firstly, never build in trace code that you think you might need! This is a waste of time if it is never used, and it clutters the code unnecessarily.
  • Secondly, when faced with a problem that needs to be traced, do not waste to much time or thought trying to guess what information you will need. Initially, just get something out there. Examining the trace is the best way to decide what information is missing.
  • Finally, when you are done, and you have found your bug, you will probably feel quite attached to you new trace code and not want to part with it. Don't worry, I understand, and I am not going now tell you that you have to remove it :) Well, not all of it, anyway.
Really, you have to be very critical and decide what parts of the trace are good in general, and what parts helped you just find this bug. That stuff must go. And the other stuff: take a bit of time to clean it up, and make sure it can only be enabled in debug mode! Ever have to recall a version because you forgot a trace in it? Hmmm...

Well in the end I was very happy with my trace code. It allowed me to pinpoint the bug in SELECT FOR UPDATE, and I added a GOTCHA to my code which you can search for as soon as I get this version released (soon I hope).

OK, so this has been quite a long post, thanks for sticking with me :)

Thursday, January 17, 2008

Good move, congratulations MySQL and Sun!

Its already a day old, but the news is as hot as ever. Sun will acquire MySQL before the end of the year.

Congratulation to MySQL and Sun!

And well done to all who were involved in making this deal, in particular, those I know personally: Marten, Monty, David, Zack and Kaj!

As I mentioned to Kaj, I am sure that MySQL has a very bright future under the wings of Sun. A deal for $1 billion made in 5 weeks can only mean both sides are extremely motivated to make it work.

I have just 3 concerns:
  • I hope that the MySQL web-site will not disappear into the Sun web-site like the proverbial needle in a haystack! Sun's download page alone is as big as the MySQL web-site ;) I would like to see a mysql.sun.com, where we can find our way around easily.

  • And the second is similar to the first but relates to the people. There is a massive difference between dealing with a company that has 400 employees, and one with 34000! I hope that this deal will not affect the access we have to the decision makers, the community support team and the developers.

  • Thirdly, Sun wants to sell MySQL to enterprise customers, but I want to be involved in a database that is there for everyone. I am concerned that the non-paying customers, which is a large part of the community, may become neglected.
I know that Kaj and many others at MySQL will be working hard to alleviate these concerns, so thanks in advance. Time will tell how successful they are. But they can be sure, in this quest they have my full support!

Thursday, December 20, 2007

Making PBXT Fully Durable

Until now PBXT has been ACId (with a lower-case d). This is soon to change as I have had some weeks to work on a fully durable version of the transactional engine (http://www.primebase.com/xt).

My first concern in making PBXT fully durable was to what extent I would have to abandon the original "write-once" design. While there are a number of ways to implement durability, the only method used by databases (as far as I know) is the write-ahead log.

The obvious advantage of this method is that all changes can be flushed at once. However, this requires that all data be written twice: once to the log and after that, to the database itself.

My solution to this problem is a compromise, but I think it is a good one. In a nutshell: short records are written twice, and long records are written once. When it comes to durability, this compromise, I believe, is a good one.

If a transaction writes only short records, then one flush will suffice to commit it. Because the records are short, contention on the write-ahead log is at a minimum. If a transaction writes any long records, most of the data will be written once to a data log (as opposed to a transaction log). Contention for writing on the data log is zero (because each writer has its own data log), but two flushes are required to commit the transaction.

By doing this I have saved other transactions having to wait while a certain transaction copies a large amount of data to the transaction log. Although the transaction log uses a double buffering system, this will still cause a hold up.

In summary: if you have a transaction which writes large records, then it will basically just hold up itself, and not everybody else.

Another innovation I have introduced to reduce contention on the transaction log is an "operation sequence number".

Normally operations must be synchronized on the transaction log to ensure consistency. For example, the allocation of a block must be written to the log before usage. But this means all threads need to lock the transaction log when performing an operation.

Instead of doing this, I issue a unique sequence number for each operation done on a table. The operations are then written to the log in batches without concern about the order.

The process that then applies the changes in the log to the database sorts the operations by sequence number before they are applied. This is also done on restart, during recovery.

Thursday, November 08, 2007

BLOB Streaming presentation at the Hamburg MySQL User Group

I have just posted the presentation that I gave at the Hamburg MySQL User Group last Tuesday. You can download the presentation here.

I have added a few slides on advanced topics: backup, replication and the distributed repository, which did not actually make it into my talk. However, these topics came up in the discussion over a few drinks afterwards.

Thanks to Lenz for the opportunity to present the BLOB Streaming Project and to those that were there for the good feedback.

As Lenz said, it was a "pretty technical crowd". For example, it did not go unnoticed that a denial of service attack could be launched by a malicious client, that establishes many upload connections that fill up the server's file system. Although unreferenced BLOBs of this type are deleted from the repository after 5 minutes, this is still a serious threat for anyone that exposes the MyBS HTTP port to the internet.

To prevent this it may be necessary to limit upload to clients with specific IP addresses (which could be specified by a system variable). Lenz suggested using HTTP-based authentication such as digest access authentification. Any other ideas would be welcome.

Another question was whether a BLOB could be deleted while it is being downloaded from the repository. Although BLOBs are not locked while they are downloaded, I have just realized that this is not a problem. The BLOB remains in the repository after deletion until the compactor thread removes it by deleting a repository file that contains the BLOB. And this is only done once all readers have release the repository file.

I have submitted this talk under the heading An Introduction to BLOB Streaming for MySQL Project as a proposal for the MySQL Conference & Expo 2008. And, if it is approved I will also be presenting Inside the PBXT Storage Engine at the conference. Ronald mentioned that there have been nearly 300 submissions so I will be quite lucky to get both talks approved! :)

Friday, October 19, 2007

New PBXT/MyBS release enables JDBC-based BLOB streaming!

This is quite a milestone for me! At last it possible to actually do some practical work with the BLOB streaming engine (MyBS)!

For this release I have completed changes to the MySQL Connector/J 5.0.7, to allow BLOB data to be transparently stored and retrieved from the MyBS BLOB repository. The new version of the driver is called MySQL Connector/J SE (streaming enabled).

Uploading a BLOB is as simple as using setBinaryStream() or setBlob() on INSERT or UPDATE. By using getBinaryStream() or getBlob() after a SELECT you get direct access to the data stream coming from the repository. More information and some examples are provided in the documentation at: http://www.blobstreaming.org/documentation.

To try this out you need to install the latest versions of PBXT and MyBS. Both are available from: http://www.blobstreaming.org/download.

Binary versions of the storage engines are also available for MySQL 5.1.22 running on 32-bit Linux and x86 Mac OS X. The modified version of the JDBC source code is included in the MyBS source code distribution, but the driver can also be downloaded here.

I have included a small test program, TestJDBC.java, as part of the JDBC driver. So once you have installed the engines, you can test BLOB streaming as follows:

java -cp mysql-connector-java-5.0.7se-bin.jar TestJDBC

TestJDBC connects to a local MySQL server, creates a PBXT table and tests INSERT and SELECT of rows containing BLOBs. The program also serves as an example of how to do BLOB streaming with JDBC, but this is all pretty much standard stuff.

To get started quickly, the most important things to note are:
  • Set EnableBlobStreaming=true in your JDBC connection URL.
  • Streamable BLOBs can only be stored in LONGBLOB columns in PBXT tables.
  • Use setBinaryStream(), setAsciiStream() or setBlob() and specify the length to upload a BLOB.
A streamable BLOB is a BLOB where the data is stored in the MyBS BLOB repository and a reference is inserted into the row. If you use setBinaryStream() on INSERT, for example, but specify a length of -1, then the JDBC driver reverts to the default (non-streaming enabled) behavior which is to store the data directly in the table. The data will be returned correctly on SELECT, but is not streamable.

As usual, any comments, questions or bug reports can be sent directly to me: paul-dot-mccullagh-at-primebase-dot-com. Make sure you put the word PBXT or MyBS in the e-mail title to make it through my spam filter! :)

Tuesday, September 25, 2007

PBXT & MyBS at the MySQL Developer Meeting in Heidelberg

I was glad to have the opportunity to join the MySQL developers in Heidelberg for a few days, so thanks to MySQL for the invitation. In between great food, quite a few beers and a number of boat trips we managed to get a significant amount of work done!

In what could be considered a follow-up to the engine summit at Google following the MySQL User's conference, I joined Calvin Sun, Brian Aker, Jeffrey Pugh, Monty and others from MySQL and the engine developing community to discuss things concerning storage engines.

One of the main topics of the meeting was features and other changes to the MySQL front-end as required by the engines. Some of the requirements (such as an interface to the MySQL optimizer) would really require huge changes, but most agree that freely defined attributes on tables, columns and indexes would be very useful (and relatively easy to implement). Monty would like error handling on the commit call to be added ASAP, but Jeffrey said that's a feature, not a bug fix, and so for MySQL 5.1 it's a no-go. It will be interesting to see who wins that one! I would also like to have engine defined, custom data types. My most pressing problem: how can I indicate that a BLOB column is streamable?

I presented the ideas behind the BLOB Streaming engine to the connector developers and we discussed how the PHP and JDBC connectors could be extended to support BLOB Streaming. Mark Matthews, responsible for JDBC, showed me the spot in the code where the ResultSet would handle a MyBS data stream. Mark also pointed out that JDBC will need to upload a BLOB without specifying which table it would be going into or the JDBC driver will have to parse the SQL statement. Hmm, ... I should have realized this before!?

I am also looking forward to discussing things further with Andrey Hristov, developer of the mysqlnd PHP Connector, after he has tried out the new engine. Making the BLOB streaming functionality easily available to PHP developers will be a great step forward.

I was also glad to be able to meet with Mats Kindahl whose experience on the MySQL replication team is very useful to the BLOB streaming project. His main concern is to maintain the flexibility of the system as he points out in his blog. He suggested a more loosely coupled system, for example to use database triggers instead of the MyBS server-side API calls. While flexibility is important, I want to avoid too many moving parts, and make sure that the basic setup is simple. We both agreed that an embedded scripting language (ala MySQL proxy) may be a good compromise.

In a bit of time between sessions Stewart Smith and I took a look at adding the BLOB streaming functionality into the NDB cluster engine. We didn't get all that far with our quick hack, but we both saw that it could be done relatively easily. The potential for the combination of MySQL cluster and BLOB streaming is huge.

Altogether it is very helpful to any developer in the community to have such concentrated access to the MySQL developers as is possible at the internal developer's conference. This is a great offer on the part of MySQL, and I can only imagine that they will have to continue to limit the number of external developers that can be accommodated at these meetings.

So my recommendation: try to book a ticket as early as possible for next year!

Tuesday, August 28, 2007

MySQL Camp: a Secret Tip?

Where can you get access to some of the most informed people from MySQL and the community, for free?

The answer: at MySQL Camp. And then throw in lunch and breakfast for free, being able to influence the session topics and you have quite a package deal.

So it is strange why so few people took up the offer in New York this year!?

My talk was about the BLOB Streaming engine, MyBS, and I have posted the slides: Presentation - MySQL Camp 2007: The BLOB Streaming Project.

OK, so I got pretty much ragged about the name, MyBS. Why, I was asked, did I name it that? Jay, even suggested a session to find a new name for the engine! Thanks, Jay, very considerate of you... :)

But it was quite unnecessary, because I really can't see what the problem is. I think the name is cool. Uhm, totally ... cool.

Friday, July 27, 2007

BLOB streaming engine (MyBS), version 0.5 Alpha released!

With some effort just before my holiday, I have managed to complete the release of the next version of MyBS, the BLOB streaming engine for MySQL.

This version includes all the basic functionality required to stream BLOB data in and out of MySQL tables.

The main features are:
  • Uploading of BLOB data directly into the database using HTTP PUT or GET methods.

  • Downloaded of BLOB data directly from the database using HTTP GET.

  • BLOB size may exceed 4GB - theoretical BLOB size limit of 256 Terabytes.

  • BLOBs are stored in a repository which manages references from other storage engine tables.

  • BLOBs are referenced by a URL.

  • URLs referencing BLOBs in the repository have a unique access code, for security.

  • The theoretical maximum repository size is 4 Zettabytes (2^72 bytes) per database.

  • The server-side streaming API allows any storage engine to store BLOB data in the repository.

  • MyBS system tables provide a view of the BLOBs and associated references in the repository.
MyBS works together with the PBXT transactional storage engine, version 0.9.88, which supports the MyBS streaming API. Both engines can be downloaded from: http://www.blobstreaming.org/download.

Documentation for MyBS is also available. It includes details about all features so far, and some examples of use: http://www.blobstreaming.org/documentation.

If you try out the new engine, I'd like to hear from you. Any comments, questions and bug reports can be sent directly to me.

Tuesday, July 17, 2007

The MyBS Engine and the BLOB Repository

After some consideration I have decided to move the BLOB repository from PBXT to MyBS (§). This has the advantage that any engine that does not have its own BLOB repository (or is otherwise not suitable for storing large amounts of BLOB data) can reference BLOBs in the MyBS BLOB repository.

(§) MyBS stands for "BLOB Streaming for MySQL". The BLOB Streaming engine is a new storage engine for MySQL which allows you to stream media data directly in and out of the database. More info at www.blobstreaming.org.

Lets look at an example of this. Assume my standard example table:
CREATE TABLE notes_tab (
n_id int PRIMARY KEY,
n_text longblob
) ENGINE=PBXT;
And assume we have a file called blob_eg.txt with the contents "This is a BLOB Streaming upload test".

Firstly, I can upload a BLOB to the MyBS BLOB Repository using the HTTP PUT method:

% curl -T blob_eg.txt http://localhost:8080/test/notes_tab
test/1-326-4891cdae-0


Here I uploaded a BLOB to the repository and specified the database, test, and the table, notes_tab. The URL returned, test/1-326-4891cdae-0, is the reference to the BLOB in the BLOB repository, returned by MyBS. Note that the BLOB is not yet in the table (to store the BLOB directly in the table, I would have to specify a column and a condition which identifies a particular row in the table).

However, the BLOB is already stored in the database, and I can download as follows:

% curl http://localhost:8080/test/1-326-4891cdae-0
This is a BLOB Streaming upload test


Since the BLOB is not yet referenced by a table, the MyBS BLOB repository sets a timer. If the BLOB is not retained (reference count incremented) within a certain amount of time it is removed from the BLOB repository.

To actually insert the BLOB into the table you just insert the BLOB reference, for example:

mysql> insert notes_tab values (1, "test/1-326-4891cdae-0");

On the MySQL server the notes_tab table engine will call the MyBS engine (using the server-side BLOB Streaming API) and retain the test/1-326-4891cdae-0 BLOB reference. So I can now download the BLOB by referencing the table, column and row as follows:

% curl http://localhost:8080/test/notes_tab/n_text/n_id=1
This is a BLOB Streaming upload test


Note: this example will only work with MyBS 0.5 (www.blobstreaming.org/download) or later. Coming soon!

Thursday, June 28, 2007

PBXT: Top 5 wishes of a Storage Engine

In response to Ronald's challenge in Top 5 wishes for MySQL, here is my top 5 wish list. However, it make sense for me to put a slightly different spin on the top 5 series, and write from a storage engine developers perspective.

1. A generic engine test suite

A set of mysql-test-run test scripts and results that are intended to be run by all engines. The tests will verify basic functionality and compatibility, and form the basis for an engine certification process.

2. Internal APIs

PBXT already has to call into MySQL to open .frm files, and transform path and file names. The BLOB Streaming engine will need to access user privilege information. Other engines use the cross-platform functionality provided by mysys. What we need is a number of official, well-defined APIs to various MySQL internal functionality.

3. Customizable table and column attributes

Specialized engines require specialized information. Right now, this information is being packed into table and column comments (hack, hack, ...).

4. Push-down restrict and join conditions

This is a big one for engines in general. Many engines are being created that can do certain searches better than the MySQL query processor. However, for the optimizer to know whether to push down a condition or not will probably require a better performance metric.

5. Custom data types

SQL-92 has the concept of a domain, which is basically a named data type. This could be used as the basis for custom data types provided by a storage engine, made available in the form of a new domain.

And without numbering them, let me slip in a few more wishes. How about MySQL community project development hosted on MySQLForge, complete with integration into the MySQL bug tracking system?! And I have heard that this may also be possible: PBXT and other GPL community engines on the MySQL Community distribution :)

Tuesday, June 26, 2007

First release of the BLOB Streaming engine for MySQL

I have just released the first version of the BLOB Streaming engine for MySQL (MyBS). You can download the source code of the engine from http://www.blobstreaming.org/download. Pluggable binaries for MySQL 5.1.19 (32-bit Linux and Mac OS X) are also available.

To install the plug-in copy libmybs.so to the /usr/local/mysql/lib/mysql directory, connect to your server using mysql, and enter:

mysql> install plugin MyBS soname "libmybs.so";

This version allows you to download BLOBs that are already stored in the database using HTTP. The URL is specified as follows:

http://mysql-host-name:8080/database/table/blob-column/condition

Where condition has the form: column1=value1&column2=value2&...

I gave an example of this in my previous blog: "GET"ing a BLOB from the database with the BLOB Streaming Engine

8080 is the default port, which can be set using the mybs_port system variable on the mysqld command line. For example: mysqld --mybs_port=8880

In order for BLOB streaming to work you also need PBXT version 0.9.87 which is streaming enabled. Streaming enabled simply means the engine supports the MyBS server-side streaming API.

This version of PBXT is also available from www.blobstreaming.org, or from Sourceforge.net.

Note that this version is currently only for use behind the firewall because the HTTP access is unrestricted.

The next step will be to enable the uploading of BLOBs using the HTTP PUT method, and the implementation of basic security.