PrimeBase XT

Friday, March 14, 2008

New version and a new home for PBXT!

I have just released the first fully durable version of PBXT. Because of the amount of new code I have reverted PBXT to Alpha status. This version, 1.0-alpha, can be downloaded from: http://www.primebase.org/download.

Oh, which reminds me: PBXT now has a new home at http://www.primebase.org, so take a look around! I have actually found a bit of time to write some documentation. Right now the documentation describes building, installation, and the PBXT system parameters. Future additions will include information on performance tuning and a road map for PBXT development.

But there is more to the new home than just a new web-site. The PBXT project is now owned and funded by PrimeBase Technologies, an open source software development company. So altogether this is a important step forward on the road to my goal which is to make PBXT a significant contribution to the MySQL community and business/eco-system.

Besides full durability, the latest release includes the following improvements:

Calculation of index statistics as required by the optimizer (execute FLUSH TABLES to refresh the statistics).
New system variables: pbxt_log_cache_size, pbxt_log_file_threshold, pbxt_transaction_buffer_size and pbxt_checkpoint_frequency (details here).
Implementation of SELECT FOR UPDATE, which performs row-level locking to prevent concurrent updates.
Group commit: increases update throughput by committing multiple transactions concurrently.
Support for SHOW ENGINE PBXT STATUS, which displays information about memory usage.

What this release does not have is an option to relax durability. The transaction log is always flushed on commit. I plan to add a system parameter shortly that will allow you, in the spirit of the original version, to trade performance for durability if this suites your application.

Even better would be to be able to specify this per table. Now if only MySQL would allow engines to specify custom table attributes...

Friday, February 08, 2008

PBXT & DBT2: Dubugging C/C++ 101

Yesterday I starting testing PBXT using the DBT2 benchmark. Following the implementation of durability and SELECT FOR UPDATE for the engine I was more interested in the benchmark as a test for stability and concurrency than performance. I was not disappointed...

Which bug first?

Well I immediately ran into 3 bugs. Isn't it funny how bugs often come in batches, which leaves you thinking: "Oh sh.. where do I start?". Here's my advice: start with the bug that is most likely to disappear if you fix the others!

A simple example, you have 2 bugs: an unexpected exception is occurring, and you're loosing memory. First look for the memory loss, because it may disappear when you fix the exception (because you may be loosing memory in the error handler).

Take things one problem at time:

Another thing: once you have decided for one of the bugs, stick with it (no matter how hard it gets) to the bitter end! Thrashing around will build frustration!

So what happened with the DBT2 test? I started the test and immediately noticed that the engine was throwing "duplicate key" errors (it was too much to hope that this behavior was intended). Next I hit an assertion that claimed that a semaphore was not initialized (but I knew the semaphore was initialized). Finally, on restart after the assertion failed the engine crashed on recovery, in the clib memory manager (not a good sign!).

So were to start? Taking my own advice I quickly secured the state of the database before the restart, and confirmed that I could repeat the restart crash. So that one could wait for later.

The duplicate key error seemed be a fairly stable repeat, so I took a closer look at the semaphore problem. Here I noticed that the assertion was failing because the check bytes that indicate that the semaphore was initialized had been overwritten, not a happy situation!

Make the bug quick and easy to repeat:

This bug was also difficult to repeat, I had to restore a fresh environment to get it to repeat consistently. So this is where I started.

But before we go on: make sure, in such a situation, that you can repeat the bug as quickly and easily as possible. Eliminate as many manual steps as you can, it will save time in the long run. For example, in this case I wrote a line of shell commands to delete and copy in the database to provide the correct starting point for repeating the bug.

Check your last bug fix first!

Unfortunately this bug turned out to be the result of a short laps of concentration during my last bug fix. But I did not notice the error during my testing of the big fix and so I moved on to DBT2. When the error occurred during the DBT2 test, I did not relate the problem to my last bug fix.

If I had, I would have found the problem quick enough by a simple code read of the bug fix again. This has happened to me before, so my advice is: check your last bug fix, even when the new error does not seem to be related!

Debugging C/C++ 101, 3 lessons:

Conveniently for my little refresher course, each of the 3 bugs proved to touch on a different aspect of C/C++ debugging:

Bug 1. Using an uninitialized pointer.

For goodness sake, if you suspect this, then compile you program with optimization on, and the warning for "uninitialized variables" enabled. I didn't do this, and I may have saved myself a lot of time. Anyway, this does not always work (for example if you used '&'). Unfortunately, if the compiler does not help, there is no easy way to find these bugs.

Debugging method: Probing

I call the method I used to find this error "probing". The idea is to write a special piece of check code which tests for the memory overwrite. The semaphore that was being overwritten was not global, but I added some code when it was initialized to set a global pointer to the semaphore. Then I wrote a little check (or probe) function which tested to see if the check bytes were still OK.

Next I spread calls to the check function around my program, trying to close in on the point were things go wrong. When doing this you have 2 difficulties to deal with:

Finding the right thread - If you are probing the wrong thread, then you get very miss-leading results. For example, I started by adding the probes to the engine API functions. When the probe was failing on the entry point it took me a while to realize that the problem must be in the PBXT background threads. So, using the probe try to first isolate the thread(s) that are causing the corruption.

This meant removing all probes from the engine API functions and placing them in the background threads, starting with the main loops. Then by elimination I managed to narrow the problem down to one particular thread.

Dealing with disappearing repeatability - The problem with this kind of bug is that it is really shy! As soon as you start to probe it, it disappears. I mean the changes made to the program change the executions so that the bug does not repeat.

At this stage it is very tempting to leave the debug statements in the code and declare the bug as fixed! But, alas, the bug is still there, it has just moved on to overwrite some other part of memory in some quite little corner.

Here I can give the following advice: When the bug disappears always return to the last repeat point and try taking smaller steps this time.

Another thing: approach the corruption point from the bottom. By this I mean, close the probe in from a point after the corruption has occurred. This is because if the corruption is due an uninitialized stack value, then as you move the probe towards the corruption point from the top, the probe disturbs the state of the stack.

As I mentioned above, when I found this bug it turned out to be a result of the last bug fix, bummer :(

Bug 2. Overwriting memory.

Next I decided to look at the crash on startup. This bug caused a crash in the memory manager. This is most often due to a memory overwrite which has wiped out some of the management data stored per block by the memory manager.

Fortunately I have to right tools to deal with this problem.

Debugging method: Scanning Memory

Its like "don't leave home without it": don't start C/C++ program without a debug memory manager that does at least the following:

Adds and checks headers and footers on every block of memory allocated.
Wipes out blocks that are freed (for example set all bytes to 0xFE).
Always moves a block of memory that is reallocated.
Records every block allocated, and notes the line number and file where allocated.
Checks on shutdown that all memory has been freed, and reports blocks not freed.

This can also be done for objects allocated in C++ by overriding the delete and new operators.

Now using the fact that I have a list of all allocated pointers I have implemented a function which scans all allocated pointers and checks the headers and footers to tell me if anything is corrupted.

So to find the recovery crash I added the scan call to some of the loops that do recovery and soon managed to narrow things down and find the point were the corruption was occurring.

Note that with this method it may not even be necessary to do such a search. One call to the scan routine tells me which block has been corrupted. If it was a simple overwrite of the end of the block, then my debug memory manager will tells me which block it was, and were it was allocated. This may be enough to find the problem.

In my case it turned out that I had taken a pointer to a block and then some sub-function reallocated the block. But this is why it is so important that the debug memory manager always moves blocks on realloc(). If it had not done this, I probably never would have noticed the bug until it happened in some production situation (ugh!).

Bug 3. Concurrency problems.

I was worried about the 3rd bug which was causing an unexpected "duplicate key" error, because I was afraid it might be a conceptual problems. This fear stems from the fact that there are indeed serious conceptual problems involving MVCC and SELECT FOR UPDATE (which requires locking), but fortunately, my fear was unfounded, and it turned out to be just a normal bug, whew!

I new this bug must be related to concurrency because I had tested all aspects of the row level locking in a simple controlled environment.

Debugging method: Trace it

The way to find concurrency problems is to trace them. The better your trace, the easier it is to find the bug. I think it is almost impossible to find a concurrency bug just using the debugger (unless you have a deadlock, for example). This is because in the debugger you only have a snapshot of the situation, and you don't see the interaction between the threads.

In my case the duplicate key error turned out to be the result of a SELECT FOR UPDATE that failed earlier.

There is, of course a big problem with tracing and concurrency. Sometimes the error is robust, and sticks around while you bombard it with printf() statements. Mine was more of the shy type where the timing was really critical, and it disappeared when I added print statements.

In this case, I also have the right tool for the job. It is a trace function which records the information only in RAM, in one huge block of memory which rolls over if necessary. It is worth also taking the bit of extra time to make the trace function handle printf() type syntax, so that it is as easy to using as printf() itself.

What I did was I set a breakpoint at the spot in my code where the duplicate key error is generated. At this point in the code, I built in a call to my "trace dump" function. This function dumps the trace information which I have collected so far to a file.

Then I can examine the trace to find out how I got to this point, and I can also use the debugger to examine the threads and find other information I need.

Now some advice on trace code:

Firstly, never build in trace code that you think you might need! This is a waste of time if it is never used, and it clutters the code unnecessarily.
Secondly, when faced with a problem that needs to be traced, do not waste to much time or thought trying to guess what information you will need. Initially, just get something out there. Examining the trace is the best way to decide what information is missing.
Finally, when you are done, and you have found your bug, you will probably feel quite attached to you new trace code and not want to part with it. Don't worry, I understand, and I am not going now tell you that you have to remove it :) Well, not all of it, anyway.

Really, you have to be very critical and decide what parts of the trace are good in general, and what parts helped you just find this bug. That stuff must go. And the other stuff: take a bit of time to clean it up, and make sure it can only be enabled in debug mode! Ever have to recall a version because you forgot a trace in it? Hmmm...

Well in the end I was very happy with my trace code. It allowed me to pinpoint the bug in SELECT FOR UPDATE, and I added a GOTCHA to my code which you can search for as soon as I get this version released (soon I hope).

OK, so this has been quite a long post, thanks for sticking with me :)

Thursday, January 17, 2008

Good move, congratulations MySQL and Sun!

Its already a day old, but the news is as hot as ever. Sun will acquire MySQL before the end of the year.

Congratulation to MySQL and Sun!

And well done to all who were involved in making this deal, in particular, those I know personally: Marten, Monty, David, Zack and Kaj!

As I mentioned to Kaj, I am sure that MySQL has a very bright future under the wings of Sun. A deal for $1 billion made in 5 weeks can only mean both sides are extremely motivated to make it work.

I have just 3 concerns:

I hope that the MySQL web-site will not disappear into the Sun web-site like the proverbial needle in a haystack! Sun's download page alone is as big as the MySQL web-site ;) I would like to see a mysql.sun.com, where we can find our way around easily.
And the second is similar to the first but relates to the people. There is a massive difference between dealing with a company that has 400 employees, and one with 34000! I hope that this deal will not affect the access we have to the decision makers, the community support team and the developers.
Thirdly, Sun wants to sell MySQL to enterprise customers, but I want to be involved in a database that is there for everyone. I am concerned that the non-paying customers, which is a large part of the community, may become neglected.

I know that Kaj and many others at MySQL will be working hard to alleviate these concerns, so thanks in advance. Time will tell how successful they are. But they can be sure, in this quest they have my full support!

Thursday, December 20, 2007

Making PBXT Fully Durable

Until now PBXT has been ACId (with a lower-case d). This is soon to change as I have had some weeks to work on a fully durable version of the transactional engine (http://www.primebase.com/xt).

My first concern in making PBXT fully durable was to what extent I would have to abandon the original "write-once" design. While there are a number of ways to implement durability, the only method used by databases (as far as I know) is the write-ahead log.

The obvious advantage of this method is that all changes can be flushed at once. However, this requires that all data be written twice: once to the log and after that, to the database itself.

My solution to this problem is a compromise, but I think it is a good one. In a nutshell: short records are written twice, and long records are written once. When it comes to durability, this compromise, I believe, is a good one.

If a transaction writes only short records, then one flush will suffice to commit it. Because the records are short, contention on the write-ahead log is at a minimum. If a transaction writes any long records, most of the data will be written once to a data log (as opposed to a transaction log). Contention for writing on the data log is zero (because each writer has its own data log), but two flushes are required to commit the transaction.

By doing this I have saved other transactions having to wait while a certain transaction copies a large amount of data to the transaction log. Although the transaction log uses a double buffering system, this will still cause a hold up.

In summary: if you have a transaction which writes large records, then it will basically just hold up itself, and not everybody else.

Another innovation I have introduced to reduce contention on the transaction log is an "operation sequence number".

Normally operations must be synchronized on the transaction log to ensure consistency. For example, the allocation of a block must be written to the log before usage. But this means all threads need to lock the transaction log when performing an operation.

Instead of doing this, I issue a unique sequence number for each operation done on a table. The operations are then written to the log in batches without concern about the order.

The process that then applies the changes in the log to the database sorts the operations by sequence number before they are applied. This is also done on restart, during recovery.

Thursday, November 08, 2007

BLOB Streaming presentation at the Hamburg MySQL User Group

I have just posted the presentation that I gave at the Hamburg MySQL User Group last Tuesday. You can download the presentation here.

I have added a few slides on advanced topics: backup, replication and the distributed repository, which did not actually make it into my talk. However, these topics came up in the discussion over a few drinks afterwards.

Thanks to Lenz for the opportunity to present the BLOB Streaming Project and to those that were there for the good feedback.

As Lenz said, it was a "pretty technical crowd". For example, it did not go unnoticed that a denial of service attack could be launched by a malicious client, that establishes many upload connections that fill up the server's file system. Although unreferenced BLOBs of this type are deleted from the repository after 5 minutes, this is still a serious threat for anyone that exposes the MyBS HTTP port to the internet.

To prevent this it may be necessary to limit upload to clients with specific IP addresses (which could be specified by a system variable). Lenz suggested using HTTP-based authentication such as digest access authentification. Any other ideas would be welcome.

Another question was whether a BLOB could be deleted while it is being downloaded from the repository. Although BLOBs are not locked while they are downloaded, I have just realized that this is not a problem. The BLOB remains in the repository after deletion until the compactor thread removes it by deleting a repository file that contains the BLOB. And this is only done once all readers have release the repository file.

I have submitted this talk under the heading An Introduction to BLOB Streaming for MySQL Project as a proposal for the MySQL Conference & Expo 2008. And, if it is approved I will also be presenting Inside the PBXT Storage Engine at the conference. Ronald mentioned that there have been nearly 300 submissions so I will be quite lucky to get both talks approved! :)

Friday, October 19, 2007

New PBXT/MyBS release enables JDBC-based BLOB streaming!

This is quite a milestone for me! At last it possible to actually do some practical work with the BLOB streaming engine (MyBS)!

For this release I have completed changes to the MySQL Connector/J 5.0.7, to allow BLOB data to be transparently stored and retrieved from the MyBS BLOB repository. The new version of the driver is called MySQL Connector/J SE (streaming enabled).

Uploading a BLOB is as simple as using setBinaryStream() or setBlob() on INSERT or UPDATE. By using getBinaryStream() or getBlob() after a SELECT you get direct access to the data stream coming from the repository. More information and some examples are provided in the documentation at: http://www.blobstreaming.org/documentation.

To try this out you need to install the latest versions of PBXT and MyBS. Both are available from: http://www.blobstreaming.org/download.

Binary versions of the storage engines are also available for MySQL 5.1.22 running on 32-bit Linux and x86 Mac OS X. The modified version of the JDBC source code is included in the MyBS source code distribution, but the driver can also be downloaded here.

I have included a small test program, TestJDBC.java, as part of the JDBC driver. So once you have installed the engines, you can test BLOB streaming as follows:

java -cp mysql-connector-java-5.0.7se-bin.jar TestJDBC

TestJDBC connects to a local MySQL server, creates a PBXT table and tests INSERT and SELECT of rows containing BLOBs. The program also serves as an example of how to do BLOB streaming with JDBC, but this is all pretty much standard stuff.

To get started quickly, the most important things to note are:

Set EnableBlobStreaming=true in your JDBC connection URL.
Streamable BLOBs can only be stored in LONGBLOB columns in PBXT tables.
Use setBinaryStream(), setAsciiStream() or setBlob() and specify the length to upload a BLOB.

A streamable BLOB is a BLOB where the data is stored in the MyBS BLOB repository and a reference is inserted into the row. If you use setBinaryStream() on INSERT, for example, but specify a length of -1, then the JDBC driver reverts to the default (non-streaming enabled) behavior which is to store the data directly in the table. The data will be returned correctly on SELECT, but is not streamable.

As usual, any comments, questions or bug reports can be sent directly to me: paul-dot-mccullagh-at-primebase-dot-com. Make sure you put the word PBXT or MyBS in the e-mail title to make it through my spam filter! :)

Tuesday, September 25, 2007

PBXT & MyBS at the MySQL Developer Meeting in Heidelberg

I was glad to have the opportunity to join the MySQL developers in Heidelberg for a few days, so thanks to MySQL for the invitation. In between great food, quite a few beers and a number of boat trips we managed to get a significant amount of work done!

In what could be considered a follow-up to the engine summit at Google following the MySQL User's conference, I joined Calvin Sun, Brian Aker, Jeffrey Pugh, Monty and others from MySQL and the engine developing community to discuss things concerning storage engines.

One of the main topics of the meeting was features and other changes to the MySQL front-end as required by the engines. Some of the requirements (such as an interface to the MySQL optimizer) would really require huge changes, but most agree that freely defined attributes on tables, columns and indexes would be very useful (and relatively easy to implement). Monty would like error handling on the commit call to be added ASAP, but Jeffrey said that's a feature, not a bug fix, and so for MySQL 5.1 it's a no-go. It will be interesting to see who wins that one! I would also like to have engine defined, custom data types. My most pressing problem: how can I indicate that a BLOB column is streamable?

I presented the ideas behind the BLOB Streaming engine to the connector developers and we discussed how the PHP and JDBC connectors could be extended to support BLOB Streaming. Mark Matthews, responsible for JDBC, showed me the spot in the code where the ResultSet would handle a MyBS data stream. Mark also pointed out that JDBC will need to upload a BLOB without specifying which table it would be going into or the JDBC driver will have to parse the SQL statement. Hmm, ... I should have realized this before!?

I am also looking forward to discussing things further with Andrey Hristov, developer of the mysqlnd PHP Connector, after he has tried out the new engine. Making the BLOB streaming functionality easily available to PHP developers will be a great step forward.

I was also glad to be able to meet with Mats Kindahl whose experience on the MySQL replication team is very useful to the BLOB streaming project. His main concern is to maintain the flexibility of the system as he points out in his blog. He suggested a more loosely coupled system, for example to use database triggers instead of the MyBS server-side API calls. While flexibility is important, I want to avoid too many moving parts, and make sure that the basic setup is simple. We both agreed that an embedded scripting language (ala MySQL proxy) may be a good compromise.

In a bit of time between sessions Stewart Smith and I took a look at adding the BLOB streaming functionality into the NDB cluster engine. We didn't get all that far with our quick hack, but we both saw that it could be done relatively easily. The potential for the combination of MySQL cluster and BLOB streaming is huge.

Altogether it is very helpful to any developer in the community to have such concentrated access to the MySQL developers as is possible at the internal developer's conference. This is a great offer on the part of MySQL, and I can only imagine that they will have to continue to limit the number of external developers that can be accommodated at these meetings.

So my recommendation: try to book a ticket as early as possible for next year!

Tuesday, August 28, 2007

MySQL Camp: a Secret Tip?

Where can you get access to some of the most informed people from MySQL and the community, for free?

The answer: at MySQL Camp. And then throw in lunch and breakfast for free, being able to influence the session topics and you have quite a package deal.

So it is strange why so few people took up the offer in New York this year!?

My talk was about the BLOB Streaming engine, MyBS, and I have posted the slides: Presentation - MySQL Camp 2007: The BLOB Streaming Project.

OK, so I got pretty much ragged about the name, MyBS. Why, I was asked, did I name it that? Jay, even suggested a session to find a new name for the engine! Thanks, Jay, very considerate of you... :)

But it was quite unnecessary, because I really can't see what the problem is. I think the name is cool. Uhm, totally ... cool.

Friday, July 27, 2007

BLOB streaming engine (MyBS), version 0.5 Alpha released!

With some effort just before my holiday, I have managed to complete the release of the next version of MyBS, the BLOB streaming engine for MySQL.

This version includes all the basic functionality required to stream BLOB data in and out of MySQL tables.

The main features are:

Uploading of BLOB data directly into the database using HTTP PUT or GET methods.

Downloaded of BLOB data directly from the database using HTTP GET.

BLOB size may exceed 4GB - theoretical BLOB size limit of 256 Terabytes.

BLOBs are stored in a repository which manages references from other storage engine tables.

BLOBs are referenced by a URL.

URLs referencing BLOBs in the repository have a unique access code, for security.

The theoretical maximum repository size is 4 Zettabytes (2^72 bytes) per database.

The server-side streaming API allows any storage engine to store BLOB data in the repository.

MyBS system tables provide a view of the BLOBs and associated references in the repository.

MyBS works together with the PBXT transactional storage engine, version 0.9.88, which supports the MyBS streaming API. Both engines can be downloaded from: http://www.blobstreaming.org/download.

Documentation for MyBS is also available. It includes details about all features so far, and some examples of use: http://www.blobstreaming.org/documentation.

If you try out the new engine, I'd like to hear from you. Any comments, questions and bug reports can be sent directly to me.

Tuesday, July 17, 2007

The MyBS Engine and the BLOB Repository

After some consideration I have decided to move the BLOB repository from PBXT to MyBS (§). This has the advantage that any engine that does not have its own BLOB repository (or is otherwise not suitable for storing large amounts of BLOB data) can reference BLOBs in the MyBS BLOB repository.

(§) MyBS stands for "BLOB Streaming for MySQL". The BLOB Streaming engine is a new storage engine for MySQL which allows you to stream media data directly in and out of the database. More info at www.blobstreaming.org.

Lets look at an example of this. Assume my standard example table:

CREATE TABLE notes_tab (
  n_id    int PRIMARY KEY,
  n_text  longblob
) ENGINE=PBXT;

And assume we have a file called blob_eg.txt with the contents "This is a BLOB Streaming upload test".

Firstly, I can upload a BLOB to the MyBS BLOB Repository using the HTTP PUT method:

% curl -T blob_eg.txt http://localhost:8080/test/notes_tab test/1-326-4891cdae-0

Here I uploaded a BLOB to the repository and specified the database, test, and the table, notes_tab. The URL returned, test/1-326-4891cdae-0, is the reference to the BLOB in the BLOB repository, returned by MyBS. Note that the BLOB is not yet in the table (to store the BLOB directly in the table, I would have to specify a column and a condition which identifies a particular row in the table).

However, the BLOB is already stored in the database, and I can download as follows:

% curl http://localhost:8080/test/1-326-4891cdae-0 This is a BLOB Streaming upload test

Since the BLOB is not yet referenced by a table, the MyBS BLOB repository sets a timer. If the BLOB is not retained (reference count incremented) within a certain amount of time it is removed from the BLOB repository.

To actually insert the BLOB into the table you just insert the BLOB reference, for example:

mysql> insert notes_tab values (1, "test/1-326-4891cdae-0");

On the MySQL server the notes_tab table engine will call the MyBS engine (using the server-side BLOB Streaming API) and retain the test/1-326-4891cdae-0 BLOB reference. So I can now download the BLOB by referencing the table, column and row as follows:

% curl http://localhost:8080/test/notes_tab/n_text/n_id=1 This is a BLOB Streaming upload test

Note: this example will only work with MyBS 0.5 (www.blobstreaming.org/download) or later. Coming soon!

Thursday, June 28, 2007

PBXT: Top 5 wishes of a Storage Engine

In response to Ronald's challenge in Top 5 wishes for MySQL, here is my top 5 wish list. However, it make sense for me to put a slightly different spin on the top 5 series, and write from a storage engine developers perspective.

1. A generic engine test suite

A set of mysql-test-run test scripts and results that are intended to be run by all engines. The tests will verify basic functionality and compatibility, and form the basis for an engine certification process.

2. Internal APIs

PBXT already has to call into MySQL to open .frm files, and transform path and file names. The BLOB Streaming engine will need to access user privilege information. Other engines use the cross-platform functionality provided by mysys. What we need is a number of official, well-defined APIs to various MySQL internal functionality.

3. Customizable table and column attributes

Specialized engines require specialized information. Right now, this information is being packed into table and column comments (hack, hack, ...).

4. Push-down restrict and join conditions

This is a big one for engines in general. Many engines are being created that can do certain searches better than the MySQL query processor. However, for the optimizer to know whether to push down a condition or not will probably require a better performance metric.

5. Custom data types

SQL-92 has the concept of a domain, which is basically a named data type. This could be used as the basis for custom data types provided by a storage engine, made available in the form of a new domain.

And without numbering them, let me slip in a few more wishes. How about MySQL community project development hosted on MySQLForge, complete with integration into the MySQL bug tracking system?! And I have heard that this may also be possible: PBXT and other GPL community engines on the MySQL Community distribution :)

Tuesday, June 26, 2007

First release of the BLOB Streaming engine for MySQL

I have just released the first version of the BLOB Streaming engine for MySQL (MyBS). You can download the source code of the engine from http://www.blobstreaming.org/download. Pluggable binaries for MySQL 5.1.19 (32-bit Linux and Mac OS X) are also available.

To install the plug-in copy libmybs.so to the /usr/local/mysql/lib/mysql directory, connect to your server using mysql, and enter:

mysql> install plugin MyBS soname "libmybs.so";

This version allows you to download BLOBs that are already stored in the database using HTTP. The URL is specified as follows:

http://mysql-host-name:8080/database/table/blob-column/condition

Where condition has the form: column1=value1&column2=value2&...

I gave an example of this in my previous blog: "GET"ing a BLOB from the database with the BLOB Streaming Engine

8080 is the default port, which can be set using the mybs_port system variable on the mysqld command line. For example: mysqld --mybs_port=8880

In order for BLOB streaming to work you also need PBXT version 0.9.87 which is streaming enabled. Streaming enabled simply means the engine supports the MyBS server-side streaming API.

This version of PBXT is also available from www.blobstreaming.org, or from Sourceforge.net.

Note that this version is currently only for use behind the firewall because the HTTP access is unrestricted.

The next step will be to enable the uploading of BLOBs using the HTTP PUT method, and the implementation of basic security.

Tuesday, June 05, 2007

"GET"ing a BLOB from the database with the BLOB Streaming Engine

Current plans call for the use of the HTTP protocol to upload and retrieve BLOBs to and from the database. The BLOB Streaming Engine makes this possible by integrating a lightweight HTTP server directly into the MySQL server.

I am currently working on an alpha implementation of this, and would like to give a short demonstration of what is possible using this system.

We start by creating a table using any streaming enabled storage engine (a streaming enable storage engine, is an engine that supports the server-side streaming API):

use test;
CREATE TABLE notes_tab (
  n_id        INTEGER PRIMARY KEY,
  n_text      BLOB
) ENGINE=pbxt;
INSERT notes_tab VALUES (1, "This is a BLOB streaming test!");

Now assuming the MySQL server is on the localhost, and the BLOB streaming engine has been set to port 8080, you can open your browser, and enter this URL:

http://localhost:8080/test/notes_tab/n_text/n_id=1

With the following result:

So without even doing a SELECT, you can GET a BLOB directly out of the database!

Note that there is no need for the BLOB in the database to be explicitly "streamable" for this to work.

Friday, May 18, 2007

The Scalable BLOB Streaming discussion so far

Having discussed BLOB streaming for MySQL with a number of people I have gathered quite a bit of input on the subject.

So here are some details of the issues raised so far:

Server-side API

One of the first things that Brian Aker pointed out to me was that the server-side API must make it possible to use the sendfile() system call. The call does direct disk to network transfer and can speed up delivery of a BLOB stream significantly.

The server-side API must also include a mechanism to inform the stream enabled engine that an upload is complete. Assuming the streaming protocol used is HTTP, there are 2 ways of determining the end of a download: either the connection is closed when download is complete, or the length of the BLOB data is specified in a HTTP header.

ODBC Issues

Monty was concerned about compatibility with the MySQL ODBC driver. The ODBC function SQLPutData() is used to send BLOB data, and SQLGetData() is used to retrieve BLOB data. Both functions transfer data in chunks. The implementation of these functions would have to be made aware of streamable BLOB values.

This can probably be done in a way that is transparent to the user. It may be necessary to designate certain database types as streaming BLOBs. One suggestion is to add 2 new types to MySQL. LONGBIN and LONGCHAR. These keywords are not yet used by MySQL.

However, from the OBDC side the data types LONGVARBINARY and LONGVARCHAR are already mapped to the MySQL types BLOB and TEXT respectively. But this simply means that it would be transparent to an ODBC application whether a BLOB is being streamed, or retrieved using the standard client/server connection.

Upload Before or After INSERT

There is some debate about how exactly a BLOB should be inserted. The 2 main possibilities are:

1) Upload the BLOB. The upload returns a handle (or key). Perform the INSERT setting the BLOB column to the key value.

2) Perform an INSERT, but specify in a CREATE_BLOB() function the size of the BLOB that is to be inserted into the column. The caller then SELECTs back the inserted data. In the BLOB column the caller finds a handle. The handle can then be used to upload the data.

For method (1) it was also suggested that the client application specify a name, instead of using a handle returned by the server. The name could then be used as input to a "create_blob" function which is specified in the INSERT statement. The advantage of this is that the text of the insert statement is readable.

One advantage of method (1), as pointed out by Monty, is that the client does not have to wait for the upload to complete before doing the INSERT. The server could wait for the upload to complete. This would require a client side generated identifier for each BLOB uploaded.

Note that method (2) requires use of a transaction to prevent the row with the BLOB from being selected before the BLOB data has been uploaded.

Storage of BLOBs

Stewart Smith suggested that the BLOBs be stored by the streaming engine itself. This is something to be considered, and probably will be necessary in most scale-out scenarios. Otherwise the current plan is that each engine that is streaming BLOB enabled (i.e. that supports the server-side streaming API) will store the BLOBs themselves.

Scaling Writes

Also mentioned in regard to scaling is the fact that read scale-out is usually more important than write. However, there are web-sites that have a heavy write load. To be investigated would be how NBD could used in combination with the streaming engine to scale-out uploading of BLOBs.

mysqldump & mysql

As pointed out by Monty, the output of mysqldump is a readable stream. So any changes to mysqldump in order to support streamable BLOBs should ensure that this continues to be the case. This is related to the question of how to upload BLOBs using the mysql client. One idea would be to provide a command in mysql to upload a file or a block of binary data or text.

Security

In some cases it may be necessary to check whether a client is authorized to download or upload a BLOB. The most straight forward method would be for the server to issue authorization tokens which are submitted with an upload or download. These tokens expire after a certain amount of time, or when the associated server connection is closed. However, if the streaming protocol is to be HTTP then we need to investigate the possibility of using standard HTTP authentication when retrieving and sending BLOBs.

In-place Updates

In-place updates of BLOBs is allowed by some SQL servers (for example by using functions UPDATETEXT and WRITETEXT supported by MS SQL Server). In this case the streaming enabled engine would be responsible for locking or using some other method to ensure that concurrent reads of the BLOB remain consistent.

PHP Upload

Georg Richter, who is responsible for the PHP Native Driver at MySQL, noted that using BLOB streaming, data could be uploaded from the Web directly into the database. The PHP extension to the standard MySQL client software should include a function to make this possible.

ALTER TABLE

Georg also pointed out that it should be possible to convert BLOBs currently stored in MySQL tables into streamable BLOBs using ALTER TABLE. This would enable developers to quickly try out streaming BLOBs and test the performance characteristics, in combination with their applications.

These are the main issues raised so far. Any further ideas, comments and questions are welcome! :)

Tuesday, May 01, 2007

PBXT and the MySQL Conference & Expo 2007

The conference is over, and it was a great week but pretty exhausting! From what I saw, and from what I have heard the sessions were of a very high standard this year. I have come back with a number of new ideas and quite a few things I would like to implement in PBXT. One of these is new online backup API for MySQL which will enable you to make a snapshot backup of data stored by all engines, and do point in time recovery.

For me the week started off with the BoF on scalable BLOB streaming on Tuesday evening. The BoF was well attended and there was significant interest in the topic. I will be reporting on some of the issues discussed soon.

On Wednesday morning I presented: PrimeBase XT - Design and Implementation of a Transactional Storage Engine. I was pleased with the number of questions, interest and feedback I received.

When I mentioned to Mårten Mickos that someone had said it was the best session they had heard so far at the conference, Mårten told me that they pay guys to say this to the speakers so that they feel good! So I must thank MySQL for this very thoughtful gesture - joke, of course ;)

Thanks also to Sheeri Kritzer for video recording the talk so I hope to get a link to that soon. In the meantime I have posted my slides to the PBXT home page:

http://www.primebase.com/xt/download/pbxt_mysql_uc_2007.pdf

After that I also took part in 2 lightning rounds sessions: Top MySQL Community Contributors and State of the Partner Engines. Both sessions were interesting in their diversity. One of the top contributors, Ask Bjørn Hansen, described how meeting his goal of filing a bug a week was a lot easier in the early days of MySQL. However, more recently he has been helped by the current state of the GUI tools!

During the Partner Engines session, it came as a surprise to some people in the audience that not all engines are for free. Actually there is a wide spectrum from GPL over partially free to highly priced. In fact, one of the developers of a data-warehousing engine found the question as to whether it may be free quite amusing. Solid has not decided to what extent it high-availability offering will be commercial, and the Amazon S3 engine is free, but the service behind it not. So that's something to watch out for in general.

I took the opportunity during the Partner Engines session to mention our plans for the future of PBXT which involve building a scalable BLOB streaming infrastructure for MySQL. This is relevant to a number engine developers as we will be providing an open system which will enable all engines to stream BLOBs directly in and out of the server. So please check out blobstreaming.org for more details.

Thursday, April 19, 2007

What makes the design of PBXT similar to MyISAM?

PBXT is a transactional storage engine, but what does the design have in common with MyISAM?

I'll be answering this and other questions during my session at the MySQL Users Conference next week:

PrimeBase XT: Design and Implementation of a Transactional Storage Engine
Date: Wednesday, April 25
Time: 10:45am - 11:45am
Location: Ballroom F

So be sure to check it out! :)

Monday, April 09, 2007

PBXT 0.9.86 the MySQL Conference 2007 release!

I have just released PrimeBase XT 0.9.86 which will be my last release before the MySQL User Conference this year. The most significant change in this version is the reduction of the number of data logs used per table. This, and a number of other modifications, makes PBXT fit to handle databases with 1000's of tables.

If you would like to learn more about the development and design of PBXT then join me for my session at the MySQL User Conference in Santa Clara, on Wed, April 25, 10:45am - 11:45am, Ballroom F:

PrimeBase XT: Design and Implementation of a Transactional Storage Engine

Looking ahead, I would also like to invite all who are interested to the "Scalable BLOB Streaming Infrastructure" BoF, which is scheduled for Tuesday, April 24, 8:30pm - 9:30pm.

This will be an informal discussion of our plans for the implementation of scalable BLOB streaming in MySQL, and how this all fits together with PBXT and other engines.

We would like as much input as possible at this early stage, so I hope you can make it!

I will also be taking part in a number of lightning rounds so check out this page for a complete list of my sessions:

I'm looking forward to seeing you all there!

Friday, March 23, 2007

PBXT and a scalable BLOB streaming infrastructure

As Kaj Arnö already mentioned in his blog, we have a vision for MySQL and it is called a scalable BLOB streaming infrastructure. Our plan is to build this into and around the MySQL architecture with the help of MySQL and the community.

It is a "big picture" idea and in this way a response to Robin Schumacher's question: The MySQL Vision - What do you see?. But at the same time it is very relevant and practical in the context of the Web 2.0 world.

The design of the system includes a client-side library which extends the existing MySQL client API, a stream based communications protocol and a scalable back-end which is (at least partially) linked into the MySQL server. In short, we want to make it possible to put BLOBs of any size in the database.

The system will be open, and available to all engines running under MySQL. Of course, I believe PBXT has the ideal design and performance characteristics to support this new infrastructure.

Many say that their programs run fine with the BLOBs stored in the filing system. And this is true, but what about transactional consistency, backup and replication? And what about scale-out for the storage and delivery of this data? Simply put, we know these are significant issues from our customers in the print and web publishing business that need this functionality.

And, when you think about it, media streaming has a very promising future. I believe it's becoming more important for an increasing number of developers. So this is more about opening up new opportunities and extending the reach of MySQL into new areas.

We have a pretty good idea of the basic design of the system, but this is just the starting point. The design must reflect the requirements of the users and developers, so our first step is to get as many as possible in the MySQL community together who have an interest in this technology.

We want to know what features and characteristics of such a system are important to you. What are your requirements of such a system?

To this end I am arranging a BoF on scalable BLOB streaming at the MySQL user conference next month. This will be a great opportunity to get together and discuss the subject with all who are interested, so I hope you will join me.

Of course you are welcome to contact me any time in this regard by e-mail at paul-dot-mccullagh-at-primebase-dot-com.

Monday, March 19, 2007

PBXT does Windows!

On Friday I posted PBXT version 0.9.85 which concludes my initial work on porting the engine to Windows NT/XP.

I have built a MySQL 5.1 executable (mysqld-nt.exe) to make it easy to try it out. The binary package can be downloaded from http://www.primebase.com/xt. Instructions on how to install are given in the README file.

If you would like to build it yourself, then download the source code from SourceForge.net, and follow the instruction in the windows-readme.txt file.

Unfortunately MySQL 5.1 does not support runtime loading of storage engines, as done on Linux and Mac OS X. Last I heard, this feature will also be a bit slow in coming, and is scheduled for 5.2. So if there is anybody else out there who would like to see this feature sooner, let your voices be heard!

On the whole the port to Windows was fairly straightforward. I miss atomic "seek-and-read/write" functions like pread and pwrite under Windows. But the only real source of problems was the pthread stuff. PBXT uses a bit more of the pthread library than the MySQL server, so I had to add implementations for several functions.

I would like to just link one of the available pthread libraries for Windows (such as pthreads-win32) but MySQL already implements some of the functions, which could cause link errors. In general, I think, MySQL should consider using a LGPL library such as pthreads-win32.

But this is a minor point, PBXT for Windows is running anyway...

Thursday, February 08, 2007

The first binary distribution of PBXT!

It's quite an effort to compile and test 3 versions of MySQL on 4 different machines, but that is what I have done for the first binary distribution of the PrimeBase XT pluggable storage engine.

Of course, its worth the effort because this is the "holy grail" of the pluggable storage engine strategy. Namely, binary plug-ins that work with the binary distributions prepared by MySQL and others in the community.

And by creating an installer script I have made it easier than ever to install the plug-in. So after you have downloaded the binary package from http://www.primebase.com/xt, all you have to do is:

1. Start your MySQL server.

2. Enter: sudo ./install

The installer will automatically determine the version of the server running and install the corresponding plug-in.

The package includes plug-ins for MySQL 5.1.14, 5.1.15 and 5.2.0 on Linux (32/64-bit) and Mac OS X (x86/PowerPC). Nevertheless you may wonder why the download is 22MB in size.

The reason is I have had to include patched versions of mysqld with the plug-ins for MySQL 5.1.14 and 5.2.0. The patch fixes a bug that causes MySQL to crash when restarted after a plug-in has been installed. The installer generates an uninstall script which restores the unpatched mysqld and removes the plug-in.

The bug is fixed in MySQL 5.1.15, so in the future the package should be much smaller, even when it includes more platforms.