Friday, December 17, 2010

HandlerSocket: Why did our version not take off?

There is quite a buzz about HandlerSocket built into the latest Percona Server. I agree with Henrik that this is a brilliant idea that is going to go very far!

But I did the same thing 2.5 years ago with the BLOB Streaming Engine. In this blog I explain how you can retrieve data out of the database using the BLOB Streaming Engine and a simple URL of the form: http://mysql-host-name:8080/database/table/blob-column/condition

Where condition has the form: column1=value1&column2=value2&...

Now I have to ask myself the question: why did we not manage to generate more enthusiasm for the idea?

Many agree that we can learn more from failure than success, so here is my list of top reasons for this particular failure:
  1. Every idea has its time. In the last 2 years the awareness of NoSQL solutions has grown a lot, making RESTful and non-transactional storage and retrieval much better known and generally acceptable.

  2. We had no platform on which to launch the idea. Without a server distribution a plug-in does not have a chance of real exposure (this was not obvious back when we started making plug-ins). Percona Server and MariaDB now present such a platform. This is great for the whole community, so support them! :)

  3. Our software had not been proven in production. And this is one reason why building software based on an idea, instead of an actual project requirement is quite likely to fail.

  4. We did it with PBXT and not a the main stream storage engine which everyone is already using. The really exciting thing about HandlerSocket is that you can use it to grab data in your existing database. This will allow it to spread like wild fire in a dry forest.

  5. It is obvious to me that we at PrimeBase have a marketing problem! We have no clue how to get a message across to the public. It is really quite sad, and great technology like PBXT engine-level replication and BLOB streaming may die because of this. The following points also show lack of marketing skills - so next time you see him, hug your marketing guy! ;)

  6. By using the name BLOB Streaming Engine, we did not make it clear that this works for all kinds data, not just BLOBs. (OK, and MyBS no was a terrible name - PBMS not much better - but "HandlerSocket" will prove that it has nothing to do with the name!)

  7. We did not show benchmarks. For me it was obvious that retrieval would be significantly faster if it did not have to go through the SQL interface. Besides, as a developer I know you can easily manipulate benchmark results, so I reluctant to present them (although I do) for my own software.
For me, as a developer, it is very important that my software gets used. This is why I can understand why there is open source, and why we give away software for free.

But to developers it is not always obvious that giving it away for free does not automatically mean it will get used. So to my hacking compatriots: I hope this list will help you to do things better!

P.S. Congratulations Oracle on release of 5.5!

Thursday, November 18, 2010

My Presentation at the DOAG 2010

Yesterday I presented PBXT: A Transactional Storage Engine for MySQL at the German Oracle User Group Conference (DOAG) in Nuremberg. A number of people asked for the slides, so here is the link.

The talk was scheduled to be in English, but since I had a German-only audience I presented in German. There was quite a bit of interest, particularly in the Engine Level replication built into PBXT 2.0.

As Ronny observed, this feature can be used effectively for many tasks, including for online backup and maintaining a hot-standby. This all with the addition of a "small" feature:

The Master could initially stream the entire database over to the Slave before actual replication begins. This would also make it extremely easy to setup replication.

A brilliant idea, but a good 3 months work...

Friday, July 16, 2010

PBXT 1.5.02 Beta adds 2nd Level Cache

As many probably already know, PBXT is the first MySQL Storage Engine to use a log-based architecture. Log-based means that data that would normally first be written to the transaction log, and then to the database tables, is just written to the log, and the log becomes part of the database.

This result is that data is only written once, and is always written sequentially. The advantage when writing is obvious, but there is a down side (as always). The data is written to the disk in write order, which is seldom the order in which the data is retrieved. So this results in a lot of random reads to the disk when accessing the data later.

Placing the data logs on a Solid State Drive would solve this problem, because SSDs have no seek time. But the problem with this solution is that SSDs are still way to expense to base all your storage needs on such hardware.

The solution: an SSD-based 2nd Level Cache.

Using an SSD-based 2nd Level Cache you can store the most commonly accessed parts of your database on SSD for a reasonable price. For example, if you have a Terabyte database, you can cache about 15% (160 GB) of it on SSD for around $400. This can significantly affect the performance of your system.

With this thought in mind, I have just released PBXT 1.5.02 Beta, which implements a 2nd level cache for the data logs. How this works is illustrated below.

Data written to the data log is also written to the, main memory based, Data Log Cache. Once the Data Log Cache is full, pages need to be freed up when new data arrives. Pages that are freed from the Data Log Cache are written to the 2nd Level Cache.

Now, when the Data Log records are read, PBXT will read the corresponding page from the Data Log Cache. If the page is not already in the cache, it will first check to see if the page is in the 2nd Level Cache, before reading from the Data Log itself.

PBXT 1.5 is available for download from primebase.org, or you can check out lp:pbxt/1.5 from Launchpad using bazaar. The documentation has also been updated for 1.5.

Using the 2nd level cache is easy. It is controlled by 3 system variables:
  • pbxt_dlog_lev2_cache_file - the name and path of the file in which the data is stored.
  • pbxt_dlog_lev2_cache_size - the size of the 2nd level cache.
  • pbxt_dlog_lev2_cache_enabled - set to 1 to enable the 2nd level cache.
It also makes sense to set a higher value for the Data Log Cache, using the pbxt_data_log_cache_size variable, which has a default value of 16MB.

Of course it will be interesting to do some benchmarks on this implementation. But that will have to wait until after my holiday! I will be away until late August, but if you decide to test the new version, be sure to let me know.

Friday, July 02, 2010

MySQL Track at the DOAG 2010 Conference

The DOAG 2010 Conference + Exhibition is to be held from the 16th to 18th November in Nuremberg this year. DOAG stands for "Deutsche ORACLE-Anwendergruppe", in English: the German Oracle User's Group.

We will be adding a MySQL track to the conference this year, much like Ronald and Sheeri did for the ODTUG Kaleidoscope 2010. Volker Oboda (of PrimeBase Technologies) is organizing the track and I will be helping to review the submissions. More information is available in German on Volker's MySQL Blog.

So, if you are planning to be in the area, please consider submitting a talk. The deadline for submissions was the 30 June, but has been extended until 10 July for the MySQL track. Talks in English are welcome!

We are looking forward to playing an active part in the German speaking Oracle community. Just the size is something to wonder about. The DOAG Conference draws over 2500 participants, which is larger than the MySQL Conference (but maybe not for long!).

Friday, June 11, 2010

An Overview of PBXT Versions

If you follow PBXT development you may have noticed a number of different versions of the engine have been mentioned in various talks and blogs.

There is actually a consistent strategy behind all this, which I would like to explain here.

PBXT 1.0 - Current: 1.0.11-3 Pre-GA

Launchpad: lp:pbxt

This is the current PBXT production release. It is stable in all tests and environments in which it is currently in use.

The 1.0.11 version of the engine is available in MariaDB 5.1.47.

PBXT 1.1 - Stability: RC

Launchpad: lp:pbxt/1.1

PBXT 1.1 implements memory resident (MR) tables. These tables can be used for fast, concurrent access to non-persistent data.

1.1 also adds parallel checkpointing. To do this, PBXT starts multiple threads to flush several tables at once during a checkpoint.

This version is feature complete. Unless someone is interested in using MR tables in production, my plan is to leave 1.1 at the RC level and concentrate development on PBXT 1.5 and 2.0.

PBXT 1.1 is part of Drizzle.

PBXT 1.5 - Current: 1.5.01 Beta

Launchpad: lp:pbxt/1.5

PBXT 1.5 changes how the data logs are written, which makes the engine much faster, depending on the database schema.

Previously each user thread wrote its own data log. In version 1.5 the data logs are written the same way the transaction log is written. This means that group commit is implemented for the data logs.

I have also added a data log cache which can be help significantly if your data has hot spots.

The log-based architecture of PBXT makes it possible to write Terabytes of data without degrading performance. But, as the amount of data increases, garbage collection and random read speed can become a problem. I am currently focusing on solving these problems in 1.5.

PBXT 2.0 - Stability: Alpha

Launchpad: lp:pbxt/2.0

The major feature in PBXT 2.0 is engine level replication (ELR). This is an extremely efficient form of replication, while being fully transactional and reliable.

ELR works by transferring changes directly from the the PBXT transaction and data logs to the PBXT engine on the slave. This means the binary log does not need to be written or flushed, which can greatly increase the speed of the master server (up to 10x in some tests).

Currently the replication does not handle database schema changes, but it works and is ready for testing.

Setting Priorities

PBXT is a free, open source project which is largely funded by a big name database company.

Nevertheless, I am not bound as to how I set priorities, which means I usually focus on what is important to those using and testing the engine.

Now that you have an overview of what's happening in the PBXT world, let me know if you have a problem that PBXT might fix. I'd be happy to hear from you... :)

Wednesday, May 12, 2010

PBXT 1.0.11 Pre-GA Released!

I have just released PBXT 1.0.11, which I have titled "Pre-GA". Going by our internal tests, and all instances of PBXT in production and testing by the community this is a GA version!

However, although PBXT has 1000's of instances in production, it is not used in very diverse applications. So I am waiting for wider testing and usage before removing the "Pre" prefix.

You can download the source code from primebase.org, or pull it straight from Launchpad. Here are instructions how to compile and build the engine with MySQL. PBXT builds with MySQL 5.1.46 GA, and earlier 5.1 versions.

If you don't want to compile it yourself, PBXT 1.0.11 will soon be available in the 5.1.46 release of MariaDB. And, for the more adventurous, PBXT 1.1 is included in Drizzle.

A complete list of all the changes in this version are in the release notes.

If you are testing PBXT and have any questions send me an e-mail. I will be glad to help.

And, oh yes. If you are looking for development or production support for MySQL/MariaDB and PBXT then please write to: support-at-primebase-dot-org.

We are working together with Percona and Monty Program Ab to provide the service level you require.

Monday, April 19, 2010

Stuck in the US of A

As far as I know, nobody who was at the MySQL User Conference and lives in Europe has made it back home yet!

Please leave a comment on this blog as soon as you get home. I am interested to know...

My flight was yesterday, so I have the worst prospects. I am booked on a flight for next week Wednesday (10 days delay)! No joke! :(

Friday, April 16, 2010

The other Oracle ACE Director

While the choice of Ronald Bradford and Sheeri Cabral were natural for Oracle ACE Director my own nomination was perhaps a bit of a surprise. Well, it was to me anyway.

Those of you at the conference may have noticed that I had no (super-cool) ACE Director jacket when I was called up on the stage...

Well that was because the jacket was too big, and I had already returned it to Lenz for it to be exchanged.

Unfortunately I can't return the shoes because they are too big for me as well...

Wednesday, April 14, 2010

Slides of the PBXT Presentation

Here are the slides to my talk yesterday: A Practical Guide to the PBXT Storage Engine.

For anyone who missed my talk, I think it is worth going through the slides, because the are fairly self explanatory.

If there are any questions, please post them as a comment to the blog. I will be glad to answer :)

Friday, April 09, 2010

PBXT at the MySQL User Conference 2010

At this year's User Conference I have some interesting results to present. But more than anything else, my talk will explain how you can really get the most out of the engine. The design of PBXT makes it flexible, but this provides a lot of options. What tools are available to help you make the right decisions? I will explain.

Every design has trade-offs. How does this work out in practice for PBXT? And how can you take advantage of the strengths of the storage engine? I will explain in:

A Practical Guide to the PBXT Storage Engine
Paul McCullagh
2:00pm - 3:00pm Tuesday, 04/13/2010
Ballroom E

Don't miss it! :)

Wednesday, March 17, 2010

PBXT Engine Level replication, works!

I have been talking about this for a while, now at last I have found the time to get started! Below is a picture from my 2008 MySQL User Conference presentation. It illustrates how engine level replication works, and also shows how this can be ramped up to provide a multi-master HA setup.


What I now have running is the first phase: asynchronous replication, in a master/slave configuration. The way it works is simple. For every slave in the configuration the master PBXT engine starts a thread which reads the transaction log, and transfers modifications to a thread which applies the changes to PBXT tables on the slave.

Where to get it

I have pushed the changes that do this trick to PBXT 2.0 on Launchpad. The branch to try out is lp:pbxt/2.0.

Getting started

Setup of the replication is dead easy. Assuming you already have a PBXT database, what you need to do is the following:

1. Copy the Master data: Shutdown the MySQL server and make a complete copy of the data directory.

2. Setup a Slave server: Setup a second MySQL server using the copy of the data directory.

3. Declare the Slave: Create a text file called slaves, in the data/pbxt directory of the master server, with the following entry:
[slave]
name=slave-thread-name
host=host-name-of-slave
port=37656
slave-process-name is any name you like, and is used to identify the replication thread running on the master. host-name-of-slave is the host name or IP address of the slave MySQL server. 37656 is the default port used by the PBXT slave engine to receive replication changes.

4. Enable replication: On the master server set pbxt_enable_replication=1, and on the slave server set pbxt_enable_replication=2. Also make sure that both servers have different server IDs (system parameter: server_id).

5. Start both servers: Replication will begin immediately if the slave server is started before master server, otherwise replication will begin after a minute (see below).

How it works


PBXT engine level replication, unlike MySQL replication, pushes changes to the slave. For every entry in the data/pbxt/slaves file, PBXT starts a thread (the supplier thread). The thread connects to the slave on the given address, and pushes the changes to an applier thread run by the PBXT engine on the slave side. If any error occurs, the supplier thread on the master will pause, and then try again in a minute.

On connect the supplier thread requests the global transaction ID (GID) of the last transaction committed on the slave. The applier determines the GID of the last transaction by searching backwards through its own transaction logs.

Replication is row-based, and fairly low level. Changes refer to the PBXT internal row and table IDs. The row data is transferred in the same format used to store the information on disk. This makes the replication extremely efficient. The supplier thread does not even have to read the log from disk if it is fairly up-to-date, because PBXT already caches the last changes to the transaction log for use by the writer and the sweeper threads.

Probably the most important thing about this type of replication is that it (theoretically) has almost no affect on the "foreground" activity on the master machine. I am interested to find out if this really is the case.

What's next?

Replication of DDL changes are not implemented yet. So if you do ALTER TABLE or any other such operation, replication will stop, and have to be restarted by copying over the data directory to the slave again.

After DDL changes the next step is to add synchronous replication, as illustrated above. This requires waiting for a commit from the slave before continuing. Latency in this case can be kept to a minimum by sending transactions to the slave before they have been committed on the master.

I believe this would then provide the basis for an extremely simple (and efficient) HA solution based on MySQL.

Friday, February 26, 2010

Embedded PBXT is Cool

Martin Scholl (@zeit_geist) has started a new project based on the PBXT storage engine: EPBXT - Embedded PBXT! In his first blog he describes how you can easily build the latest version: Building Embedded PBXT from bzr.

The interesting thing about this project is that it exposes the "raw" power of the engine. Some basic performance tests show this really is the case.

At the lowest level, PBXT does not impose any format on the data stored in tables and indexes. When running as a MySQL storage engine it uses the MySQL native row and index formats. Theoretically it would be possible to expose this in an embedded API. The work Martin is doing goes in at this level. The wrapper around the engine determines the data types, data sizes, row and index format. Comparison operations for the data types are also supplied by the embedded code or user program.

This flexibility will make it possible for an application to store its own data very efficiently. As Martin suggested, it would also be possible to use Google's protobuf's for the row format. This would eliminate the need to use an ALTER TABLE for many types of changes to a table's definition!

Of course, EPBXT is still a way from realizing this vision, and Martin has some very specific problems he wants to solve with the development. However, judging by his command of the code within such a short time, this is going to be a project to watch in the future!

Monday, February 08, 2010

Ken we will miss you!

What does it take for someone, fiercely loyal to a company to suddenly leave? Ken Jakobs, Oracle employee number 18, a man that sincerely loves the company, has resigned! The only reason I can think of is an extreme snub!

I must say, I am very disappointed. The prospect of Ken running MySQL was a light at the end of the tunnel for the community. Why? Because Ken is a MySQL insider! He knows the project, he knows the community.

As an engine developer I have come to know Ken well over the last 4 years. He lead the InnoDB team and is largely responsible for the improvements made to the engine since the Oracle acquisition. At the yearly Engine Summit he was always professional and constructive in his suggestions, with a deep technical knowledge of the subject. His track record shows that he has always kept his word with regard to Oracle's intensions with InnoDB, and I would trust him to do the same with MySQL.

Goodbye Ken. This is great loss for both the MySQL community and Oracle!