Friday, September 30, 2011

What's happened to the MySQL Storage Engine Vendor Advisory Board?

As most of you know, the Engine Vendor Advisory Board was setup by Oracle under terms that Oracle specified for themselves when acquiring MySQL. I am referring to point number 8, on the 10 point list that played a major role in quieting nervous Bureaucrats in Europe.

PrimeBase Technologies is a member of the Board and so far we have heard nothing of a meeting this year.

Has the Board been quietly disbanded?

If so, what does this mean for the other promises on Oracles list...

UPDATE: We are in contact with Oracle concerning this. I will keep you all posted. Hopefully it was just a communications error.

Tuesday, April 12, 2011

PBXT "Secrets" at the MySQL Conference

In my presentation tomorrow at the MySQL Conference I plan to talk about some aspects of PBXT that I have never spoken about before. Here are the details of the presentation:

Update on the PBXT Storage Engine
10:50am Wednesday, 04/13/2011
Location: Ballroom D

Of course nothing about the engine is really a secret, if you are prepared to read the code. But who does that right? I am pretty sure that not even developers of other engines have spent much time (if any) on that.

But really, there are some gems stuck away in those X 1000 lines of code, and I plan to pick out a few tomorrow and show them to you. So don't miss it! :)

Friday, December 17, 2010

HandlerSocket: Why did our version not take off?

There is quite a buzz about HandlerSocket built into the latest Percona Server. I agree with Henrik that this is a brilliant idea that is going to go very far!

But I did the same thing 2.5 years ago with the BLOB Streaming Engine. In this blog I explain how you can retrieve data out of the database using the BLOB Streaming Engine and a simple URL of the form: http://mysql-host-name:8080/database/table/blob-column/condition

Where condition has the form: column1=value1&column2=value2&...

Now I have to ask myself the question: why did we not manage to generate more enthusiasm for the idea?

Many agree that we can learn more from failure than success, so here is my list of top reasons for this particular failure:
  1. Every idea has its time. In the last 2 years the awareness of NoSQL solutions has grown a lot, making RESTful and non-transactional storage and retrieval much better known and generally acceptable.

  2. We had no platform on which to launch the idea. Without a server distribution a plug-in does not have a chance of real exposure (this was not obvious back when we started making plug-ins). Percona Server and MariaDB now present such a platform. This is great for the whole community, so support them! :)

  3. Our software had not been proven in production. And this is one reason why building software based on an idea, instead of an actual project requirement is quite likely to fail.

  4. We did it with PBXT and not a the main stream storage engine which everyone is already using. The really exciting thing about HandlerSocket is that you can use it to grab data in your existing database. This will allow it to spread like wild fire in a dry forest.

  5. It is obvious to me that we at PrimeBase have a marketing problem! We have no clue how to get a message across to the public. It is really quite sad, and great technology like PBXT engine-level replication and BLOB streaming may die because of this. The following points also show lack of marketing skills - so next time you see him, hug your marketing guy! ;)

  6. By using the name BLOB Streaming Engine, we did not make it clear that this works for all kinds data, not just BLOBs. (OK, and MyBS no was a terrible name - PBMS not much better - but "HandlerSocket" will prove that it has nothing to do with the name!)

  7. We did not show benchmarks. For me it was obvious that retrieval would be significantly faster if it did not have to go through the SQL interface. Besides, as a developer I know you can easily manipulate benchmark results, so I reluctant to present them (although I do) for my own software.
For me, as a developer, it is very important that my software gets used. This is why I can understand why there is open source, and why we give away software for free.

But to developers it is not always obvious that giving it away for free does not automatically mean it will get used. So to my hacking compatriots: I hope this list will help you to do things better!

P.S. Congratulations Oracle on release of 5.5!

Thursday, November 18, 2010

My Presentation at the DOAG 2010

Yesterday I presented PBXT: A Transactional Storage Engine for MySQL at the German Oracle User Group Conference (DOAG) in Nuremberg. A number of people asked for the slides, so here is the link.

The talk was scheduled to be in English, but since I had a German-only audience I presented in German. There was quite a bit of interest, particularly in the Engine Level replication built into PBXT 2.0.

As Ronny observed, this feature can be used effectively for many tasks, including for online backup and maintaining a hot-standby. This all with the addition of a "small" feature:

The Master could initially stream the entire database over to the Slave before actual replication begins. This would also make it extremely easy to setup replication.

A brilliant idea, but a good 3 months work...

Friday, July 16, 2010

PBXT 1.5.02 Beta adds 2nd Level Cache

As many probably already know, PBXT is the first MySQL Storage Engine to use a log-based architecture. Log-based means that data that would normally first be written to the transaction log, and then to the database tables, is just written to the log, and the log becomes part of the database.

This result is that data is only written once, and is always written sequentially. The advantage when writing is obvious, but there is a down side (as always). The data is written to the disk in write order, which is seldom the order in which the data is retrieved. So this results in a lot of random reads to the disk when accessing the data later.

Placing the data logs on a Solid State Drive would solve this problem, because SSDs have no seek time. But the problem with this solution is that SSDs are still way to expense to base all your storage needs on such hardware.

The solution: an SSD-based 2nd Level Cache.

Using an SSD-based 2nd Level Cache you can store the most commonly accessed parts of your database on SSD for a reasonable price. For example, if you have a Terabyte database, you can cache about 15% (160 GB) of it on SSD for around $400. This can significantly affect the performance of your system.

With this thought in mind, I have just released PBXT 1.5.02 Beta, which implements a 2nd level cache for the data logs. How this works is illustrated below.

Data written to the data log is also written to the, main memory based, Data Log Cache. Once the Data Log Cache is full, pages need to be freed up when new data arrives. Pages that are freed from the Data Log Cache are written to the 2nd Level Cache.

Now, when the Data Log records are read, PBXT will read the corresponding page from the Data Log Cache. If the page is not already in the cache, it will first check to see if the page is in the 2nd Level Cache, before reading from the Data Log itself.

PBXT 1.5 is available for download from primebase.org, or you can check out lp:pbxt/1.5 from Launchpad using bazaar. The documentation has also been updated for 1.5.

Using the 2nd level cache is easy. It is controlled by 3 system variables:
  • pbxt_dlog_lev2_cache_file - the name and path of the file in which the data is stored.
  • pbxt_dlog_lev2_cache_size - the size of the 2nd level cache.
  • pbxt_dlog_lev2_cache_enabled - set to 1 to enable the 2nd level cache.
It also makes sense to set a higher value for the Data Log Cache, using the pbxt_data_log_cache_size variable, which has a default value of 16MB.

Of course it will be interesting to do some benchmarks on this implementation. But that will have to wait until after my holiday! I will be away until late August, but if you decide to test the new version, be sure to let me know.

Friday, July 02, 2010

MySQL Track at the DOAG 2010 Conference

The DOAG 2010 Conference + Exhibition is to be held from the 16th to 18th November in Nuremberg this year. DOAG stands for "Deutsche ORACLE-Anwendergruppe", in English: the German Oracle User's Group.

We will be adding a MySQL track to the conference this year, much like Ronald and Sheeri did for the ODTUG Kaleidoscope 2010. Volker Oboda (of PrimeBase Technologies) is organizing the track and I will be helping to review the submissions. More information is available in German on Volker's MySQL Blog.

So, if you are planning to be in the area, please consider submitting a talk. The deadline for submissions was the 30 June, but has been extended until 10 July for the MySQL track. Talks in English are welcome!

We are looking forward to playing an active part in the German speaking Oracle community. Just the size is something to wonder about. The DOAG Conference draws over 2500 participants, which is larger than the MySQL Conference (but maybe not for long!).

Friday, June 11, 2010

An Overview of PBXT Versions

If you follow PBXT development you may have noticed a number of different versions of the engine have been mentioned in various talks and blogs.

There is actually a consistent strategy behind all this, which I would like to explain here.

PBXT 1.0 - Current: 1.0.11-3 Pre-GA

Launchpad: lp:pbxt

This is the current PBXT production release. It is stable in all tests and environments in which it is currently in use.

The 1.0.11 version of the engine is available in MariaDB 5.1.47.

PBXT 1.1 - Stability: RC

Launchpad: lp:pbxt/1.1

PBXT 1.1 implements memory resident (MR) tables. These tables can be used for fast, concurrent access to non-persistent data.

1.1 also adds parallel checkpointing. To do this, PBXT starts multiple threads to flush several tables at once during a checkpoint.

This version is feature complete. Unless someone is interested in using MR tables in production, my plan is to leave 1.1 at the RC level and concentrate development on PBXT 1.5 and 2.0.

PBXT 1.1 is part of Drizzle.

PBXT 1.5 - Current: 1.5.01 Beta

Launchpad: lp:pbxt/1.5

PBXT 1.5 changes how the data logs are written, which makes the engine much faster, depending on the database schema.

Previously each user thread wrote its own data log. In version 1.5 the data logs are written the same way the transaction log is written. This means that group commit is implemented for the data logs.

I have also added a data log cache which can be help significantly if your data has hot spots.

The log-based architecture of PBXT makes it possible to write Terabytes of data without degrading performance. But, as the amount of data increases, garbage collection and random read speed can become a problem. I am currently focusing on solving these problems in 1.5.

PBXT 2.0 - Stability: Alpha

Launchpad: lp:pbxt/2.0

The major feature in PBXT 2.0 is engine level replication (ELR). This is an extremely efficient form of replication, while being fully transactional and reliable.

ELR works by transferring changes directly from the the PBXT transaction and data logs to the PBXT engine on the slave. This means the binary log does not need to be written or flushed, which can greatly increase the speed of the master server (up to 10x in some tests).

Currently the replication does not handle database schema changes, but it works and is ready for testing.

Setting Priorities

PBXT is a free, open source project which is largely funded by a big name database company.

Nevertheless, I am not bound as to how I set priorities, which means I usually focus on what is important to those using and testing the engine.

Now that you have an overview of what's happening in the PBXT world, let me know if you have a problem that PBXT might fix. I'd be happy to hear from you... :)

Wednesday, May 12, 2010

PBXT 1.0.11 Pre-GA Released!

I have just released PBXT 1.0.11, which I have titled "Pre-GA". Going by our internal tests, and all instances of PBXT in production and testing by the community this is a GA version!

However, although PBXT has 1000's of instances in production, it is not used in very diverse applications. So I am waiting for wider testing and usage before removing the "Pre" prefix.

You can download the source code from primebase.org, or pull it straight from Launchpad. Here are instructions how to compile and build the engine with MySQL. PBXT builds with MySQL 5.1.46 GA, and earlier 5.1 versions.

If you don't want to compile it yourself, PBXT 1.0.11 will soon be available in the 5.1.46 release of MariaDB. And, for the more adventurous, PBXT 1.1 is included in Drizzle.

A complete list of all the changes in this version are in the release notes.

If you are testing PBXT and have any questions send me an e-mail. I will be glad to help.

And, oh yes. If you are looking for development or production support for MySQL/MariaDB and PBXT then please write to: support-at-primebase-dot-org.

We are working together with Percona and Monty Program Ab to provide the service level you require.

Monday, April 19, 2010

Stuck in the US of A

As far as I know, nobody who was at the MySQL User Conference and lives in Europe has made it back home yet!

Please leave a comment on this blog as soon as you get home. I am interested to know...

My flight was yesterday, so I have the worst prospects. I am booked on a flight for next week Wednesday (10 days delay)! No joke! :(

Friday, April 16, 2010

The other Oracle ACE Director

While the choice of Ronald Bradford and Sheeri Cabral were natural for Oracle ACE Director my own nomination was perhaps a bit of a surprise. Well, it was to me anyway.

Those of you at the conference may have noticed that I had no (super-cool) ACE Director jacket when I was called up on the stage...

Well that was because the jacket was too big, and I had already returned it to Lenz for it to be exchanged.

Unfortunately I can't return the shoes because they are too big for me as well...

Wednesday, April 14, 2010

Slides of the PBXT Presentation

Here are the slides to my talk yesterday: A Practical Guide to the PBXT Storage Engine.

For anyone who missed my talk, I think it is worth going through the slides, because the are fairly self explanatory.

If there are any questions, please post them as a comment to the blog. I will be glad to answer :)

Friday, April 09, 2010

PBXT at the MySQL User Conference 2010

At this year's User Conference I have some interesting results to present. But more than anything else, my talk will explain how you can really get the most out of the engine. The design of PBXT makes it flexible, but this provides a lot of options. What tools are available to help you make the right decisions? I will explain.

Every design has trade-offs. How does this work out in practice for PBXT? And how can you take advantage of the strengths of the storage engine? I will explain in:

A Practical Guide to the PBXT Storage Engine
Paul McCullagh
2:00pm - 3:00pm Tuesday, 04/13/2010
Ballroom E

Don't miss it! :)

Wednesday, March 17, 2010

PBXT Engine Level replication, works!

I have been talking about this for a while, now at last I have found the time to get started! Below is a picture from my 2008 MySQL User Conference presentation. It illustrates how engine level replication works, and also shows how this can be ramped up to provide a multi-master HA setup.


What I now have running is the first phase: asynchronous replication, in a master/slave configuration. The way it works is simple. For every slave in the configuration the master PBXT engine starts a thread which reads the transaction log, and transfers modifications to a thread which applies the changes to PBXT tables on the slave.

Where to get it

I have pushed the changes that do this trick to PBXT 2.0 on Launchpad. The branch to try out is lp:pbxt/2.0.

Getting started

Setup of the replication is dead easy. Assuming you already have a PBXT database, what you need to do is the following:

1. Copy the Master data: Shutdown the MySQL server and make a complete copy of the data directory.

2. Setup a Slave server: Setup a second MySQL server using the copy of the data directory.

3. Declare the Slave: Create a text file called slaves, in the data/pbxt directory of the master server, with the following entry:
[slave]
name=slave-thread-name
host=host-name-of-slave
port=37656
slave-process-name is any name you like, and is used to identify the replication thread running on the master. host-name-of-slave is the host name or IP address of the slave MySQL server. 37656 is the default port used by the PBXT slave engine to receive replication changes.

4. Enable replication: On the master server set pbxt_enable_replication=1, and on the slave server set pbxt_enable_replication=2. Also make sure that both servers have different server IDs (system parameter: server_id).

5. Start both servers: Replication will begin immediately if the slave server is started before master server, otherwise replication will begin after a minute (see below).

How it works


PBXT engine level replication, unlike MySQL replication, pushes changes to the slave. For every entry in the data/pbxt/slaves file, PBXT starts a thread (the supplier thread). The thread connects to the slave on the given address, and pushes the changes to an applier thread run by the PBXT engine on the slave side. If any error occurs, the supplier thread on the master will pause, and then try again in a minute.

On connect the supplier thread requests the global transaction ID (GID) of the last transaction committed on the slave. The applier determines the GID of the last transaction by searching backwards through its own transaction logs.

Replication is row-based, and fairly low level. Changes refer to the PBXT internal row and table IDs. The row data is transferred in the same format used to store the information on disk. This makes the replication extremely efficient. The supplier thread does not even have to read the log from disk if it is fairly up-to-date, because PBXT already caches the last changes to the transaction log for use by the writer and the sweeper threads.

Probably the most important thing about this type of replication is that it (theoretically) has almost no affect on the "foreground" activity on the master machine. I am interested to find out if this really is the case.

What's next?

Replication of DDL changes are not implemented yet. So if you do ALTER TABLE or any other such operation, replication will stop, and have to be restarted by copying over the data directory to the slave again.

After DDL changes the next step is to add synchronous replication, as illustrated above. This requires waiting for a commit from the slave before continuing. Latency in this case can be kept to a minimum by sending transactions to the slave before they have been committed on the master.

I believe this would then provide the basis for an extremely simple (and efficient) HA solution based on MySQL.

Friday, February 26, 2010

Embedded PBXT is Cool

Martin Scholl (@zeit_geist) has started a new project based on the PBXT storage engine: EPBXT - Embedded PBXT! In his first blog he describes how you can easily build the latest version: Building Embedded PBXT from bzr.

The interesting thing about this project is that it exposes the "raw" power of the engine. Some basic performance tests show this really is the case.

At the lowest level, PBXT does not impose any format on the data stored in tables and indexes. When running as a MySQL storage engine it uses the MySQL native row and index formats. Theoretically it would be possible to expose this in an embedded API. The work Martin is doing goes in at this level. The wrapper around the engine determines the data types, data sizes, row and index format. Comparison operations for the data types are also supplied by the embedded code or user program.

This flexibility will make it possible for an application to store its own data very efficiently. As Martin suggested, it would also be possible to use Google's protobuf's for the row format. This would eliminate the need to use an ALTER TABLE for many types of changes to a table's definition!

Of course, EPBXT is still a way from realizing this vision, and Martin has some very specific problems he wants to solve with the development. However, judging by his command of the code within such a short time, this is going to be a project to watch in the future!

Monday, February 08, 2010

Ken we will miss you!

What does it take for someone, fiercely loyal to a company to suddenly leave? Ken Jakobs, Oracle employee number 18, a man that sincerely loves the company, has resigned! The only reason I can think of is an extreme snub!

I must say, I am very disappointed. The prospect of Ken running MySQL was a light at the end of the tunnel for the community. Why? Because Ken is a MySQL insider! He knows the project, he knows the community.

As an engine developer I have come to know Ken well over the last 4 years. He lead the InnoDB team and is largely responsible for the improvements made to the engine since the Oracle acquisition. At the yearly Engine Summit he was always professional and constructive in his suggestions, with a deep technical knowledge of the subject. His track record shows that he has always kept his word with regard to Oracle's intensions with InnoDB, and I would trust him to do the same with MySQL.

Goodbye Ken. This is great loss for both the MySQL community and Oracle!

Thursday, December 31, 2009

PBXT 1.0.10, New Year Release!

I have just released PBXT 1.0.10 RC4. The sources can be downloaded from primebase.org, or from Launchpad.

The major feature in this release is the implementation of the pbxt_flush_log_at_trx_commit system variable. Similar to the InnoDB equivalent, this variable allows you to determine the level of durability of transactions.

This is a trade-off: by decreasing durability, the speed of database update operations can be increased.

The default setting is 1, which means full durability: the transaction log is flushed on every transaction commit.

Setting the variable to 2 reduces durability, by just writing the log on transaction commit (no flush is done). In this case, transactions can only be lost if the entire server machine goes down (for example a power failure).

The lowest level of durability is 0. In this case the transaction log is not written on transaction commit. Transactions can be lost if the server crashes.

In the case of 2 and 0, the engine flushes the transaction log at least once per second. So only transactions executed within the last second can be lost.

Ironically, PBXT started life as a "partially durable" storage engine (level 2 according to the description above). Almost exactly 2 years ago I started the implementation of full durability. It has taken a while to build in the original "feature" :)

The main reason for doing this has been the Mac version, and our work with TeamDrive. On the Mac the fsync() operations is a fake. To do a true flush to disk you have to call fcntl(of->of_filedes, F_FULLFSYNC, 0). Problem is, the real flush is incredibly slow (about 20 times slower than fsync), but necessary to avoid any corruption.

The advantage of a lot of applications like TeamDrive is that they can tolerate a lower level of durability. So we can look forward to an even speedier TeamDrive in the future :)

I would love to hear from anyone testing the new version. Bugs can be reported on Launchpad, as usual.

Happy New Year to you all!

Monday, December 14, 2009

Monty's appeal is selfless!

What many people don't get is that Monty's appeal to the MySQL community to help save MySQL is really quite selfless.

The fact is, Monty's own company, Monty Program Ab, stands to benefit the most from bad stewardship of MySQL by Oracle.

If Oracle slows and closes up development, rejects community contributions and creates a commercial version of MySQL, then Monty Program's MariaDB fork will become very popular, very quickly.

Which would translate into income for Monty Program Ab as customers come to his company for additions, features and bug fixes that they need to secure there own production.

What Monty is concerned about is the commercial vendors of MySQL (one of which Monty Program is not).

These vendors either:
  • OEM MySQL and integrate it into a commercial software or hardware product, or
  • they produce a closed source (or dual-license) storage engine, which is sold with a commercial version of MySQL.
Oracle could kill both businesses, and this is Monty's main concern. As Monty explained in a phone call this morning: he sees the existence of commercial/dual-license vendors of MySQL as very important to the long-term survival of "his baby".

Of course Oracle cannot prevent 3rd parties from continuing to offer consulting, support and training for MySQL. But close sourcing and vigorous enforcement of trademarks can make things very difficult for such companies.

Unfortunately Oracles latest concessions may not be enough to satisfy investors in MySQL based technology either, because there is no guarantee of what happens after 5 years.

Wednesday, November 11, 2009

The EU's real problem: MySQL and Oracle do not compete!

I think that most people are missing the point, Oracle included. The main objection of the EU is not that Oracle is swallowing up a major competitor.

To understand this you have to read between the lines of the EU decision:

"The regulators see a major conflict of interest in the world's largest commercial database company owning its largest open-source competitor"

This should actually read: "the world's largest commercial database company owning the largest open-source database"

The database market is divided into 2 parts: the back-office and the online world.

And now you know what I am going to say ... Oracle has an near monopoly in back-office and MySQL has a monopoly in online applications.

So let's do a little maths:

If we assume that back-office and online applications divide the database market into 2 equal parts, and that Oracle owns 60% of the back-office, and MySQL 90% of the online world.

This means that Oracle controls 30% (60% of 50%) of the entire database market today, but after the acquisition this number will be 75% (30% + 90% of 50%).

Something to think about.

Friday, September 18, 2009

The mysterious Storage Engine Independent Test Suite

Recently Mark observed that we now all need a storage engine independent test suite, Sun included! Well, as far as I know, there is such a thing at Sun, sort of. Apparently it has been used to test PBXT and other engines, but I've heard it is not in good enough shape to be released.

But my question is, why not release it anyway? We could turn it into an engine community project. I believe there are enough engine developers out there to get this moving forward.

The secret is to start small, and just get a few tests to run with all engines. Then additional tests can be added step by step. Engines need a way to specify that they want to skip a test entirely (e.g. transactional tests), and it should be easy to customize results for various engines.

An example of a simple and elegant solution can be found in Drizzle. As Monty Taylor mentioned in a comment to Marks blog: "We have some patches to test-run in Drizzle to allow running the whole test suite with a specified storage engine".

I think it has been long enough. This could be a good opportunity to start a Sun/Community project, something like Drizzle. In other words, get something out there, even if it is incomplete, and let the community also take a large part of the responsibility.

Friday, September 11, 2009

PBXT 1.0.09 RC3 implements XA and online backup

I have just released PBXT 1.0.09 RC3. Besides bug fixes (details in the release notes), this version includes 2 Beta features:
  • XA/2-Phase Commit support
  • Native online backup Driver
XA support has been around MySQL for quite a while, and we all know of it usefulness, for example when sharding. So I was surprised to find a bug in the XA recovery: Bug #47134. Contrary to what is reported, the crash can also occur when using XA with just the default engines installed, so watch out for that one (the good news: the bug fix is simple).

Online backup is really cool! I have heard that it may soon be released in a coming version of 5.4, so lets hope that this is true.

In a little test, I did a backup of a 10GB database in 49.26 seconds! Admitedly this was on a system with 4 15K drives in a RAID 0 configuration. But that is still a fantastic, considering the tables are not even locked during this time!

The database itself took 19 min. 56 sec. to generate. A complete restore took only 14 min. 29 sec.

But, it gets even better....

I have been working on PBXT 1.1, where I have done a number of things to improve the I/O performance of the engine.

In the same test as above, run with PBXT 1.1, the time to generate the database was 9 min. 35 sec., and the time to restore was 6 min 18 sec! (Time to generate the backup was identical.)

PBXT 1.1 is available directly from Launchpad here: lp:~pbxt-core/pbxt/staging, if you are interested in trying it out. 1.1 also has full support for memory based tables.

The new release candidate (PBXT 1.0.09) can be downloaded from primebase.org/download. It is also available from Lauchpad as the rc3 series: lp:pbxt/rc3.

Please report bugs here.

Any feedback is welcome! You can use Launchpad questions or the PBXT mailing list for this purpose.