Tuesday, June 05, 2007

"GET"ing a BLOB from the database with the BLOB Streaming Engine

Current plans call for the use of the HTTP protocol to upload and retrieve BLOBs to and from the database. The BLOB Streaming Engine makes this possible by integrating a lightweight HTTP server directly into the MySQL server.

I am currently working on an alpha implementation of this, and would like to give a short demonstration of what is possible using this system.

We start by creating a table using any streaming enabled storage engine (a streaming enable storage engine, is an engine that supports the server-side streaming API):
use test;
CREATE TABLE notes_tab (
n_id INTEGER PRIMARY KEY,
n_text BLOB
) ENGINE=pbxt;
INSERT notes_tab VALUES (1, "This is a BLOB streaming test!");

Now assuming the MySQL server is on the localhost, and the BLOB streaming engine has been set to port 8080, you can open your browser, and enter this URL:

http://localhost:8080/test/notes_tab/n_text/n_id=1

With the following result:



So without even doing a SELECT, you can GET a BLOB directly out of the database!

Note that there is no need for the BLOB in the database to be explicitly "streamable" for this to work.

12 comments:

Anonymous said...

Wooohooo! Wicked cool, Paul!

Paul McCullagh said...

Thanks Jay! A response like that makes the effort worthwhile! :)

Anonymous said...

This is very interesting. Especially for user uploads, it would be nice to junk the asset server and just hit the DB, assuming it's fast.

How would horizontal scaling work? Would each DB shard somehow register itself as a proxy listener? Any thoughts?

Unknown said...

Nice idea, but I'm curious about how the MIME type and character set for the response will be set, and how will security be handled?

Paul McCullagh said...

Hi Anon,

Yes, one of the goals of the BLOB Streaming Project is to enable upload of BLOB data directly into the database, without any intermediate storage.

With regard to horizontal scaling: are you referring to registration with a reverse proxy? With the idea that the proxy would take the hits and then distribute them amongst the available database servers?

Paul McCullagh said...

Hi Richard,

Funny, I have been wondering about that too! :)

Yes, I am not so sure myself. This project is still very much in the planning stage. Although I am working on an alpha version, this only includes components that I am fairly sure of at the moment.

So I welcome any input and ideas you may have, especially on the subject of security.

Here are my thoughts on the matter so far:

There are various levels to consider here: database level, application level and web-server level.

Web-server level security is probably the easiest to achieve. This can be done, at least initially, by using a proxy (like the Apache mod_proxy).

Database level security requires that a request be related to a particular database connection. In this way, an engine can determine if the database user has SELECT (for example) privileges on the column for which the BLOB is requested.

Application level security is probably the most tricky. It would involve knowing whether an application session is still active, and which BLOBs have been authorized for viewing. One idea would involve encoding an expiry time into the link.

With regard to MIME types and character sets (i.e "meta" data associated with a BLOB):

The character set is already part of the column definition, my idea at the moment is to put the rest of the meta data in the column comment.

I would be glad to hear if you have any suggestions or comments in this regard!

Anonymous said...

What if we take this further and allows read access to any data, returnable in JSON? That would provide a very nice REST access method into the data without having to go through the overhead of logging in and dealing with a SQL statement.

Sheeri K. Cabral said...

looks great! I can't wait to see the application (& source code) and play with it.

One question, though -- would it be possible to have some kind of a randomly picked unique identifier? Imagine YouTube running this, what's to stop folks from retriving record 1, then record 2, then record 3.....

Paul McCullagh said...

Hi Francois,

As you have noted, the BLOB streaming engine turns MySQL into a "RESTful" application. I am sure this will please Roy Fielding, and fits perfectly into the MySQL Web 2.0 strategy!

What you are now suggesting is that we enable retrieval of entire records (or a selection of records) in JSON. Then, as you say, an application could just GET the data it wants out of the database (it could also use PUT to insert, POST to update and DELETE to ... delete records, for that matter).

Excellent idea! :) I will make sure that the server-side streaming API makes this possible!

Paul McCullagh said...

Hi Sheeri,

I hope to post the first version within a month or 2. So we be able to try it out and kick some ideas around at the MySQL Camp this year!

You are right about the sequential IDs. It would make it easy for someone on the internet to retrieve all the BLOBs out of the database. This is another good point which belongs under the heading of security.

One possible solution is to restrict search access to an alternative key, which is generated randomly. For example, we could add a column to the notes_tab table, as follows:

CREATE TABLE notes_tab (
n_id INTEGER PRIMARY KEY,
n_random INTEGER UNIQUE KEY,
n_text BLOB
) ENGINE=pbxt;

A trigger could be used to generate a random number and insert this value in n_random on INSERT.

Then we need to be able to restrict HTTP access to the columns notes_tab.n_random and notes_tab.n_text. This could be done by defining an "anonymous_streaming" user. If no other authorization is given, then this user is assumed.

So anonymous web access will be forced to use the following URL (for example) for the notes_tab table:

http://localhost:8080/test/notes_tab/
n_text/n_random=327156188

Mark Bowman said...

Hi,

I am a PHP Developer by trade, but came across this project because I am tasked with a project to create a video based web site.

The new MySQL Engine seem great and is a very interesting avenue to go down. One thing that as an App Developer I always look at is Access Control.

From what I can see any BLOB can be viewed by anyone with the URL www.mysite.com:8080/database/table/data

If I wanted to sell a subscription service or a music store, how would we restrict access for a whole range of reasons such as paying for services, subscription based downloads, stopping other sites using the videos etc etc.

Under normal MySQL conditions I can restrict access to certain database resources with privileges that are set when the connection to the database is set-up in a PHP Script. I can further Wrap calls to the database in Access Control rules, denying access to the database even further.

How does the privilege system work in relation to the Blob Repository? It very much looks like a "anyone can use it" which seems very dangerous to me. It seems that there isn't any database level or application level of controls in place, unless I am really missing something.

For me it would be important that the access to the Blob Repository could be controlled at the application level.

Paul McCullagh said...

Hi Mark,

In the latest version of the BLOB streaming engine (PBMS), access using the primary key condition is no longer possible.

Retrieving a BLOB is now only possible if you have a BLOB ID. The BLOB IDs are large and randomized so that it is not possible to guess what ID a BLOB has.

The only way to get a BLOB ID is to first SELECT this information from the database, for which normal database access privileges are required.

You can find more details on Barry's blog: http://bpbdev.blogspot.com.

Basically access to the repository is currently not restricted in any way. This means that anyone can upload a BLOB, and anyone can download a BLOB if they have the correct ID.

BLOBs that are uploaded in this way will be deleted after a certain timeout, because they are not referenced by the database.

However, due to possible DOS attacks we recommend that the repository be placed behind a firewall.

Then a proxy (for example Apache) should be setup to access the repository from the internet. Only download of BLOBs should be allowed via the proxy.

Upload will only be done by your application server behind the firewall.