February 15, 2008

Using MySQL as a filesystem

Author: Ben Martin

With MySQLfs you can store a filesystem inside a MySQL relational database. MySQLfs breaks up the byte content of files that you store in its filesystem into tuples in the database, which allows you to store large files in the filesystem without requiring the database to support extremely large BLOB fields. With MySQLfs you can throw a filesystem into a MySQL database and take advantage of whatever database backup, clustering, and replication setup you have to protect your MySQLfs filesystem.

MySQLfs does not appear to be offered in binary packages for Fedora, openSUSE, or Ubuntu, so you must build it from source. To build MySQLfs you'll need to install mysql-devel and fuse-devel packages first.
(FUSE allows a normal program to expose a filesystem through the Linux kernel. This means that you can immediately use a FUSE filesystem with any application.) If you are missing some devel packages then the errors generated by configure might be a little bit cryptic. For example, on my Fedora 8 64-bit machine, I was at first unable to link with mysqlclient_r because FUSE development packages were not found; this error generates a single -L in the linker flags, which makes linking a mysqlclient test program fail. The build also defaults to using -Wall -Werror, which will cause the build to fail at the first warning generated. You can fix these issues by compiling without -Werror and making sure that you have all the necessary development packages installed. Once you have built MySQLfs from sources, the make install command simply places MySQLfs into /usr/local/bin.

The commands below set up a MySQL database to use with MySQLfs. Note that make install does not install the schema.sql file anywhere on your filesystem by default, so you'll have to copy that schema someplace yourself so you can set up more MySQLfs databases later.

# mysql -p
mysql> CREATE DATABASE mysqlfs;
mysql> GRANT SELECT, INSERT, UPDATE, DELETE ON mysqlfs.* TO mysqlfs@"%" IDENTIFIED BY 'foobar';
mysql> FLUSH PRIVILEGES;
mysql> Bye
# mysql -uroot -p mysqlfs

The database schema consists of three tables: tree, inodes, and data_blocks. The tree table maps an inode to its parent inode. Part of the table is shown below.

In the schema shown below I have imported a file with the path guten/alice13a.txt. The filesystem naming hierarchy is represented in the tree table, the alice13a.txt file is stored in the guten directory, which is created under the root of the MySQLfs filesystem. The inode table contains all the information that one would expect from a stat(2) call. The data_blocks table is used to store the actual file byte content.

mysql> select * from tree;
+-------+--------+----------------------+
| inode | parent | name |
+-------+--------+----------------------+
| 1 | NULL | / |
| 2 | 1 | guten |
| 3 | 2 | alice13a.txt |
...
mysql> select * from inodes;
+-------+-------+---------+-------+-----+-----+------------+------------+------------+----------+
| inode | inuse | deleted | mode | uid | gid | atime | mtime | ctime | size |
+-------+-------+---------+-------+-----+-----+------------+------------+------------+----------+
| 1 | 0 | 0 | 16877 | 0 | 0 | 1201155861 | 1201155861 | 1201155861 | 0 |
| 2 | 0 | 0 | 16888 | 500 | 500 | 1200108244 | 1200108244 | 1201156234 | 0 |
| 3 | 0 | 0 | 33184 | 500 | 500 | 1200108239 | 1200108239 | 1201156234 | 153477 |
...
mysql> describe data_blocks;
+-------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| inode | bigint(20) | NO | PRI | | |
| seq | int(10) unsigned | NO | PRI | | |
| data | blob | YES | | NULL | |
+-------+------------------+------+-----+---------+-------+

Because FUSE allows you to mount filesystems that can behave in strange ways, the ability to mount custom FUSE filesystems is normally limited to users who have been explicitly added to the fuse group. The below session shows an initial denied attempt to mount a FUSE filesystem and the commands to remedy the situation. I had to log in again after adding myself to the fuse group. The initial MySQLfs filesystem and database is empty, but the df command shows that the kernel knows that the filesystem is mounted. The fusermount command unmounts the user-created FUSE filesystem at the end of the session.

$ mkdir ~/mysqlfs
$ mysqlfs -ohost=localhost -odatabase=mysqlfs ~/mysqlfs
* Opening logfile 'mysqlfs.log': OK
fuse: failed to open /dev/fuse: Permission denied
$ su -l
root# usermod -a -G fuse ben
root# exit
$ exit
...
$ id
uid=500(ben) gid=500(ben) groups=492(fuse), ...
$ mysqlfs -ohost=localhost -odatabase=mysqlfs ~/mysqlfs
* Opening logfile 'mysqlfs.log': OK
$ ls -ld mysqlfs
drwxr-xr-x 1 root root 0 2008-01-24 16:24 mysqlfs
$ df -h mysqlfs .
Filesystem Size Used Avail Use% Mounted on
mysqlfs 0 0 0 - /home/ben/mysqlfs
/dev/sdc3 16G 4.4G 11G 31% /home
$ ls -l mysqlfs
total 0
$ fusermount -u mysqlfs

Now that we can have set up the ability to mount MySQL as a filesystem, let's copy some files into the MySQL database and interact with the filesystem. In the session below, I copy some text files from Project Gutenberg into MySQLfs and verify that when read back they have the same MD5 checksum as the originals. I then copy the Linux source tarball into MySQLfs to get an idea of how well MySQLfs handles a 44MB file.

Copying the kernel from /tmp to a backup file in /tmp took about 2 seconds on a cold disk cache. Copying the kernel into a MySQLfs took about 20 seconds. Performing the same test again right away, in order to have the source file in disk cache, the /tmp backup took about 0.3 seconds, while the MySQLfs copy still took 20 seconds. This indicates that write performance is the major bottleneck in MySQLfs. After I read back the kernel, MySQL seemed to cache some data, which made response time faster as I performed more tests. This could make MySQLfs an interesting option if you have a filesystem that is read often, as you could take advantage of MySQL caching.

~]$ cp -av /.../guten ./mysqlfs/
`/.../guten' -> `./mysqlfs/guten'
`/.../guten/alice13a.txt' -> `./mysqlfs/guten/alice13a.txt'
`/.../guten/boysw10.txt' -> `./mysqlfs/guten/boysw10.txt'
`/.../guten/dmoro11.txt' -> `./mysqlfs/guten/dmoro11.txt'
~]$ cd ~/mysqlfs/guten
guten]$ ls -l
-rw-r----- 1 ben ben 153477 2008-01-12 13:23 alice13a.txt
-rw-rw---- 1 ben ben 48923 2008-01-12 13:23 boysw10.txt
-rw-rw---- 1 ben ben 259214 2008-01-12 13:23 dmoro11.txt
guten]$ md5sum *
135e06ad31b169065bccbf03ec7236f2 alice13a.txt
7dd30f1b37e32cdb5d21fe992bbf248d boysw10.txt
87c05f11193c0e05b3d0dec0808a0450 dmoro11.txt
guten]$ md5sum /.../guten/*
135e06ad31b169065bccbf03ec7236f2 /.../guten/alice13a.txt
7dd30f1b37e32cdb5d21fe992bbf248d /.../guten/boysw10.txt
87c05f11193c0e05b3d0dec0808a0450 /.../guten/dmoro11.txt
guten]$ cd ..
mysqlfs]$ time cp /tmp/linux-2.6.23.tar.bz2 .
real 0m16.278s
user 0m0.006s
sys 0m0.531s
mysqlfs]$ time cat linux-2.6.23.tar.bz2 >/dev/null
real 0m0.502s
user 0m0.004s
sys 0m0.035s
mysqlfs]$ time dd if=linux-2.6.23.tar.bz2 of=/tmp/junk bs=1024 count=1024
1048576 bytes (1.0 MB) copied, 0.0200973 s, 52.2 MB/s

real 0m0.058s
user 0m0.003s
sys 0m0.013s
mysqlfs]$ time dd if=linux-2.6.23.tar.bz2 of=/tmp/junk bs=1024 count=1024 skip=9000
1048576 bytes (1.0 MB) copied, 0.0214207 s, 49.0 MB/s

real 0m0.031s
user 0m0.001s
sys 0m0.011s

Although it is one of the nastiest things that one can do to a FUSE filesystem, I decided to run the Bonnie++ filesystem benchmark suite against MySQLfs. I conducted these tests inside a virtual machine, which might introduce a slight linear slowdown due to the virtualization, but both the MySQL database and /tmp are stored on the same virtual disk device so relative performance should be valid.

As you can see from the results below, MySQLfs was about one-tenth as fast as /tmp/bonnie (ext3) at running Bonnie++ (version 1.03a-7). That indicates you might not want to use MySQLfs to directly store a maildir. Bonnie++ is designed to work against native kernel filesystems and not FUSE, so these figures should be taken with a rather large grain of salt.

$ /usr/sbin/bonnie++ -d /tmp/bonnie

Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
v8tsrv 2G 18155 31 16726 5 13338 6 26207 46 74527 24 9840 144
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 24878 94 +++++ +++ +++++ +++ 29018 99 +++++ +++ +++++ +++

$ /usr/sbin/bonnie++ -d ~/mysqlfs/bonnie

Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
v8tsrv 2G 2615 5 1207 1 1323 1 2143 2 2363 0 138.1 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 186 2 505 1 296 2 209 2 441 1 282 2

Categories:

  • Databases
  • System Administration
Click Here!