Documentation

SeznamFS is distributed binlogging filesystem based on FUSE. It works similar to MySQL, it creates a binary log containing all write operations and provides it to slaves as master. Every server has its own server ID and therefore it's possible to use master-master replication (with the same limitations as MySQL master-master replication has) or multimaster round replication. Normal filesystem (directory) is used as storage backend and files are stored 1:1, so it's possible to read files directly from storage directory instead of reading them through mountpoint (which has bigger overhead)

Actual version

Actual version is 0.2.x and at this time, SeznamFS packages exist only for Debian GNU/Linux 4.0 (Etch). It depends on fuse-utils package (version >= 2.7.0), libfuse2 package (version >= 2.7.0), libcfgparser0 package (version >= 1.0.2) and libdbglog package (version >= 1.4.6). SeznamFS was tested on x86, x86_64 and sparc architectures.

How does it work

SeznamFS has two components, one is SeznamFS itself and second is administration console. SeznamFS (/usr/bin/seznamfs) runs in threads: Administration console (/usr/bin/seznamfsadmin) is a simple console for basic manipulation with server, at this time it is able to: When SeznamFS is mounted, it begins to listen on given port (if [master]::BindAddress and [master]::BindPort set), forks to background, starts service thread, master threads (if [master]::BindAddress and [master]::BindPort set), starts slave thread (if [slave]::MasterHost and [slave]::MasterPort set) and begins to serve syscalls from FUSE.

Performance

We used IBM 3650 (2x dual-core Xeon 5130 @ 2.00GHz) and IBM 3550 (2x quad-core Xeon E5405 @ 2.00GHz) for performance testing. We compared write performance to plain Ext3 filesystem write performance and watched replication delay.

How to get it to work

Configuration

Work with administration console

Internals

Communication protocol

Communication protocol is a client-server binary protocol and uses TCP. All data longer than 1 byte is in big endian. Request: Reply: Commands by ID: Slave connects periodically to master and reads binary log data. If not in sync, master sends binlog entry immediatelly, slave replicates it and reads next binary log entry. When in sync, master waits up to 5 seconds and then replies with SYNC status. If any write operation is performed during these 5 seconds, master sends binlog entry immediatelly. When slave gets connection refused error, tries it again. After 10 unsuccessful tries slave begins to sleep for 1 second between tries, after 20 unsuccessful tries slave begins to sleep for 5 second between tries.

Binary log

All data is in big endian. Entry: