Documentation
SeznamFS is distributed binlogging filesystem based on FUSE. It works similar to MySQL, it creates
a binary log containing all write operations and provides it to slaves as master.
Every server has its own server ID and therefore it's possible to use master-master replication
(with the same limitations as MySQL master-master replication has) or multimaster round replication.
Normal filesystem (directory) is used as storage backend and files are stored 1:1, so it's possible
to read files directly from storage directory instead of reading them through mountpoint (which
has bigger overhead)
Actual version
Actual version is 0.2.x and at this time, SeznamFS packages exist only for Debian GNU/Linux 4.0 (Etch).
It depends on fuse-utils package (version >= 2.7.0), libfuse2 package (version >= 2.7.0), libcfgparser0
package (version >= 1.0.2) and libdbglog package (version >= 1.4.6).
SeznamFS was tested on x86, x86_64 and sparc architectures.
How does it work
SeznamFS has two components, one is SeznamFS itself and second is administration console.
SeznamFS (/usr/bin/seznamfs) runs in threads:
- Worker threads - threads created by FUSE, serving syscalls from FUSE, reads/writes from/to storage
directory and also logs to binary log (in case of write operation)
- Master threads - threads listening on given address and port, serving requests from slaves
and administration consoles
- Service thread - thread used to restart master threads
- Slave thread - thread connecting to master, reading entries from its binary log and
replicating changes to local storage and also to local binlog
Administration console (/usr/bin/seznamfsadmin) is a simple console for basic manipulation
with server, at this time it is able to:
- Show slave status (replication status)
- Start/stop slave
- Reconfigure slave (master hostname, master port, master binlog number, master binlog position,
maximal replication speed)
- Reconfigure master (thread count, allowed/denied hosts)
- Reconfigure filesystem (read-only/read-write)
When SeznamFS is mounted, it begins to listen on given port (if [master]::BindAddress and [master]::BindPort set),
forks to background, starts service thread, master threads (if [master]::BindAddress and [master]::BindPort set),
starts slave thread (if [slave]::MasterHost and [slave]::MasterPort set) and begins to serve syscalls from FUSE.
Performance
We used IBM 3650 (2x dual-core Xeon 5130 @ 2.00GHz) and IBM 3550 (2x quad-core Xeon E5405 @ 2.00GHz) for performance
testing. We compared write performance to plain Ext3 filesystem write performance and watched replication delay.
-
39000 images, 1.3 GB total
-
50 files, each 10 MB large (random data)
-
2GB file (random data)
How to get it to work
Configuration
-
Section [main]
============================================================================
# Section [main]
# ============================================================================
# Storage : Where to store local files
# LogFile : Log filename
# LogMask : Log mask
# ReadOnly : Whether filesystem should be read-only (on/off)
# ============================================================================
[main]
Storage = /var/lib/seznamfs/storage
LogFile = /var/log/seznamfs/seznamfs.log
LogMask = I3W2E1F1
ReadOnly = off
-
Section [binlog]
# ============================================================================
# Section [binlog]
# ============================================================================
# ServerId : Server ID, servers must have unique IDs
# Filename : Binlog filename (will be suffixed with log number)
# Filesize : Binlog file size (+/-) in megabytes
# KeepFiles : Count of binlog files to keep (0 = keep all)
# ============================================================================
[binlog]
ServerId = 1
Filename = /var/log/seznamfs/log.bin
Filesize = 10
KeepFiles = 0
-
Section [master]
# ============================================================================
# Section [master]
# ============================================================================
# BindAddress : Address to bind, all = *
# BindPort : Port to bind
# StatusFile : Master status filename
# ThreadCount : Maximal count of clients (slaves + other utilities)
# Should be at least count of clients + 1
# AllowAddresses : Comma-separated list of allowed IP addresses
# ============================================================================
[master]
BindAddress = 127.0.0.1
BindPort = 15000
ThreadCount = 8
StatusFile = /var/log/seznamfs/seznamfs-master.status
AllowAddresses = 127.0.0.1
-
Section [slave]
# ============================================================================
# Section [slave]
# ============================================================================
# StatusFile : Slave status filename
# MasterHost : Address or hostname of master server
# MasterPort : Port which master server listens on
# MaxKilobytesPerSecond : Maximal size of data replicated per second
# ============================================================================
[slave]
MasterHost = 127.0.0.1
MasterPort = 15000
StatusFile = /var/log/seznamfs/seznamfs-slave.status
MaxKilobytesPerSecond = 0
Work with administration console
-
Administration console is started with command
seznamfsadmin [HOST (default 127.0.0.1) [PORT (default 15000)]]
-
To change host or port when already in console, use command
USE [HOST (default 127.0.0.1) [PORT (default 15000)]]
-
Show slave status:
SHOW SLAVE STATUS
-
Start slave:
START SLAVE
-
Stop slave:
STOP SLAVE
-
Change configuration:
SET {VARIABLE} {VALUE} [{VARIABLE} {VALUE}] ...
-
Following variables can be configured:
-
READONLY (ON/OFF)
-
MASTER_HOST, MASTER_PORT
-
MASTER_LOGNO, MASTER_LOGPOS
-
THREAD_COUNT
-
MAX_KILOBYTES_PER_SECOND
-
SKIP_COUNTER
-
Show configuration:
SHOW {VARIABLE}
-
Following variables can be shown:
-
READONLY (ON/OFF)
-
MASTER_HOST, MASTER_PORT
-
MASTER_LOGNO, MASTER_LOGPOS
-
THREAD_COUNT
-
MAX_KILOBYTES_PER_SECOND
-
SKIP_COUNTER
-
List connected slaves
LIST SLAVES
-
Allow/deny host:
ALLOW {HOST}
DENY {HOST}
-
Exit console:
EXIT
or
QUIT
Internals
Communication protocol
Communication protocol is a client-server binary protocol and uses TCP. All data
longer than 1 byte is in big endian.
Request:
- 8 bytes - magic (SeznamFS)
- 8 bytes - following data size
- 1 byte - command ID
- 8 bytes - client time (timestamp)
- n bytes - request data
Reply:
- 8 bytes - magic (SeznamFS)
- 8 bytes - following data size
- 1 byte - command ID
- 1 byte - status code
- 8 bytes - server time (timestamp)
- n bytes - reply data
Commands by ID:
-
COMMAND_NOP (0) - No Operation
- data: none
-
reply:
- status = 0 (OK)
- data: none
-
COMMAND_READ_BINLOG - Read entry from binary log
-
data:
- 8 bytes - binlog number
- 8 bytes - binlog position
-
reply:
-
status = 0 (OK)
- data:
- 8 bytes - last binlog number on master
- 8 bytes - last binlog position on master
- n bytes - binlog entry data
-
status = 1 (SYNC)
-
status = 2 (NEXT)
-
status = 3 (ERROR)
-
COMMAND_START_SLAVE, COMMAND_STOP_SLAVE - Start or stop slave
- data: none
-
reply:
- status = 0 (OK)
data: none
COMMAND_GET_SLAVE_STATUS - Get slave status
- data: none
-
reply:
- status = 0 (OK)
-
data:
- n bytes - slave status data (SeznamFs_t::SlaveStatus_t::Data_t)
COMMAND_SETUP - Reconfigure running SeznamFS
-
data:
- 1 byte - variable count
-
n bytes - variables:
- 8 bytes - variable ID (connector.h)
- 8 bytes - variable value data size
- o bytes - variable data
-
reply:
-
status = 0 (OK):
-
status = 3 (ERROR):
COMMAND_ALLOW, COMMAND_DENY - Allow/deny host
-
data:
- n bytes - hostname or IP address to allow/deny
-
reply:
COMMAND_GET_VARIABLE - Get configuration variable
-
data:
-
reply:
- data:
- 1 byte - value type - int(0) or string(1)
-
- 8 bytes - integer value / n bytes - string value
COMMAND_LIST_SLAVES - List connected slaves
- data: none
-
reply:
- data:
- 8 bytes - slave count
-
n bytes - slave addresses:
- 8 bytes - address length
- o bytes - address
Slave connects periodically to master and reads binary log data. If not in sync, master sends
binlog entry immediatelly, slave replicates it and reads next binary log entry. When in sync,
master waits up to 5 seconds and then replies with SYNC status. If any write operation is
performed during these 5 seconds, master sends binlog entry immediatelly.
When slave gets connection refused error, tries it again. After 10 unsuccessful tries
slave begins to sleep for 1 second between tries, after 20 unsuccessful tries slave begins
to sleep for 5 second between tries.
Binary log
All data is in big endian.
Entry:
- 8 bytes - magic (SeznamFS)
- 8 bytes - entry size
- 1 byte - server ID
- 8 bytes - server time
- 1 byte - function ID
- 1 byte - function parameter count
-
n bytes - function parameters
- 8 bytes - parameter size
- o bytes - parameter data