Nexedi has developed for about ten years a distributed transactional replicated object store powered by MariaDB: NEO. We use NEO to store all kinds of data: records for ERP5 or large arrays of floating points for Wendelin big data platform. NEO, ERP5 and Wendelin are implemented entirely in python.
NEO is an implementation of ZODB with features such as multi-master replication. ZODB is an ACID key-value DB that preserves history by default. ZODB's simplest backend is a simple logging file.
NEO's architecture consists of:
NEO storage nodes are the only nodes that persist data. Storage nodes rely on MariaDB database. You can see the code of our MariaDB backend at https://lab.nexedi.com/nexedi/neoppod/blob/master/neo/storage/database/mysqldb.py
The _setup method contains the schema of tables:
Since ZODB is key-value database, it translates in NEO terms as:
NEO is an ACID database with 2-phase commit and we do 2 or 3 SQL commits per transaction:
The _connect method disables autocommit. The performance would be catastrophic if we didn't minimize the number of fsyncs.
Currently, each NEO storage node accesses through a Unix socket to an independent MariaDB server. We use MariaDB mainly to store BLOBs. But, as part of ERP5 and Wendelin's architecture, we also use MariaDB independently of NEO as a kind of separate datawarehouse that it used to index certain specific properties hidden in each BLOB.
We focus in this document mainly on NEO and the storage of BLOBs in MariaDB.
We have compared in 2017 the ability of the different storages of MariaDB to handle a large quantity of BLOBs. Here are the results:
We also found that all MariaDB storage engines failed to optimize property index selection and even to take into account hints for forced index selection.
Here are some details from the experiences we have conducted.
Most of them are reported upstream and aren't described here. They are listed at the end of this document.
It seems that in the case of NEO, optimizer issues aren't as severe as might have been feared:
However, optimizer issues can remain problematic for queries other than SELECT: DELETE/UPDATE offer no way to specify index hints.
Besides NEO, we do experience sub-optimal index selection within boolean fulltext search as described in MDEV-7250.
This concerns the unlockTransaction method mentioned above. Sometimes, when a NEO DB restarts (maybe not only after a NEO failure) and checks for transactions to recover, it sees lines in ttrans/tobj for transactions that were committed long time ago. This results in integrity errors because NEO moves these lines again to trans/obj, which already have them.
As shown above, the move from ttrans/tobj to trans/obj is really done in a transactional way. A row must never exist in 2 tables at the same time.
It's so hard to reproduce the issue (it never happened to me) that I don't even know if restarting MariaDB matters.
Due to the difficulty to configure RocksDB, we use it with default options. In particular, this means that all experiments were done without compression enabled. We think that MariaDB should provide a way for applications to enable it on their own, like the compression option of TokuDB when creating table.
Compression is important for obj and trans, and it's enabled for TokuDB.
The superiority of TokuDB over InnoDB in terms of performance is quite well-known (e.g. https://www.percona.com/blog/2012/09/27/three-ways-that-fractal-tree-indexes-improve-ssd-for-mysql/). To name a few, here are some advantages of TokuDB that we could observe:
Nexedi did a comparison between RocksDB, TokuDB and InnoDB, on a machine with SSD and 16GB RAM, and innodb_buffer_pool_size/tokudb_cache_size/rocksdb_block_cache_size at 2GB (InnoDB would have been faster with bigger values but it does not matter: we're going to have DB of several hundred GB, and we'll be far from having everything in RAM).
The template for MariaDB configuration resulted in:
socket = /srv/slapgrid/slappart38/var/run/mariadb.sock
datadir = /srv/slapgrid/slappart38/srv/mariadb
tmpdir = /srv/slapgrid/slappart38/tmp
pid_file = /srv/slapgrid/slappart38/var/run/mariadb.pid
log_error = /srv/slapgrid/slappart38/var/log/mariadb_error.log
slow_query_log_file = /srv/slapgrid/slappart38/var/log/mariadb_slowquery.log
init_file = /srv/slapgrid/slappart38/etc/mariadb_initial_setup.sql
log_warnings = 1
### Enables TokuDB
plugin-load = ha_tokudb
## The following settings come from ERP5 configuration.
max_allowed_packet = 128M
query_cache_size = 32M
innodb_locks_unsafe_for_binlog = 1
# Some dangerous settings you may want to uncomment temporarily
# if you only want performance or less disk access.
#innodb_flush_log_at_trx_commit = 0
#innodb_flush_method = nosync
#innodb_doublewrite = 0
#sync_frm = 0
# Extra parameters.
rocksdb_block_cache_size = 2G
tokudb_cache_size = 2G
innodb_buffer_pool_size = 2G
innodb_log_file_size = 80M
innodb_file_per_table = 1
# Force utf8 usage
collation_server = utf8_unicode_ci
character_set_server = utf8
socket = /srv/slapgrid/slappart38/var/run/mariadb.sock
user = root
In this test, we import a lot of data and automatically deduplicate it thanks to NEO. This tests gives a good idea of the write performance of the storage engine.
126 GB is the size of the source DB file, which includes metadata. Size of raw data:
In this test, we change the topology of a NEO cluster by redistributing data over available nodes: adding nodes is a way to increase available space. A NEO database is split into a fixed number of partitions (NEO term, i.e. not to be confused with MariaDB partition) and some of them are moved to other nodes. A partition on a node is a cell, cells are first copied ("tweak" below) and then dropped from source nodes.
The actual deletion of dropped cells is still an experimental feature:
At last, we use OPTIMIZE TABLES so that MariaDB really frees space.
A first comparison between InnoDB and TokuDB was done in February 2017 (at that time, NEO didn't use index hints). One result to keep is that InnoDB performed really bad with innodb_file_per_table=0, to a point it could not end. After the first tweak from 1 to 2 nodes, the index stats were too bad, and OPTIMIZE/ANALYZE didn't help. Optimizing the first DB took 7h17 for nothing.
Tests were repeated recently to also compare with RocksDB (MariaDB 10.2.8), with worse results although the hardware is the same. Maybe the SSD has aged.
A few notes about deduplication:
We have contacted MariaDB core developer team about the different sub-optimal behaviours that we discussed and have received a very positive and constructive reply.
Most behaviours are either being fixed or already fixed. They can be tracked through the bug tracker of MariaDB:
Reported to Facebook:
The last one would explain the OOM reported during the tests.
Here is a list of ongoing, planned or considered improvements: