Nexedi has developed for about ten years a distributed transactional replicated object store powered by MariaDB: NEO. We use NEO to store all kinds of data: records for ERP5 or large arrays of floating points for Wendelin big data platform. NEO, ERP5 and Wendelin are implemented entirely in python.
NEO is an implementation of ZODB with features such as multi-master replication. ZODB is an ACID key-value DB that preserves history by default. ZODB's simplest backend is a simple logging file.
NEO's architecture consists of:
- storage nodes: their role is to access or store data in a sharded and redundant way;
- master nodes: their role is to ensure consistency of transactions;
- control nodes: their role is to control the operation of the cluster.
NEO storage nodes are the only nodes that persist data. Storage nodes rely on MariaDB database. You can see the code of our MariaDB backend at https://lab.nexedi.com/nexedi/neoppod/blob/master/neo/storage/database/mysqldb.py
The _setup method contains the schema of tables:
- config, pt: these 2 tables are so small that we can ignore them
- trans: transaction metadata (1 row per transaction)
- obj: object metadata (1 row per object)
- data: raw data (we do data deduplication so several rows of obj can refer to the same row in data)
- bigdata: in addition to max_allowed_packet, both InnoDB (by configuration) and TokuDB (hardcoded) have a maximum row size limit, so the biggest records are split here
- ttrans/tobj, which are similar to trans/obj, is a temporary place for metadata being committed (they're a replay log, and they're empty when there's no transaction being committed)
Since ZODB is key-value database, it translates in NEO terms as:
- key maps to obj.oid
- value maps to data.value
- tid is transaction id
NEO is an ACID database with 2-phase commit and we do 2 or 3 SQL commits per transaction:
- a first one (tpc_vote) to make sure that all the data to commit is actually stored durably somewhere
- a second one (tpc_finish) to mark the transaction as committed (ideally, that would only flip 1 bit): lockTransaction method
- after a successful commit, the storage nodes are notified that metadata can be moved from ttrans/tobj to trans/obj, making the new transaction readable: this is done in the unlockTransaction method, and since there's no urgency to commit after that (we're able to replay the transaction in case of crash), we wait at most 1 second in case that another transaction would be committed before
The _connect method disables autocommit. The performance would be catastrophic if we didn't minimize the number of fsyncs.
Currently, each NEO storage node accesses through a Unix socket to an independent MariaDB server. We use MariaDB mainly to store BLOBs. But, as part of ERP5 and Wendelin's architecture, we also use MariaDB independently of NEO as a kind of separate datawarehouse that it used to index certain specific properties hidden in each BLOB.
We focus in this document mainly on NEO and the storage of BLOBs in MariaDB.
MariaDB storage engines: InnoDB, TokuDB, RocksDB
We have compared in 2017 the ability of the different storages of MariaDB to handle a large quantity of BLOBs. Here are the results:
- InnoDB: good data integrity but poor scalability;
- TokuDB: good scalability but data integrity issues;
- RocksDB: good data integrity, mixed results about performance.
We also found that all MariaDB storage engines failed to optimize property index selection and even to take into account hints for forced index selection.
Here are some details from the experiences we have conducted.
Most of them are reported upstream and aren't described here. They are listed at the end of this document.
It seems that in the case of NEO, optimizer issues aren't as severe as might have been feared:
- FORCE INDEX hints would work as expected and they have been added everywhere if the code.
- The only queries with potential full scan despite appropriate index could be rewritten differently. The workaround is enough for the moment.
However, optimizer issues can remain problematic for queries other than SELECT: DELETE/UPDATE offer no way to specify index hints.
Besides NEO, we do experience sub-optimal index selection within boolean fulltext search as described in MDEV-7250.
TokuDB: DELETE sometimes do nothing
This concerns the unlockTransaction method mentioned above. Randomly and very rarely, when a NEO DB restarts and checks for transactions to recover, it sees lines in ttrans/tobj for transactions that were committed long time ago. This results in integrity errors because NEO moves these lines again to trans/obj, which already have them.
As shown above, the move from ttrans/tobj to trans/obj is really done in a transactional way. A row must never exist in 2 tables at the same time.
It even happened for a DB with internal replication: the data of 2 MariaDB instances are modified identically, and the bug happened for only 1 of them.
We don't know if DELETE really did nothing or if lines are resurrected. Further investigation, with an external tool watching the ttrans table every minute, showed that a deleted line was still there ~21 seconds later. And this line was alone in the table when it should have been deleted.
MariaDB 10.1.25 (tokudb_version: 5.6.36-82.0) is the most recent version for which we got this bug.
Reported to Percona at https://jira.percona.com/browse/PS-4291
Due to the difficulty to configure RocksDB, we use it with default options. In particular, this means that all experiments were done without compression enabled (in fact, that's because we built MariaDB without Snappy). We think that MariaDB should provide a way for applications to enable it on their own, like the compression option of TokuDB when creating table.
Compression is important for obj and trans, and it's enabled for TokuDB.
The superiority of TokuDB over InnoDB in terms of performance is quite well-known (e.g. https://www.percona.com/blog/2012/09/27/three-ways-that-fractal-tree-indexes-improve-ssd-for-mysql/). To name a few, here are some advantages of TokuDB that we could observe:
- good compression, which means 2 things: less disk space used and more efficient disk cache
- less fragmentation (optimizations for HDD tend to be useful for SSD)
- better query plans (I mean, probably better stats)
- faster queries even for the same query plan, and IIRC, even if all data is in cache
- less configuration required: it's horrible that one have to be an expert in InnoDB so that the engine is close to optimal configuration, given the machine specifications and actual load
Nexedi did a comparison between RocksDB, TokuDB and InnoDB, on a machine with SSD and 16GB RAM, and innodb_buffer_pool_size/tokudb_cache_size/rocksdb_block_cache_size at 2GB (InnoDB would have been faster with bigger values but it does not matter: we're going to have DB of several hundred GB, and we'll be far from having everything in RAM).
The template for MariaDB configuration resulted in:
socket = /srv/slapgrid/slappart38/var/run/mariadb.sock
datadir = /srv/slapgrid/slappart38/srv/mariadb
tmpdir = /srv/slapgrid/slappart38/tmp
pid_file = /srv/slapgrid/slappart38/var/run/mariadb.pid
log_error = /srv/slapgrid/slappart38/var/log/mariadb_error.log
slow_query_log_file = /srv/slapgrid/slappart38/var/log/mariadb_slowquery.log
init_file = /srv/slapgrid/slappart38/etc/mariadb_initial_setup.sql
log_warnings = 1
### Enables TokuDB
plugin-load = ha_tokudb
## The following settings come from ERP5 configuration.
max_allowed_packet = 128M
query_cache_size = 32M
innodb_locks_unsafe_for_binlog = 1
# Some dangerous settings you may want to uncomment temporarily
# if you only want performance or less disk access.
#innodb_flush_log_at_trx_commit = 0
#innodb_flush_method = nosync
#innodb_doublewrite = 0
#sync_frm = 0
# Extra parameters.
rocksdb_block_cache_size = 2G
tokudb_cache_size = 2G
innodb_buffer_pool_size = 2G
innodb_log_file_size = 80M
innodb_file_per_table = 1
# Force utf8 usage
collation_server = utf8_unicode_ci
character_set_server = utf8
socket = /srv/slapgrid/slappart38/var/run/mariadb.sock
user = root
Test 1: importing 126 GB of data to a NEO DB with 1 storage node
In this test, we import a lot of data and automatically deduplicate it thanks to NEO. This tests gives a good idea of the write performance of the storage engine.
126 GB is the size of the source DB file, which includes metadata. Size of raw data:
- before NEO compression: 120 GB
- after NEO compression: 26.2 GB
- after deduplication: 23.7 GB
Test 2: balancing from 1 node to 2 and then to 4
In this test, we change the topology of a NEO cluster by redistributing data over available nodes: adding nodes is a way to increase available space. A NEO database is split into a fixed number of partitions (NEO term, i.e. not to be confused with MariaDB partition) and some of them are moved to other nodes. A partition on a node is a cell, cells are first copied ("tweak" below) and then dropped from source nodes.
Test 3: reclaim free space on the first 2 nodes
The actual deletion of dropped cells is still an experimental feature:
- it doesn't scale for big databases (1 DELETE per partition per table)
- it's not done in background: it only happens when the cluster is going to be operational
At last, we use OPTIMIZE TABLES so that MariaDB really frees space.
A first comparison between InnoDB and TokuDB was done in February 2017 (at that time, NEO didn't use index hints). One result to keep is that InnoDB performed really bad with innodb_file_per_table=0, to a point it could not end. After the first tweak from 1 to 2 nodes, the index stats were too bad, and OPTIMIZE/ANALYZE didn't help. Optimizing the first DB took 7h17 for nothing.
Tests were repeated recently to also compare with RocksDB (MariaDB 10.2.8), with worse results although the hardware is the same. Maybe the SSD has aged.
|time spent / resulting size (GB)
|import 126GB of data
||20h24 / 50.6
||29h22 / 40.6
||13h14 / 33.1
||9h14 / 76.6
||14h20 / 59
||5h09 / 51.8
||5h12 / 102
||6h41 / 80
||2h27 / 70.8
|deletion of dropped cells
||69h16 / 102
||2h30 / 93.3
||68m14 / 51.7
||10m32 / 66.7
- The SQL queries were completed in 15h55 but InnoDB remained busy in background.
- Out-Of-Memory killed but not really a problem: we already know that NEO must delete data in a smarter way.
- SHOW STATUS reports 26.8 GB free (1.9 GB free before tweak).
A few notes about deduplication:
- It's internal to each node, so it is normal that the DB is bigger after being tweak.
- Before the tweak, 15% of records are deduplicated.
- Each deduplicated record put RocksDB at a disadvantage, due to MDEV-13739.
|import 126GB of data
Ongoing effort with MariaDB team
We have contacted MariaDB core developer team about the different sub-optimal behaviours that we discussed and have received a very positive and constructive reply.
Most behaviours are either being fixed or already fixed. They can be tracked through the bug tracker of MariaDB:
Reported to Facebook:
The last one would explain the OOM reported during the tests.
Future improvements of NEO'architecture
Here is a list of ongoing, planned or considered improvements:
- use MariaDB embedded (considered);
- use MariaDB embedded's ability to access BLOB through pointer and reduce number of copies (considered);
- reimplement NEO in golang (ongoing);
- run automated scalability tests over 1 PB of data (ongoing).