From b3dd4ec4c1cfeca1f8fd8bb2c829af69377bcda8 Mon Sep 17 00:00:00 2001 From: Magnus Ahltorp Date: Mon, 18 Jan 2016 17:37:10 +0100 Subject: Added preliminary permdb format specification --- doc/Makefile | 2 +- doc/permdb.md | 141 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 142 insertions(+), 1 deletion(-) create mode 100644 doc/permdb.md diff --git a/doc/Makefile b/doc/Makefile index 05b987d..5f5b1b6 100644 --- a/doc/Makefile +++ b/doc/Makefile @@ -1,4 +1,4 @@ -ALL = db.html +ALL = db.html permdb.html all: $(ALL) diff --git a/doc/permdb.md b/doc/permdb.md new file mode 100644 index 0000000..bb73903 --- /dev/null +++ b/doc/permdb.md @@ -0,0 +1,141 @@ +permdb format +============= + +The permdb format is one of the formats for key-value stores. + +Each permdb key-value store has two files, the data file and the index +file. + +The data file has the same name as the base name of the store. It +contains the keys and the data for each entry. It is append-only and +contains enough information for fast integrity detection and rollback. + +The index file has the name of the data file with ".idx" added. It +contains an index tree referring to the entries in the data file. It +only contains partial key information, and cannnot be used on its own +to verify the existence or non-existence of a key. It is append-only +and does not support rollback, instead the whole file is removed if +corruption is detected. + +All keys in a permdb key-value store are of the same length, and +expected to be distributed somewhat evenly across the key space. + +Parameters +---------- + +- blocksize [32-bit] +- q (bits per tree level) [32-bit], where 2^q = children per index node +- keylength [32-bit] + + +Data file +--------- + +The data file is written as a series of commits. Each commit is +fsynced before writing the next commit. Therefore, if a commit is not +fully fsynced, it must be the last one. This means that the whole data +file can be checked by just checking the last commit. The verification +information of a commit is placed in a trailer. + +To make it possible to store data directly on a block device, the data +file must never contain more than *blocksize - 1* contiguous zero +bytes. Therefore, the data in an entry is chunked. + +All values are in big-endian byte order. + +- file + - file cookie [64-bit] = 0xd53551ba539a4252 + - parameters + - commits + - commit + - commit + - ... + +- commit + - entries + - entry + - entry + - ... + - padding to 4-byte boundary + - length [32-bit] + - checksum = SHA-256 of whole commit except checksum and cookie + - commit cookie [64-bit] = 0x2b05eed61b5af550 + +- entry + - entry cookie [64-bit] = 0xe7c1cdc2ba3dc77c + - key = (keylength bytes) + - number of chunks [16-bit] + - chunks + - chunklength [16-bit] + - data + + +Index file +---------- + +The index file stores a tree where the leaf nodes are offsets into the +data file. Since the file is append-only, a write operation has to +write all parts of the tree that changes, including the root node. The +root node is written last, which means that it is easy to find by +searching backwards from the end of file. The index file accumulates +garbage as more entries are written to it, but can be removed and +re-created to optimize its size, since it contains no original +information. + +All values are in host-endian byte order. + +- file + - file cookie [64-bit] = 0xb7e16b02ba8a6d1b + - commits + - commit + - commit + - ... + +- commit + - nodes + - node + - node + - ... + - length [64-bit] + - checksum = SHA-256 of whole commit except checksum and cookie + - commit cookie [64-bit] = 0x2fb1778c74a402e4 + +- node + - node cookie [64-bit] = 0x2e0f555ad73210d1 + - children + - child 0 + - child 1 + - ... + - child q^2-1 + +- child [64-bit] + + Any of these: + 1. If value is 0x0000000000000000: An empty node + (nothing exists beyond this node). + 2. If MSB = 0: Intermediate node. An offset into the index file. + 3. If MSB = 1: Leaf node. An offset into the data file. + +### Tree + +The last node in any commit is the root node for that commit. Each +level corresponds to q bits of the key, starting with the root level +at the MSB of the first byte. + + +Integrity detection +------------------- + +1. Check cookie. +2. Read last commit using length field in trailer. +3. Verify checksum. + + +Recovery (only for data file) +----------------------------- + +1. Search backwards for cookie. +2. The cookie is always 4-byte aligned. +3. Check that commit and truncate file after last complete commit. + + -- cgit v1.1