summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--doc/Makefile2
-rw-r--r--doc/permdb.md141
2 files changed, 142 insertions, 1 deletions
diff --git a/doc/Makefile b/doc/Makefile
index 05b987d..5f5b1b6 100644
--- a/doc/Makefile
+++ b/doc/Makefile
@@ -1,4 +1,4 @@
-ALL = db.html
+ALL = db.html permdb.html
all: $(ALL)
diff --git a/doc/permdb.md b/doc/permdb.md
new file mode 100644
index 0000000..bb73903
--- /dev/null
+++ b/doc/permdb.md
@@ -0,0 +1,141 @@
+permdb format
+=============
+
+The permdb format is one of the formats for key-value stores.
+
+Each permdb key-value store has two files, the data file and the index
+file.
+
+The data file has the same name as the base name of the store. It
+contains the keys and the data for each entry. It is append-only and
+contains enough information for fast integrity detection and rollback.
+
+The index file has the name of the data file with ".idx" added. It
+contains an index tree referring to the entries in the data file. It
+only contains partial key information, and cannnot be used on its own
+to verify the existence or non-existence of a key. It is append-only
+and does not support rollback, instead the whole file is removed if
+corruption is detected.
+
+All keys in a permdb key-value store are of the same length, and
+expected to be distributed somewhat evenly across the key space.
+
+Parameters
+----------
+
+- blocksize [32-bit]
+- q (bits per tree level) [32-bit], where 2^q = children per index node
+- keylength [32-bit]
+
+
+Data file
+---------
+
+The data file is written as a series of commits. Each commit is
+fsynced before writing the next commit. Therefore, if a commit is not
+fully fsynced, it must be the last one. This means that the whole data
+file can be checked by just checking the last commit. The verification
+information of a commit is placed in a trailer.
+
+To make it possible to store data directly on a block device, the data
+file must never contain more than *blocksize - 1* contiguous zero
+bytes. Therefore, the data in an entry is chunked.
+
+All values are in big-endian byte order.
+
+- file
+ - file cookie [64-bit] = 0xd53551ba539a4252
+ - parameters
+ - commits
+ - commit
+ - commit
+ - ...
+
+- commit
+ - entries
+ - entry
+ - entry
+ - ...
+ - padding to 4-byte boundary
+ - length [32-bit]
+ - checksum = SHA-256 of whole commit except checksum and cookie
+ - commit cookie [64-bit] = 0x2b05eed61b5af550
+
+- entry
+ - entry cookie [64-bit] = 0xe7c1cdc2ba3dc77c
+ - key = (keylength bytes)
+ - number of chunks [16-bit]
+ - chunks
+ - chunklength [16-bit]
+ - data
+
+
+Index file
+----------
+
+The index file stores a tree where the leaf nodes are offsets into the
+data file. Since the file is append-only, a write operation has to
+write all parts of the tree that changes, including the root node. The
+root node is written last, which means that it is easy to find by
+searching backwards from the end of file. The index file accumulates
+garbage as more entries are written to it, but can be removed and
+re-created to optimize its size, since it contains no original
+information.
+
+All values are in host-endian byte order.
+
+- file
+ - file cookie [64-bit] = 0xb7e16b02ba8a6d1b
+ - commits
+ - commit
+ - commit
+ - ...
+
+- commit
+ - nodes
+ - node
+ - node
+ - ...
+ - length [64-bit]
+ - checksum = SHA-256 of whole commit except checksum and cookie
+ - commit cookie [64-bit] = 0x2fb1778c74a402e4
+
+- node
+ - node cookie [64-bit] = 0x2e0f555ad73210d1
+ - children
+ - child 0
+ - child 1
+ - ...
+ - child q^2-1
+
+- child [64-bit]
+
+ Any of these:
+ 1. If value is 0x0000000000000000: An empty node
+ (nothing exists beyond this node).
+ 2. If MSB = 0: Intermediate node. An offset into the index file.
+ 3. If MSB = 1: Leaf node. An offset into the data file.
+
+### Tree
+
+The last node in any commit is the root node for that commit. Each
+level corresponds to q bits of the key, starting with the root level
+at the MSB of the first byte.
+
+
+Integrity detection
+-------------------
+
+1. Check cookie.
+2. Read last commit using length field in trailer.
+3. Verify checksum.
+
+
+Recovery (only for data file)
+-----------------------------
+
+1. Search backwards for cookie.
+2. The cookie is always 4-byte aligned.
+3. Check that commit and truncate file after last complete commit.
+
+