From 4a86c60c415cc9b0790bb6bf1a8dcac6f802cae8 Mon Sep 17 00:00:00 2001
From: Magnus Ahltorp <map@kth.se>
Date: Thu, 11 Sep 2014 17:04:23 +0200
Subject: New architecture

---
 doc/design.txt | 235 ++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 157 insertions(+), 78 deletions(-)

(limited to 'doc/design.txt')

diff --git a/doc/design.txt b/doc/design.txt
index 29ca0a4..0a57e70 100644
--- a/doc/design.txt
+++ b/doc/design.txt
@@ -1,89 +1,168 @@
-catlfish design (in Emacs -*- org -*- mode)
+-*- markdown -*-
+
+Overview
+========
 
 This document describes the design of catlfish, an implementation of a
 Certificate Transparency (RFC6962) log server.
 
-We have
-- a primary database storing x509 certificate chains [replicating r/o
-  copies to a number of frontend nodes?]
-- a hash tree kept in RAM
-- one secondary database per frontend node, storing the most recently
-  submitted data
-- a cluster of backend nodes with an elected leader which periodically
-  updates the primary db with data from the secondary db's
-- a number of frontend nodes accepting http requests, updating
-  secondary db's and reading from local r/o copy of the primary db
-- a private key used for signing SCT's and STH's, kept (in HSM:s) on
-  backend nodes
-
-Backend nodes
-- are either asleep, functioning as storage only
-or
-- store submitted cert chains in persistent media
-- have write access to the primary database holding cert chains
-- periodically append new cert chains to the hash tree and sign the
-  tree head
-
-Frontend nodes
-- reply to the http requests specified in RFC 6962
-- write submitted cert chains to their own, single master, secondary
-  database
-- have read access to (a local copy of) the primary database
-- defer signing of SCT's (and STH's) to backend nodes
-
-The primary database
-- stores cert chains and their corresponding SCT's
-- is indexed on a running integer (primary) and a hash of the cert
-  chain (secondary)
-- runs on backend nodes
-- is persistently stored on disk on several other backend nodes in
-  separate data centers
-- grows with 5 GB per year, based on 5,000 3 kB submissions per day
-- max size is 300 GB, based on 100e6 certificates
-
-The secondary databases
-- store cert chains, unordered, between hash tree generation
-- run on frontend nodes
-- are persistently stored on disk on several other frontend nodes
-- are typically kept in RAM too
-- max size is around 128 MB, based on 10 submissions (á 3 kB) per
-  second for an hour
-
-Scaling, performance, estimates
-- submissions: less than 0.1 qps, based on 5,000 submissions per day
-- monitors: 6 qps, based on 100 monitors
-- auditors: 8,000 qps, based on 2.5e9 browsers visiting 100 sites
+
+
+    +------------------------------------------------+
+    |                front end nodes                 |
+    +------------------------------------------------+
+         ^                |                  |
+         |                v                  v
+         |        +---------------+  +---------------+
+         |        | storage nodes |  | signing nodes |
+         |        +---------------+  +---------------+
+         |                ^                  ^
+         |                |                  |
+    +------------------------------------------------+
+    |                  merge master                  |
+    +------------------------------------------------+
+         ^                         |
+         |                         v
+         |        +----------------------------------+
+         |        |          merge slaves            |
+         |        +----------------------------------+
+         |           ^
+         |           |
+    +-------------------+
+    | merge-repair node |
+    +-------------------+
+
+
+
+Design assumptions
+------------------
+* The database grows with 5 GB per year, based on 5,000 3 kB
+  submissions per day
+* Max size is 300 GB, based on 100e6 certificates
+* submissions: less than 0.1 qps, based on 5,000 submissions per day
+* monitors: 6 qps, based on 100 monitors
+* auditors: 8,000 qps, based on 2.5e9 browsers visiting 100 sites
   (with a 1y certificate) per month (assuming a single combined
   request for doing get-sth + get-sth-consistency + get-proof-by-hash)
 
+
 Open questions
-- What's a good MMD? Google seems to sign a new tree after 60-90
+--------------
+* What's a good MMD? Google seems to sign a new tree after 60-90
   minutes (early 2014). They don't promise an MMD but aim to sign at
   least once a day.
 
-A picture
-
-+-----------------------------------------------+
-|             front end nodes                   |
-+-----------------------------------------------+
-  ^      ^             ^                ^
-  |      |             |                |
-  |      v             |                |
-  |   short term     long term          |
-  |   cert db        cert db copy       |
-  |                    ^                |
-  |                    |                v
-+-----------------------------------------------+
-| tree makers   |      mergers    |    signers  |
-+-----------------------------------------------+
-   ^                   ^
-    \                  |
-     \                 v
-      ------------- long term
-                    cert db
-
-[TODO: Update terms in text or picture so they match:
-secondary database == short term cert db
-primary database == long term cert db
-backend nodes == box with tree makers, mergers and signers]
-[TODO: Move the picture to the top of the document.]
+
+Terminology
+===========
+
+CC = Certificate Chain
+CT log = Certificate Transparency log
+
+Front-end node
+==============
+
+* Handles all http requests.
+* Has a complete copy of the published data locally.
+* Read requests are answered directly by reading local files
+  and calculating the answers.
+* Add requests are validated and then sent to all storage
+  nodes. At the same time, a signing request is sent to one or
+  more of the signing nodes. When responses have been received
+  from a predetermined number of storage nodes and one signing
+  response has been received, a response is sent to the client.
+* Has an inward-facing API with the entry points SendLog(Hashes),
+  MissingEntries() (returns a list of hashes), SendEntry(Entry),
+  SendSTH(STH), CurrentPosition().
+
+
+Storage node
+============
+
+* Stores certificate chains and SCTs.
+* Has a write API SendEntry(Entry) that stores the certificate chain
+  in a database, indexed by its hash. Then stores the hash in a list
+  NewEntries.
+* Takes reasonable measures to ensure that data is in permanent
+  storage before sending a response.
+* When seeing a new STH, moves the variable start to the index of the
+  first unpublished hash.
+* Has a read API FetchNewEntries() which returns
+  NewEntries[start...length(NewEntries)-1].
+
+
+Signing node
+============
+
+* Has the signing key for the log.
+
+
+Merging node
+============
+
+* The master is determined by configuration.
+* The other merging nodes are called "slaves".
+* The master has two phases, merging and distributing.
+
+Merging (master)
+----------------
+
+* Fetches CCs by calling FetchNewEntries() on storage node i
+  where i = 0...(n-1)
+* Determines the order of the new entries in the CT log.
+* Sends the entries to the slaves.
+* Calculates the tree head and asks a signing node to sign it.
+* When a majority of the slaves have acknowledged the entries,
+  compares the calculated tree head to the tree heads of the slaves.
+  If they match, considers the additions to the CT log final and
+  begins the distributing phase.
+
+Merging (slave)
+---------------
+
+* Receives entries from the master. The node must be certain
+  that the request comes from the current master, and not
+  an old one.
+* Takes reasonable measures to ensure that data is in
+  permanent storage.
+* Calculates the new tree head and returns it to the master.
+
+Distributing
+------------
+
+* Performs the following steps for all front-end nodes:
+ * Fetches curpos by calling CurrentPosition().
+ * Calls SendLog() with the hashes of CCs from curpos to newpos.
+ * Fetches missing_entries by calling MissingEntries(), a list
+   of hashes for the CCs that the front-end nodes does not
+   have.
+ * For each hash in missing_entries, upload the CC by calling
+   SendEntry(CC).
+ * Send the STH with the SendSTH(STH) call.
+
+
+Merge-repair node
+=================
+
+* There is only one of these nodes.
+* When this node detects that an STH has not been published
+  in t seconds, it begins the automatic repair process.
+
+Automatic repair process
+------------------------
+
+* Turn off all reachable merge nodes.
+* If a majority of the merge nodes cannot be reached,
+  die and report.
+* Fetch the CT log order from the merge nodes.
+* Determine the latest version of the log.
+* Select a new master.
+* Change the configuration of the merge nodes so that
+  they know who the new master is.
+* Start all merge nodes.
+* If any of these steps fail, die and report.
+* If all steps succeed, die and report anyway. The automatic
+  repair process must not be restarted without manual
+  intervention.
+
+
-- 
cgit v1.1