Disaster Recovery ================= For unexpected reasons, a significant number [#crash]_ of CCF nodes may become unavailable. In this catastrophic scenario, operators and members can recover transactions that were committed on the crashed service by starting a new network. The disaster recovery procedure is costly (e.g. the :term:`Service Identity` certificate will need to be re-distributed to clients) and should only be staged once operators are confident that the service will not heal by itself. In other words, the recovery procedure should only be staged once a majority of nodes do not consistently report one of them as their primary node. .. tip:: See :ccf_repo:`tests/infra/health_watcher.py` for an example of how a network can be monitored to detect a disaster recovery scenario. .. note:: From 4.0.9/5.0.0-dev2 onwards secret sharing used for ledger recovery now relies on a much simpler implementation that requires no external dependencies. Note that while the code still accepts shares generated by the old code for now, it only generates shares with the new implementation. As a result, a DR attempt that would downgrade the code to a version that pre-dates this change, after having previously picked it up, would not succeed if a reshare had already taken place. Overview -------- The recovery procedure consists of two phases: 1. Operators should retrieve one of the ledgers of the previous service and re-start one or several nodes in ``recover`` mode. The public transactions of the previous network are restored and the new network established. 2. After agreeing that the configuration of the new network is suitable, members should vote to accept to recover the network and once this is done, submit their recovery shares to initiate the end of the recovery procedure. See :ref:`here ` for more details. .. note:: Before attempting to recover a network, it is recommended to make a copy of all available ledger and snapshot files. .. tip:: See :ref:`build_apps/run_app:Sandbox recovery` for an example of the recovery procedure using the CCF sandbox. Establishing a Recovered Public Network --------------------------------------- To initiate the first phase of the recovery procedure, one or several nodes should be started with the ``Recover`` command in their config file (see also the sample recovery configuration file :ccf_repo:`recover_config.json `): .. code-block:: bash $ cat /path/to/config/file ... "command": { "type": "Recover", ... "recover": { "initial_service_certificate_validity_days": 1 } ... $ /opt/ccf/bin/js_generic --config /path/to/config/file Each node will then immediately restore the public entries of its ledger (``ledger.directory`` and ``ledger.read_only_ledger_dir`` configuration entries). Because deserialising the public entries present in the ledger may take some time, operators can query the progress of the public recovery by calling :http:GET:`/node/state` which returns the version of the last signed recovered ledger entry. Once the public ledger is fully recovered, the recovered node automatically becomes part of the public network, allowing other nodes to join the network. The recovery procedure can be accelerated by specifying a valid snapshot file created by the previous service in the directory specified via the ``snapshots.directory`` configuration entry. If specified, the ``recover`` node will automatically recover the snapshot and the ledger entries following that snapshot, which in practice should be a fraction of the total time required to recover the entire historical ledger.` The state machine for the ``recover`` node is as follows: .. mermaid:: graph LR; Uninitialized-- config -->Initialized; Initialized-- recovery -->ReadingPublicLedger; ReadingPublicLedger-->PartOfPublicNetwork; PartOfPublicNetwork-- member shares reassembly -->ReadingPrivateLedger; ReadingPrivateLedger-->PartOfNetwork; .. note:: It is possible that the length of the ledgers of each node may differ slightly since some transactions may not have yet been fully replicated. It is preferable to use the ledger of the primary node before the service crashed. If the latest primary node of the defunct service is not known, it is recommended to `concurrently` start as many nodes as previous existed in ``recover`` mode, each recovering one ledger of each defunct node. Once all nodes have completed the public recovery procedure, operators can query the highest recovered signed seqno (as per the response to the :http:GET:`/node/state` endpoint) and select this ledger to recover the service. Other nodes should be shutdown and new nodes restarted with the ``join`` option. Similarly to the normal join protocol (see :ref:`operations/start_network:Adding a New Node to the Network`), other nodes are then able to join the network. .. warning:: After recovery, the identity of the network has changed. The new service certificate ``service_cert.pem`` must be distributed to all existing and new users. The state machine for the ``join`` node is as follows: .. mermaid:: graph LR; Uninitialized-- config -->Initialized; Initialized-- join -->Pending; Pending-- poll status -->Pending; Pending-- trusted -->PartOfPublicNetwork; Summary Diagram --------------- .. mermaid:: sequenceDiagram participant Operators participant Node 0 participant Node 1 participant Node 2 Operators->>+Node 0: recover Node 0-->>Operators: Service Certificate 0 Note over Node 0: Reading Public Ledger... Operators->>+Node 1: recover Node 1-->>Operators: Service Certificate 1 Note over Node 1: Reading Public Ledger... Operators->>+Node 0: GET /node/state Node 0-->>Operators: {"last_signed_seqno": 50, "state": "readingPublicLedger"} Note over Node 0: Finished Reading Public Ledger, now Part of Public Network Operators->>Node 0: GET /node/state Node 0-->>Operators: {"last_signed_seqno": 243, "state": "partOfPublicNetwork"} Operators->>+Node 1: GET /node/state Node 1-->>Operators: {"last_signed_seqno": 36, "state": "readingPublicLedger"} Note over Node 1: Finished Reading Public Ledger, now Part of Public Network Operators->>Node 1: GET /node/state Node 1-->>Operators: {"last_signed_seqno": 203, "state": "partOfPublicNetwork"} Note over Operators, Node 1: Operators select Node 0 to start the new network (243 > 203) Operators->>+Node 1: shutdown Operators->>+Node 2: join Node 2->>+Node 0: Join network (over TLS) Node 0-->>Node 2: Join network response Note over Node 2: Part of Public Network Once operators have established a recovered crash-fault tolerant public network, the existing members of the consortium :ref:`must vote to accept the recovery of the network and submit their recovery shares `. Sealing-based Recovery (Experimental) ------------------------------------- Sealing-based recovery aims to minimise operator intervention during disaster recovery, by first automating the Recovery-Decision-Protocol (deciding which node has the best ledger to recover) and then allowing the chosen node to automatically recover the ledger secrets using previously Locally Sealed secrets. Together these features allow a network to automatically recover from a crash without requiring operators to manually inspect the ledgers and intervene in the recovery process, while still ensuring that the recovered ledger is the most up-to-date one available. Recovery Decision Protocol ~~~~~~~~~~~~~~~~~~~~~~~~~~ At a high level, the recovery decision protocol allows recovering nodes to discover which node has the most up-to-date ledger and automatically recover the network using that ledger. The protocol completes with a node choosing to `transition-to-open`, and so requires another mechanism to unseal and recover the private ledger. The protocol uses three phases to ensure that so long as the hosts and network between them is sufficiently healthy, forks are prevented and the most up-to-date ledger is recovered. Specifically, the protocol ensures this so long as: all nodes restart, have full network connectivity and a majority of nodes' on-disk ledger contains every committed transaction. This is a strong, but reasonable requirement, and greatly simplifies the protocol. To ensure progress even when these requirements are not met, the protocol also includes a fallback path that advances through the phases after a timeout. This fallback cannot prevent forks or data loss, but allows the service to recover and make progress even in unhealthy conditions. We refer to the healthy case as the "election path" and the other as the "failover path". In the election path, nodes first gossip with each other, learning of the ledgers of other nodes. Once they have heard from every node they vote for the node with the best ledger. If a node receives votes from a majority of nodes, it invokes `transition-to-open` and notifies the other nodes to restart and join it. This path is illustrated below, and is guaranteed to succeed if all nodes can communicate and no timeouts trigger. .. mermaid:: sequenceDiagram participant N1 participant N2 participant N3 Note over N1, N3: Gossip N1 ->> N2: Gossip(Tx=1) N1 ->> N3: Gossip(Tx=1) N2 ->> N3: Gossip(Tx=2) N3 ->> N2: Gossip(Tx=3) Note over N1, N3: Vote N2 ->> N3: Vote N3 ->> N3: Vote Note over N1, N3: Open/Join N3 ->> N1: IAmOpen N3 ->> N2: IAmOpen Note over N1, N2: Restart Note over N3: Transition-to-open Note over N3: Local unsealing Note over N3: Open N1 ->> N3: Join N2 ->> N3: Join If failover is enabled, each phase has a timeout, after which the node will advance to the next phase regardless of whether it meets the requirements to do so. For example, the election path requires all nodes to communicate to advance from the gossip phase to the vote phase. However, if any node fails to recover, the election path is stuck. In this case, after a timeout, nodes will advance to the vote phase regardless of whether they have heard from all nodes, and vote for the best ledger they have heard of at that point. Unfortunately, this can lead to multiple forks of the service if different nodes cannot communicate with each other and timeout. Hence, we recommend setting the timeout substantially higher than the highest expected recovery time, to minimise the chance of this happening. To audit if timeouts were used to open the service, the `public:ccf.gov.recovery_decision_protocol.open_kind` table tracks this. This failover path is illustrated below. .. mermaid:: sequenceDiagram participant N1 participant N2 participant N3 Note over N1, N3: Gossip N2 ->> N3: Gossip(Tx=2) N3 ->> N2: Gossip(Tx=3) Note over N1: Timeout Note over N3: Timeout Note over N1, N3: Vote N1 ->> N1: Vote N3 ->> N3: Vote N2 ->> N3: Vote Note over N1, N3: Open/Join Note over N1: Transition-to-open Note over N3: Transition-to-open If the network fails during reconfiguration, each node will use its latest known configuration to recover. Since reconfiguration requires votes from a majority of nodes, the latest configuration should recover using the election path, however nodes in the previous configuration may recover using the election path. Local Sealing ~~~~~~~~~~~~~ When sealing-based recovery is enabled, each node generates an RSA key pair (the "recovery key pair") during join. The private key is encrypted (sealed) using an AES-GCM key derived from the SNP ``DERIVED_KEY``, and the public key along with the sealed private key is stored in the ``public:ccf.gov.nodes.sealed_recovery_keys`` table. During normal operation, whenever the ledger secret changes, or a node joins the network, the system also shuffles "sealed shares". The primary generates a fresh ledger secret wrapping key, encrypts the ledger secret with that key, and stores a sealed copy of the wrapping key for each trusted node with a sealed recovery public key. During recovery, if the node was previously part of the network and has the same CPU, measurement, and policy, it can bypass the need for member recovery shares by re-deriving the sealing key, unsealing its recovery private key, decrypting the sealed wrapping key, and using that to unwrap the ledger secret. The following diagram illustrates the key hierarchy and encryption relationships: .. mermaid:: flowchart TB subgraph SNP["SNP PSP"] DK["DERIVED_KEY"] VCEK Measurement Policy["UserData (Policy)"] TCB VCEK --> DK Measurement --> DK Policy --> DK TCB --> DK end subgraph KG["Key Generation"] subgraph Sealing Key SK["Sealing Key
(HKDF)"] Label["Label:
CCF AMD Local Sealing Key"] DK -->|ikm| SK Label -->|info| SK end RSA["Recovery Key
(RSA Key Pair)"] PubKey["Public Key"] PrivKey["Private Key"] RSA --> PubKey RSA --> PrivKey LS["Ledger secret"] LSWK["Ledger secret wrapping key"] end subgraph Sealed["Store: nodes.sealed_recovery_keys"] SPK["Sealed Private Key
(AES-GCM encrypted)"] SK -->|key| SPK PrivKey --> SPK StoredPubKey["Public Key (plaintext)"] PubKey --> StoredPubKey end subgraph Shares["Store: internal.sealed_shares table"] WLS["Wrapped Ledger Secret
(AES-GCM encrypted)"] LSWK -->|key| WLS LS --> WLS EWK["Encrypted Wrapping Key
(per-node, RSA-OAEP encrypted)"] StoredPubKey -->|key| EWK LSWK --> EWK end subgraph Recovery["Recovery Process"] UPK["Unsealed Private Key"] SK -->|key| UPK SPK --> UPK UWK["Unsealed Wrapping Key"] UPK -->|key| UWK EWK --> UWK ULS["Unsealed Ledger Secret"] UWK -->|key| ULS WLS --> ULS end Configuration ~~~~~~~~~~~~~ If the ``sealing_recovery`` field is set in the configuration, this will enable local sealing, where the current node will seal ledger secrets into the ledger and a future recovering node will attempt to unseal these secrets using the supplied ``sealing_recovery.location.name``. Additionally, the ``sealing_recovery.recovery_decision_protocol`` field can be set to enable the recovery decision protocol, and configure its parameters. .. code-block:: json { "sealing_recovery": { "location": { "name": "", "address": "" }, "recovery_decision_protocol": { "expected_locations": [ { "name": "", "address": "" }, { "name": "", "address": "" } ], "failover_timeout": "2000ms" } }, } Setting ``sealing_recovery.recovery_decision_protocol.failover_timeout`` to ``0ms`` disables failover timers. Notes ----- - Operators can track the number of times a given service has undergone the disaster recovery procedure via the :http:GET:`/node/network` endpoint (``recovery_count`` field). .. rubric:: Footnotes .. [#crash] When using CFT as consensus algorithm, CCF tolerates up to `(N-1)/2` crashed nodes (where `N` is the number of trusted nodes constituting the network) before having to perform a recovery procedure. For example, in a 5-node network, no more than 2 nodes are allowed to fail for the service to be able to commit new transactions.