The much-anticipated Cosmos Network will go live at the beginning of March ushering in an era of the Internet of Blockchains - hundreds of scalable blockchain ledgers, hosting performant Dapps, that are able to send tokens to each other effortlessly.
Our team of six at Chorus One spent the better part of 2018 building a Cosmos validator, and we are happy to share some of the validation capabilities we have deployed on Cosmos test networks. Similar capabilities shall also be extended to other networks in the future, starting with the Loom Network, where we are already live and validating.
Validators attempt to earn the maximum reward by taking the least risk for their customers (customers are also known as 'delegators') in maintaining the blockchain for Proof-of-Stake networks. On the Cosmos Network, staking is highly incentivized and those participating in securing the blockchain stand to earn considerable interest on their Atom (the cryptocurrency native to the Cosmos Network) holdings. On the downside, validators can accrue penalties if their system exhibits malevolent behavior or suffers from extended downtime. In order to maximize risk-adjusted returns, our technical system needs to:
- Minimize downtime: The higher the uptime of a validator, the greater the financial returns made by the validator for its customers.
- Develop secure internal processes to handle cryptographic keys and assets: while a validator does not take custody of customer funds, the private keys that identify a validator on the Cosmos Network are extremely high-value objects. Their compromise could trigger business losses for the validator, and losses of up to -20% for their customers.
We have built various components and processes to achieve these objectives, as listed here.
Downtime can occur due to the physical failure of critical servers that process new blocks and sign new blocks on behalf of the validator.
We've built a highly available architecture wherein all the components of our validator; whether they are validating machines, logging machines, monitoring machines, machines to enhance network connectivity etc; are duplicated in the United States and in the UK. In this article, components operating in the United States will be called 'the US leg', 'the UK leg' for those in the UK. On both sides of the Atlantic, we use multiple data centers to further distribute the risk of the US leg or the UK leg suffering an outage.
In order to build a highly available system, some components are harder to duplicate across the Atlantic than others. One of the trickiest components to duplicate is the validating server. This machine ultimately decides what gets signed, and it must never sign two contradictory blocks at the same position in the blockchain. Of course, once one puts two validating machines across the Atlantic, how does one ensure that they never make this kind of mistake?
We've developed a custom system, the Chorus active/active coordinator for this purpose. A later blog post will open source and explain this component. We believe the coordinator will be of enormous utility to the Cosmos ecosystem and result in qualitatively better validators.
A second, unrelated cause of downtime are software failures, particularly in the critical validating machines. Cosmos is a young project, therefore such failures are to be expected. We've instituted logging, monitoring, analytics, and root cause analysis systems to catch these failures and learn from the experience.
Handling cryptographic keys securely
The secure generation, backup, and infrastructure use of cryptographic keys is an old and hard problem. We followed the guidelines of NIST to build our internal key management system.
Some of the key features of our key management processes are:
- Separation of the related duties into different roles inside the company - cryptographic officers generate sensitive key material, cryptographic auditors ensure the faithful execution of key ceremonies, cryptographic administrators prepare software for key handling and key material custodians store keys in multi-signature configurations.
- The usage of Hardware Security Modules for validation signing. These are devices specialized for the purposes of key storage and signing on servers. Think of them as glorified hardware wallets for servers.
- Geographically distributed key storage and key recovery. No dependence on a single person for recovery (unlike the recent Quadriga episode).
- No manual work to generate backup and recovery keys. All operations that handle keys are performed using precise, repeatable software on air-gapped machines.
Title picture credit: Holger Link on Unsplash.