by Meher Roy
This has been the most exhilarating experience for me as a technical lead of Chorus One. There were moments of celebration, times of sheer terror, countless hours of hard work and sleepless nights. Before mainnet begins, we want to look back on GoS and share some of our learnings. We hope they’ll help people look under the hood of what it is like to run a serious validator on Cosmos. And that the lessons will be valuable for those following in our path.
In this first article, we cover the system we participated with in GoS, the unique challenges it brought in GoS and how we scrambled to recover from our downtime episodes. In a later article, we will cover our efforts to grapple with hard forks and network attacks.
What must a validator do well?
From the outset, the work of a validator looks simple. Run some open source code, beautifully built by the Cosmos team, on a server with high uptime. Verify transactions, sign blocks and make sure that the chain progresses. What is difficult about that?
It turns out, there are many different prerequisites to deliver on this:
- Uptime: The ultimate aim is to have 100% uptime. GoS had a crazy amount of inflation! Validators started with 15,000 stake, and ended up with ~2.5 million. Being down for one hour in GoS, was the equivalent of being down for half a month in the Cosmos main network.
- Key security: Validation keys must be generated, stored and used securely. This requires the creation of ceremonies, usage of special hardware, audits etc. We wanted to build something so secure that people could delegate tokens worth tens of millions to us and we wouldn’t lose sleep at night. Managing keys is not a novel challenge and the National Institute of Standards and Technology has published a comprehensive (120 pages long!) set of guidelines for how to do it. We implemented this standard and developed custom software for key management. We’ll write more about this in another post.
- Monitoring and Alerting: Automated systems need to be put in place to monitor the performance of the validator, and alert team members when something goes wrong aka a digital fire alarm.
- On call schedule and operations manual: When issues arise, team members have to be available 24/7 to troubleshoot them quickly.
- Logging and analysis: After immediate recovery from issues, one needs the ability to diagnose what went wrong and institute preventive action.
- Analyze validators financial performance: In the end, a key performance parameter is the validator earnings. A great way to identify problems of opportunities for improvement is to have the real-time relative performance of all the validators. Let’s say we can see that in the last 3 hours, validator X had revenues that were 3.4% higher than hours, we want to know about this immediately to understand the root cause and take action. To do this, a validator needs to build an analytics system, clear metrics to track and alerts around those.
- Respond to change: GoS organizers threw in a lot of changes, hard forks, and encouraged network attacks during the game. Each of these requires the validator to understand the changes, make adjustments and implement them rapidly.
Preparing for the Game of Stakes
Our story starts in September, about 3 months before GoS actually started. We were on our quarterly team retreat, and set two GoS-related objectives:
- Deploy our High Availability (HA) setup to guarantee high uptime.
- Develop one or two offensive attacks for use against other validators.
The reader might wonder: what is a high-availability, HA, setup? This is an ingenious piece of our validator architecture. A validator needs a variety of servers to do different things - servers for signing, those for monitoring, those for logging, servers to scan for vulnerabilities etc. In total, our system uses 10+ machines to coordinate for basic validation.
We chose to duplicate all of these 10+ machines across the Atlantic. For any validator component, say monitoring, one of our machines resides in the East Coast US; and another resides in the UK. That way, if either of the machines goes down; the overall system does not lose that functionality. We can tolerate the entirety of mainland Europe suffering a power failure, and will still continue validating. A fault tolerant validator!
Many readers will be familiar with the concept of double-signing. This is when a validator is signing two different blocks at the same height. By the protocol, this is considered an attack and it leads to severe financial penalties. But double signing could also happen accidentally if one has a redundant system. Our big goal was to have a fully redundant system, but completely eliminate the risk of double-signing arising from such redundancy.
We decided to design an active-active architecture. This looks as follows: We have two validating servers across the Atlantic. Both of these machines have access to the same Chorus One validator key and they are working concurrently. But there is a ‘coordinator’ that prevents the servers in the UK and US from ever signing different messages at the same blockheight.
We spent many months developing this and, to our knowledge, there is only one other team that has built an architecture that works in a similar way. We thank the Cosmos and Iqlusion team for their work on the tmkms, which we built upon. Prior to GoS, we were able to test our highly available set up in the test network for just one week, but we were set on running in GoS with the same infrastructure. In the end, there were upsides and downsides to that, but it provided invaluable lessons for the mainnet.
On January 4, Brian Crain manually observed that our validator was down; and both legs - the US and Europe - had stopped signing. Certus One and Staking Facilities had jointly executed a mempool attack that night, and I assumed the root cause was related to the attack. One of our engineers was on annual leave; and the other was in bed - 1 am his time.
Our primary focus, hitherto, had been to deliver a secure, performant, highly available validator, and with last minute changes to Tendermint, Cosmos and tmkms (including from our own team) to facilitate this, other facets of our system did not receive the attention they deserved. Our monitoring system was not fully operational yet, we didn’t have centralized logging and no on-call rotation.
Our ability to anticipate problems, react quickly when issues arose and subsequently determine a root cause was hugely restricted as a result. In total, our validator was offline for 4 hours, during which we dropped to around 10th place in the rankings. To this date, we are still not 100% sure what exactly caused both nodes to fail on that fateful night.
The following week, 7-11 January, saw a concerted effort to increase our ability to observe and understand our systems. Monitoring checks and alerting were improving daily and work began to develop an analytics system to allow observing not only our performance, but analyze it in relation to the rest of the network. We didn’t yet fully trust our alerting and worried about downtime, leading to checks on our validator every few hours, day and night.
The week thereafter, 12 January - 20 January we had another big challenge. Chorus One is a distributed team, and we meet physically every quarter. During this week, team members from all over the world gathered in Egypt for our retreat. We planned our travels, sleep schedules and meetings such that somebody would be available to respond to issues with our validator. We had a successful week, and it boosted our confidence.
After all this hard work, on Saturday January 19, we went for a tour of the Giza pyramids and the Egyptian Museum. We spent the evening smoking shisha and playing backgammon right by the Nile. It was a perfect and idyllic day as a reward for our travails in Game of Stakes!
Never let your guard down
Internet connectivity is generally poor in Egypt - mobile Internet works, but its coverage is sketchy. Our shisha evening was, unbeknownst to us, leading us to disaster: Our validator was down, but bad mobile connections meant our team members hadn’t been alerted. So it took a hurried Uber ride through Cairo’s crazy traffic for our engineer, Joe Bowman, to find a decent internet connection and get us up again. That's how our position slipped to the mid-teens - in a shisha bar on the Nile, playing backgammon. What a way to go down :).
We doubled down on our efforts for alerting, and finally had an automated system that would phone call team members when the validator was down. I can't express my relief at the system - no longer would we need to compulsively keep checking block explorers!
This alerting system, made our response times to incidents much better. With one exception, team members responded to issues within 15 minutes. Even in that one exception, alerting work perfectly - I was personally troubleshooting within 5 minutes; but lacked the knowledge to actually fix a different downtime episode on Jan 28.
However, we knew we faced a different problem - why was our system going down with such frequency (~7 times in a month) in the first place? We knew this had to be some kind of software failure, since our HA solution protects us against network and hardware failures. It was also likely that our problem was specifically our own, judging from the uptime patterns of other validators. Therefore, our suspicion centered around the specific active-active/high availability changes we made to the Tendermint KMS.
The addition of more verbose logging output to the KMS showed that our downtime incidents were triggered by the KMS sending the Cosmos application messages that it was unable to handle, causing Cosmos to terminate the connection to the KMS. It transpires that some of our changes to the KMS to add high-availability capabilities coupled with a Tendermint buffer handling bug, resulted in these spurious messages being broadcast. This issue has been addressed in the latest release on Tendermint, and we therefore expect a quantum improvement in our system reliability.
The final run
Our learnings and efforts over the past 45 days have tremendously improved our capabilities as a validator. Our performance in the final leg of Game of Stakes, GoS6, was a great demonstration of the progress made over the last 8 weeks. We had a near perfect uptime record in the final days and were also able to handle creative network attacks without pain.
We ended the game at position #17 in rankings, by stake. Whilst the rank is not quite what we had envisaged at the start of this process, GoS gave us a chance to battle test our innovative setup and improve our own operational readiness for mainnet. Once we move past these early innovation issues, we are confident the payback from our design will be very large in the long run.
It would have been very tempting to approach Game of Stakes with a mindset of ‘win at any cost’, and compromise on security in order to improve latency as some validators did. However, this was never an option for Chorus One - we have invested significant time, effort and money in producing the most secure and reliable infrastructure supported by well-rehearsed best-practice processes to ensure that we are able to provide the best staking service to our customers. Game of Stakes was the perfect opportunity to test, practice and improve upon our work and be confident in helping secure Cosmos as the mainnet takes off. We’re grateful for the Cosmos team and especially Zaki Manian for doing an amazing job running GoS. It’s also a great pleasure to work with the outstanding Cosmos validator community. Our confidence in Cosmos has never been higher before.
If you want to discuss Game of Stakes and Cosmos further, stop by our Chorus One Telegram and say Hi. And, of course, we’re happy to answer any questions about delegating to Chorus One.