One of the troubles we have faced as a validator for Celo is keeping up with all the information that comes up in the Celo's Discord discussions. This is especially true for smaller validators whose portfolios include several networks. To help everyone stay in touch with what is going on the Celo validator scene and contribute to the validator and broader Celo community, we have decided to publish the Celo Discord Validator Digest. Here are the notes for the period of 14-27 September 2020.
Discussions
September 22 incident
On September 22, 13 validators went down for more than 12 blocks, with some of them staying offline for longer periods. While cLabs identified and published an incident report about it, there was also some input from the validator community:
@zviad | WOTrust | celovote.com: Just a heads up, there was a huge spike in memory use so that could be one of the reasons why a lot of validators crashed.
If you were running close to max memory limit it could have easily caused OOM => which leads to a crash => which leads to corrupt chain in some cases.
Memory use on my proxy nodes went up by >1.5GB in span of few minutes (this was at 2:30 am).
@Francesco | Simply VC: After yesterday, I'm upgrading our proxies from 8GB to 12GB memory. Our active proxy at the time of the tx overload had 8gb and ran out of memory, corrupting the blockchain state.
@zviad | WOTrust | celovote.com: I would say, to be safe, you are probably better off to going to 16GB of memory directly on both Proxy and Validator nodes. Memory use is likely only going to increase and considering there is sort of a memory leak going on right now too, you are better off having more buffer.
...
We run 16GB proxy node and 32GB validator node and also key rotate to a fresh host once every 2-3 months. Even with all that, Proxy node can creep up to 8-10GB memory use.
Later, pranaypublished an update on the reason for the issue:
We’re still investigating the issue, but the current understanding is that the sudden burst of txs (around ~4000 at the same time) caused an OOM in several machines, causing a restart which then lead to a corrupted DB. There is a fix for the chain corruption issue which will land in the next client release (1.0.2), but the underlying issue with the tx pool is still being investigated. Some of y’all have mentioned a possible memory leak, which may be related.
were the txs spam that tried to put the network on hold? or were they legit txs?
With regards to whether this was intentional spam to disrupt the network, as far as we can tell, it was not. It was an exchange deploying many hot wallets all at once, in transactions similar to this one: https://explorer.celo.org/tx/0x8d7923e4bbc4477e76b1c168865ee4ab7decf39979c071e99db54f82eea35c5d/internal_transactions. It seems to be a one-of, and not malicious intent, so we are not concerned about a repeat. In the meanwhile, validators are recommended to bump their RAM on proxies and validator nodes to 12GB, to avoid a similar event in the near future while we are investigating the tx pool issue. Will follow up as soon as we understand the root cause.
BTW, one silver lining is that proxies functioned as intended, by taking the hit and protecting their validators. Multiproxy is being finalized, and should ship soon, which will help mitigate this type of event in the future as well.
Some community members expressed surprise at the fact that some validators use lower RAM limits:
@ag: Why even talk about OOM? Celo daemon is not eating tons of RAM suddenly. It sometimes eats a bit more than a tiny amount of RAM. Get 32-64 GB + add notifications of 50%/40%/30%/20%/whatever% RAM left = forget about OOM.
Maybe I'm not catching up something? Why just not get more RAM? (and wait for fixes)
@zviad | WOTrust | celovote.com: Tbf considering Celo validator payout, 16GB ram is still very cheap on both validator and proxy. My setup is basically what @ag describes. I have an alert on 60% utilization and just key rotate at that point to a fresh host. I have to do key rotations once every 2-3 months, which is useful for other security reasons anyways.
That, in turn, led to the idea of stress-testing the network regularly to encourage all validators to use more rigorous node configurations:
@warfollowsme | Celomap.io: I do not know who initiated this stress action. but it seems to me that we should say thanks to this person. and I think we need to start doing this stress test once a month/week at random time. this will help a lot for validators to setup best config for servers infrastructure. Get OOM with 4-6k txs in mempool it is unacceptable for validators in network with such rewards.
@Thylacine | PretoriaResearchLab: We should be using Baklava for stress tests ideally?
@warfollowsme | Celomap.io: Yes of course. On the first tests for sure. But the mainnet also needs to be tested for resilience to more transactions. Maximum that should happen is the slowdown in transaction processing, but not the loss of validators from work.
@zviad | WOTrust | celovote.com: Here is one suggestion: We can issue ~1000 CELO bounty from the mainnet community fund, for anyone who can take down Baklava. To make things more formal, we can have a process for it, i.e. whoever plans to run any sort of large scale stress test on Baklava, needs to announce it in #baklava channel to be eligible for bounty, they should also say how much of baklava CELO they will use for the attack.
...
Maybe this is something @yaz would be interested in formalizing/organizing ^^. If we want people other than cLabs employees to contribute more actively on technical side, there needs to be proper incentives/compensation setups for it. Few examples:
Various Bounties for stress-testing Baklava. (i. e. some reward for shutting down network for ~20-30minutes, larger reward for breaking consensus completely for some time period, etc).
Bounties for various Github issues. There are still plenty of open issues or CIP suggestions, but non cLabs folks most likely wont put the work to implement them unless they are actually compensated for it. Community can decide on compensation for different issues.
There is plenty of money in the community fund, and we can't expect people to just put extra development time in for free to improve the platform.
@pranay: Using the community fund to incentivize bounties for stress testing baklava is an awesome idea. I think the first step would be to create a meta-proposal to figure out a process to tap into the community fund. I drafted up a post to explain what the community fund is, and think this is a great reason to expedite getting that out. Will share out soon.
Useful info
For those building something on top of Celo:
@medha | cLabs: Updated versions of the following packages were released today:
@celo/base@0.0.3
,@celo/utils@0.1.20
,@celo/contractkit@0.4.14
and@celo/celocli@0.0.57
Some useful commands from daithi for looking at node peers and removing them:
docker exec celo-proxy geth --exec "admin.peers" attach
should give you a list of peers
once you've found the peer its:
docker exec celo-proxy geth --exec "admin.removePeer('$enode_url')" attach
The new release of celo-blockchain will contain new monitoring metrics:
@timmoreton | cLabs: There'll be new metrics landing in the celo-blockchain 1.0.2 release, including one for consecutive missed blocks. https://docs.celo.org/validator-guide/monitoring#validator-health-metrics
Community
zviad | WOTrust | celovote.com published a gist with script for monitoring downtime on the Celo network.
Brad | Polychain open sourced a tool for keeping track of our Celo balances over time:
Our finance team uses this for the monthly reporting requirements, as well as helping us keep track of total staking rewards for our validators: https://gitlab.com/polychainlabs/celo-accountant - hopefully others find it useful! (and thanks to @zviad | WOTrust | celovote.com for a couple pointers along the way!)
The next Validator Happy Hour time:
@Deepak | cLabs: Oct 8 (Thursday) at 6pm PT / 10 am (Seoul time) / 9 am (Shangai) - tentative - as this time we want to make it more Asia friendly.
https://thecelo.com/ now shows the attestation stats for validators on the main page as well as more detailed data at https://thecelo.com/?tab=attestations:
yaz has joined cLabs as "Partner, Developer Relations with a focus on all things Validators!"
Like what we do? Support our validator group by voting for it!