Celo Discord Validator Digest #13

Solution to upgrade validator nodes, proposal to bump the minimunClientVersion to 1.0.1, useful info and new community channels and contributions.

One of the troubles we have faced as a validator for Celo is keeping up with all the information that comes up in the Celo's Discord discussions. This is especially true for smaller validators whose portfolios include several networks. To help everyone stay in touch with what is going on the Celo validator scene and contribute to the validator and broader Celo community, we have decided to publish the Celo Discord Validator Digest. Here are the notes for the period of 20 July - 2 August 2020.


Discussions

Solution to upgrade validator nodes

The cLabs has started collecting feedback regarding a solution on upgrading validator nodes without incurring a downtime penalty, which prompted a discussion among validators:

@Joshua | cLabs: Hello. I'm working on a solution to upgrade validators without incurring a downtime penalty. The two simplest solutions that we've come up with are improving geth's restart and reconnection time, and adding an RPC to have a validator start/stop mining at a certain block to allow for hot swapping an upgraded validator in without a key rotation. If you have a preference please fill out this form: https://forms.gle/pxjPWcWnU43QG3VC8 or dm me.

@Viktor Bunin | Bison Trails: Are you open to recommendations around relaxing the 12 block miss requirement?

It would be nice to never miss any blocks, but zero downtime upgrades can be onerous from a process / coordination perspective for professional validators.

Given the costs on operators, is there enough value being added to the chain's performance from a zero downtime upgrade as opposed to a 30-50 block miss (as an example)?

@Joshua | cLabs: I'd prefer under 12 blocks, but relaxing it is an option. One of the questions on the survey is trying to ask what the time budget that geth has from startup to signing again. On part of relaxing would be looking at network impact, the other is looking at how long an upgrade takes, with info on what each step takes (time to kill geth cleanly, time to restart vm/container, time for geth to startup, time for geth to start signing).

@keefertaylor: Would we accomplish the same incentives if we increased the 12 block window but also increased the minimum penalty?

It seems like a fair amount of validators went the route of just upgrading and taking the downtime hit this last round and there was ultimately no (obviously notable) effect on the network's ability to commit blocks.

@Thylacine | PretoriaResearchLab: Re: upgradeability, I don't understand how everyone didn't just key rotate to their backup server for the docker image update. Are people not running validators with a fully synced separate server in a different location or cloud region? As far as I understood from the messaging, we didn't need to update immediately and had the time to wait for an epoch boundary...?

@syncnode (George Bunea): Rotating key is not something I like for example. Also we have seen cases of incomplete rotations which is totally undesirable. I have a fully synced backup server and I was one of the first validators that tested miner start and stop functionality. For me it is a way better method of updating.

Unfortunately as geth works now you can’t be sure that the miner start works in less than 12 blocks.

But still I find it to be a better option than a key rotation due to flexibility of the process.

@Peter [ChainLayer.io]: Imo key rotation is and should be a security feature, not an ops feature. Also epoch endings are not everywhere inside office hours so this forces you to run standby services for regular maintenance, which imo people now will neglect. And lastly as we’ve unfortunately seen, key rotation can go wrong for various reasons. It adds an extra point of failure for what could be a simple restart.

But multiple proxies/threshold signatures are the real solution here I guess.

Proposal to bump the minimunClientVersion to 1.0.1

On July 28, cLabs submitted a governance proposal that makes running 1.0.1 version of the client mandatory. This, however, raised some concerns from the community:

@ag: From proposal 00010P text i see:

Before this proposal is executed at least ⅔ of the validators must be running 1.0.1 already. If not, the network would stall. At present time, this condition is already met.

What if validator have updated proxy version but not validator version? As far as I know it will be visible as 1.0.1 in celostats and look like it's updated but in reality it's only part of update done. What if 50% (for example) validators are running 1.0.1 proxy with 1.0.0 validator? Will the network stall because such proxies will start to reject their validators?

@victor | cLabs: Validators that are running 1.0.0 would automatically shut down. So it is true that is >1/3 of validators are are running 1.0.0, the network would stall. That's a good point and we should definitely make sure this is the case before the moving forward.

@ag: Security-wise - how bad is it to run a validator 1.0.0 with proxy 1.0.1? proxy cant fall to the bug that was fixed, and validator knows only 1 peer - proxy so it cant fall a victim of an attack. Do I get it right?

Not advocating against update of validator. Just curious if I understood the attack in a right way.

@victor | cLabs: It is true that a direct peering connection is needed to execute the attack, so a 1.0.0 behind a 1.0.1 validator is technically shielded. We would recommend against relying on this.

@ag: So currently it looks like this: most proxies are updated and network looks updated (with celostats for example, or looking through peers list), but it’s uncertain if minimum number of validators are running 1.0.1 and it’s uncertain if the network is safe to bump the version. Do I get it right?

@victor | cLabs: Right, so from celostats we can see that all validators (who report to celostats) report running version 1.0.1. What we can tell from that with certainty is that the proxies are up to date. What needs to be confirmed now is that the validator nodes are also up to date. Although this is likely the case, it's important that we know for certain network stability will not be affected.

@ag: I will honestly disclose myself: my validator is still running 1.0.0. this cant be seen with celostats. i don’t know is proxy reporting it’s validator version or not or maybe it’s gossiping peer to other peers but anyway, it is not seen from celostats as I do appear as 1.0.1. I’ll update my version before it’s too late but it’s not the point. the point is: current network status (version speaking) is not known for sure if we are relying only on celostats. I’ve spoken to at least 2 validators who’s validators are still not updated but are seen as 1.0.1 because proxy is updated

tldr: maybe we need some method to be sure 2/3 are running 1.0.1 before flipping the switch?

Following the discussion, the proposal was retracted by the cLabs:

@Henry | cLabs: ... Earlier, cLabs submitted a governance proposal to bump the minimum client version to v1.0.1 to enforce a critical security fix (https://github.com/celo-org/celo-blockchain/releases/tag/celo-v1.0.1). However, through our conversations with node operators, developers, and partners, we have realized that some community members are not ready to upgrade their nodes within the short timeline. In order to prevent unexpected interruptions to services and operations that depend on the Celo network, cLabs has recommended approvers not to approve. If the proposal is not approved, there will be no action required. In the case that this proposal is approved, cLabs recommends the community to delay this proposal by voting “NO”. When more stakeholders have upgraded their nodes, cLabs will re-introduce this proposal.


Useful info

  • ReleaseGold contracts have a hidden functionality that affects accessibility of cUSD rewards:

@Henry | cLabs: This is to let you know that if all the CELO is withdrawn from a ReleaseGold contract serving the validator or validator group, the cUSD rewards provided to that validator group will be inaccessible.

Here’s why:

  • Let's say there’s a ReleaseGold contract serving as the validator (group) address, and it's been receiving cUSD rewards.

  • The beneficiary decides to withdraw all CELO from the ReleaseGold contract.

  • The ReleaseGold contract now executes selfdestruct

  • Since the ReleaseGold contract has selfdestructed the transfer(address to, uint256 value) method cannot be called to move the cUSD.

    The cLabs team is working on upgrading the CLI so that a warning is issued if there’s a cUSD balance left when withdrawing all CELO.

    In the meantime, please make sure to withdraw from your validator (group) ReleaseGold contract all cUSD prior to withdrawing all vested CELO.

  • syncnode (George Bunea)** and several other validators noticed that validator rewards are decreasing amid the increasing validator scores. But this seems to be a part of the protocol:

Hello! There is something that intrigues me a little regarding the rewards for the last epochs. While the score of my validator increased the amount of paid rewards decreased (ex bellow): epoch 97 reward paid 202.037 cUSD, epoch 98 reward paid 201.99 cUSD, epoch 99 reward paid 201.94 cUSD.

I am wondering if there is a rounding or potentially some more complex issue somewhere.

Any ideas?

This behavior is not only for my validator.

@zviad | WOTrust | celovote.com: Validator Epoch rewards also get adjusted by current “rewards multiplier”.

https://docs.celo.org/celo-codebase/protocol/proof-of-stake/epoch-rewards/validator-rewards

“The protocol's overall spending vs target of epoch rewards”.


Community

  • Pretoria Research Lab's block map has seen a few optimizations:

@Thylacine | PretoriaResearchLab: ... I've added a new performance optimisation. Instead of pulling the entire 100-300 blocks from my API every couple of seconds (this is really slow and always lags the current block by 1-2 blocks), after the first reload I'm only pulling one block at a time, and programatically reconstructing the payload client-side. This saves a lot of time, and saves you a lot of download MB. Changes are live now, and it should be a bit more current and easier on your network.


Like what we do? Support our validator group by voting for it!