Celo Discord Validator Digest #19

Migration of nodes to 1.1.0, useful info and community updates

Oct 26, 2020

One of the troubles we have faced as a validator for Celo is keeping up with all the information that comes up in the Celo's Discord discussions. This is especially true for smaller validators whose portfolios include several networks. To help everyone stay in touch with what is going on the Celo validator scene and contribute to the validator and broader Celo community, we have decided to publish the Celo Discord Validator Digest. Here are the notes for the period of 12-25 October 2020.

Discussions

Migration of nodes to 1.1.0

The previous discussion of upgrading Celo nodes to 1.1.0 continued in the last two weeks:

@Marc | LetzBake!: @Thylacine | PretoriaResearchLab I have a follow-up question to what you said concerning "Migration of Baklava nodes to 1.1.0"
Key rotation on first rotation to a new or backup server works fine as the celo folder / nodekey matches everyone's address book. Key rotation on subsequent rotations to an existing server where you are burning the previous signing key (because you can't authorize the same key twice) requires you to delete the nodekey file in the celo folder. Keep the chainstate if you don't want to resync from genesis. How does that sound?
So before rotating to the updated 1.1.0 node, the nodekey on that node should be deleted, then the node restarted and a new signing key authorized, and then the updated node should work fine?
@Thylacine | PretoriaResearchLab: Well, my understanding is this is only necessary if you are rotating to a new signing key, on a node that was originally synced with a different key (so it's your second or Nth rotation to this server). If you haven't key rotated to this node before, you should be able to simply update the docker images while it's not elected, restart the node, and then authorize/key rotate to that key.
@Marc | LetzBake!: I'm usually rotating to a new signing key between two validator-proxy sets to perform updates on the set that is validating at a given time. So do I understand correctly that the method described in your quote should work without node failure after upgrading to 1.1.0?
@Dee | Usopp.club: Up to now, I have always rotated to a new machine. Now I'm wondering that if I rotate to one of my previous machine, do I need to change its fix IP as the issue seems to be with the couple nodekey/IP? Does deleting the nodekey is enough or is it better to change the IP as well.
@Thylacine | PretoriaResearchLab: @Marc | LetzBake! I'm doing the same thing, rotating between two sets of validator-proxies. I had this issue where the backup wouldn't come back online after creating a new signing key and restarting the validator container (constant disconnects and errors in logs, the "leveldb" message etc) after the critical update we had a couple of months ago.
At the time we hadn't discussed in detail in the #baklava channel and I simply dropped the entire chainstate and resynced from scratch, which fixed the issue. I believe now, I could have simply stopped the container, deleted the nodekey from the /celo folder and would not have had to resync. I'm not 100% sure though, so don't put your signatures on the line to test it out. Monitor the newly created signing key validator to make sure it's synced before authorising it, of course. I don't have a backup set for Baklava but can try it out on my next mainnet rotation when I upgrade images.
@Dee | Usopp.club - since resyncing from genesis on the same fixed IP worked for me, I doubt you need to rotate IP addresses as well. If you go back and follow the discussion in #baklava a, you can see the "address book" explanation. I think if you clear the address book and restart it should be fine.
@BisonD: I would be interested in getting some guidance from the cLabs team on the 1.1.0 release. In GitHub it is marked as "latest-release", though there is only a baklava tag on it in the docker repo. The mainnet tag is still on the previous release (though I can assume it will move soon?).
The docker release is also not in the typical /celo-node repo, but only in /geth ... is this a new structure of releases or will it be promoted to /celo-node at some point? The baklava docs have different images now for writing genesis and running the node. Will the be the same for a mainnet rollout?
@Mariano | cLabs: Sorry for the confusion! To answer your questions:
We are using geth instead of celo-node since celo-node was "geth + genesis.json + bootnodes" file. But that's not needed any more (now the binary has testnets information built-in) . Still we are in the process of changing the name.
Tagging follow distribution guidelines. So, we've start with baklava, then alfajores, then mainnet. And we'll update docs & docker tags accordingly.
...
You don't need to do geth init. You can run geth directly. So, for baklava I've removed all instances of geth init calls and replaced that by geth ... --baklava.

Several validators asked what was the deadline for migrating mainnet nodes to the new version of celo-blockchain:

@Bmass | Goods & Services: Sorry if this has already been mentioned, but when should mainnet nodes be upgraded to 1.1.0 by?
@yaz: My recommendation is to wait a bit until an image is ready to pull mainnet docker for 1.1.0. We want to make sure everyone upgraded on baklava and that no tests are running. having said that, one can build the binary themselves and run 1.1.0, but this isn't recommended by cLabs...
@mbay2002 | Qoor: Honestly I haven't updated on Baklava because I saw a lot of back-and-forth here that the documentation for baklava had some problems (IIRC there was something about needing to create a "baklava" subdirectory). In any case, the signal-to-noise ratio on the 1.1.0 discussion for baklava got pretty low so I did not update.
Question: can I "simply" pull the new image to my baklava nodes and restart my containers? Or is there more to it than that?
@yaz: The problems were because we didn't update the docs for the new baklava docs in time (due to a security vuln on a node package in the monorepo delaying the process a bit)
So users were referring to the outdated docs for baklava. It wasn't an ideal situation and we will roll out the docs for future releases in coordination together to avoid this issue again. My bad. For pulling the new image, you can, just make sure to update the image name as it's been changed: https://docs.celo.org/getting-started/baklava-testnet/running-a-full-node-in-baklava#celo-networks

This also led to questions about new flags that are used in the new version:

@Thylacine | PretoriaResearchLab: What about this new flag --baklava ? Is this required? And will there be an equivalent version on mainnet like --mainnet ? (Once everything is ready)
@yaz: Correct. You don't need to init with a genesis.json file. You can just use --baklava. [For mainnet,] no flag [is] needed, it'll default to mainnet.
@Thylacine | PretoriaResearchLab: OK I got it working. Here's what I did:
account new as normal - creates keystore under .celo
Store .password under .celo as normal
Run container with --baklava to get correct genesis/bootnodes but with --datadir /root/.celo to not create a subfolder .celo/baklava.
If we want to be consistent the entire way through with the new folder structure we could:
account new with --baklava flag? <not sure if this works, didn't try>
Store .password under $PWD/baklava/.password which will be docker mounted to /root/.celo/baklava/.password <this folder will not exist yet unless account new works with --baklava flag?>
Start container with no overridden datadir but with the --baklava flag. Sound about right?
@Joshua | cLabs: Account new doesn't support the --baklava flag. You'd use the --datadir or --keystore (probably keystore b/c if follows the default datadir path) to put the key datadir/baklava/keystore if you wanted to run with no datadir but with --baklava
@yaz: the default behavior in geth iirc is to avoid having one overwrite their existing ~/.ethereum default directory if they try to run more than one network from the same instance. so that if you do $ geth and then cancel it and do $ geth --ropsten the subdirectory prevents overwriting the chaindata on the default directory location.
But I do sympathize that it causes frustrations to all of you and we will be updating our docs to reflect that.

Useful info

If you are not running attestation node on Baklava yet, @timmoreton | cLabs kindly suggested that it might be a good idea:

I'm working on getting attestations flowing on baklava. I have updated all of the cLabs validators and they're now all running 1.0.5. I'll also be deploying the Attestation Bot probably later today to start generating a little attestation load. So if you're not currently running an Attestation Service on baklava, or you're running an old one, then please use this as a great opportunity to stop testing in prod
...
Also, please see my previous message about getting TLS setup -- would love to see every attestation service URL starting https://
...
Attestation bot is now running on baklava! It's hoping to make an attestation request every 20mins. If you can, please deploy/update your service so it's successful
...
A few updates on attestation service:
1) Now that celocli 0.0.58 is released, i'd encourage everyone to upgrade baklava to Attestation Service 1.0.5. celocli identity:test-attestation-service now has a --provider option you can use to test a particular provider (pass it twilio or nexmo). It'll also wait after sending you the SMS to check that the attestation service receives the delivery status callback.
2) In the near future, there will be a need to require TLS for Attestation Services. Valora already has to pass requests via a proxy because Android security profiles restrict non-TLS traffic. Now, we are looking at making SMSes a regular 6-8 digit security code, which then get redeemed at the attestation service for the full signed message. This is likely to land in a 1.0.6 release in a few weeks. And in that case we wouldn't want to use an unencrypted connection even from proxy to attestation service. I know many of you already run TLS for this service, but if you're not, please look at enabling it soon.
3) 1.0.5 also supports HA setups: e.g. get a load balancer using /healthz as a health check, then have two VMs behind it, each hosting an Attestation Service instance and a full node, and a DB running separately, e.g. Cloud SQL on GCP. I'd love to understand what help (if any!) you need getting there. For example, if a lot of people use the example Terraform templates in celo-monorepo, whether it'd be helpful adding this pattern there. LMK!

If you are trying to spin up an attestation node and the command celocli releasegold:authorize --contract $CELO_VALIDATOR_RG_ADDRESS --role attestation --signature 0x$CELO_ATTESTATION_SIGNER_SIGNATURE --signer $CELO_ATTESTATION_SIGNER_ADDRESS results in the following error: Error: Unable to parse signature, check the celocli version. The latest version, 0.0.59 is the reason for the error. Downgrading to a previous version solves the problem. Source.
Attention to docker users from @mbay2002 | Qoor:

For those of you running docker.io on Ubuntu, careful doing any system updates/upgrades. There is a security update available for https://ubuntu.com/security/notices/USN-4589-2 which will kill running containers if applied.

There's now a Mainnet Metadata Crawler bot running in the Celo's Discord that checks validators' attestation services health:

@timmoreton | cLabs: Hey folks, as you can see, we now have a bot that checks Mainnet Attestation Services every 30-60mins and will report any failures here. If you have your /healthz endpoint visible, it'll report failures seen on that health check. If your /healthz endpoint is not visible, it'll not report that as an error. However... as a quick hack this bot was not smart enough to avoid repeat notifications so please don't make me implement that and just resolve any issues promptly.

Celo Governance Proposal had to be rejected due to a typo diligently spotted by zviad | WOTrust | celovote.com, and cLabs published an incident report about how that happened.
For those who faced the Metered peer count reached the limit error:

@Or | cLabs: FYI, for those of you have run into "Metered peer count reached the limit" in the past, there is a github issue and I've updated it with my findings: https://github.com/celo-org/celo-blockchain/issues/1172 . The TLDR is that it was due to a bug in the metered peer count itself rather than with the actual number of peers, but that the race condition the bug depends on was occurring frequently is due to a separate bug that has since been fixed in upstream (but the fix is not yet merged to Celo).

Community

Celo Community Fund proposal form @Patrick|Validator.Capital|Moola is getting some traction. Anyone is free to contribute:

@Deepak | cLabs: RE: Celo Community Fund Proposal, I have started putting together a plan on how we can move this forward. I had a detailed conversation with @Patrick|Validator.Capital|Moola on how to create on unified proposal. I think it will help us in the long run if we put some extra effort upfront to come up with a good plan. I have created a plan after talking to Patrick. We do need your help with comments or contributing to some of the tasks - https://www.notion.so/CCF-Proposal-Planning-da023db081624252a9ffb3fdbc6979e8. Let's use this Notion Page to work together.

Felix | chorus.one published staking guide on how to non-custodially vote for validator groups on Celo using Chorus One's Anthem and a Ledger device: https://medium.com/chorus-one/celo-on-anthem-ledger-staking-guide-4aa15195b83f

Like what we do? Support our validator group by voting for it!