Celo Discord Validator Digest #25

Twilio downtime, using a different set of proxies for backup validator, useful info and community updates

Mar 15, 2021

One of the troubles we have faced as a validator for Celo is keeping up with all the information that comes up in the Celo's Discord discussions. This is especially true for smaller validators whose portfolios include several networks. To help everyone stay in touch with what is going on the Celo validator scene and contribute to the validator and broader Celo community, we have decided to publish the Celo Discord Validator Digest. Here are the notes for the period of 14 February 14 March 2021.

Discussions

Twilio downtime

On February 26, Twilio sms services went down for approximately 1.5 hours leaving several validators' attestation services inactive, which led to a discussion on attestation requests completion rates:

Cody | cLabs: This is a good opportunity for all @Validators to make sure they have multiple SMS providers configured (Twilio, Nexmo, MessageBird) to ensure an outage from a single provider doesn't impact the completion rate.
https://docs.celo.org/validator-guide/attestation-service#sms-providers
syncnode (George Bunea): Nexmo is a pain... Instead of wasting time with their setup process I prefer doing anything else.
Twilio is currently experiencing some technical issues. They will solve the issues and will be back up, eventually. Also I’ve seen a lot of people trying to register using landlines phone numbers in my case. Of course the landlines numbers have no sms capabilities. I’ve raised months ago the possibility of creating a form of council to review specific situations and manually approve or discharge failed attestation requests.
It’s very important to understand that in this types of interconnections there will always be corner cases that can not be addressed in a fully automated way and we will eventually need a form of manual review for specific situations.
Cody | cLabs: Hey George, we're continuing to work with SMS providers to make the onboarding process as painless as possible. If you have any specific feedback about Nexmo, I'm happy to push on this on our end as well. It is important that Attestation Services have provider redundancy both for high availability as well as to prevent centralization. Phone number verification is a core feature of the Celo protocol as part of the mobile-first vision, so it's important that we keep making progress to making it reliable/decentralized.
Do you have any pointers to details on how this council will work? Would love to review it and see if it should be included as a CIP.
Ideally the caller (Valora) would detect first that the number is not a valid mobile (ex. landline) and not make the request in the first place. I'll create a ticket to track this on our end.
Here's the ticket to detect landlines in Valora: https://app.zenhub.com/workspaces/wallet-backlog-board-60092d006a84e3001d0d5936/issues/celo-org/celo-monorepo/7294
victor | cLabs: Instead of setting a static target completion rate, my current mindset is to make it more dynamic. I made a proposal along those lines in https://github.com/celo-org/celo-proposals/issues/144#issuecomment-786828982. I'm not confident that this proposal is the answer, but it represents a slightly different direction that I think it worth discussing.
Bart | chainvibes: Not many have all 3 providers configured, but are there any statistics that show a preference considering completion rate? For example, if I place MessageBird as primary provider, would this give better results than Twilio?
Cody | cLabs: Good question! We don't have enough data from MessageBird yet to compare. At the moment it looks like Twilio is a bit better than Nexmo; however it may be due to the fact that more folks have Nexmo as a backup (and backups are more likely to fail if a primary fails). We have a random provider option, but we need a way to track when this is specified so that we can do a fair comparison.
...
... I wanted to start a discussion around best practices for Attestation Service infra reliability. I personally have some thoughts but wanted to hear from you all to find out what's working and what's not. One of the biggest pains with uptime is the full-node syncing. Because the attestation signer key is currently managed by the full-node, it's not possible to fallback to Forno. We have a ticket to address this (below), but there may be ways to protect against this before the change is rolled out. https://app.zenhub.com/workspaces/cap-backlog-board-600598462807be0011921c65/issues/celo-org/celo-monorepo/7343
One possibility is to run to instances of the attestation service behind an active/passive load balancer. Both instances can run full-nodes with the AS signing key unlocked. Active/passive load balancing is important if not using a shared database, since there is some per-session state tracked by the service that will become inconsistent if round robin load balancing. The other benefit of this setup is that you have a fully-synced full node that can be used as backup if your validator chain state ever becomes corrupted.
Any other thoughts or ideas?
chorus-joe: Can we run multiple nodes with a shared DB? Or is there likely to be locking issues? Because A/A is way more reliable than A/P. Or worst case, can the attestation service grab a DB lock on startup (and the 'second' node sit 'waiting' on lock), and the node without lock can return non-200 response to the LB. I want to remove the human element; because I'm soft and squishy.
timmoreton | cLabs: Definitely recommended to have multiple nodes behind a load balancer with a shared DB as described here: https://docs.celo.org/validator-guide/attestation-service#deployment-architecture
So the LB round-robins between two attestation service instances, spread over two VMs, each of which has a full node on the same VM. Then both use the same DB with a second replica. In this setup any one AS instance, full node, VM, DB replica, can fail and you are still up.
...
Even if a "failure" delivery receipt comes in and hits the opposite instance that served it, the DB state is shared, and that instance will trigger the retransmit.

Using a different set of proxies for backup validator

spa | swiftstaking.com inquired if it was mandatory for a backup validator in the multi-proxy setup to use the same set of proxies:

spa | swiftstaking.com: Why hot swap validator instances has to use the same proxies? I tried on baklava that it works also with separate proxy<->validator, proxy2<->validator2 setup.
Bart | chainvibes: I've tested that also. Seems to work, except Joshua told me that it's been bundled with multi-proxy so build and tested (combi). I think it will be more seamless when connected to the same proxy. Now the hot standby has to wait/send some peering messages to discover the enodes of other validators and you'll miss more blocks.
spa | swiftstaking.com: When I tried for the first time (separate proxies), the passive validator when activated was missing a lot of signatures. But a day after that when I tried again, it worked flawlessly. So maybe the peering happens regardless of multi-proxy.
I am worried to use the same proxies, because of following use case. The validator node network is interrupted (you loose the connection to the server completely and cannot shut it down). You enable the backup to be primary and start signing. Then the network issue is resolved and you start double signing (not sure if this happens). When you use the same set of proxies, you cannot easily block the offline validator (maybe firewall rules would help, but it is not the real solution).
@Joshua | cLabs can you share some more information, how the hotswap works? Is there some failsafe against double signing?
Joshua | cLabs: At a high level a validator is configured as a primary or replica and then also has start block and stop block. Validators do not communicate between each other, so if you have the case that a validator with no stop block disconnects, you make a second validator the primary, and then the first validator comes back online you will double sign. If you want to have a HA setup with hotswap, you'll need a distributed lock manager to determine which node is responsible for which set of blocks. You'd generally want to set a stop block (though there's some trickery that's might mostly work without setting it).
It's a little tricky but hotswap is not HA by itself, but could be made into HA. Here is a k8s chart to implement HA on top of hotswap https://github.com/mantalabs/validator-elector. Note that I have not read the code so you should understand it before using it on mainnet.
spa | swiftstaking.com: ... I wanted to understand the need for reusing the same set of proxies. Why is it mentioned in the docs as a prerequisite?
Joshua | cLabs: For proxies there's two parts. 1 is maintaining stability of your validator's enode address. That means when you switch nodes other validators don't have to re-establish a connection. 2 is populating the enode table of your replica that currently works better behind the same set of proxies. Having a full enode table stops the replica from having to find the addresses of the other validators when it starts up.
For 2 there's a couple things that I think could be changed there to improve it, but it's still a couple vague ideas floating around in my head. For 1 we'd need to test that doing the external swap does not result in unacceptable downtime. It's possible that it doesn't, but it's harder to guarantee.
...
The proxy enode is the external enode address (probably the better term), and single validator can have multiple in multi-proxy.
External validators do care which proxy is making the connection.
A replica would have a different enode, but because it is behind a proxy, the proxy's enode serves as the replica (and primary's) external enode, so other validators only see the proxies.
...
Other validators can not distinguish between a validators setup with any number of proxies and and number of replicas.
A validator has an address (from private key) and n proxies. It assigns each external validator to one proxy. It then gossips out the external enode for a specific validator (encrypted to that validator). The external validator then initiates a connection to the proxy.
...
If the validator recognizes the proxy as going down it will update external validators with new address they should be using.
spa | swiftstaking.com: In active/passive using the same multiproxy. Does passive use the same proxy as active? Is its enode table populated the same way as the active one?
Joshua | cLabs: With a replica using the same proxies as the primary, the replica queries (though the proxy) for the enodes of external validators to populate it's enode table. It's the same process that the primary used at the start.
spa | swiftstaking.com: But why this is different for a replica with its own set of proxies.
Joshua | cLabs: It's part of how the announce protocol works, specifically on how the replica queries and learn of the enodes of other validators.
...
There's this CIP (https://github.com/celo-org/celo-proposals/blob/master/CIPs/cip-0005.md) on direct announce, but it's a bit out of date. You'd probably have to read the source backend/announce.go. These are the message types that deal with announce/proxies: https://github.com/celo-org/celo-blockchain/blob/68e2ab747b40ce215a916cde7fdc77613366dcd0/consensus/istanbul/protocol.go#L53-L58 (delegate sign is just for celostats though and fwd message forwards a message from a validator to it's proxies)

Useful info

A note from aslawson | cLabs on the 10 DLC requirement by Nexmo:

Received some direction from Nexmo and added it to the notion support page https://www.notion.so/clabsco/SMS-Provider-Validator-Support-Template-e168d45219e844e8a826c1ccefb5a06a#797033879adf469d9cbe1ee56b1d0e61 for convenience:
The brand information is only accessed/seen by Vonage (aka Nexmo) and the carrier registration partners downstream. In other words, the end user never sees this so validators do not need to coordinate with the "Celo" brand. In fact, the brand must be unique to an account, so only one account could register "Celo"...Please create your own brand name and campaign (TBD — will provide screenshots and flow details in documentation).

If you need a mainnet or Baklava snaphot, alexandruast | stakesystems.io is there to help:

I included celo mainnet and baklava in our fast sync service via rsync for chain data Snapshots are created every day at 06:00 UTC and 18:00 UTC https://www.notion.so/Stake-Systems-Fast-Sync-Service-5cb0dffb78174d3494b93f87d242939d

Community

Cody | cLabs published a proposal on Attestation Node Incentives:

Hey @Validators ✅ the revised version of CIP 32 (Attestation Node Incentives) is in a PR here. It's pretty different from the original slashing proposal thanks to the feedback from everyone. Please take a look and weigh in!
PR: https://github.com/celo-org/celo-proposals/pull/161/files
Discussion: https://github.com/celo-org/celo-proposals/issues/144

Looks like Mozilla is considering an opportunity to join Celo as one of the validators:

Thylacine | PretoriaResearchLab: Are Mozilla starting a staking / validator lab?
benoit | mozilla: It's only research for now.

Like what we do? Support our validator group by voting for it!