Celo Discord Validator Digest #9
Reserve balancing, oracle fixing, reset of the Baklava network, foundation votes, missing blocks, useful info and new community channels
One of the troubles we have faced as a validator for Celo is keeping up with all the information that comes up in the Celo's Discord discussions. This is especially true for smaller validators whose portfolios include several networks. To help everyone stay in touch with what is going on the Celo validator scene and contribute to the validator and broader Celo community, we have decided to publish the Celo Discord Validator Digest. Here are the notes for the period of 15-21 June 2020.
Discussions
Reserve rebalancing
There was a discussion on how the stability reserve would rebalance itself:
@zviad | WOTrust | celovote.com: So as I understand "unfrozen" part of the reserve is what powers the on-chain exchange. i.e. when you buy CELO for cUSD on-chain, the CELO part comes out from the unfrozen part of the reserve. So how does rebalancing of reserve to other crypto assets work? Will someone take part of the "frozen" part of the reserve, sell it on outside exchange to buy other crypto currencies? Or is that rebalancing done using the "unfrozen" CELO too?
...
Also ... all sources say that reserve has 120 m CELO, however everything I have used to check balance on-chain shows me that the balance is actually 110m. Is 10m somewhere else? or accidentally missing?
reserve.getReserveGoldBalance() -> returns 110m,
Manually checking reserve + other addresses shows total of ~110 m:
0x9380fa34fd9e4fd14c06305fd7b6199089ed4eb9 - 105m0x246f4599eFD3fA67AC44335Ed5e749E518Ffd8bB - 1.7m0x298FbD6dad2Fc2cB56d7E37d8aCad8Bf07324f67 - 3.47m
@marek | cLabs: Yes! The reserve entity has started diversifying the reserve into ETH and BTC as per the Celo white paper. It is in the process of creating a tracking website to provide greater visibility into the overall state of the reserve.
@zviad | WOTrust | celovote.com: so that means it is the "unfrozen" part of the reserve that gets converted in other currencies right? One reason I am asking is because, this has practical consequence on size of buckets that Celo exchange's uniswap protocol uses. In this case, CELO bucket size wont be 1% of 40m, but will actually be more like 1% of ~30m, since actual "unfrozen" CELO in the reserve will be 30m and not 40m. (which is still fine, just wanted to confirm that there wasn't some sort of a mixup happening).
@aslawson | cLabs: ... The rebalancing will be done regularly with the unfrozen portion -- it exchanges it for other crypto assets on exchange platforms. The freezing is meant to enforce a gradual reserve balancing.
Oracle fixing (continued)
The previous week's discussion of the oracle fixing threat extended into the past week:
@Roman | cLabs: The uniswap-style stability mechanism was chosen to reduce the depletion potential of the Celo reserve during times of imprecise oracle rates (whether they are off because of timing issues and/or because they are being manipulated). In the worst-case scenario of a malicious actor somehow gaining full control over the oracle rate, the malicious actor could set the CELO price very close to zero and then try to buy all CELO out of the uniswap bucket for little cUSD. Since the CELO bucket is set to a small fraction of the reserve (see the reserveFraction parameter in exchange.sol) on every bucket update (about every 5 minutes initially), the attacker would have to repeat this procedure quite often to deplete the CELO reserve to a larger degree.
@zviad | WOTrust | celovote.com: Agreed that if a malicious actor is trying to deplete the reserve of CELO, they would need ~1000 minutes if not more. However attacker can do much more damage instantly if they take out infinite cUSD from the reserve instead of trying to deplete the CELO. They can set price of cUSD to be very large. Since afaik there is no limit on amount of cUSD you can take out from exchange if you can control the price. They exchange whatever CELO they have to close to infinity cUSD, take that cUSD to various outside exchanges that support trading cUSD directly (assuming that will happen in future). Exchange that cUSD to other crypto coins on those outside exchanges and cash out. This can be done with only 1 malicious majority update from Oracles, and probably within just few minutes, assuming those outside exchanges don't have some complex protection mechanisms to detect flood of new coins to block and freeze the user. (This will only get easier to do if there is on-chain trading for other cryptos like BTC, ETH so on with cUSD directly, since it will be much harder to block user from trading with their maliciously minted cUSD).
@Roman | cLabs: Yes that's a great observation and underscores the important role of oracle security in the stability mechanism. To mitigate the oracle risk further, an oracle rate circuit breaker on the smart contract level was proposed in the past (see PR https://github.com/celo-org/celo-monorepo/pull/1490). Such a circuit breaker would rule out extreme manipulations of the oracle rate and thus mitigate the described risk. The above PR did not get merge in the end as the design implications were not fully satisfying but I expect some form of a circuit breaker to be added in the future. Additional, it was discussed whether the size of the cUSD tank should be constrained to be smaller than X% of the total cUSD supply and from my perspective, that is still a good candidate for one of the early improvement proposals.
@zviad | WOTrust | celovote.com: That makes sense. looking forward to future improvements to oracle contracts, I think there is definitely some room for improvement both in security (or decentralization) aspect and in predictability aspect. (small example of predictability issues: https://github.com/celo-org/celo-monorepo/issues/4096). Tbf afaik other big stable coins (like USDC, or especially USDT), also have similar potential attack vector. If someone were to hack admin keys or masterMinter keys for USDC they could issue unlimited funds or do whatever they want with USDC supply too.
Reset of the Baklava network
Soon, cLabs plans to reset the Baklava network to make the code compatible with that of the mainnet's. There were, however, concerns over whether to use RG contract addresses to fund all the genesis addresses from the previous network:
@victor | cLabs: We are definitely planning on funding all the genesis validator addresses, and any faucet requests, directly in the genesis block for the reset. (And not through a ReleaseGold contract.) Does that help?
@Rob | Polychain: I don't think so. Our baklava validators weren't funded in the genesis block. Can you maintain a repo (or spreadsheet?) of "registered" or Celo approved Baklava validator addresses that you fund with each reset? That way any validator can get their validating addresses saved for funding.
@victor | cLabs: If you requested a faucet, it is recorded and that address will be funded in the new genesis block. If you have a different address or set of addresses you'd like funded, DM me and I'll add them to the list to include in the new genesis block.
@syncnode (George Bunea): How about the rest of us, that have never asked for a faucet, but were involved in the network launch? Will we get tokens in our initial addresses?
@victor | cLabs: Yup. Current plan is to directly fund the beneficiary address (i.e. without a new ReleaseGold contract). Will that work?
@syncnode (George Bunea): Sure. That will be a change compare with the mainnet though.
@victor | cLabs: That's true. I guess it's operationally easier to work without ReleaseGold, but our guides currently assume it.
@syncnode (George Bunea): And since the documentation is lacking the setup process for non RG addresses there will be plenty of questions. Agree it's way easier and in the future we will probably get rid of them when all the tokens will be actually released.
@zviad | WOTrust | celovote.com: It is unlikely we will ever get rid of RG addresses because validator uptime is tied to an address, and also it is pretty much impossible to ever change validator group address. So I expect everyone who started with an RG address will stay with an RG address forever.
For operational testing, I think it is better to start with release gold addresses tbh, so people can test same scripts that they use in mainnet (for example for key rotation that people would want to test continuously).
Easiest might be to give people release gold contracts with fully liquid/release amounts. so if someone wants to take the gold out and use it directly they can just do that.
@victor | cLabs: It's not going to be possible to mirror it 100% in the sense that the state will be different, so there will be differences that we won't be able to address. As far as the use of ReleaseGold goes, I like the idea of using a ReleaseGold contract with 0 vesting time so it can be withdrawn if desired and I'll include that in the process.
Foundation votes
Last week, the Celo Foundation announced submission period for validators who seek to receive foundation votes. However, there were questions about how the votes can be obtained by currently unelected groups:
@ag: Question about applications for votes (I'm in 1-25 cohort but been discussing with some wannabe groups): does filling application mean that group should already be registered and 10k CELO locked for validator group creation? If the answer is no, then how will foundation filter out those who are not ready to invest in 10k CELO?
@Henry | cLabs: We had a meeting and concluded that requiring registration might cause an incumbency bias, so we're removing the requirement to be registered. Will release more details on how to gauge non-registered groups soon!
Missing blocks (continued)
Previous discussions regarding the reasons behind some validators regularly missing blocks continued last week:
@chris-chainflow: I'm still working to troubleshoot missed blocks and discovered a curious pattern. For about 8-9 hrs I was missing every XXXX88 & XXXX89 block, now that pattern reset and I'm missing every XXXX45 & XXXX46 block. The blocks come from the same miner, e.g. XXXX88 from proposer A, XXXX89 from proposer B, then XXXX45 from miner C and XXXX46 from miner D. Does anyone have an idea what might be causing this? It seems strange that I'd lose connectivity with different miner pairs on such a regular basis.
@victor | cLabs: In general, connectivity issues with the first of those two proposers is the cause. Without a stable connection to the proposer, it cannot receive your signature to include in the parent aggregated seal (i.e. how uptime for the previous block is recorded), which results in the immediate block being "missed", and you cannot receive the block proposal to sign, which results in the next block being "missed".
Any number of root causes could result in pair-wise communication failure, so it's hard to say without a lot of digging on each particular case. It's been happening to multiple validator pairs, but I don't think there has been any issue discovered in the client that is contributing. It's unclear to be whether there is an underlying client issue or not, but it seems likely this is a common network issue. The client could definitely be improved to report this issue, for example through a metric, and potentially adopt some mitigations.
@Peter [ChainLayer.io]: so far I've been able to narrow it down to this error message in the logging:
Jun 09 20:02:09 prod-celo-validator2a geth[690]: INFO [06-09|20:02:09.017] Disconnecting from static or trusted peer id=c6d885a191bc5efe conn=trusted-inbound static=ValidatorPurpose trusted=ValidatorPurpose reason="too many peers" remoteRequested=true err="too many peers"
If I look up that peer that I'm indeed not connected to it and it does match the blocks that I was missing. @victor | cLabs missing that first block would only happen though if you don't meet the initial quorum on the aggregate seal of the previous proposer which could be a simple latency issue correct?
Regarding missing the the block proposal to sign, does that mean that only the proposer sends that proposal and its not propagated through other nodes? I. e. if you are not connected to the block proposer you'll always miss that complete block and can't even make it to the parent aggregate?
...
I understood from @trevor | cLabs that that message indicates the remote is refusing you because "too many peers" but I haven't been able to find out why because afaik validator connections don't adhere to the maxpeers setting
Both locations in the code that can throw that error actually check if the node is trusted or static first. Being a validator it "should" be trusted.. so somehow that other node doesn't recognize you as a validator. It might have something to do with a certificate error I've been seeing as well, but that was intermittent.
@victor | cLabs: The aggregate seal for a given block is not unique, and each validator can potentially have a different valid aggregated seal in their local block database. My understanding is that if the validator is actively participating in consensus, it will receive commits from all its peers and construct a valid block before it is propagated to them over the network. As a result, the aggregated seal on the any validator sees will only include signatures from the peers they are connected to, which is why although the proposer have a quorum of signatures from the past block, that quorum will not include any validator they are not connected to. (I have not read this code in a bit, so I may not have all the details right though).
I believe the full proposal, including transactions needed to calculate the state root, is only sent out by the proposer and all other messages only contain the block hash, but I may be mistaken.
@Peter [ChainLayer.io]: ... So it seems to be the combination of these three error messages that are causing the issue then (where the issue is not being able to connect to certain validators):
Jun 14 13:11:32 prod-celo-validator1a geth[744]: INFO [06-14|13:11:32.331] Error sending all version certificates func=RegisterPeer err="write tcp IP:30303->IP:PORT: use of closed network connection"Jun 14 13:12:37 prod-celo-validator1a geth[744]: INFO [06-14|13:12:37.708] Error sending all version certificates func=RegisterPeer err="shutting down"Jun 14 13:12:37 prod-celo-validator1a geth[744]: INFO [06-14|13:12:37.703] Disconnecting from static or trusted peer id=b3ed97f60505f010 conn=trusted-inbound static=ValidatorPurpose trusted=ValidatorPurpose reason="too many peers" remoteRequested=true err="too many peers"Jun 14 13:12:40 prod-celo-validator1a geth[744]: INFO [06-14|13:12:40.955] Disconnecting from static or trusted peer id=b3ed97f60505f010 conn=trusted-staticdial static=ValidatorPurpose trusted=ValidatorPurpose reason="too many peers" remoteRequested=true err="too many peers"
My 2 cents based on what you just told me is that somehow the certificate exchange between validators can fail. This causes the "other" side not to see you as a trusted validator, which will in turn cause these lines to throw the too many peers error (since its not a trusted peer)
isStaticOrTrusted := p.Peer.Info().Network.Trusted || p.Peer.Info().Network.Static if !isStaticOrTrusted && pm.peers.Len() >= pm.maxPeers && p.Peer.Server != pm.proxyServer { return p2p.DiscTooManyPeers }
and/or
func (srv *Server) CheckPeerCounts(peer *Peer) error { switch { case peer.Info().Network.Trusted || peer.Info().Network.Static: return nil // KJUE - Remove the peerOp not nil check after restoring peer check in server.go case srv.peerOp != nil && (srv.PeerCount() > srv.MaxPeers): return DiscTooManyPeers case srv.inboundCount() > srv.maxInboundConns(): return DiscTooManyInboundPeers default: return nil }}
...
(Btw, two days ago it magically stopped for me, so currently I'm hardly missing any blocks anymore, likely because the number of peers seems to have gone down for some reason).
...
A theory I had was that the signatures could actually be discarded by the next proposer if the signature arrives there before that proposer receives the block proposal itself causing them to go missing in the parent aggregate. but that was before I knew that you need to be connected to the proposer to actually get that block proposal.. it might still be worth looking into though.
...
(That theory came from this line of code https://github.com/celo-org/celo-blockchain/blob/96f69040267fa4c6aec666766660e5dc1e1c7ce3/consensus/istanbul/backend/handler.go#L115 and the fact that a lot of people reported seeing "Got a consensus message signed by a non validator" messages.
I was wondering if that "backlog feature" actually works since the proxy seems to be configured not to forward any messages to the validator for which it cannot
checkValidatorSignature
(which it shouldn't be able to for a block in the future).It also seems the proxy in that case does a hard disconnect of the connected node since he thinks he's malicious or something.
...
... This was a while back, but then the "other" validator saw a huge amount of these lines in his logging:
ERROR[05-25|04:55:26.210] Got a consensus message signed by a non validator.INFO [05-25|04:55:26.210] Disconnecting from static or trusted peer id=6ee8d4b57ffdb359 conn=trusted-staticdial static=ValidatorPurpose trusted=ValidatorPurpose reason="useless peer" remoteRequested=false err="useless peer"
So if that validator would "think" that of your validator, and the time between that error and him proposing would be short you would always be disconnected from that validator when he'd propose a block.
This was during one of those "missing block storms" we've had a while back (and haven't seen in a while thank god).
Btw this could actually cause that
Jun 14 13:12:37 prod-celo-validator1a geth[744]: INFO [06-14|13:12:37.708] Error sending all version certificates func=RegisterPeer err="shutting down"
message on my side couldn't it?
@victor | cLabs: That is really interesting. It's true that is the remote peer was experiencing an issue where it thinks your validator is not elected it would disconnect and cause the behavior we are looking sore, so it could certainly be related. It might also cause the issue where the dropped node cannot reconnect because the remote peer would not recognize their certificate as valid. And you might see that error, ya. More generally, the error sending version certificates from
err="shutting down"
can occur and time a node disconnects, so this would be one of those cases.@Peter [ChainLayer.io]: So maybe not all validators have a correct idea of who the active validators are then for some reason. That would be an entirely different problem but gives me something to look for.
That would be quite simple to see though, if both validators that are having issues connecting compare the output of
geth --exec "istanbul.valEnodeTableInfo" attach
that would be a good start I guess right?@victor | cLabs: Ya, the
valEnodeTable
is definitely useful information in these cases!
Useful info
Hummingbot.io has launched its arbitrage trading strategy for Celo that "allows any CELO holder to earn arbitrage profits while contributing to cUSD stability".
Community
ansonlau3 created a Wechat group for Chinese validators while Liviu | Easy 2 Stake created an unofficial Celo Telegram channel for Romania.
Like what we do? Support our validator group by voting for it!