Post Mortem — No slashing protection incident (SSV Testnet)

Summary

On August the 25th an SSV validator got slashed, the Blox team got informed by one of the testnet operators (OneInfra). The slashing was caused by 2 double attestations the where broadcasted for the same slot with different block roots (double attestation).
The slashing was primarily caused by not having slashing protection, at all, on the SSV node level. This is a known issue which was taken into consideration as the testnet was launched with a place holder signer to be refactored at a later stage.

A side discovery was made about a vulnerability in the QBFT SSV implementation which doesn’t make the QBFT instance to decide if a quorum of commit messages are received with delay. This caused the committee to continue to an additional round of messages and decide in the next round, even though enough commit messages were received in the previous round.

Slashing incident details

Network messages data
10:19:28.459–10:19:29.009: At epoch 34,812 validator 216474 had an attestation duty which was executed by the SSV committee at sequence number 445 by node ids 2(Everstake), 3(Lighthouse), 4(RockX).

10:19:31.469–10:25:34.579: Node 1(OneInfra) received a decided message for sequence 445 which triggered a re-sync, causing a delayed duty execution for epoch 34,812 (by 3 seconds).
By the time node 1 started his duty for epoch 34,812 it started with sequence number 446.
As the other nodes finished the epoch’s duty, non of them were actively processing messages for sequence 446 resulting in node 1 being “on its own”.
While nodes 2,3,4 didn’t process the messages they did receive and cached all node 1’s future messages.
Node 1 was the leader for round 1 sequence 446 but timed out for several rounds as no one was processing its messages.

10:27:40.256–10:27:43.475: When nodes 2,3 and 4 started execution for epoch 34,813 (6 minutes later) they all started it at sequence 446.
At this point all nodes were on sequence 446. Node 1 with data from epoch 34,812 and the rest with data from epoch 34,813.
Nodes 2 and 3 started processing node 1’s messages. At round 1 node 1 was the leader.
As nodes 2 and 3 process the pre-prepare message for round 1 (sequence 446) they both send prepare messages building on node 1’s pre-prepare.
Nodes 2 and 3 both achieved prepare quorum for round 1 with messages from nodes 2,3 and 1.

10:27:43.475–10:28:01.685: Node 4 joins the rest at round 2. All nodes time out and move to round 3.
At round 3 node 3 is the leader, sending a pre-prepare value to all other nodes.
Because node 3 prepared at round 1 with node 1’s data, it must sent that data on future rounds.
All nodes receive node 3’s pre-prepare message for round 3 with the data prepared on at round 1 (the attestation data for the previous epoch).
All 3 nodes (2,3 and 4) manage to decide on round 3, broadcasting an attestation that has the previous epoch’s data.
The SSV committee broadcasts a double attestation vote, causing the validator to get slashed.

The above happened because there is no slashing protection on the nodes signers. If there was one they would have rejected node 1’s pre-prepare data for round 1 (seq 446) and moved on with the current data for the corrent epoch.

Vulnerability in the QBFT SSV Implementation

In sequence 446 it was discovered that 2 sets of qualified commit quorums were achieved (round 2 and 3). The QBFT is specifically designed to NOT enable such events, that above was caused because of vulnerability found in the implementation code.

Commit quorum condition from https://arxiv.org/pdf/2002.03613.pdf

Whenever 2f+1 commit messages are received by a node it should stop it’s running instance and decide. No further message processing should happen afterwards.
At our SSV implementation we index incoming messages by identifier, sequence number and round. This results in late commit messages to not be processed if the round moved forward, e.g., a commit message from round 2 won’t be processed if the node moved to round 3.

The code above is the main message processing handler, line 3 shows how round numbers are invovled in popping message out of the queue for processing.
A fix to the above should remove round number from message index along side removing round check on commit message pipeline.

Important Questions

  1. Is this a problem with SSV or its security — NO. This was caused because no slashing protection is currently implemented which will not be the case in mainet.
  2. Any funds were lost — NO, we are strictly on testnet.
  3. How can this be prevented in the future — Once a proper remote signer with slashing protection will be implemented this will not happen. We are working with the leading validator client implementation teams to integrate into existing VC and remote signer clients.

CEO @ bloxstaking.com and blox.io. Developing trustless staking products for eth2.0.