A Practical Verification Mechanism for Decentralized GPU Sharing

A Practical Verification Mechanism for Decentralized GPU Sharing

The verification mechanism is critical to every decentralized network. Vulnerabilities in the mechanism will lead to the collapse of the entire network.

In the decentralized GPU sharing network, the individual GPU holders contribute their spared computing power to the network in exchange for the profit, and the users pay to use the computing power to run their AI tasks, such as the Stable Diffusion image generation.  If the GPU holders could send random computation results back without being discovered, the users will keep getting bad results for their AI tasks, and they will soon stop using the system, and the system will be abandoned by all eventually.

Note that we can not rely on the users to discover the cheating, since the user could also cheat by reporting error computation for every task to save his money. The verification could only be performed by the system, or more precisely, the consensus mechanism in the decentralized network.

The verification is much harder in the decentralized systems, comparing to the verification in the centralized systems. The verification rules must be open to the public since it must be executed by all the consensus participants in the network. Which gives the attackers the opportunity to fully examine all the details to find vulnerabilities. And the rules is fixed before the attack happens. There is no chance to manually examine the attacking behaviors and modify the rules accordingly before the real damage occurs.

Works have been done on various of schemas and the related technologies by quite a lot of researchers and companies. At Crynux, we have been working on the Zero Knowledge Proofs (ZKP) for quite a long time, to verify the actual execution of the AI tasks. The bottleneck so far, is still the speed of the ZKP generation for large neural networks, or a practical contribution measurement schema that survives the randomness and the attacks of the long-running gradient descending process.

But if we put some limits on the supported AI tasks, we could already implement a practical schema that works on large scale neural networks in a large scale decentralized network.

More specifically, the limit we put on the task is that the input must be open and the task must be deterministic. If the same task is executed by multiple nodes, they should all get the same result. For example, the training using private datasets from other nodes is not supported since the private data can not be revealed to others. But most of the tasks are supported other than those related to the private data. Stable Diffusion training task is supported, as well as the image generation task. LLM fine-tuning and inferencing tasks are also supported.

The idea is actually quite simple: for a given computation task, the network randomly selects 3 nodes to run the task simultaneously, and compares the results returned from the nodes. If one of the nodes has a different result than the other two nodes, the node is cheating. And the node should have certain amounts of tokens staked on the network before receiving tasks. If the node is found cheating, the staked tokens are slashed immediately.

The design consumes 3 times the computation power than what is required only to get the computation result. But since it is the spared computation resource we are talking about, it will still be cheaper than the cloud hosted GPUs. And we can still further cut it down to 2 times in most cases by a small optimization in the workflow.

For the attacker, it becomes a calculation of the probabilities of winning or loosing. The variable the attacker could control is the number of the (fake) nodes he possesses in the network. The attacker could start as many nodes as he wants, as long as he has enough tokens to stake. Possessing more nodes gives the attacker higher probability to win, since it will be more likely to have more than two nodes of himself been selected in a single task. And in this case, he can cheat the system by giving two identical fake results.

The probability of a successful attack (an attacker getting more than 2 nodes of himself selected in a task) could be calculated as:

\[  p = \frac{ C_d^2 * C_h^1 + C_d^3}{C_{d+h}^3} \]

Where \( h \) is the number of the honest nodes, and \( d \) is the number of the dishonest nodes the attacker possesses.

And the expectation of the income from cheating is given by:

\[ E = p * k - (1-p) * s \]

Where \( k \) is the price of the task, and \( s \) is the number of staked tokens.

By increasing the number of staked tokens \( s \), we could decrease the expectation \( E \) down to zero or even below. If \( E \) is below zero, there is no benefit to attack the system by starting more fake nodes. The attacking is highly likely to cause the attacker to loose money rather than earn.

Let's see some examples of the calculated numbers.

For easier understanding, assume the price for executing a task is $0.1, rather than a number in tokens. And let's say 10% of the nodes in the network are honest. Set the target \( E \) to be zero, the required amount of money to stake is then:

\[ s = \frac{p * k}{1-p} \]

Note that we might double the amount of the tokens required to stake in practice, to further limit the gambling attempts of the attackers.

Here's a table showing the required money to stake in different network sizes:

No. total No. honest No. dishonest Prob. of cheating success $ required to stake
100 10 90 0.974211503 3.777697842
1000 100 900 0.972216505 3.499259211
10000 1000 9000 0.972021605 3.474186444
100000 10000 90000 0.97200216 3.471704109

We could tell from the calculation that the probability of successful cheating is only related to the proportion of the honest nodes in the network. And at such a low proportion, the amount of money required to stake is surprisingly low.

Let's decrease the proportion of the honest nodes further more and set it to 1%, here's what happens:

No. total No. honest No. dishonest Prob. of cheating success $ required to stake
100 1 99 1 -
1000 10 990 0.999731174 371.8879113
10000 100 9900 0.999704911 338.781133
100000 1000 99000 0.999702291 335.7985534

The required amount of money to stake is still not too much. And note that we are assuming all the dishonest nodes belong to a single attacker. In the real world, if there're multiple attackers, and if they're not conspired to summit the same fake result, the attackers could be treated as the honest ones to each other. Which will further reduce the risk of being successfully attacked.

The staking based verification mechanism lays a solid foundation for the decentralized GPU sharing network, but there are still possibilities to attack the network from the engineering aspect.

For example, the attacker could monitor the Blockchain for computation result submission, and submit the same result as other nodes in the task. Another way of attack is to submit fake results only when the attacker finds that two of his nodes are selected in a same task. In other cases, he just aborts the task to avoid being slashed.

The engineering attacks could be solved with careful system design and implementation. But the attacks are still somewhat inevitable for a growing system. Several rounds of the testnet runs under controlled environment will help resolving the potential problems.

Stay tuned for the upcoming!