ZKP: the Last Piece in the Puzzle of Data Trading
The hunger for data arises as the development of the Artificial Intelligence. Thus created the market of data.
However, there are several problems standing in the way of data trading. Data, unlike the traditional merchandise, could be copied at no cost. The buyer who purchased the data could sell the copies at a lower price immediately. The original data holder won’t have another chance to sell it again. Another problem is that the personal privacy included in the data makes it impropriate to sell in plain text.
Privacy-Preserving Computation (PPC) technologies are adopted to implement the trading of the computation results instead of the original data. The buyers pay for the computation on the data, getting the computation result without seeing the original data. Multiple types of computation are supported, such as the training of AI models, and the statistical analytics.
PPC makes the fundamental block in the infrastructure of the next-gen data trading platform. It is not enough to secure the whole data trading process by its own. Since there is no way for the buyer to verify the computation result without seeing the original data. And this is where ZKP fits in.
ZKP is short for Zero-Knowledge Proof, which is a sophisticated cryptographical setup to prove to others that we know something without telling them what it is.
ZKP could be used to verify the computation result without revealing the original data. Together with Blockchain, ZKP completes the last piece in the puzzle of data trading problem.
To introduce the last piece in detail, we will have to describe the whole picture first. And in the next part of this article, we will give a concrete scheme to demonstrate how ZKP is used to verify a Logistical Regression computation task in the data trading scenario.
Towards the Next-Gen Data Trading Infrastructure
Step 1. Computation result trading using PPC.
AI model training and statistical analytics cover most of the use cases of data. And both of them could be treated as some complex computation steps. Which inspires the transformation from “data trading” to “computation result trading”.
Instead of paying for the original data, the buyer now sends the computation steps to several data holders, who perform the computation on their own data and get partial results. Privacy-Preserving Computation is used to sum up all the partial results to get the final result, in a way that no partial result of a single data holder will be revealed to others.
There are several technologies that implement PPC in different ways, such as Secure Multi-Party Computation (MPC), Federated Learning/Federated Analytics (FL/FA), Fully Homomorphic Encryption (FHE) and Differential Privacy (DP). We will not go into the details of these technologies. They all serve the same purpose in the data trading infrastructure. Some of them are the replacements of others, and some could be used together as complementary.
Step 2. Computation result verification using ZKP.
The verification problem in PPC
After implementing computation result trading using PPC, a new problem occurs: How could the buyer be sure that the computation is correctly performed on the data he requires?
Now that the buyer cannot see the data, he lost the ability to verify the computation manually by himself, what if the data holder just returns fake results without executing the computation task?
Even if the computation is actually performed, how could the buyer be sure that the data used is not randomly generated?
As a summary, the computation verification and the original data verification are the two problems we must solve to implement a reliable data trading platform.
Contribution proof is a promising method that solves the verification problem of both the computation and data at the same time. The idea behind is that if the portion of contribution could be measured (and proved) for a single participant in a converged model, and the buyer is satisfied with the accuracy of the model, he won't care about the data and computation anymore since he has already got what he wanted. There're quite some researches on the contribution proof, however, we are still far from a method that works practically in the real world environment, especially in terms of large amount of participants, and proof without revealing one's own gradients.
Computation verification using ZKP
A ZKP method could be generated by the buyer for a given computation task. The method contains a generation function and a verification function. The generation function is passed to the data holder together with the computation task.
The data holder, beside executing the computation task, should also execute the ZKP generation function on his data, which outputs a proof. He then sends the result and the proof to the buyer.
The buyer feeds the proof into his ZKP verification function. If the verification passes, he knows that the result is truly the output of his computation task. Which completes the computation verification. The input of the computation remains secret to him though.
Original data sourcing using ZKP
How about the original data verification?
Technically, there is no way to tell if the data is fake or not directly. Because fake data differs from the real ones only on the statistical distributions, and there is no way to find the right distributions to compare with.
But if we step a little bit backward, even if there is no way to verify the original data immediately, we could leave evidence of the original data used during a computation task. By calculating and exposing the data hash in the ZKP, the buyer has a proof that certain data (indexed by the hash) has been used in the computation task, and he may later require an audit on a portion of the original data.
Additionally, by asking the data holders to recording the hashes of the modified data on the Blockchain prior to the computation task on a daily basis, when the computation task finishes, the buyer could verify that the data used in the task are generated long before the task coming, which eliminates the possibility for the data holders to fake data dedicated for the task.
Original data sourcing is a weakened version of the data verification. Combined with other mechanisms such as auditing and reputation system, it could solve the data verification problem in a longer term.
Step3. Filling the gap of trust using Blockchain.
Middleman standing between the counterparties of data trading
Now that we have the computation method and verification method, it is still not enough to make the trading happen before we solve the counterparty risk in the trading process.
As the buyer, if he pays first, he is facing the risk of the data holders not sending him the result after payment. Thus, he prefers to ask the data holders to compute first and pay them only after he verified the computation result.
The situation is similar to the data holder, who prefers to get paid before wasting his power to perform the computation.
No trading could happen at all.
The solution is to introduce a middleman, who is trusted by both parties. The middleman promises to both parties that he will verify the computation and transfer the money honestly.
The trading process starts from the buyer sending the computation task, the ZKP generator and the money to the middleman. The middleman keeps the money to himself temporarily and sends the task and ZKP generator to the data holder.
The data holder, knowing that the middlemen will surely transfer him the money as long as he passes the ZKP verification, starts the computation, generates the ZKP using the generator, and sends the result and ZKP back.
The middleman, after verifying the ZKP, sends the money to the data holder, and sends the computation result to the buyer. The trading completes successfully.
Middleman implemented using the Blockchain
The middleman could be implemented in programs, where all the steps the middleman needs to perform are executed automatically. The system could serve a lot of buyers and data holders, providing trust for all the transactions as a centralized platform, as long as the platform is trusted by all the users.
But what if the platform is not trustworthy? What if the platform conspires with the data holders to cheat buyers, or vice versa?
The Blockchain provides stronger trust by replacing the one-man decision making with consensus among multiple participants. Think of the Blockchain system as a consortium of a lot of middlemen, where the decisions are made by voting. If we have more honest middlemen that dishonest ones, the system is safe.
The middlemen are acted by the buyers and the data holders, by hosting the Blockchain nodes themselves, thus formed the infrastructure of next-gen data trading. The data holders could simply start a node to continuously sell his data. And the buyers could run tasks on the data from the whole network.