Data Validation

Why the data validation is needed? A quick answer is to prevent users from submitting malicious data.

The AI model needs tons of cleaned and qualified data to train and fine-tune it. The philosophy is to build an incentivized system for the qualified data contributors, who help the model become stronger and stronger; on the other hand, the users who submit malicious data will get punished.

To achieve this goal, our framework applies to platforms where decentralized networks agree on a shared sequence of computations. The smart contract is used in this shared code. It contains data fields and interacts with new code and events via its method calls. A computation on-chain means the computation is done inside of a smart contract. The input and result of the computation are usually stored on the blockchain. In contrast, off-chain means the calculation can be done locally on the client’s machine and does not necessarily need to be public.

In conventional legal systems, violating an agreement may result in a penalty or fine. Enforcing a penalty via a smart contract is complicated because a user cannot be forced to make a payment. Instead, many solutions in the blockchain space require users to “stake” deposits that can be re-claimed later if they obey rules. Similarly to those systems, we propose staking a deposit to simplify some incentive mechanisms for submitting new data.

(1) The Incentive Mechanism validates the transaction; sometimes, a “stake” or monetary deposit is required.

(2) The DataHandler stores data and meta-data onto the blockchain. This ensures it is accessible for all future uses, not limited to this smart contract.

(3) The machine learning model is updated according to predefined training algorithms. In addition to adding data, anyone can query the model for predictions, and the incentive mechanism may be triggered to provide users with rewards.

Here is one example of two Tweets related to the topic "Bitcoin":

  1. The first one, in English, is straightforward for the computation model, even though there is a hashtag #Bitcoin.

  2. In the second one, the word "Flatbread" indicates Bitcoin, which does not explicitly relate to Bitcoin in English but is very well known in Chinese.

In this case, both Tweets will pass the prediction process by the validation node. Then, both tweets will be labeled as BTC and added to the dataset to fine-tune the sharing updatable models. The more datasets collected and validated, the stronger the model will be by fine-tuning the model with the newly added data.

The staking-based self-assessment mechanism

Implementing fines or penalties for submitting inaccurate data would be ideal in an optimal framework. A prevalent method for assessing data quality involves leveraging peer validation, a technique widely recognized in traditional crowdsourcing models. However, the post-submission enforcement of penalties via smart contracts presents particular challenges. A deposit mechanism is integrated at the point of data contribution to streamline the process of imposing penalties.

A specifically deployed model, called h, is critical in validating newly submitted data. It's important to note that this model must undergo initial training to ensure its capability to categorize data inputs accurately with a reasonable degree of precision. The process encompasses several vital steps:

  • Model Deployment: The introduction of a model h, which has been pre-trained on a subset of data, to the validation process.

  • Deposit Requirement: A deposit is mandated for every data submission, encapsulating the data x and its associated label y. This ensures that the data and its metadata are securely recorded on the blockchain, promoting an environment of accountability and quality in data contributions.

  • Refund: To claim a refund on their deposit after a time t has passed and if the current model, h, still agrees with the originally submitted classification, i.e., if h(x) == y, then the contributor can have their entire deposit d returned.

– It is now assumed that (x, y) is “good” data.

– The successful return of the deposit should be recorded in a tally of points for the wallet address.

• Take: A contributor that has already had data validated in the Refund stage can locate a data point (x, y) for which h(x) is not equal to y and request to take a portion of the deposit given initially when (x, y) was submitted.

If the sample submitted (x, y) is incorrect or invalid, then within time t, other contributors should submit (x, y0 ) where y0 is the correct or at least generally preferred label for x and y0 is not equal to y. This is similar to how one expects typically bad edits to popular Wikipedia articles to be corrected promptly.

  • Claiming Deposits: A contributor whose data has been previously validated during the Refund phase is empowered to identify a data entry x,y where the model's output h(x)h(x) diverges from yy, and thereby initiate a claim for a segment of the deposit originally placed for x,yx,y. This mechanism is designed to incentivize the maintenance of data integrity within the system. In instances where a submitted dataset x,yx,y is deemed incorrect or invalid, a defined period tt is allocated for other network participants to propose corrections by submitting x,y0x,y0​, where y0y0​ represents the correct or widely accepted label for xx that differs from yy. This procedure mirrors the collaborative editing and correction dynamics observed in platforms like Wikipedia, ensuring rapid rectification of inaccuracies.

Time to Wait for Refund: Within the smart contract framework, tt serves as a crucial temporal parameter, establishing the duration contributors must wait prior to initiating a refund claim on their deposit. It is imperative that tt is sufficiently extended to allow other network participants the opportunity to offer corrective submissions with alternate labels, should they identify discrepancies in the data. For instance, setting tt to a minimum of one week could facilitate this process. This delay is particularly crucial for models that exhibit lower sensitivity, providing ample time to accumulate a diverse range of samples necessary for adapting the model to new scenarios.

Models characterized by high sensitivity present a unique challenge, as they may permit the premature claiming of refunds for inaccurately submitted data, potentially before corrective actions by other contributors are possible. To mitigate this risk, such models necessitate a significantly higher deposit requirement, aimed at deterring rapid, malicious submissions. Rigorous testing and careful consideration should precede the determination of tt, ensuring it is optimally set to balance between model sensitivity and the need for data accuracy.

The parameter tt need not be static. Its duration could be dynamically adjusted based on various factors, such as the nature of the data sample, submission frequency, or the model's confidence level in the data accuracy. For instance, if a model can quantify the likelihood of a submission's correctness, P(h(x)=y)P(h(x)=y), this probability metric could justify a reduction in tt, especially when the model's confidence in the submission's validity is high, suggesting that subsequent changes are improbable.

Varying Deposit: The implementation of a deposit requirement serves multiple objectives within the system:

  • It injects value into the ecosystem, rewarding participants who contribute accurate data, thereby motivating the submission of high-quality information.

  • It acts as a deterrent against the overly frequent modification of the model, maintaining the stability and reliability of the system.

  • It curtails the influx of spam, defined here as incorrect or invalid data submissions, enhancing the overall data integrity.

To realize these objectives, a specific equation is employed to ensure that it becomes prohibitively expensive for contributors to submit a high volume of updates within a brief timeframe. This approach aims to offer users of the model's prediction function a more uniform and reliable experience. An illustrative example of this principle is the expectation of consistent responses from a personal assistant device to identical voice commands issued multiple times within a day, such as a request to play the news.

Taking Another’s Deposit: We introduce guidelines for a contributor reporting “bad” data to take some of the deposit from the original contributor, c. Note that contributed data and meta-data about it can be found in the data handler or emitted events.

First, some definitions:

• Let r(Cr, d) be the reward the contributing reporter, cr, receives for reporting data (x, y) with deposit d.

• Let n(c) be the number of data samples for which contributor c received a refund (assumed good data).

Guidelines:

• h(x) != y: The current model disagrees with the label. So, it is assumed the data is “bad”.

• n(cr) > 0: The reporter should have already had data refunded. This ensures that they have hopefully already submitted “good” data before they can try to profit from the system.

• cr != c: The reporter cannot be the original contributor. Otherwise, contributors can easily attempt to reclaim their deposit for “bad” data.

• The reward should be shared amongst “good” contributors.

– This protects against Sybil attacks where a contributor can use a second account to take back their entire deposit. They can still claim back some of their rewards from another account, but they will have to wait and get refunded for some “good” data using that other account.

  • r(cr, d) > ε > 0: The reward should be at least some minimal value to cover potential transaction costs.

  • The data handler must keep track of the remaining deposit that can be claimed, dr ≤ d.

  • Since n(c) changes over time, the ratio in (3) changes while reporters claim their share of d. Therefore, it is possible that some reporters get a smaller proportion of d. We discuss some possible solutions to this in III-C5.

Biasing the Model: Under the proposed system, there exists a potential for contributors to predominantly submit data that aligns with the current model predictions (h(x)=yh(x)=y) at the time of submission, in anticipation that the model will maintain its agreement upon the refund period. Such a strategy may inadvertently induce a confirmation bias within the model, skewing it towards reaffirming the data it has previously encountered. Despite the requirement for contributors to incur a transaction fee, thereby incurring a nominal cost for both depositing and retrieving their refund, this does not entirely mitigate the risk of biased data submission.

The selection of the model and its training methodology thus becomes paramount, necessitating a strategic approach to the acceptance and processing of data submissions. It is essential for the system's architect to implement mechanisms capable of identifying and mitigating the influence of redundant or overly similar data entries, which could compromise the diversity and representativeness of the dataset. To this end, the Information Manager (IM) has the authority to reject submissions that excessively replicate previously submitted data, ensuring the model's continuous exposure to a broad spectrum of information.

Preventing Lock-ups: This section discusses ways to avoid funds getting “locked up “ or “stuck inside” the smart contract. It is possible that contributors omit to collect their refunds or that contributors do not take their portion of the deposit, leaving value “stuck inside” the contract. To avoid this, we introduce two new parameters:

• tc: The time the creator has to wait to take the remaining refund (dr) for a specific contribution. Where tc > t. Additionally, this incentivizes creators to deploy a model as they may get a chance to claim a significant portion of d. Contracts may want to enforce that this is much greater than the amount of time to wait to attempt a refund, which gives contributors even more time to get the deposit back and not allow the creator to take too much (tc ​≥ t).

• ta: The amount of time anyone has to wait to take the remaining refund (dr). Where ta ​≥ tc > t . is used if the creator omits taking the “stuck” value from the contract.

Indeed, there can be more variants of these, such as a value, td, for data contributors with refunded submissions (n(c) > 0) where t a ​≥ t d ​ ≥t c.

This section outlines strategies to prevent funds from becoming inaccessible or "locked up" within the smart contract framework. There are instances where contributors may neglect to retrieve their refunds, or fail to claim their designated portion of the deposit, resulting in funds remaining unclaimed within the contract. To mitigate such scenarios, two novel parameters are introduced:

  • tc: This parameter specifies the duration a creator must wait before they are eligible to claim the remaining refund (drdr) allocated for a particular contribution. The condition tc>ttc>t

    ensures that creators are motivated to deploy models, as they stand to gain access to a considerable portion of dd if unclaimed by contributors. It is recommended that contracts stipulate tctc to significantly exceed the minimum waiting period for refund attempts, providing contributors with ample opportunity to reclaim their deposits and limiting the amount creators can withdraw (tcttc≥t).

  • ta: This parameter defines the timeframe within which any party must wait before claiming the leftover refund (drdr). The stipulation that tatc>tta≥tc>t is particularly relevant in situations where the creator neglects to retrieve "stuck" funds from the contract.

Additionally, further adaptations of these parameters, such as tdtd, could be considered for contributors who have successfully received refunds for their submissions (n(c)>0n(c)>0), where tatdtcta≥td≥tc. These measures are designed to ensure the fluid circulation of funds within the system, preventing the accumulation of unclaimed assets within the smart contract and fostering a dynamic and responsive model deployment and refinement process.

In the contemporary digital ecosystem, the computational capabilities of end-user devices, such as smartphones, tablets, and laptops, remain primarily underutilized, exhibiting a surplus of processing power that is not fully harnessed. Within this framework, users engage with the network by executing dApps that incorporate specific algorithms on these peripheral devices. This engagement facilitates data validation across the network while concurrently contributing to the iterative refinement of underlying models. Furthermore, each validation node is assigned a reputation score, predicated upon their historical contributions of computational resources and duration of network participation. Nodes with higher reputations are entrusted with greater tasks and, correspondingly, receive more substantial rewards. This mechanism ensures a meritocratic distribution of tasks and incentives, reinforcing the network's integrity and efficiency.