In an era where data quality is paramount, whoever can solve the problem of data value distribution will be able to attract the best data resources.
Written by: Haotian
Is data labeling, this “hard and tiring work”, quietly becoming a hot commodity? @OpenledgerHQ, which is led by Polychain and has secured over $11.2 million in funding, aims at the long-ignored pain point of “data value distribution” with its unique PoA + infini-gram mechanism. Let’s explore this from a technical perspective:
To be honest, the biggest “original sin” of the current AI industry is the unfair distribution of data value. OpenLedger’s PoA (Proof of Authority) aims to establish a “copyright tracking system” for data contributions.
Specifically, data contributors will upload content to specific domain DataNets, and each data point will be permanently recorded along with contributor metadata and content hash.
After the model is trained on these datasets, the attribution process occurs during the inference phase, which is when the model generates output. PoA tracks which data points influenced that output by analyzing the matching range or impact scores, and these records determine the proportional impact of each contributor’s data.
When the model incurs costs through inference, PoA ensures that profits are accurately distributed based on the impact of each contributor - creating a transparent, fair, and on-chain reward mechanism.
In other words, PoA addresses the fundamental contradiction of data economics. The past logic was simple and crude—AI companies obtained massive amounts of data for free and then commercialized their models to make a fortune, while data contributors received nothing. However, PoA achieves “data privatization” through technical means, allowing each data point to generate a clear economic value.
I think that once this conversion mechanism from “free-riding mode” to “labor-based distribution” is successfully implemented, the incentive logic for data contribution will completely change.
Moreover, PoA adopts a hierarchical strategy to address the attribution problem of models of different scales: small models with millions of parameters can estimate the influence of each data point by analyzing the model’s influence function, which is computationally manageable, while this method becomes infeasible and inefficient for medium to large parameter models. At this point, the powerful tool Infini-gram must be deployed.
The question arises, what is infini-gram technology? The problem it aims to solve sounds quite bizarre: accurately tracking the data source of each output token in medium to large parameter black box models.
Traditional attribution methods mainly rely on analyzing the influence function of models, but they basically become ineffective in the face of large models. The reason is simple: the larger the model, the more complex the internal calculations, and the analytical costs grow exponentially, making it infeasible and inefficient in computation. This is completely unrealistic in commercial applications.
Infini-gram has completely changed its approach: since the internal model is too complex, it directly looks for matches in the original data. It builds an index based on suffix arrays and uses dynamically selected longest matching suffixes instead of traditional fixed window n-grams. In simple terms, when the model outputs a certain sequence, Infini-gram will identify the longest exact match from the training data for each Token’s context.
The performance data brought by this is indeed impressive: a dataset of 1.4 trillion tokens, with queries taking only 20 milliseconds and storing each token at just 7 bytes. More importantly, it does not require analyzing the internal structure of the model or complex calculations to achieve precise attribution. For AI companies that view their models as trade secrets, this is simply a tailor-made solution.
It is important to know that data attribution solutions on the market are either inefficient, lack precision, or require access to the internal model. Infini-gram has found a balance across these three dimensions.
In addition, I feel that the concept of dataNets proposed by OpenLedger, which involves on-chain datasets, is particularly trendy. Unlike traditional one-time data transactions, DataNets allows data contributors to continuously enjoy profit sharing when their data is used in inference.
In the past, data annotation was a tedious job with meager and one-time earnings. Now it has transformed into an asset that generates continuous income, with a completely different incentive logic.
While most AI+Crypto projects are still focusing on relatively mature directions like computing power leasing and model training, OpenLedger has chosen to tackle the toughest challenge of data attribution. This technology stack may redefine the supply side of AI data.
After all, in an era where data quality reigns supreme, those who can solve the problem of data value distribution will be able to attract the best quality data resources.
That’s it.
Overall, the combination of OpenLedgerPoA and Infini-gram not only addresses technical challenges but, more importantly, provides a new value distribution logic for the entire industry.
As the arms race in computing power gradually cools down and the competition for data quality becomes increasingly fierce, this type of technological route will certainly not be unique. This track will see a situation where multiple solutions compete in parallel—some focus on attribution accuracy, some emphasize cost efficiency, and others work on ease of use. Each is exploring the optimal solution for data value distribution.
Ultimately, which one will emerge in the end depends on whether it can truly attract enough data providers and developers.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
How OpenLedger, which raised tens of millions of dollars, is reshaping the distribution of data value?
Written by: Haotian
Is data labeling, this “hard and tiring work”, quietly becoming a hot commodity? @OpenledgerHQ, which is led by Polychain and has secured over $11.2 million in funding, aims at the long-ignored pain point of “data value distribution” with its unique PoA + infini-gram mechanism. Let’s explore this from a technical perspective:
Specifically, data contributors will upload content to specific domain DataNets, and each data point will be permanently recorded along with contributor metadata and content hash.
After the model is trained on these datasets, the attribution process occurs during the inference phase, which is when the model generates output. PoA tracks which data points influenced that output by analyzing the matching range or impact scores, and these records determine the proportional impact of each contributor’s data.
When the model incurs costs through inference, PoA ensures that profits are accurately distributed based on the impact of each contributor - creating a transparent, fair, and on-chain reward mechanism.
In other words, PoA addresses the fundamental contradiction of data economics. The past logic was simple and crude—AI companies obtained massive amounts of data for free and then commercialized their models to make a fortune, while data contributors received nothing. However, PoA achieves “data privatization” through technical means, allowing each data point to generate a clear economic value.
I think that once this conversion mechanism from “free-riding mode” to “labor-based distribution” is successfully implemented, the incentive logic for data contribution will completely change.
Moreover, PoA adopts a hierarchical strategy to address the attribution problem of models of different scales: small models with millions of parameters can estimate the influence of each data point by analyzing the model’s influence function, which is computationally manageable, while this method becomes infeasible and inefficient for medium to large parameter models. At this point, the powerful tool Infini-gram must be deployed.
Traditional attribution methods mainly rely on analyzing the influence function of models, but they basically become ineffective in the face of large models. The reason is simple: the larger the model, the more complex the internal calculations, and the analytical costs grow exponentially, making it infeasible and inefficient in computation. This is completely unrealistic in commercial applications.
Infini-gram has completely changed its approach: since the internal model is too complex, it directly looks for matches in the original data. It builds an index based on suffix arrays and uses dynamically selected longest matching suffixes instead of traditional fixed window n-grams. In simple terms, when the model outputs a certain sequence, Infini-gram will identify the longest exact match from the training data for each Token’s context.
The performance data brought by this is indeed impressive: a dataset of 1.4 trillion tokens, with queries taking only 20 milliseconds and storing each token at just 7 bytes. More importantly, it does not require analyzing the internal structure of the model or complex calculations to achieve precise attribution. For AI companies that view their models as trade secrets, this is simply a tailor-made solution.
It is important to know that data attribution solutions on the market are either inefficient, lack precision, or require access to the internal model. Infini-gram has found a balance across these three dimensions.
In the past, data annotation was a tedious job with meager and one-time earnings. Now it has transformed into an asset that generates continuous income, with a completely different incentive logic.
While most AI+Crypto projects are still focusing on relatively mature directions like computing power leasing and model training, OpenLedger has chosen to tackle the toughest challenge of data attribution. This technology stack may redefine the supply side of AI data.
After all, in an era where data quality reigns supreme, those who can solve the problem of data value distribution will be able to attract the best quality data resources.
That’s it.
Overall, the combination of OpenLedgerPoA and Infini-gram not only addresses technical challenges but, more importantly, provides a new value distribution logic for the entire industry.
As the arms race in computing power gradually cools down and the competition for data quality becomes increasingly fierce, this type of technological route will certainly not be unique. This track will see a situation where multiple solutions compete in parallel—some focus on attribution accuracy, some emphasize cost efficiency, and others work on ease of use. Each is exploring the optimal solution for data value distribution.
Ultimately, which one will emerge in the end depends on whether it can truly attract enough data providers and developers.