Decentralized Data Collection and Labeling at Scale

Feb 14, 2025

By Joules Barragan, Yingyi Hu | Sahara AI

Executive Summary

The explosive growth of Generative AI (GenAI) has created an unprecedented need for high-quality, labeled data—the foundation for model training, RAG pipelines, validation, and fine-tuning. Traditional data labeling approaches, while effective at smaller scales, face challenges when adapting to the demands of modern AI development. 

These challenges stem from the diverse and specialized nature of today’s AI projects, which often require equally diverse and specialized datasets. Creating and labeling this data requires significantly more thought, effort, and precision to complete. The diversity of these needs makes it difficult for any single labeler or team to handle every type of task effectively.

Decentralization addresses these issues by providing access to a broader pool of contributors, making it possible to meet varied labeling demands while ensuring diversity in perspectives and expertise. However, decentralization also introduces challenges around maintaining trust, quality, and efficiency. Addressing these issues is critical to creating a decentralized ecosystem capable of meeting the diverse and growing needs of modern AI projects.

Sahara AI’s Data Services Platform introduces a first-of-its-kind decentralized alternative, leveraging distributed contributors to perform data collection and annotation at scale. In our initial POC (Season 1), more than 10,000 global participants completed labeling tasks over the course of a month, with decentralized peer review, incentive mechanisms, and quality assurance processes implemented to ensure data integrity and reliability. The results of this POC demonstrated that decentralized data annotation is not only viable but scalable, efficient, and capable of delivering high accuracy:

  • Decentralized peer review achieved 92% accuracy in internal QA, highlighting its scalability and effectiveness in data collection and labeling.

  • Only 83% of simple research tasks and 67% of more in-depth research tasks passed decentralized peer review, demonstrating the peer review system’s ability to filter out poor submissions to maintain the accuracy and reliability needed for meaningful datasets.

  • While technical and labor-intensive tasks only had an acceptance rate of 10%, they still yielded tens-of-thousands of high-value datapoints

This report explores the key results and insights from Season 1 of the Data Services Platform, as well as the broader implications for decentralized AI data labeling.

Optimizing Accuracy, Scalability, and Efficiency in Decentralized Data Collection and Labeling

Unlike traditional systems, decentralized data collection systems rely on contributors from diverse regions and expertise levels. This creates several challenges that need to be addressed for decentralized data collection to become a viable large-scale alternative:

  • Ensuring Quality: Distributed contributors may have varying knowledge and accuracy rates, making it critical to implement effective quality assurance processes.

  • Scalability: Managing thousands of contributors without sacrificing quality or speed requires dynamic task allocation and efficient review mechanisms.

  • Incentive Alignment: Structuring rewards that encourage high-quality contributions rather than quantity is essential to long-term success.

  • Fraud Mitigation: Distributed systems are vulnerable to automated, low-effort or malicious submissions that aim to exploit reward systems, requiring robust detection mechanisms.

To address these issues, Sahara AI has implemented multi-layered validation systems, peer reviews, and dynamic reward structures designed to align contributor efforts with quality outcomes. Our validation process for submitted datapoints was designed as follows:

  1. Automated Quality Screening: Initial quality control is performed by machine learning models designed to flag duplicate, incomplete, or inconsistent submissions. These automated checks help reduce manual review workloads and ensure that only potentially valid data progresses further.

  2. Decentralized Peer Review: Data submissions that pass automated screening are reviewed by other contributors through a decentralized peer review mechanism. A majority consensus determines whether a submission is accepted or dropped. This process ensures scalability while benefiting from diverse perspectives.

  3. Task-Specific Machine Review:  For tasks with well-defined criteria—such as determining whether a jailbreak attempt succeeds or fails—machine review can provide precise, consistent evaluations. These tasks often involve binary outcomes or objective benchmarks, making them ideal for automated processing. When applicable, machine review can act as a "gold standard," minimizing the need for human intervention, scaling effortlessly to handle large datasets, and maintaining high accuracy.

  4. In-House Human QA: A randomly selected subset of accepted peer-reviewed submissions undergoes manual review by the Sahara AI team to measure overall accuracy and identify any patterns of low-quality or fraudulent contributions. This layer serves as a benchmark to improve future validation processes.

Given the decentralized nature of this approach, we anticipated several key challenges, particularly the risk of participants submitting and approving low-quality submissions for mutual benefit. To address these risks, we integrated the following safeguards:

  • Pre-task Qualification Quizzes: Contributors were required to pass task-specific knowledge assessments, ensuring that only participants with relevant expertise were allowed to annotate or review data.

  • Dynamic Incentive Structures: Reward mechanisms were designed to prioritize accuracy by granting higher rewards for reliable annotations and reviews while applying penalties for incorrect submissions, such as partial or permanent bans from the platform.

Tasks were divided into categories based on complexity, with participants incentivized through a tiered reward system: 

  • Beginner Tasks included simple research-based labeling, such as answering questions related to smart contracts, dapp development, and styling advice (e.g., best date outfits).

  • Intermediate Tasks required more in-depth research, like identifying top AI influencers on Twitter or researching cryptocurrency investment strategies and selecting ideal first-date gifts.

  • Advanced Tasks involved jailbreaking common AI models like Qwen and LLaMA or designing AI personas.

  • Expert Tasks included more sophisticated red team challenges, such as jailbreaking common AI models to produce explicit or adult content.

The more complex the task, the higher the reward. This is to both reflect the added time needed to complete higher-difficulty tasks, as well as to reward contributors with more bespoke knowledge. Rewards were all given out as Sahara Points. Only accepted datapoints were rewarded. 

Unlike beginner and intermediate tasks, advanced and expert tasks were evaluated using machine reviews rather than decentralized peer reviews due to their technical complexity and need for precise evaluation criteria.

Key Findings:

Decentralized peer review achieved 92% accuracy in internal QA, proving its scalability and effectiveness

An analysis of the research and knowledge-based tasks revealed that 92% of datapoints accepted through decentralized peer review passed in-house quality assurance (QA) checks. This demonstrates that decentralized peer review can act as an effective first-layer filter for data quality, even at scale, as contributors are motivated to provide accurate assessments when properly incentivized.

These results indicate that the combination of decentralized peer review with complementary validation mechanisms creates a system that balances scalability and quality. Decentralized peer review, by design, allows rapid processing of large volumes of data, while the addition of automated checks and human oversight ensures that low-quality submissions are minimized. Data Services Platform’s structured incentive system further aligns participants’ behavior with the goal of high-quality outputs.

The success of this small-scale POC highlights the potential of decentralized peer review as a scalable and cost-effective alternative to traditional centralized data annotation. By reducing reliance on expensive, centralized QA teams, this model enables AI projects to achieve high-quality data annotation through a decentralized framework, setting the foundation for scalable, distributed AI data collection.

Beyond scalability and cost efficiency, the success of decentralized peer review in Season 1—supported by 10,000 participants from diverse backgrounds and regions—proves that anyone with internet access can meaningfully contribute to the AI economy. This inclusive model enabled global contributors, regardless of location or expertise, to participate in data annotation and AI development. 

83% of simple research tasks and 67% of more in-depth research tasks pass decentralized peer review

Season 1 demonstrated strong performance in research-based tasks, with 83% of simple research task submissions and 67% of more in-depth research task submissions passing decentralized peer review. These tasks, ranging from basic information gathering to more complex, research-intensive challenges, showcase the effectiveness of the peer review system and the importance of properly incentivizing contributors.

Simple research tasks (beginner tasks) involved basic information retrieval and labeling, such as answering common questions about crypto or everyday advice like choosing the best date outfit. With 83% of submissions passing peer review and in-house QA confirming a 94% accuracy rate, the high acceptance rate is likely due to the accessible nature of the topics, which required common knowledge or subjective responses that were easy to answer and review consistently.

More in-depth research tasks (intermediate tasks) required contributors to perform more thorough investigations and critical evaluations. Examples include identifying top AI influencers on Twitter (now X), researching effective cryptocurrency investment strategies, and selecting ideal first-date gifts based on various parameters. These tasks were more demanding, leading to a 67% peer review acceptance rate. However, in-house QA confirmed an 88% accuracy rate for accepted submissions, indicating that the peer review system effectively identified and rejected low-quality or incomplete responses.

This data highlights that when contributors are properly incentivized, they consistently deliver quality outputs, even for more challenging tasks. Simple research tasks naturally yielded higher acceptance rates due to their accessibility, while more in-depth research tasks required more rigorous evaluation but still produced high-quality contributions. The peer review system’s ability to filter out poor submissions ensures that accepted datapoints maintain the accuracy and reliability needed for meaningful datasets.

As Sahara AI scales participation to 100,000 contributors for Season 2, we have further refined the annotation and peer review process to eliminate low-quality contributors earlier on.

While technical and labor-intensive tasks only had an acceptance rate of 10%, they still yielded tens-of-thousands of high-value datapoints

Technical and labor-intensive tasks in Season 1 required contributors to perform highly specialized work. For advanced tasks, this involved creating jailbreaking prompts for large AI models like Qwen and LLaMA or designing AI personas. Expert tasks, on the other hand, involved advanced adversarial prompt generation, including creating explicit or boundary-pushing prompts for some of the most common LLMs. Despite their complexity and stringent review criteria resulting in only a 10% overall acceptance rate, these tasks still successfully produced more than 24,000 high-value datapoints essential for testing AI model safety and robustness.

The high volume of submissions (239,126 datapoints for advanced tasks, the highest among all task types) coupled with the complexity of the tasks naturally resulted in lower acceptance rates overall. These tasks attracted a large number of contributors due to the high payout in Sahara Points. While exams were required to access these tasks, Season 1 allowed broad participation without restrictions based on domain-specific expertise, contributing to the lower acceptance rates.

Advanced and expert tasks were reviewed using machine review instead of decentralized peer review due to the technical and binary nature of the evaluation criteria. The goal of the tasks was to determine whether the jailbreaking prompts succeeded or failed—an objective, rule-driven outcome that did not require subjective interpretation or human consensus. Machine review was more suitable for this purpose because it ensured consistent, scalable, and efficient processing of large submission volumes while applying strict, predefined rules to assess outcomes. In contrast, peer review—typically valuable for tasks requiring diverse human perspectives—was unnecessary for these straightforward evaluations.

The lower acceptance rates for these tasks (10%) reflect the difficulty of curating high-quality, domain-specific datasets, not issues with data quality. Many of these tasks required contributors to generate edge-case adversarial inputs designed to test the boundaries of LLMs, making strict review necessary to filter out noise and maintain high data integrity. The goal was not to accept a high percentage of submissions but to ensure that accepted datapoints were relevant, accurate, and valuable. This approach helps build datasets critical for stress testing AI models, enhancing their safety, robustness, and resistance to exploitation. The curated adversarial prompts that passed review serve as high-impact datapoints essential for improving model behavior under extreme conditions.

To improve performance and scalability for specialized tasks requires specialized annotators with domain-specific expertise (e.g. music, engineering, security, etc). Only qualified contributors should be able to engage in these complex tasks, ensuring that both the quantity and quality of accepted datapoints continue to improve.

What’s Next: Scaling Decentralized Data Labeling

The first phase of Sahara AI’s Data Services Platform proves that decentralized data collection and labeling can achieve high-quality results at scale.The next step is to expand from 10,000 contributors in Season 1 to 100,000 contributors in Season 2 to further refine these processes before the open release of Data Services Platform.

Season 2 is now live. As we expand to 100,000 contributors, we have:

  • Issued more advanced task segmentation for specialized data labeling.

  • Refined our automated verification models to enhance quality control.

  • Released multi-modal annotation capabilities to support text, image, and audio datasets.

To improve data quality and platform efficiency, we’ve also enhanced the banning mechanism for labeling tasks. Labelers are now banned earlier if their performance makes it mathematically impossible to meet the required accuracy threshold. For instance, if a task requires 80% accuracy, a user making two errors in the first five data points will be immediately disqualified. The benefits for this are two-fold:

  • Faster removal of underperforming contributors ensures higher-quality datasets.

  • Clearer, immediate feedback for contributors on task performance.

By implementing these refinements, we aim to maintain the highest standards for task completion while improving the overall experience for everyone involved.

The decentralization of AI data services marks a major step forward in AI development, proving that decentralized data labeling is not only viable—but scalable, cost-effective, and inclusive. We look forward to sharing the data that comes out of Season 2.