Case Study: MyShell Scales Data Collection and Labeling to Improve Voice Models with Sahara AI
Sep 26, 2024
The Challenge: Meeting High Demands for Diverse Audio Data
MyShell Al, a decentralized AI platform connecting consumers, creators, and open-source researchers, set out to create cutting-edge text-to-speech (TTS) and voice clone models. To do this, MyShell needed high-quality, multilingual, accent-diverse audio data delivered quickly and efficiently. However, they encountered several key challenges before partnering with Sahara AI:
Vendor Sourcing: Finding data vendors capable of delivering accent-specific audio at scale was difficult.
High Costs and Inefficiencies: Data labeling processes were costly and slow, impacting quality.
Delays in Model Training: Lengthy feedback loops hindered rapid model adaptation and improvements.
These obstacles limited MyShell’s ability to experiment with new model architectures and elevate their AI-native applications.
Enter Sahara AI.
"MyShell's commitment to open-source model development has found a strong ally in Sahara AI's precise data labeling services through their Sahara Data platform. Their contribution is a cornerstone of our vision for accessible AI. Together, we're forging a path toward innovation and open collaboration."
— MyShell Team
The Solution: Delivering High Quality Data Collection and Labeling at Scale
Sahara AI’s Sahara Data platform provided MyShell with a comprehensive solution across three key projects. Using decentralized, AI-driven data collection and filtering, we enabled MyShell to gather high-quality, diverse datasets efficiently and at scale.
Project 1: Audio Samples Collection – Short Sentences
Sahara AI delivered 11,980 audio samples of short sentences in various English accents, including:
English with Chinese accent
English with American accent
English with Indian accent
English with British accent
This allowed MyShell to begin training their voice models with a wide variety of global accents.
Project 2: Audio Samples Collection – Long Text
To further enhance MyShell’s model capabilities, Sahara AI provided long-text audio samples in multiple languages and accents, ensuring diversity in voice data:
English with Chinese accent: 13,000 samples
English with American accent: 18,000 samples
English with Indian accent: 14,000 samples
English with Australian accent: 3,000 samples
English with British accent: 2,000 samples
English with German accent: 13,003 samples
Chinese: 14,068 samples
This wide range of data empowered MyShell to train their models for more global applications.
Project 3: Data Filtering
Sahara AI’s human-in-the-loop infrastructure enabled the filtering of over 180,000 audio samples across various languages. By carefully evaluating and refining the data, we ensured only the highest quality samples were used, including:
French: 47,678 samples
Spanish: 50,876 samples
German: 40,190 samples
Russian: 46,238 samples
This allowed MyShell to focus on model accuracy without compromising on data quality.
The Outcome: Over 2 Million Downloads and Thousands of Github Stars
With Sahara Data’s decentralized data collection and filtering, MyShell was able to significantly improve their model training process. Key outcomes included:
Faster Model Training: MyShell adaptively trained and improved their TTS and voice cloning models using real-time data, dramatically reducing time to market.
Open Source Success: The collaboration led to the successful development and open-sourcing of VoiceClone and MeloTTS, which garnered thousands of GitHub stars and over 2 million downloads on Hugging Face.
Transform Your AI Strategy with Sahara Data
Sahara Data is designed to meet the most challenging training data demands. Whether through decentralized infrastructure or on-premise deployment, Sahara Data provides a privacy-preserving, AI-centered, and human-in-the-loop approach that ensures high-value datasets for AI training.
Sahara Data by the numbers:
31+ Enterprise Clients
35+ Countries served
45+ Languages & Dialects covered
150+ Partner providers
30,000+ Vetted AI Trainers
Why Choose Sahara Data?
Automatic Labeling: Proprietary AI models handle labeling, matching human-level performance in mainstream tasks.
Human-in-the-Loop Refinement: Human experts refine and verify labels to ensure top-tier data quality.
Continuous Learning: Models learn from human input, improving labeling accuracy over time.
Through this optimized collaboration, Sahara AI helped MyShell gather the precise, high-quality datasets needed for efficient and cost-effective model training, a key factor in their project's success.
If you’re ready to scale your data collection and improve your AI models with Sahara Data, contact us today to discuss how our platform can support your AI training needs.