Feb 2, 2026
Protege, the platform for proprietary AI training data, today announced a new partnership with Sunain, a multimodal data company that collects audio, video, gameplay, and egocentric data through a distributed global contributor network. Through this partnership, Sunain's datasets will be made available for AI training and evaluation via the Protege platform.
Here’s a summary of the types of unique data that Sunain and Protege have unlocked for AI model development:
Audio
Natural, multi-speaker conversational audio across underrepresented languages and dialects. The collection includes emotion-rich speech, expert conversations, code-switching between languages, and longitudinal recordings that capture how relationships and speech patterns evolve over time. This data is designed for training expressive, real-world speech and conversational AI systems.
Video
High-quality real-world video data with dense annotations. The collection includes video editing pairs (raw footage alongside edited versions) and cinematic-style clips. This data is structured for video generation, instruction-following, and visual reasoning research.
Gameplay
Large-scale gameplay video spanning over a million clips across hundreds of game titles, with both frame-level action/control signals and HUD-free video-text annotations. Games provide a unique environment for capturing action-dense interactions and decision-making in real time. This data supports training of interactive agents, world models, and video generation systems.
Egocentric
Multi-view egocentric home data captured via head-mounted and wrist-mounted cameras. The collection focuses on real human manipulation and household tasks, with task segmentation and hand keypoint annotations. This data is designed for embodied AI, robotics, and dexterous manipulation research.
Partnering for Scalable Human Behavior Data
Across all modalities, Sunain's approach centers on scale through people: collecting real, unscripted human behavior via a global contributor network rather than relying on synthetic or simulated alternatives.
"Sunain exists to make the AI economy work for everyone, not just experts," said Shahbaz Magsi, co-founder of Sunain. "We’re excited to partner with Protege to continue to connect people everywhere with AI training opportunities and have a direct role in building the foundations of AI."
"Authentic human data across multiple modalities remains one of the hardest things to source at scale," said Grant Murphy-Herndon, Protege’s GM of New Verticals. "Sunain's contributor network captures the diversity of languages, dialects, and behaviors that models need to generalize in the real world. We're glad to make this data accessible to AI teams through the Protege platform."
About Sunain
Sunain operates what it calls The Human Data Network — a platform where contributors across 30+ countries and 50+ languages and dialects record conversations, play games, and capture real-world activities in exchange for compensation. The company's core thesis is that authentic, unscripted human behavior remains a critical input for training AI systems, particularly in modalities like speech, video, and embodied interaction. Sunain’s data is collected at scale from real people in real environments.
Rather than relying on actors, simulations, or synthetic generation, Sunain's datasets are sourced directly from real people in real environments, capturing naturalistic behaviors that are difficult to replicate in controlled settings. This includes conversational dynamics between people who know each other, emotional expression, regional dialects, and physical interactions with everyday objects.
About Protege
Protege is the trusted source for finding and sharing AI training data, enabling seamless and compliant data exchange. By empowering data holders and connecting them with AI developers, Protege supports the creation of thoughtful AI solutions. Protege's scientific & strategic approach allows AI teams to quickly discover and license a wide array of curated datasets across industries, expediting the time to obtain AI-ready data for model development.
To learn more about how your organization can unlock new revenue by ethically licensing your content for AI, or access datasets purpose-built for AI, fill out our partner information form or contact the Protege team at contact@withprotege.ai.

