Protege AI: Navigating Training Data, Privacy and Ethics

Oct 9, 2024

In a recent interview, Bobby Samuels, co-founder of Protege shed light on the company's mission to solve one of AI's most pressing challenges: access to high-quality training data. Samuels, with a background in data connectivity and privacy from LiveRamp and Datavant, launched Protege in February 2024 to address what he sees as the "biggest bottleneck for AI." 

"The biggest bottleneck for AI is enabling the sharing and accessing of training data," Samuels explained. Protege aims to build a platform that facilitates this across various industries, starting with healthcare. 

Healthcare: A Strategic Starting Point

Protege's decision to begin with healthcare is strategic, leveraging Samuels' industry knowledge and existing relationships. "We've built the richest set of training data in healthcare" Samuels noted. This approach also positions Protege to tackle one of the most sensitive areas in data privacy. 

Given the sensitive nature of healthcare data, Protege has placed privacy at the forefront of its operations. "Privacy is central to what we do. We started a privacy review before we wrote a line of code," Samuels emphasized. All data on Protege's platform is de-identified, adhering to HIPAA regulations and ethical standards. 

This commitment to privacy extends beyond mere compliance. Samuels stated, "The way that we operate is to assume that everything anyone ever says or writes is public information.we're in a space that rightfully has more scrutiny, and we need to be thoughtful,prudent, and patient-centric in what we do." 

Sustainable Data Business in the AI Era

Protege is tackling the challenge of building a sustainable data business in an era where many AI training data deals are one-time transactions. Samuels outlined their approach: "Our take is that if you are using data for ongoing training, you should continue to pay for that data." 

This model aims to align the value received from using the data with the value provided by data sources. Samuels acknowledged the difficulties in enforcing such agreements but emphasized the importance of contractual guarantees and careful vetting of customers. Addressing the current state of data aggregation, often characterized by a "act now, ask forgiveness later" approach, Samuels distanced Protege from such practices. "We are very much data source centric. We are only successful if our data partners are incredibly successful, and so both from a commercial perspective but also from a moral perspective, we want to make sure data holders are adequately compensated for their data assets" he stated. 

Samuels predicted that while past aggressive data practices might have been commercially successful, the landscape is changing. "We really strongly believe that data holders should be compensated for their data, and that's something we care deeply about." 

AI's Impact on Traditional Data-as-a-Service Companies

Discussing AI's impact on traditional data-as-a-service (DaaS) companies, Samuels outlined three potential outcomes: 

Continuity in deterministic analysis Increased value of existing data due to new AI applications Newfound value in previously underutilized data 

"The North star that I have on that one is Reddit," Samuels explained. "If you rewind 10 years ago, Reddit data was probably an academic project. Now it is a core part of Reddit's valuation story to the market because of how they can use that for training data." 

While Protege operates in the AI space, Samuels clarified that AI is not deeply integrated into their core operations yet. "I think of ourselves as a network business that is in AI versus an AI business that has a network," he said. The company uses AI tools like ChatGPT for tasks such as brainstorming and data analysis but maintains a balance between AI and human input. 

The Future of AI and Healthcare

Looking ahead, Samuels expressed optimism about AI's potential impact on healthcare outcomes. "What I hope is we get to a world where you go to a doctor and they say, okay, great. You have these treatments, you have this history, you have this genetic makeup. Here's a treatment that makes sense. And I hope we get there and I think we can," he envisioned. Samuels acknowledged the current limitations of AI, particularly in areas requiring high accuracy. "I don't think that there's basically anything that folks should do without reviewing," he cautioned, emphasizing the need for human oversight in AI-generated outputs. 

As Protege navigates the complex landscape of AI training data, it's clear that the company is taking a thoughtful, privacy-centric approach. By focusing on sustainable data practices and ethical considerations, Protege aims to play a crucial role in advancing AI capabilities across industries, starting with healthcare. As the AI field continues to evolve rapidly, companies like Protege will likely play a pivotal role in shaping how data is sourced, used, and valued in the development of AI technologies.