The Urgent Need for More Training Data

Sep 6, 2024

Many AI prognosticators are talking about the lack of training data as a fundamental bottleneck to developing AI models; articles from the Economist, NY Times, and WSJ cover how we’re close to exhausting publicly available training data. 

But the reality is, we’re nowhere close to running out of training data – we’re just close to running out of publicly-available data. The vast majority of humanity’s useful data isn’t public but rather is proprietary or semi-proprietary. From textbooks to medical records to contracts to spreadsheets, massive troves of data are currently unusable for training. 

This data typically lives within organizations that have never made data available before; with the rise of AI, they’re now considering enabling access to their data for model training. Examples of companies beginning to license semi-proprietary data are everywhere (Vox Media and OpenAI, Yelp and Perplexity, Informa and Microsoft, Shutterstock and Meta).

So what’s the hold up in getting this data online? Here’s a list of the top eight reasons I’ve seen for why such valuable data isn’t available for AI today. To enable the continued rapid gains in AI, these are the problems that will need solutions.

  1. Data storage. Today, some data that would be useful for AI is stored in a format inaccessible to any models. The storage rooms of hundreds of media companies hold decades of footage on old 8mm film strips piled in boxes. Lab shelves across the country are filled with microscope slides containing bisections of cancerous tumors that could help physicians diagnose or treat deadly diseases. There’s work required to bring this data online and get it in a format ready to use by model developers.

  2. Business cases. Even when companies sense they have valuable data on their hands or in their storage rooms, it can be difficult to make the business case to convert that data into the right format. The costs can be high: $100 converts an hour of film from 8mm to digital, and a local TV broadcaster could have hundreds of thousands of hours of film archived. Sometimes companies have a hard time even knowing where to store data to monetize it and therefore what the cost would be. And on the other side, it’s difficult to assess demand in a new AI market. Without a clear business case, inaction prevails.

  3. Discovery. Even if companies are ready to monetize their data, they face the challenge of finding and connecting with the right buyers. The AI market is vast and fragmented, making it difficult for data holders to identify where demand lies. Take textbooks as an example: there is a well-established market connecting publishers with schools. While textbook publishers may hear of interest in their content from AI companies, it is difficult to connect with the right customers in an entirely new buyer group. Unlike the Business Case Problem, where the challenge is about justifying the value of converting data, the Discovery Problem is about visibility and reach—ensuring that the data holder's offerings are seen by potential buyers who are actively seeking specific types of data.

  4. Personnel. Some companies have data assets that would enable models to solve new problems, but they don’t have the organizational resources to respond to or pursue those sales opportunities. If you want to license your data, you need some type of sales person to talk with a potential buyer; you need a contract, which means you need counsel who can provide legal language and guidance. You need technical resources to fulfill the deal terms and get data to a buyer. It takes time and organizational momentum to find and onboard those new folks.

  5. Risk. Some companies are waiting on the sidelines until the legal ramifications of data use in AI crystallize. Established case law in other areas of business means companies can operate there with informed decisions about risk. “Unknown unknowns” describes much of the AI space from a legal perspective today. Who bears responsibility if a model leaks data used in training? Who bears responsibility if a model recommends an action that results in the end user incurring damage? Can a company license data generated from people before they could opt in/out of an AI use case? Lawyers have indicated that some companies are waiting for the first lawsuits to settle before entering the AI space.

  6. Contracting. One of the biggest challenges in AI data today is the process of contracting. Because there is virtually no legal precedent on which to build, each deal for data for AI is bespoke; companies come in with different expectations and understanding about intellectual property, liability, value, and other key terms. Resolving all that is expensive. So far, companies like Reddit and News Corp have secured deals with model developers worth tens of millions of dollars annually, which justifies the expense of custom contracting processes. Standardized deal terms and procedures could lower the friction associated with deals.

  7. Labeling. Some datasets are used to build models that differentiate — is it this disease or not, is it this customer problem or not — and those models need datasets that include metadata. It’s not enough to just provide a self-driving car millions of pictures of traffic lights — they need to know which pictures include red lights, yellow lights, or green lights, so they know when to stop and when to go. The practice of labeling data can take millions of hours, which can hold up a data deal or model development. 

  8. Privacy. Companies with data are wary of models leaking data used in their training — revealing to a model’s end user things that should remain confidential. These data holders want to do right by the people and companies who trust them with the data in the first place. Model developers are working on protections here, but for a data supplier new to AI, a lot of trust has to be built from scratch to make sure they are comfortable with the company, use case, and protections in place for their data. Where applicable, that also includes proper de-identification of the data.

I see companies trying to push through this friction today, as we partner with early adopters in the space. It is our vision that the Protege platform can help establish norms that reduce friction in each of these areas, opening up more data for better and faster problem-solving in AI.