The Battle for AI Training Data

As AI companies strive to build ever-more powerful models, the sourcing of training data has become a growing legal and ethical minefield.

Good day, humans.

The Battle for AI Training Data has begun

In the evolving landscape of artificial intelligence (AI), the hunger for data to fuel the AI models has sparked a battle between the giant corporations looking to dominate this new field, and slowly, the legal ambiguities surrounding data sourcing for AI training are appearing, raising crucial questions about accountability and transparency within the industry.

Today's Newsletter includes:

The giant Corporations are fighting over data all over the world.

The Battle for AI Training Data

AI models' insatiable appetite for large data sources has led to a growing tension between technological progress and data privacy. News reports have shed light on the questionable practices some AI companies are employing to feed their voracious algorithms.

According to The Wall Street Journal, the internet is too small for the AIs: AI companies are hitting a wall in their efforts to collect high-quality training data, as the available data proves insufficient for their needs. In response, The New York Times reports that companies like OpenAI, Google, and Meta have resorted to less-than-ethical methods to source this data.

One prominent example is OpenAI, the parent company of ChatGPT. In training its advanced language model GPT-4, OpenAI utilized over a million hours of YouTube videos – a move that falls into a murky legal area regarding AI and copyright.

Similarly, Adobe has come under scrutiny for its practices in training its AI text-to-video generator, Firefly. While Adobe has long promoted Firefly as an "ethical AI" tool, trained primarily on its licensed stock imagery, a Bloomberg report suggests that the company also tapped into images from competitor Midjourney, whose own data-sourcing practices are not entirely transparent.

In response to the growing scrutiny, the giant companies are rushing to make deals for data access:

  • OpenAI has forged partnerships with media outlets such as Le Monde and Prisa Media.

  • Reddit and Google entered into a $60 million deal that would give Google access to Reddit’s API to train its generative AI models.

  • The Associated Press has licensed part of its archives to OpenAI.

  • Shutterstock, the stock photo archive, has signed a six-year deal with OpenAI to provide training data, which includes access to its photo, video, and music databases.

These collaborations not only provide access to high-quality data but also emphasize a commitment to ethical data sourcing and transparency.

The battle for AI training data has become a complex, high-stakes game, where the lines between innovation and overreach are often blurred. As AI companies continue to push the boundaries of what's possible, the need for a robust ethical framework and clear legal guidelines has never been more pressing. The future of AI-driven progress may well hinge on how successfully this delicate balance can be struck.

Generative AI exists because of the transformer

Visualize how LLMs work

The Financial Times made a beautiful article explaining how LLMs (Large Language Models) work with very easy-to-understand visuals. It's worth a read.

Kaiber has introduced Transform 3.0, an Upgraded Video-to-Video Model

Kaiber presents Transform 3.0

Kaiber AI has 3 important updates, accessible to all users:

  • Motion 3.0. Now with audio reactivity, smoother movements and an improved initial image.

  • Photorealistic Model. Create videos with surprising photographic qualities, now in Motion and Transform.

  • Profiles. Claim your username, customize your dashboard and publicly share your favorite Kaiber artworks.

Ideogram has been upgraded

Ideogram has an upgrade

Ideogram announced an update for its 1.0 model, with new tools, including Describe to get image descriptions, negative prompts, and improvements in image quality and speed.

Are these notes useful for you? Are they interesting? I would appreciate any feedback you can give me to improve.

Thanks for reading!

See you next week! Hello 👋 I'm Erik Knobl, Product Designer by day and Generative AI Explorer on weekends. I share my learnings in this newsletter. Consider subscribing to stay in touch.