Shutterstock/JRdes
News + Trends

YouTube videos as AI training material: content creators at a disadvantage

Debora Pape
9.4.2024
Translation: machine translated

The internet offers too few usable texts for training voice AI systems. OpenAI therefore resorted to videos on YouTube. However, this is not permitted.

Artificial intelligence (AI) is on everyone's lips, or rather on everyone's screen. They write texts, generate images and videos, compose songs and programme. However, an AI is only as good as the training material it can access: the more material, the better. According to a report in the New York Times, the AI company OpenAI has also accessed millions of hours of video material from the YouTube platform for this purpose - although YouTube's guidelines prohibit such access.

Not enough data for further AI training

It has long been clear that those who know how to use AI will secure enormous advantages in the future. Conversely, this means that those who develop the best AI will gain the most lucrative market shares. The major companies in the voice AI business, including OpenAI, Google and Meta, are therefore in a neck-and-neck race to develop the best AI.

However, this requires the largest possible pool of training material produced by humans. AI companies are already running their algorithms through all kinds of internet content in order to transfer it to their AI systems.

High-quality data such as specialist articles, books, Wikipedia pages and other content that has been created with quality in mind is particularly valuable. According to the AI research organisation Epoch, this content could be fully indexed between 2024 and 2026. Another problem is that much of this content is protected by copyright - but that doesn't stop AI developers from using it anyway.

YouTube videos as an illegal source of training data

In order to obtain more data for its voice AI, OpenAI developed the Whisper tool back in 2021, which can transcribe spoken language in YouTube videos. The resulting texts can be used as further training material for the voice AI. According to employees cited by the New York Times, around one million hours of videos have been incorporated into the current version of Chat-GPT. The criteria by which these videos were selected remain unclear. Compared to the total playing time on YouTube, one million hours is not a lot: according to Statista, around 720,000 hours of new videos were added every day in 2022.

However, such access is not permitted: According to YouTube's terms of use, it is not permitted to "access the service [i.e. YouTube] using automated processes (e.g. robots, botnets or scrapers) [...]". OpenAI developers knowingly violated this, according to the New York Times. And Google, which owns YouTube, was aware of this.

However, Google itself is in trouble: it has also recognised the potential of YouTube videos and uses them as training material. This is also wrong, as YouTube does not own the copyright to the videos on its platform. This lies with the content creators who create and upload videos. YouTube can therefore hardly protest against unauthorised access by OpenAI if the AI of the parent company Google itself illegally uses the content creators.

Complaints by copyright holders

The New York Times reported on this new potential copyright infringement by AI companies for a reason. It already sued OpenAI in December for the unlawful use of its own articles https://www.rosepartner.de/blog/urheberrechtsverletzung-durch-ki-training.html. The articles can be replicated by the AI and thus contribute to OpenAI's economic success without financial compensation or mention of authorship.

The use of protected works is becoming a problem for artists, authors and other content creators. According to the New York Times, the US Copyright Office has already received more than 10,000 complaints. However, an initial class action lawsuit by artists has already been rejected by a judge.

There are currently no legal regulations that specify the use of AI in relation to copyright law.

Header image: Shutterstock/JRdes

20 people like this article


User Avatar
User Avatar

Feels just as comfortable in front of a gaming PC as she does in a hammock in the garden. Likes the Roman Empire, container ships and science fiction books. Focuses mostly on unearthing news stories about IT and smart products.


Robotics
Follow topics and stay updated on your areas of interest

Computing
Follow topics and stay updated on your areas of interest

13 comments

Avatar
later