The GeForce GPU giant is data scraping 80 years worth of video every day for AI training to "unlock various downstream applications important to Nvidia".

Action
The GeForce GPU giant is data scraping 80 years worth of video every day for AI training to "unlock various downstream applications important to Nvidia".

Leaked documents, including spreadsheets, emails, and chat messages, show that Nvidia used millions of YouTube videos, Netflix, and other sources to train AI models for use in the Omniverse, autonomous vehicles, and digital avatar platforms

Data Screens.

A surprising, but perhaps not surprising, extent of data scraping was reported by 404 Media, which examined the documents. The media outlet found that an internal project codenamed Cosmos (same name, but different from Nvidia's Cosmos Deep Learning service) had its staff use dozens of virtual PCs on Amazon Web Service (AWS) and download a very large number of per-day They discovered that they were downloading videos and that Nvidia had accumulated over 30 million URLs in a single month.

Copyright law and usage rights were repeatedly discussed by employees, who found several creative ways to prevent direct infringement. For example, Nvidia used Google's cloud service to download the YouTube-8M dataset.

In a leaked Slack channel discussion, one person stated that they "cleared the download with Google/YouTube in advance and dangled the carrot of downloading using Google Cloud. After all, for 8 million videos they would typically get a lot of ad impressions, revenue they lose when downloading for training.

404 Media asked Nvidia to comment on the legal and ethical aspects of using copyrighted material for AI training, but the company responded that it "fully complies with the letter and spirit of copyright law."

While some datasets allow their use only for academic purposes, and Nvidia does a significant amount of research (both in-house and with other institutions), the leaked material clearly indicates that this data scraping was for commercial purposes.

Of course, Nvidia is not alone in doing this; both OpenAI and Runway have been accused of knowingly using copyrighted material to train AI models. Interestingly, one of the sources of video content that Nvidia appears to have no problem using is gameplay footage from its GeForce Now service, but leaked documents indicate otherwise.

A senior research scientist at Nvidia explained to another employee why: "We don't have stats or video files yet. This is because the infrastructure is not yet set up to capture much live game video & action. There are both technical and regulatory hurdles."

AI models must be trained on billions of data points, and there is no way around this. Some datasets have very clear rules regarding their use, while others have only fairly loose restrictions, but when it comes to laws regarding the use of copyrighted material, it is very clear what can and cannot be done, even if the application to AI training is not 100% transparent.

It is not just a copyright issue either, as video content often contains personal information. In the U.S., there is no single overriding federal law that directly applies here, but there are many regulations governing the collection and use of personal data. The GPDR is a law that clearly defines how such data can be used, even outside the EU.

One might also wonder what would happen if a company like NVIDIA is found to be violating various regulations during training of its AI model. If its systems are used worldwide, would they be blocked in certain countries?

Regardless of how one feels about AI, it is clear that urgent efforts are needed to increase transparency, especially when copyrighted personal data is used for commercial purposes. If tech companies are not held accountable, data scraping will continue on an ad hoc basis because.

.

Categories