Despite the rapid growth and popularity of large language models and generative AI, how AI vendors and other organizations train their models has long drawn criticism.
From artists decrying the use of their artwork in art-generating tools , such as Dall-E and Stable Diffusion, to programmers suing GitHub for enabling its Copilot system to scrape their code, many are questioning AI vendors and the data they used to train their LLMs and generative AI systems.
Even Getty Images recently filed a lawsuit against generative AI vendor Stability AI for its use of copyrighted material to train its system.
While some are trying to recoup alleged past losses, others are looking ahead to protect the future use of unauthorized data.
Most recently, on April 18, social news website and forum Reddit updated the terms of its Data API access, developer terms and Ads API to prevent the free use of the data on its platform for commercial reasons.
The forum website now “reserves the right to charge fees for access and use of Reddit Services and Data,” according to the new terms.
The fees will be determined at the sole discretion of the company. Developers cannot use Reddit Services and Data on behalf of a business or as part of a monetized service or product. Developers are also restricted from indirectly and directly monetizing, licensing or leasing any part of the forum’s data or services.
That means AI vendors such as Microsoft and OpenAI, Google, AWS, Stability AI and others now have to pay Reddit for using the social platform’s vast troves of previously free data.
Reddit will now charge companies and vendors seeking to use data from its website solely to train AI models. Data will still be available for free to others looking to build applications that help people use Reddit or who want to use its data for noncommercial purposes.
No longer free
Reddit’s move was sparked by the same usage problems that led companies such as Getty Images to enforce their copyright rules contractually with AI vendors, said Michael Bennett, director of the education curriculum and business leaders for Responsible AI at Northeastern University.
“What’s happening with Reddit is very similar,” he said. “Different language but same concept. They’ve woken up to the fact that AI developers have been using their very rich forums.”
While Reddit may not be able to gain revenue from past use of its forum, the nature of large language models (LLMs) makes it possible for Reddit to still make money from future use of the data on its platform.
For example, the popular ChatGPT system from Microsoft partner OpenAI is a generative model has that been trained with limited data, restricting its knowledge based to the end of 2021. When OpenAI trains ChatGPT for subsequent years, if the vendor wants to use Reddit data, it must pay.
However, the restriction on data will not have an adverse effect on deep-pocketed AI vendors like OpenAI, Bennett said. The prominent vendors have enough capital to absorb the costs of paying data companies like Reddit.
Microsoft and OpenAI did not immediately respond to requests for comments.
For smaller AI vendors, the story is a bit different, Bennett said.
“It may change their calculus on the best way to get access to training data for their LLM,” he said. “But even then, many of the small ones have resources.” Even smaller vendors might have to negotiate a pay structure to get access to data to train their models.
Moreover, vendors likely anticipated that they would only be able to freely access data from Reddit and information websites for a limited time.
“They must have, in their overall strategy, anticipated this and probably prepared for it,” Bennett said.
Cost for consumers
Even if vendors accept that that they must pay for data to train their models, the new paradigm could still limit innovation, said Conversica CEO Jim Kaskade. Conversica is an AI vendor that creates virtual assistants for enterprises. It recently introduced a GPT-based chat product for enterprises.
“Charging for data access could create a barrier to entry for new and innovative projects, especially those in the early stages of development or with limited funding,” Kaskade said.
Michael BennettDirector of business lead and responsible AI, Northeastern University
It could also lead developers to seek different data sources that are free, which may lead to a less diverse source of data, he added.
Moreover, consumers may end up bearing part of the brunt.
“Paying for data access may incentivize some developers to monetize their applications by collecting, storing or selling user data, raising privacy concerns for end-users,” Kaskade said.
In addition to privacy problems, consumers could end up shouldering costs for those AI vendors that can afford to pay.
“Faced with the choice of absorbing the increased costs of paying to access training data like Reddit’s or passing those costs on to consumers, companies will likely opt for the latter,” Bennett said.
However, that cost might not be so high since it will be spread across users, said Forrester Research analyst Andrew Cornwall.
Reddit’s move will likely affect not only AI vendors but also the social platform’s users, Cornwall said. Reddit is also seemingly blocking from not-safe-for-work content that should not be viewed in the presence of an employer from all subreddit discussion threads.
“This will hurt researchers and businesses using generative AI to detect and block pornography,” Cornwall said.
Reddit is also trying to comply with privacy laws with the new change, which also requires messages deleted from Reddit to be completely erased and not appear anywhere else, Cornwall added. Deleting the messages will most likely affect websites such as unddit that show deleted and moderated messages. It might also affect AI vendors because it makes their models harder to train.
“If I’ve trained a model on all of Reddit and then a user deletes a comment, what action must I take? Do I have to rebuild my training set entirely without the deleted comment? It’s not clear that’s even possible with current generative AI,” Cornwall continued.
Ultimately, users can expect a different Reddit — one that is a little more closed.
“This is a move to a more walled garden Reddit,” Cornwall said.
Esther Ajao is a news writer covering artificial intelligence software and systems.