Paresh Dave
Last year Stack Overflow announced that it would charge AI giants for access to content used to train chatbots. The popular Q&A service for coders has now signed up its first customer, Google. This is the start of a new stream of revenue, according to CEO Prashanth Chandrasekar.
Receipt of this deal is notable, as it’s not yet clear how extensively Google and other AI developers will pay for content for AI projects. Millions of books and websites have propelled the development of AI systems, but most publishers have not been recompensed, and some are suing over alleged misuse. Many publishers, including Stack Overflow, feel threatened by ChatGPT and other generative AI products that can answer queries that would have previously directed coders their way.
The agreement means Google’s cloud division will use questions and answers from Stack Overflow about Google Cloud services to provide coding assistance and technical support through a version of Google’s Gemini chatbot. Google’s cloud computing customers will also be able to ask questions through Google Cloud’s command-line interface. “Their AI may not have all the answers, and so we have a huge ability to help complete that loop,” states Chandrasekar. “We are the biggest place where community knowledge is curated and validated.”
The Gemini AI system will summarize answers that originate from Stack Overflow in its own words but include the company’s logo, a link back to the original content, and the username of the contributor who provided it. The companies plan to showcase the system at Google Cloud Next, the search company’s annual cloud conference in April, and launch it soon after.
Chandrasekar says there are no significant restrictions on how Google Cloud can use Stack Overflow data, meaning it can be used to train large language models and other AI systems. “Where we want to stand firm on is—nonnegotiable things for us— trust, accuracy, quality, and attribution back to the sources of these AI outputs,” he says.
He declined to say how much Stack Overflow is being paid by Google for the data. “This will be a meaningful commercial offering for us in the near term, medium term, and long term,” Chandrasekar says.
Google and other AI developers have previously gathered data from Stack Overflow and other websites without much notice. As demand for generative AI technologies has surged—and the valuations of the companies developing them has rocketed—the websites supplying the foundational text have begun demanding what they view as their fair share. Fortunately for Stack Overflow, prospective customers have heeded the message, Chandrasekar says. “We’re not having to chase people,” he says.
Stack Overflow data is particularly beneficial to AI systems that generate computer code, which have proven to be popular with software engineers and a significant source of revenue for Microsoft and OpenAI.
The new Stack Overflow deal comes just a week after Google reached a licensing agreement to hoover up data from Reddit, the discussion forums operator, whose content has helped chatbots’ ability to converse. Reddit had unveiled plans to start charging for data access just before Stack Overflow had last year.
Byron Tau
Aarian Marshall
Simon Hill
Lauren Goode
Stack Overflow’s charges for what they are naming as OverFlowAPI are varying based on the type of data given. Other than its basic repository of 59 million questions and responses, the site charges more for layers of metadata such as post categories and voting history of user-submitted responses, patterns about the kinds of questions being asked, and bespoke cuts of information, maybe questions about a specific coding language, to help with precision. “It’s more about the level of the data they can reach,” says Chandrasekar. “It’s less about the frequency of the data requests.”
Chandrasekar states that internal quality testing presents the value Stack Overflow data can offer. When they fine-tuned open-source language models from Meta and AI startup Mistral with Stack Overflow data, the accuracy of responses to technical queries improved by 20 percentage points, he mentions.
The Google agreement will also examine how users of the version of Gemini for Google Cloud integration can generate new data for Stack Overflow. Users who do not get a satisfying response from the chatbot can send their question to Stack Overflow, where the moderators will review it and once approved, it will be available for the website’s community of users to respond. Upcoming plans for the demo in April with Google also include discussions about letting users submit enhanced responses back to Stack Overflow.