Preventing Your Data from Being Utilized for AI Training: A Guide

Matt Burgess Reece Rogers

If you’ve ever posted something to the internet—a smart remark, an old blog post, a critical review, or a selfie on Instagram—it has most likely been taken and used to help train the current generation of generative AI. Large language models, such as ChatGPT, and image creators are powered by vast quantities of our data. Even if it’s not powering a chatbot, the data can be utilized for additional machine-learning features.

Tech corporations have scraped immense sections of the web to amass the data they claim is essential to create generative AI—with little concern for content creators, copyright laws, or privacy. In addition to this, more and more, companies with an abundance of people’s posts are looking to cash in on the AI gold rush by selling or licensing that information. We’re looking at you, Reddit.

However, as the lawsuits and investigations around generative AI and its opaque data practices accumulate, there have been small strides to give people more control over what happens to what they post online. Some companies now allow individuals and business customers to opt out of having their content used in AI training or being sold for training purposes. Here’s what you can—and can’t—do.

Before we move onto how you can choose to opt out, it’s important to establish a few facts. Many businesses that are developing AI have already crawled the internet, meaning that any information you’ve posted is likely already part of their datasets. These companies tend not to disclose what data they have collected, bought, or utilized in training their systems. According to Niloofar Mireshghallah, a researcher specializing in AI privacy at the University of Washington, there is still a great deal that remains unknown. The general situation is quite opaque.

According to Mireshghallah, businesses might intentionally make it difficult for users to opt out from their data being used to train AI. Even when opting out is possible, many individuals don’t have a comprehensive understanding of the permissions they’ve granted or how their information is being used. That’s without even taking into account the complex web of laws, such as copyright protections and Europe’s stringent privacy laws. Major companies like Facebook, Google, X, and others have specified in their privacy policies that they might use your data for training AI.

While there are multiple technical methods available that could allow AI systems to “unlearn” or purge data, Mireshghallah states that knowledge about any such processes currently in use is very limited. Additionally, these options can be hard to find or require significant amounts of labor. If you want to get posts removed from AI training data, you might find it to be quite a challenge. As for companies that are beginning to permit users to opt out from future data collection or sharing, they virtually always preset users to be opted in automatically.

Thorin Klosowski, a security and privacy advocate at the Electronic Frontier Foundation, notes that “Most companies add the friction because they know that people aren’t going to go looking for it. Opting in would require a deliberate action, as opposed to opting out, where you would need to know about its existence”.

Karen Williams

Vittoria Elliott

Lauren Goode

Reece Rogers

Some companies developing AI tools and machine learning models do not automatically use customer data for model training. “We do not train our models on user data as a default practice. If the user provides express permission through actions such as sending feedback by clicking a thumbs up or down signal on a Claude output, we may incorporate their prompts and outputs in Claude’s training, says Jennifer Martinez, spokesperson for Anthropic. In this case, the latest variant of Claude, the company’s chatbot, is primarily based on public online data and third-party data, and not user data.

While this guide focuses mainly on text opt-outs, artists have been using “Have I Been Trained?” to indicate that their images should not be used for model training. This service is provided by the startup Spawning, and it allows people to check if their creations have been scraped and then opt out of future training. “Anything with a URL can be opted out. Our search engine only searches images, but our browser extension allows opting out of any media type,” says Jordan Meyer, co-founder, and CEO of Spawning. Stability AI, which developed the text to image tool, Stable Diffusion, is amongst the companies that claim they respect the system.

The list below only includes companies that currently have an opt-out feature. For instance, Microsoft’s Copilot does not offer personal account users the option to refuse the use of their prompts for software improvement. “A percentage of user prompts in Copilot and Copilot Pro responses are used to fine-tune the experience,” says Donny Turnbaugh, a spokesperson for Copilot. “Microsoft takes measures to deidentify data before using it, thus helping protect user identity.” Nonetheless, users who prioritize privacy might desire more control over their data.

If you use Adobe’s Creative Cloud for storing your files, the company could use them for training its machine-learning algorithm. “When we analyze your content for product improvement and development, we firstly aggregate your content with other content and then use the aggregated content to train our algorithms and improve our products and services,” states the company’s FAQ. However, this does not apply to any files stored only on your device.

Written by: Karen Williams

Vittoria Elliott

Lauren Goode

Reece Rogers

If you’re using a personal Adobe account, it’s easy to opt out. Open up Adobe’s privacy page, scroll down to the Content analysis section, and click the toggle to turn it off. For business or school accounts, the opt-out process is not available on the individual level, and you’ll have to reach out to your administrator.

AI services from Amazon Web Services, like Amazon Rekognition or Amazon CodeWhisperer, may save customer data to improve the company’s tools. Head’s up, this is the most complicated opt-out process included in the roundup, so you likely need help from an IT professional at your company or an AWS representative to perform it successfully. Outlined on this support page from Amazon, the process includes enabling the option for your organization, creating a policy, and attaching that policy where necessary.

For users of Google’s chatbot, Gemini, conversations may sometimes be selected for human review to improve the AI model. Opting out is simple, though. Open up Gemini in your browser, click on Activity, and select the Turn Off drop-down menu. Here you can just turn off the Gemini Apps Activity, or you can opt out as well as delete your conversation data. While this does mean in most cases that future chats won’t be seen for human review, already selected data is not erased through this process. According to Google’s privacy hub for Gemini, these chats may stick around for three years.

Grammarly does not currently offer an opt-out process for personal accounts, but self-serve business accounts can choose to opt out from having their data used to train Grammarly’s machine-learning model. Turn it off by opening up your Account Settings, clicking on the Data Settings tab, and toggling off Product Improvement & Training. If you have a managed business account, which includes accounts for classroom education and accounts bought through a Grammarly sales representative, you are automatically opted out from AI model training.

HubSpot, a popular marketing software, automatically uses data from customers to improve its machine-learning model. Unfortunately, there’s not a button to press to turn off the use of data for AI training. You have to send an email to privacy@hubspot.com with a message requesting that the data associated with your account be opted out.

People reveal all sorts of personal information while using a chatbot. OpenAI provides some options for what happens to what you say to ChatGPT—including allowing its future AI models not to be trained on the content. “We give users a number of easily accessible ways to control their data, including self-service tools to access, export, and delete personal information through ChatGPT. That includes easily accessible options to opt out from the use of their content to train models,” says Taya Christianson, an OpenAI spokesperson. (The options vary slightly depending on your account type, and data from enterprise customers is not used to train models).

Karen Williams

Vittoria Elliott

Lauren Goode

Reece Rogers

On its help pages, OpenAI says ChatGPT web users without accounts should navigate to Settings and then uncheck Improve the model for everyone. If you have an account and are logged in through a web browser, select ChatGPT, Settings, Data Controls, and then turn off Chat History & Training. If you’re using ChatGPT’s mobile apps, go to Settings, pick Data Controls, and turn off Chat History & Training. Changing these settings, OpenAI’s support pages say, won’t sync across different browsers or devices, so you need to make the change everywhere you use ChatGPT.

OpenAI is about a lot more than ChatGPT. For its Dall-E 3 image generator, the startup has a form that allows you to send images to be removed from “future training datasets.” It asks for your name, email, whether you own the image rights or are getting in touch on behalf of a company, details of the image, and any uploads of the image(s). OpenAI also says if you have a “high volume” of images hosted online that you want removed from training data, then it may be “more efficient” to add GPTBot to the robots.txt file of the website where the images are hosted.

Traditionally a website’s robots.txt file—a simple text file that usually sits at websitename.com/robots.txt—has been used to tell search engines, and others, whether they can include your pages in their results. It can now also be used to tell AI crawlers not to scrape what you have published—and AI companies have said they’ll honor this arrangement.

Perplexity is a startup that uses AI to help you search the web and find answers to questions. Like all of the other software on this list, you are automatically opted in to having your interactions and data used to train Perplexity’s AI further. Turn this off by clicking on your account name, scrolling down to the Account section, and turning off the AI Data Retention toggle.

Quora says it “currently” doesn’t use answers to people’s questions, posts, or comments for training AI. It also hasn’t sold any user data for AI training, a spokesperson says. However, it does offer opt-outs in case this changes in the future. To do this, visit its Settings page, click to Privacy, and turn off the “Allow large language models to be trained on your content” option. Despite this choice, there are some Quora posts that may be used for training LLMs. If you reply to a machine-generated answer, the company’s help pages say, then those answers may be used for AI training. It points out that third parties may just scrape its content anyway.

Rev, a voice transcription service that uses both human freelancers and AI to transcribe audio, says it uses data “perpetually” and “anonymously” to train its AI systems. Even if you delete your account, it will still train its AI on that information.

Kendell Kelton, head of brand and corporate communications at Rev, says it has the “largest and most diverse data set of voices,” made up of more than 6.5 million hours of voice recording. Kelton says Rev does not sell user data to any third parties. The firm’s terms of service say data will be used for training, and that customers are able to opt out. People can opt out of their data being used by sending an email to support@rev.com, its help pages say.

All of those random Slack messages at work might be used by the company to train its models as well. “Slack has used machine learning in its product for many years. This includes platform-level machine-learning models for things like channel and emoji recommendations,” says Jackie Rocca, a vice president of product at Slack who’s focused on AI.

Even though the company does not use customer data to train a large language model for its Slack AI product, Slack may use your interactions to improve the software’s machine-learning capabilities. “To develop AI/ML models, our systems analyze Customer Data (e.g. messages, content, and files) submitted to Slack,” says Slack’s privacy page. Similar to Adobe, there’s not much you can do on an individual level to opt out if you’re using an enterprise account.

Karen Williams

Vittoria Elliott

Lauren Goode

Reece Rogers

The only real way to opt out is to have your administrator email Slack at

feedback@slack.com. The message must have the subject line “Slack Global model opt-out request” and include your organization’s URL. Slack doesn’t provide a timeline for how long the opt-out process takes, but it should send you a confirmation email after it’s complete.

Website-building tool Squarespace has built in a toggle to stop AI crawlers from scraping websites it hosts. This works by updating your website’s robots.txt file to tell AI companies the content is off limits. To block the AI bots, open Settings within your account, find Crawlers, and turn off Artificial Intelligence Crawlers. It points out this should work for the following crawlers: Anthropic, OpenAI’s GPTBot and ChatGPT-User, Google Extended, and CCBot.

If you use Substack for blog posts, newsletters, or more, the company also has an easy option to apply the robots.txt opt-out. Within your Settings page, scroll to the Publication section and turn on the toggle to Block AI training. Its help page points out: “This will only apply to AI tools that respect this setting.”

Blogging and publishing platform Tumblr—owned by Automattic, which also owns WordPress—says it is “working with” AI companies that are “interested in the very large and unique set of publicly published content” on the wider company’s platforms. This doesn’t include user emails or private content, an Automattic spokesperson says.

Tumblr has a “prevent third-party sharing” option to stop what you publish being used for AI training, as well as being shared with other third parties such as researchers. If you’re using the Tumblr app, go to account Settings, select your blog, click on the gear icon, select Visibility, and toggle the “Prevent third-party sharing” option. Explicit posts, deleted blogs, and those that are password-protected or private, are not shared with third-party companies in any case, Tumblr’s support pages say.

Like Tumblr, WordPress has a “prevent third-party sharing” option. To turn this on, visit your website’s dashboard, click on Settings, General, and then through to Privacy, select the Prevent third-party sharing box. “We are also trying to work with crawlers (like https://commoncrawl.org/) to prevent content from being scraped and sold without giving our users choice or control over how their content is used,” an Automattic spokesperson says.

If you are hosting your own website, you can update your robots.txt file to tell AI bots not to scrape the pages. Most news websites don’t allow their articles to be crawled by AI bots. WIRED’s robots.txt file, for example, doesn’t allow bots from OpenAI, Google, Amazon, Facebook, Anthropic, or Perplexity, among others. This opt-out isn’t just for publishers though: Any website, big or small, can alter its robots file to exclude AI crawlers. All you need to do is add a disallow command; working examples can be found here.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Article

Vote on US Wiretap Program Halted by Trump Loyalists

Next Article

A Comprehensive Guide: Acquiring and Attuning Brave Weapons in Destiny 2

Related Posts