Protecting Your Privacy: How to Stop Your Data From Being Used to Train AI

If you purchase anything through the links in our articles, we may receive a commission. This assists in supporting our journalism. Learn more. Also, consider subscribing to WIRED.

Everything you’ve shared online—be it an embarrassing tweet, an old blog entry, an excited restaurant review, or a fuzzy Instagram photo—has likely been collected and incorporated into the training datasets for the latest wave of generative AI.

Tools based on large language models, like ChatGPT, along with image generators, are driven by extensive collections of our data. Even if your information isn’t being used for a chatbot or another generative tool, the data you’ve contributed to the numerous servers on the internet might still be utilized for machine-learning applications.

Corporations have harvested massive amounts of content from the internet to obtain the data they assert is essential for developing generative AI—often with minimal consideration for content creators, copyright regulations, or privacy. Furthermore, more frequently, companies that possess vast collections of user posts aim to capitalize on the AI boom by selling or licensing that data. We’re looking at you, Reddit.

Amid the growing number of lawsuits and inquiries regarding generative AI and its unclear data practices, there are emerging efforts aimed at granting individuals greater control over their online content. Some companies now provide options for both individuals and businesses to exclude their content from being utilized in AI training or sold for that purpose. Here’s a look at what you can—and cannot—do.

Update: This guide received an update in October 2024. We have incorporated additional websites and services into the list and revised some outdated instructions. This article will continue to be updated as the available tools and their respective policies change.

Before delving into the process of opting out, it’s important to set realistic expectations. Numerous companies developing AI have already harvested data from the web, meaning that much of what you have posted may currently reside in their databases. Moreover, AI companies often remain tight-lipped regarding the specific content they have collected, acquired, or used in their training processes. “We honestly don’t know that much,” states Niloofar Mireshghallah, a researcher specializing in AI privacy at the University of Washington. “In general, everything is very black-box.”

Mireshghallah further notes that companies can create obstacles when it comes to opting out of having data utilized for AI training. Even in situations where opting out is possible, many individuals remain unclear about the permissions they have granted or the ways in which their data is utilized. This complexity is compounded by various regulations, including copyright laws and Europe’s robust privacy regulations. Major platforms like Facebook, Google, and X have incorporated modifications into their privacy policies, indicating that they may leverage your data for AI training purposes.

While there are several technical methods available for AI systems to eliminate data or “unlearn,” Mireshghallah points out that knowledge about these processes remains limited. The available options may be obscured or require significant effort. Removing posts from AI training datasets is generally a significant challenge. Although some companies are beginning to allow users to opt out of future data scraping or sharing, the default setting typically requires users to opt-in.

“Most companies introduce this friction because they understand that few people will actively seek it out,” explains Thorin Klosowski, a security and privacy advocate at the Electronic Frontier Foundation. “Opting in is a deliberate choice, whereas opting out happens only if you are aware of its existence.”

Less frequently, some organizations focusing on AI technologies and machine-learning frameworks refrain from automatically opting in their customers. “We do not train our models on user-submitted data by default. User prompts and outputs may be used to train Claude only when the user has given explicit permission, such as by clicking a thumbs up or down on a specific Claude output for feedback,” states Jennifer Martinez, a spokesperson for Anthropic. Thus, the latest version of the company’s Claude chatbot relies on publicly available information and third-party data—content shared by individuals across the internet—rather than user data.

This guide primarily addresses text opt-outs, but artists are also utilizing “Have I Been Trained?” to indicate that their images should not be used for training purposes. Managed by the startup Spawning, this service enables individuals to check if their works have been scraped and to opt-out of future training involvements. “Any content with a URL can be opted out. Our search engine is solely focused on images, but our browser extension allows opting out of any type of media,” remark Jordan Meyer, the cofounder and CEO of Spawning. Stability AI, the company behind the text-to-image tool known as Stable Diffusion, is among those who have previously pledged to respect this framework.

The following list consists of companies that currently provide opt-out processes. For instance, Meta does not provide this option. “While we don’t currently have an opt-out feature, we’ve integrated tools within our platform that enable users to delete their personal data from conversations with Meta AI in our applications,” explains Emil Vazquez, a representative for Meta. You can find the complete procedure for this process here.

Meanwhile, Microsoft’s Copilot has announced a new opt-out process for generative AI training that is set to be released soon. “Some of the total user inputs in Copilot and Copilot Pro responses are utilized to enhance the service,” mentions Donny Turnbaugh, a spokesperson for the company. “Microsoft implements measures to anonymize data before it is processed, which helps protect consumer identities.” Even though the data may be anonymized—ensuring that no identifiable details remain—privacy-conscious users might prefer increased control over their data and would opt out when given the option.

If you utilize Adobe’s Creative Cloud for your files, the company might analyze these files to enhance its software. This analysis does not extend to files that are stored solely on your device. Furthermore, Adobe will not use the files to train a generative AI model, with one exception. “We do not analyze your content to train generative AI models unless you choose to submit content to the Adobe Stock marketplace,” states the updated FAQ page of the company.

For those with a personal Adobe account, it’s simple to opt out of content analysis. Access Adobe’s privacy page, scroll to the Content analysis for product improvement section, and toggle the feature off. If you have a business or school account, opting out is automatic.

Amazon Web Services (AWS) offers AI services, such as Amazon Rekognition and Amazon CodeWhisperer, that may utilize customer data to enhance their tools. However, customers have the option to opt-out of AI training. The opt-out process has improved recently, making it much easier than it used to be. For a comprehensive guide on how to opt out for your organization, please refer to this support page from Amazon.

Figma, a renowned design software, may employ your data for model training. If you are using an Organization or Enterprise plan, you are automatically opted out of data usage for training. Conversely, users with Starter or Professional accounts are opted in by default. This can be adjusted at the team level by navigating to the settings and accessing the AI tab, where you can disable the Content training feature.

Users of Google’s chatbot, Gemini, may have their conversations selected for human review to enhance the AI model. However, opting out is quite straightforward. Simply launch Gemini in your browser, click on Activity, and choose the Turn Off option from the menu. You can disable the Gemini Apps Activity or opt out entirely, including the deletion of your conversation data. It’s important to note that while this process prevents future chats from being reviewed by humans, any previously selected data will remain. According to Google’s privacy hub for Gemini, these conversations might be retained for up to three years.

Grammarly has recently updated its policies to allow personal accounts to opt out of AI training. You can do this by going to the Account section, selecting Settings, and toggling off the Product Improvement and Training option. If your account is through an enterprise or educational license, you are automatically opted out.

Kate O’Flaherty authored an insightful article for WIRED discussing Grok AI and how to safeguard your privacy on X, the platform that hosts the chatbot. In this scenario, countless users found themselves automatically enrolled in AI training without sufficient warning. If you maintain an account on X, you can opt out of having your data used for Grok’s training purposes by navigating to the Settings and privacy section, then selecting Privacy and safety. Look for the Grok tab and untick the data sharing option.

HubSpot, well-known for its marketing and sales software, utilizes information from its clients to enhance its machine-learning models automatically. Unfortunately, there’s no straightforward option available to disable the use of your data for AI training. Instead, you must send an email to privacy@hubspot.com requesting that your associated data be opted out.

In September, users of the networking site were taken aback to find out that their data might be utilized to train AI models. “Ultimately, people seek that competitive edge in their careers, and our generative AI services aim to provide them with that support,” stated Eleanor Crum, LinkedIn’s spokesperson.

If you’d like to prevent new LinkedIn posts from being used for AI training, visit your profile, access the Settings, and click on Data Privacy. Then, simply uncheck the slider marked Use my data for training content creation AI models.

When interacting with a chatbot, users often share various personal details. OpenAI offers options regarding how your input is utilized, allowing users the choice to prevent their content from being used in future AI training. “We provide multiple easily accessible methods for users to manage their data, including tools to access, export, and delete personal information in ChatGPT. This includes straightforward options to opt out from using their content for model training,” explains Taya Christianson, a representative from OpenAI. (The availability of these options may differ based on account type, and data from enterprise customers is excluded from model training.)

On its help pages, OpenAI instructs ChatGPT web users who wish to opt out to go to Settings, then Data Controls, and uncheck the box labeled Improve the model for everyone. However, OpenAI encompasses more than just ChatGPT. For its Dall-E 3 image generator, there’s a form to request the removal of images from “future training datasets.” This form requires your name, email, confirmation of image ownership or representation on behalf of a company, image details, and the option to upload the images.

For users with a “high volume” of images available online wishing to exclude them from training data, OpenAI suggests that it could be “more efficient” to incorporate GPTBot into the robots.txt file of the website hosting the images.

Traditionally, a website’s robots.txt file—a basic text file that typically resides at websitename.com/robots.txt—has been utilized to inform search engines and others whether your pages can be included in their search results. This file can now also serve to instruct AI crawlers to refrain from scraping your published material, a request that AI companies have stated they will respect.

Perplexity is an innovative startup leveraging AI technology to assist users in searching the web and obtaining answers to their inquiries. Similar to other platforms detailed here, users are automatically signed up to allow their interactions and data to contribute to the training of Perplexity’s AI. You can disable this feature by selecting your account name, navigating to the Account section, and deactivating the AI Data Retention option.

Quora mentions that it currently does not utilize users’ questions, posts, or comments for the training of AI systems. A representative confirmed that no user data has been sold for this purpose. Nevertheless, users do have the option to opt-out should policies change in the future. To do so, navigate to the Settings page, select Privacy, and turn off the “Allow large language models to be trained on your content” setting. By default, users are opted into this feature. However, despite the available opt-out, some Quora content may still be used for training large language models. If users respond to a machine-generated answer, Quora’s help pages indicate that these replies might be included in AI training datasets. It is also noted that third parties may scrape content from the platform regardless of these settings.

Rev is another platform, specializing in voice transcription, combining human expertise with AI to convert audio to text. According to Rev, it utilizes data indefinitely and in an anonymous manner for the purpose of training its AI systems. Even after account deletion, the company retains the right to continue training its AI using the information previously gathered.

Kendell Kelton, who oversees brand and corporate communications at Rev, highlights that the company possesses the “largest and most diverse dataset of voices,” totaling over 7 million hours of recorded audio. Kelton reassures that Rev does not sell user data to external entities. According to the firm’s terms of service, user data may be employed for training; however, individuals can opt out of this usage by sending an email to support@rev.com, as specified in its help documentation.

All those seemingly random Slack messages exchanged at the workplace could potentially be utilized by the company to enhance its models as well. “Slack has incorporated machine learning into its product for a significant amount of time. This includes platform-wide machine-learning models for features such as recommendations for channels and emojis,” mentions Jackie Rocca, a vice president of product at Slack with a focus on AI.

While the company does not utilize customer data to train a large language model specifically for its Slack AI product, Slack might use your interactions to enhance the software’s machine-learning abilities. This can include aspects such as your messages, content, and files, according to Slack’s privacy page.

If you wish to opt out, the only true method is to have your administrator reach out to Slack via email at feedback@slack.com. The email must contain the subject “Slack Global model opt-out request” and provide your organization’s URL. While Slack doesn’t specify how long the opt-out process takes, they should send a confirmation email once it’s completed.

On another note, the website-building tool Squarespace offers a feature that allows users to disable AI crawlers from accessing the websites it hosts. This feature operates by modifying your website’s robots.txt file, which informs AI companies that the content is off-limits. To prevent AI bots from scouring your site, navigate to Settings in your account, locate Crawlers, and select Block known artificial intelligence crawlers. The company suggests this will be effective against numerous crawlers, including Anthropic AI, Applebot-Extended, CCBot, Claude-Web, cohere-ai, FacebookBot, Google Extended, GPTBot, ChatGPT-User, and PerplexityBot.

If you utilize Substack for your blog posts, newsletters, or similar content, the platform offers a straightforward way to opt-out with robots.txt. To do this, navigate to the Settings page and locate the Publication section to activate the Block AI training toggle. Their help page mentions: “This will only apply to AI tools that respect this setting.”

Tumblr, a blogging and publishing platform owned by Automattic (which also owns WordPress), has stated that it is “working with” AI companies interested in the vast and unique assortment of publicly available content across its platforms. However, this does not extend to user emails or private content, as clarified by an Automattic spokesperson.

Tumblr provides a “prevent third-party sharing” setting to help ensure that your published content isn’t utilized for AI training or shared with other third parties, including researchers. If you’re using the Tumblr app, head to account Settings, choose your blog, click on the gear icon, go to Visibility, and toggle the “Prevent third-party sharing” feature. Posts that are explicit, deleted blogs, or those set to password-protected or private are not shared with third parties regardless, as supported by Tumblr’s support pages.

WordPress, like Tumblr, also offers a “prevent third-party sharing” feature. To enable this, access your website’s dashboard, select Settings, go to General, and then navigate to Privacy to check the Prevent third-party sharing box. “We are also trying to work with crawlers (such as commoncrawl.org) to stop content from being scraped and sold without allowing our users some choice or control over how their content is used,” mentioned an Automattic spokesperson.

If you manage your own website, you have the ability to modify your robots.txt file to instruct AI bots not to scrape your pages. Many news organizations prevent their articles from being crawled by AI bots. For instance, WIRED’s robots.txt file prohibits crawling by bots from various sources including Google, Amazon, Facebook, Anthropic, and Perplexity. This choice to opt out is not limited to publishers; any website, regardless of its size, can modify its robots file to keep AI crawlers at bay. Simply adding a disallow command is all that is required, and you can find working examples here.

Editor

As the Editor of IT Magazine, I curate cutting-edge content on technology trends, collaborating with experts to deliver insightful articles and reviews. With a focus on innovation and precision, I ensure each issue maintains the magazine's reputation as a trusted source in the IT community.