Protecting Your Privacy: How to Stop Your Data From Being Used in AI Training

If you purchase something through the links in our articles, we might earn a commission. This assists in supporting our journalism. Learn more. Additionally, please consider subscribing to WIRED.

Anything you’ve shared online—whether it’s a cringe-worthy tweet, an old blog entry, an eager review of a restaurant, or a blurry selfie on Instagram—has likely been scooped up and utilized as part of the training content for the ongoing surge of generative AI.

Large language model tools, such as ChatGPT and various image generators rely on extensive amounts of our personal data. Furthermore, even if your information doesn’t contribute to a chatbot or another generative tool, the digital content you have provided may be utilized for machine-learning applications.

Technology companies have extensively scoured the internet to compile the data they say is essential for developing generative AI—frequently with scant consideration for content creators, copyright regulations, or privacy concerns. To compound this issue, more companies with extensive collections of user posts are eager to capitalize on the AI trend by selling or licensing that data. We’re looking at you, Reddit.

As discussions around generative AI and its unclear data practices intensify, steps are being taken to empower individuals with greater control over their online contributions. A number of companies are now offering options for both individual users and businesses to opt out of having their content utilized in AI training sessions or sold for such purposes. Here’s what you can and cannot do regarding these options.

Update: This guide was updated in October 2024. New websites and services have been added to the list below, and some outdated directions have been refreshed. We will continue to keep this article up-to-date as tools and their policies change.

Before delving into the opt-out process, it’s important to manage expectations. Many organizations involved in AI development have already harvested data from the web, which means that anything you’ve shared online may likely already be part of their databases. Additionally, AI companies are often reticent about revealing what information they have collected, purchased, or leveraged for training purposes. “We honestly don’t know that much,” states Niloofar Mireshghallah, a researcher specializing in AI privacy at the University of Washington. “In general, everything is very black-box.”

Mireshghallah points out that the process for opting out of data usage for AI training can be convoluted, and often, even when it is feasible, many individuals lack a “clear idea” of the permissions they have consented to or how their data is being utilized. This complexity is further amplified by various legal considerations, such as copyright protections and Europe’s stringent privacy laws. Companies like Facebook, Google, and X have updated their privacy policies to include clauses that allow them to utilize your data for AI training purposes.

Although there are multiple technical methods that AI systems could use to remove data or “unlearn,” Mireshghallah mentions that there’s limited knowledge about the existing processes. These options may often be hidden or require substantial effort. It is likely a challenging process to have posts taken out of AI training data. Many companies have started allowing users to opt-out of future data scraping or sharing, but they typically set it up in a way that requires users to opt-in by default.

“Most companies create these barriers because they know people tend not to search for them,” states Thorin Klosowski, a security and privacy advocate at the Electronic Frontier Foundation. “Opting in would require a conscious action, unlike opting out, which necessitates being aware of its existence.”

While it’s less common, certain companies developing AI tools don’t automatically opt-in their customers. “We do not train our models on user-submitted data by default. We may utilize user prompts and responses to train Claude when the user explicitly permits it, such as by clicking a thumbs up or down on specific Claude outputs for feedback,” explains Jennifer Martinez, a representative of Anthropic. In this scenario, the latest version of the company’s Claude chatbot is designed using publicly available information online and third-party data—specifically, content shared by users elsewhere online, not user data.

This guide primarily addresses opt-outs related to text, but artists have also been utilizing “Have I Been Trained?” to indicate that their images should not be used for training purposes. Operated by the startup Spawning, this service enables individuals to check if their works have been scraped and opt out of any upcoming training. “Any content with a URL is eligible for opting out. Our search engine focuses solely on images, but our browser extension allows opting out of any type of media,” shares Jordan Meyer, cofounder and CEO of Spawning. Stability AI, the company responsible for a text-to-image tool called Stable Diffusion, is among those that have previously announced they will respect these requests.

The following list includes companies that currently provide opt-out options. For instance, Meta does not offer an opt-out feature. Emil Vazquez, a spokesperson for Meta, mentioned, “While we don’t currently have an opt-out feature, we’ve built in-platform tools that allow people to delete their personal information from chats with Meta AI across our apps.” For detailed steps on this process, you can find more here.

Furthermore, Microsoft’s Copilot has announced a new opt-out option for generative AI training that may be introduced soon. Donny Turnbaugh, a representative for the company, stated, “A portion of the total number of user prompts in Copilot and Copilot Pro responses are used to fine-tune the experience.” He also added that Microsoft takes measures to de-identify data before utilization, which helps safeguard consumer privacy. Although this de-identification process removes identifiable details, users concerned about privacy may prefer to opt-out once the option becomes available.

If your files are saved in Adobe’s Creative Cloud, the company may analyze them to enhance its software offerings. However, this does not pertain to files that are only stored on your device. Adobe will not utilize your files for training a generative AI model, with one exception: “We do not analyze your content to train generative AI models unless you choose to submit content to the Adobe Stock marketplace,” according to the updates on their FAQ page.

For those using a personal Adobe account, it’s simple to opt out of content analysis. To do this, visit Adobe’s privacy page, scroll to the Content analysis for product improvement section, and toggle the option off. If you have a business or school account, you are automatically opted out.

AI offerings from Amazon Web Services, such as Amazon Rekognition and Amazon CodeWhisperer, have the potential to utilize customer data to enhance the capabilities of their tools. However, customers do have the option to decline participation in AI training. This process was previously quite complex but has become more user-friendly recently. For a comprehensive guide on how to opt out on behalf of your organization, you can refer to this support page from Amazon.

Figma, a widely used design tool, might utilize your data for model training. Users with an Organization or Enterprise account are automatically excluded from this. Conversely, those with Starter and Professional accounts are included by default. Team administrators can adjust settings by navigating to the AI tab in settings and disabling the Content training feature.

Users of Google’s chatbot, Gemini, may find that their conversations are sometimes selected for human review in order to enhance the AI model. However, opting out is quite straightforward. Simply launch Gemini in your web browser, click on Activity, and choose from the Turn Off drop-down menu. You can disable the Gemini Apps Activity or choose to opt out and delete your conversation history. It is important to note that while this prevents future chats from being reviewed by humans, any previously selected data will remain in the system for up to three years, as mentioned in Google’s privacy hub for Gemini.

Grammarly has revised its policies, allowing personal account holders the ability to opt out of AI training. To do this, navigate to Account, then Settings, and switch off the Product Improvement and Training option. If your account is through an enterprise or educational institution, it will already be set to opt out automatically.

Kate O’Flaherty shared an insightful article for WIRED discussing Grok AI and the crucial issue of privacy on X, the platform that hosts the chatbot. It highlights a scenario where countless users awoke one day to discover they had been automatically enrolled in AI training with very little notification. If you maintain an account on X, you can opt out of your data being utilized for Grok’s training by navigating to the Settings and privacy section, then selecting Privacy and safety. Open the Grok tab and untick the option for data sharing.

HubSpot, a well-known platform for marketing and sales, inherently uses customer data to enhance its machine-learning functionality. Regrettably, there is no straightforward option to disable the use of data for AI training. Instead, you must send an email to privacy@hubspot.com, requesting to opt out of data tied to your account.

On the career networking site, users were astonished to discover in September that their data might be used for AI model training. Eleanor Crum, a representative for LinkedIn, stated, “At the end of the day, people want that edge in their careers, and our gen-AI services assist in providing that support.”

You can opt out of having new LinkedIn posts utilized for AI training by accessing your profile and then opening the Settings. Proceed to Data Privacy and toggle off the option labeled Use my data for training content creation AI models.

Individuals tend to share various personal details while interacting with a chatbot. OpenAI offers multiple options regarding the handling of information shared with ChatGPT, which includes the choice to prevent future AI models from being trained on that content. “We provide users with several easy-to-access tools to manage their data, which include self-service options to access, export, and delete personal information via ChatGPT. This includes straightforward methods to opt out from using their content for training models,” explains Taya Christianson, a representative from OpenAI. (These options may vary slightly depending on the type of account, and data from enterprise customers is not utilized for model training).

According to its help pages, OpenAI mentions that web users of ChatGPT who wish to opt out should go to Settings, then Data Controls, and uncheck Improve the model for everyone. OpenAI’s focus extends beyond just ChatGPT. For instance, with its Dall-E 3 image generator, there is a form available for removing images from “future training datasets.” This form inquires about your name, email, whether you hold the image rights or are representing a company, details regarding the image, and allows for uploads of the image(s).

If you possess a significant number of images online that you want excluded from training data, OpenAI suggests that adding GPTBot to the robots.txt file of the hosting website may be a “more efficient” method.

Historically, a website’s robots.txt file—a basic text file typically located at websitename.com/robots.txt—has been utilized to inform search engines and other entities about the inclusion of your pages in their results. Now, it can also serve the purpose of instructing AI crawlers not to scrape your published content, and AI firms have indicated that they will adhere to this guideline.

Perplexity is an innovative startup leveraging AI technology to enhance online search capabilities and provide users with answers to their inquiries. Similar to other platforms listed, users are automatically enrolled in a program that permits the use of their interactions and data to refine Perplexity’s AI systems. To opt-out, navigate to your account name, scroll to the Account section, and disable the AI Data Retention option.

According to Quora, it “currently” refrains from using user-generated content such as answers, posts, or comments for AI training purposes. A spokesperson confirmed that user data has not been sold for AI training. Nevertheless, the platform provides options to opt-out should their policies change in the future. To manage these settings, users should visit the Settings page, select Privacy, and deactivate the setting labeled “Allow large language models to be trained on your content.” By default, users are enrolled in this setting. Despite this, some Quora posts may still be eligible for AI training. If you respond to a machine-generated answer, the company’s help resources indicate that these responses may be utilized for AI training, highlighting that third parties could scrape their content regardless.

Rev, a transcription service that combines human freelancers and AI for audio transcription, discloses that it utilizes user data “perpetually” and “anonymously” for its AI training purposes. Even after an account is deleted, the information may still contribute to AI training.

Kendell Kelton, who oversees brand and corporate communications at Rev, notes that the company possesses the “largest and most diverse dataset of voices,” aggregating over 7 million hours of recorded audio. Kelton asserts that Rev does not sell user data to outside parties. Their terms of service clarify that data is used for training, while customers have the option to opt-out. Users wishing to prevent their data from being utilized can do so by emailing support@rev.com, as stated in the help resources.

Numerous random Slack messages exchanged in the workplace may be utilized by the company to enhance its models. “For many years, Slack has integrated machine learning into its platform. This encompasses machine-learning models at the platform level for functions like emoji and channel suggestions,” explained Jackie Rocca, a vice president of product at Slack focusing on AI.

Although the company does not harness customer data for training a large language model for its Slack AI product, your interactions may still be employed to refine the software’s machine-learning ability. This might comprise details such as your messages, files, and various contents, as articulated on Slack’s privacy page.

The primary method to opt out involves having your administrator contact Slack at feedback@slack.com. The email should have the subject line “Slack Global model opt-out request” and include your organization’s URL. While Slack does not specify how long the opt-out procedure will take, it typically sends a confirmation email once completed.

The website creation platform Squarespace features a toggle designed to prevent AI crawlers from scraping the websites it hosts. This function updates your website’s robots.txt file, informing AI companies that the content is restricted. To obstruct the AI bots, navigate to Settings in your account, locate Crawlers, and choose Block known artificial intelligence crawlers. This feature is intended to be effective against several crawlers, including Anthropic AI, Applebot-Extended, CCBot, Claude-Web, cohere-ai, FacebookBot, Google Extended, GPTBot, ChatGPT-User, and PerplexityBot.

If you utilize Substack for your articles, newsletters, or similar purposes, the platform provides a simple method to opt-out using robots.txt. To do this, navigate to the Settings page, find the Publication section, and activate the toggle for Block AI training. According to the help documentation: “This will only apply to AI tools that respect this setting.”

Similarly, the blogging and publishing platform Tumblr, which is owned by Automattic (the parent company of WordPress), has stated it is “collaborating with” AI companies that are “interested in the vast and unique publicly available content” on its platforms. This excludes user emails or any private material, as confirmed by a spokesperson from Automattic.

Tumblr offers a feature to prevent third-party sharing, which allows you to stop your published content from being used to train AI, as well as being shared with external parties like researchers. If you use the Tumblr app, you can access this feature by going to your account Settings, selecting your blog, clicking the gear icon, choosing Visibility, and toggling the “Prevent third-party sharing” option. Posts marked as explicit, deleted blogs, and content that is password-protected or private are not shared with third-party companies, per Tumblr’s support information.

In line with this, WordPress also has a prevent third-party sharing feature. To enable it, simply go to your website’s dashboard, select Settings, then General, and proceed to Privacy, where you can check the Prevent third-party sharing option. “We are also attempting to collaborate with crawlers (like commoncrawl.org) to prohibit content from being scraped and sold without providing our users choice or control over their content usage,” a representative from Automattic mentioned.

If you manage your own website, you can modify your robots.txt file to instruct AI bots not to scrape your pages. Many news websites restrict AI bots from crawling their articles. For instance, WIRED’s robots.txt file blocks crawling by bots from platforms like Google, Amazon, Facebook, Anthropic, and Perplexity, among others. This option is available not only to publishers; any website, regardless of size, can modify its robots file to exclude AI crawlers. All it takes is the addition of a disallow command, and you can find working examples here.

Editor

As the Editor of IT Magazine, I curate cutting-edge content on technology trends, collaborating with experts to deliver insightful articles and reviews. With a focus on innovation and precision, I ensure each issue maintains the magazine's reputation as a trusted source in the IT community.