OpenAI is keeping a close lid on the inner workings of its newest AI model. After releasing the “Strawberry” AI model series last week, which includes versions like o1-preview and o1-mini boasting reasoning capabilities, the company has begun issuing warnings and potential bans to users attempting to decipher the models’ operational mechanics.
Unlike its predecessors such as GPT-4o, the o1 models are designed to engage in a detailed problem-solving procedure before providing responses. In ChatGPT, when questions are posed to an “o1” model, a processed sequence of thought is displayed, which is actually a sanitized version formulated by another AI, concealing the original thought process from users.
This narrative first appeared on Ars Technica, a reputable platform for tech news and analysis, which is affiliated with WIRED’s parent organization, Condé Nast.
The allure of hidden information has spurred hackers and security experts to try and decrypt o1’s concealed thought sequence using tactics like jailbreaking or prompt injection. While some initial breakthroughs have been reported, their efficacy and veracity remain yet to be definitively proven.
OpenAI monitors interactions via the ChatGPT interface and is strict about any attempts to explore how o1 reasons, affecting even those who are just curious.
A user on X reported (with confirmation from others, including Scale AI prompt engineer Riley Goodside) receiving a warning email for using the phrase “reasoning trace” in conversations with o1. Others mention that merely asking ChatGPT about the model’s “reasoning” triggers a warning.
The warning email from OpenAI indicates that such inquiries violate policies against bypassing security measures. The message advises, “Please halt this activity and ensure you are using ChatGPT in accordance with our Terms of Use and our Usage Policies,” warning further breaches could lead to loss of access to GPT-4o with Reasoning, the internal designation for o1.
Marco Figueroa, responsible for Mozilla’s GenAI bug bounty programs, initially posted about the OpenAI warning on X last Friday, expressing his concerns that it restricts his capacity to perform constructive red-teaming safety tests on the model. He mentioned, “I was too lost focusing on #AIRedTeaming to realize that I received this email from @OpenAI yesterday after all my jailbreaks. I’m now on the get banned list!!!”
In a post titled “Learning to Reason With LLMs” on OpenAI’s blog, the company discusses the significance of hidden chains of thought in AI models. They propose that these hidden processes offer a crucial monitoring opportunity, allowing insights into the model’s internal thought mechanisms. OpenAI argues that maintaining these processes in their raw, uncensored state could be invaluable, though it may clash with commercial interests.
“For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user,” according to the company. “However, allowing the model complete freedom to express uncensored thoughts is essential, thus we cannot impose any restrictive training regarding policy compliance or user preferences on these thoughts. Additionally, making an unaligned chain of thought visible to users is undesirable.”
OpenAI has opted not to display these raw chains of thought to users, citing the need to preserve an unfiltered feed for internal use, considerations regarding user experience, and maintaining a “competitive advantage.” The company admits this approach has its drawbacks. “We attempt to compensate by training the model to reiterate any valuable insights from the thought process in its responses,” they mention.
Addressing the issue of “competitive advantage,” independent AI researcher Simon Willison expresses his concerns in a write-up on his personal blog. He interprets their strategy as a protective measure to prevent other models from training against the proprietary reasoning processes developed by OpenAI.
It’s widely known within the AI sector that numerous researchers utilize outputs from OpenAI’s GPT-4, and previously GPT-3, to train new AI models that often end up being competitors, despite this breaching OpenAI’s terms of service. The unfiltered output from these models is highly valuable for competitors aiming to develop similar “reasoning” AI tools.
Willison argues that OpenAI’s strict control over the detailed operations of their models is a setback for community transparency. “I’m not at all happy about this policy decision,” he expressed. “As a developer working on large language models, interpretability and transparency mean everything to me. The concept of executing a complex prompt and not seeing the details of how it was processed feels like a significant regression.”
This story was originally published on Ars Technica.