Why OpenAI is Holding Back the Release of its Advanced Human Voice Replication Technology

Voice synthesis has come a long way since 1978’s Speak & Spell toy, which once wowed people with its state-of-the-art ability to read words aloud using an electronic voice. Now, using deep-learning AI models, software can create not only realistic-sounding voices but can also convincingly imitate existing voices using small samples of audio.

Along those lines, OpenAI this week announced Voice Engine, a text-to-speech AI model for creating synthetic voices based on a 15-second segment of recorded audio. It has provided audio samples of the Voice Engine in action on its website.

This story originally appeared on Ars Technica, a trusted source for technology news, tech policy analysis, reviews, and more. Ars is owned by WIRED’s parent company, Condé Nast.

Once a voice is cloned, a user can input text into the Voice Engine and get an AI-generated voice result. However, OpenAI is not ready to widely release its technology. Originally, they planned to launch a pilot program for developers to sign up for the Voice Engine API, but reevaluated due to ethical concerns and decided to limit its availability.

“In line with our approach to AI safety and our voluntary commitments, we are choosing to preview but not widely release this technology at this time,” the company states. “We hope this preview of Voice Engine emphasizes its potential and also encourages the need to strengthen societal resilience against the challenges introduced by increasingly realistic generative models.”

Voice cloning technology is not particularly new—there have been several

AI voice synthesis models since 2022. The tech is readily available in the open source community with packages like OpenVoice and

XTTSv2. However, OpenAI’s move towards making its own voice tech widely available is notable. Their hesitance to fully release the technology may even pose a bigger story.

OpenAI mentions the benefits of its voice technology includes providing reading assistance through natural-sounding voices, allowing global reach for creators by translating content while retaining native accents, supporting non-verbal individuals with personalized speech options, and assisting patients in recovering their own voice after speech-impairing conditions.

But it also means that anyone with 15 seconds of someone’s recorded voice could effectively clone it, and that has obvious implications for potential misuse. Even if OpenAI never widely releases its Voice Engine, the ability to clone voices has already caused trouble in society through phone scams where someone imitates a loved one’s voice and election campaign robocalls featuring cloned voices from politicians like Joe Biden.

Also, researchers and reporters have shown that voice-cloning technology can be used to break into bank accounts that use voice authentication (such as Chase’s Voice ID), which prompted US senator Sherrod Brown of Ohio, the chair of the US Senate Committee on Banking, Housing, and Urban Affairs, to send a letter to the CEOs of several major banks in May 2023 to inquire about the security measures banks are taking to counteract AI-powered risks.

OpenAI recognizes that the tech might cause trouble if broadly released, so it’s initially trying to work around those issues with a set of rules. It has been testing the technology with a set of select partner companies since last year. For example, video synthesis company HeyGen has been using the model to translate a speaker’s voice into other languages while keeping the same vocal sound.

Celia Ford

Matt Simon

Will Knight

Leif Wenar

To use Voice Engine, each partner must agree to terms of use that prohibit “the impersonation of another individual or organization without consent or legal right.” The terms also require that partners acquire informed consent from the people whose voices are being cloned, and they must also clearly disclose that the voices they produce are AI-generated. OpenAI is also baking a watermark into every voice sample that will assist in tracing the origin of any voice generated by its Voice Engine model.

Currently, OpenAI is showcasing its technology, but is not prepared to accept the potential societal turmoil a widespread deployment might create. In response, the company has adapted its promotional strategy to present itself as responsibly raising awareness about this pre-existing technology.

“Our approach to a wider launch is careful and informed, considering the potential misuse of synthetic voices”, stated the company. “We aim to initiate a conversation about the careful introduction of synthetic voices and how our society can adjust to these new potentials. Based on these discussions and the outcomes of these preliminary trials, we’ll have a clearer understanding of how and whether to distribute this technology on a larger scale.”

Consistent with its mission to gradually introduce the technology, OpenAI set forth three suggestions for societal adaptations in a blog post. These steps involve the elimination of voice-based authentication for banking accounts, public education about the potential for deceptive AI content, and speeding up the development of methods that can trace the source of audio content, “so it’s always evident when you’re interacting with a real person or an AI”.

OpenAI also proposes that impending voice-cloning technologies should necessitate confirmation that the original speaker is conscious of their voice being added to the service and forming a catalogue of voices prohibited from cloning, such as those resembling well-known individuals. This type of filtering technology could inadvertently exclude anyone whose voice might naturally resemble a celebrity or a US president.

According to the company, OpenAI developed its Voice Engine technology in late 2022, and many people have already been using a version of the technology with pre-defined (and not cloned) voices in two ways: The spoken conversation mode in the ChatGPT app released in September and OpenAI’s text-to-speech API that debuted in November of last year.

With all the voice-cloning competition out there, OpenAI says that Voice Engine is notable for being a “small” AI model (how small, exactly, we do not know). But having been developed in 2022, it almost feels late to the party. And it may not be perfect in its cloning ability. Previous user-trained text-to-voice models like those from ElevenLabs and Microsoft have struggled with accents that fall outside their training dataset.

For now, Voice Engine remains a limited release to select partners.

This story originally appeared on Ars Technica.

Editor

As the Editor of IT Magazine, I curate cutting-edge content on technology trends, collaborating with experts to deliver insightful articles and reviews. With a focus on innovation and precision, I ensure each issue maintains the magazine's reputation as a trusted source in the IT community.