My Experience with a Next-Gen AI Assistant: Prepare To Be Astounded!

The most famous virtual valets around today—Siri, Alexa, and Google Assistant—are a lot less impressive than the latest AI-powered chatbots like ChatGPT or Google Bard. When the fruits of the recent generative AI boom get properly integrated into those legacy assistant bots, they will surely get much more interesting.

To get a preview of what’s next, I took an experimental AI voice helper called vimGPT for a test run. When I asked it to “subscribe to WIRED,” it got to work with impressive skill, finding the correct web page and accessing the online form. If it had access to my credit card details I’m pretty sure it would have nailed it.

Although hardly an intelligence test for a human, buying something online on the open web is a lot more complicated and challenging than the tasks that Siri, Alexa, or the Google Assistant typically handle. (Setting reminders and getting sports results are so 2010.) It requires making sense of the request, accessing the web to find the correct site, then correctly interacting with the relevant page or forms. My helper correctly navigated to WIRED’s subscription page and even found the form there—presumably impressed by the prospect of receiving all WIRED’s entertaining and insightful journalism for only $1 a month—but fell at the final hurdle because it lacked a credit card. VimGPT makes use of Google’s open source browser Chromium that doesn’t store user information. My other experiments showed that the agent is, however, very adept at searching for funny cat videos or finding cheap flights.

VimGPT is an experimental open-source program built by Ishan Shah, a lone developer, not a product in development, but you can bet that Apple, Google, and others are doing similar experiments with a view to upgrading Siri and other assistants. VimGPT is built on GPT-4V, the multimodal version of OpenAI’s famous language model. By analyzing a request it can determine what to click on or type more reliably than text-only software can, which has to attempt to make sense of the web by untangling messy HTML. “A year from now, I would expect the experience of using a computer to look very different,” says Shah, who says he built vimGPT in only a few days. “Most apps will require less clicking and more chatting, with agents becoming an integral part of browsing the web.”

Shah is not the only person who believes that the next logical step after chatbots like ChatGPT is agents that use computers and roam the Web. Ruslan Salakhutdinov, a professor at Carnegie Mellon University who was Apple’s director of AI research from 2016 to 2020, believes that Siri and other assistants are in line for an almighty AI upgrade. “The next evolution is going to be agents that can get useful tasks done,” Salakhutdinov says. Hooking Siri up to AI like that powering ChatGPT would be useful, he says, “but it will be so much more impactful if I ask Siri to do stuff, and it just goes and solves my problems for me.”

Salakhutdinov and his students have developed several simulated environments designed for testing and honing the skills of AI helpers that can get things done. They include a dummy ecommerce website, a mocked-up version of a Reddit-like message board, and a website of classified ads. This virtual testing ground for putting agents through their paces is called VisualWebArena.

By Amanda Hoover

Andy Greenberg

Michael Calore

Matt Reynolds

Tales from this testing ground suggest that AI agents will be able to do impressive things in the near future that will make digital life much easier. A model can, for instance, look at a photo of someone wearing a sweater, then hunt through ecommerce listings for similar garments below a certain price and add the cheapest to a person’s shopping cart. In another example, an agent informed that a person no longer wants to see posts from a particular user on a Reddit-like site can work out how to navigate the site’s settings to hide posts from the offending individual.

The tricky part is there are also heaps of mishaps. In their investigations, the CMU squad discovered that AI agents were able to achieve a sophisticated aim around 16 percent of the instances—however, humans did it 88 percent of the likelihood. Failures are often ordinary, such as failing to navigate a website and falling into an endless browsing cycle. Yet sometimes they may resemble misconduct, for instance, if an agent accidentally puts dozens of items into a patron’s shopping cart or inadvertently befriends a troublesome user on a social platform. Perhaps it’s beneficial that I’m currently unable to give my payment details to vimGPT.

A reason why CMU settings are beneficial is that AI agents can create havoc within them without causing any genuine damage. Gathering information on such events helps investigators comprehend how well they can perform a specific task and how they mess up. Salakhutdinov says that letting agents run wild in environments like VisualWebArena also enables the agents to actively learn from their triumphs and disasters, similar to how simulations can provide training for game-playing machine learning algorithms, making them into champion-beating aces like Alphabet’s AlphaGo.

Salakhutdinov mentions he doesn’t have any internal knowledge about what Apple is doing now but anticipates them to be actively creating agents. “Each of the significant tech firms—Apple, Microsoft, Google—possess divisions fundamentally operating in that arena,” he proclaims.

Editor

As the Editor of IT Magazine, I curate cutting-edge content on technology trends, collaborating with experts to deliver insightful articles and reviews. With a focus on innovation and precision, I ensure each issue maintains the magazine's reputation as a trusted source in the IT community.