Anthropic’s Ambitious Vision: Empowering AI Agents to Control Your Computer

People have gradually come to terms with the concept of chatbots exhibiting independent thought. The next significant shift may require us to place our trust in artificial intelligence managing our computers as well.

Anthropic, a rising competitor to OpenAI, recently revealed that it has trained its AI model, Claude, to perform a variety of tasks on a computer, such as searching the internet, launching applications, and entering text via the mouse and keyboard.

“We are on the brink of a new age where a model can utilize all the tools at a person’s disposal to accomplish tasks,” states Jared Kaplan, Anthropic’s chief science officer and an associate professor at Johns Hopkins University.

Kaplan showcased a pre-recorded demonstration to WIRED wherein a “tool-using” version of Claude was tasked with planning a trip to witness the sunrise at the Golden Gate Bridge alongside a friend. In response to the request, Claude opened the Chrome browser, searched for pertinent information on Google, such as the best viewpoint and optimal timing, and then utilized a calendar application to schedule an event to share with a friend. (However, it did not provide additional guidance, such as the quickest route to take.)

In a second demonstration, Claude was tasked with creating a basic website to showcase itself. In an uncanny turn of events, the AI input a text prompt into its own web interface to generate the required code. Subsequently, it utilized Visual Studio Code, a widely-used code editor from Microsoft, to craft a straightforward website, logging into a text terminal to set up a simple web server for testing the site. The result was a charming, 1990s-inspired landing page for the AI model. When prompted by the user to rectify an issue on the site, the AI returned to the editor, pinpointed the problematic snippet of code, and removed it.

Mike Krieger, the chief product officer at Anthropic, mentions that the company aspires for these AI agents to streamline routine office work, allowing people to enhance their productivity in other areas. “What would you do if you eliminated a significant number of hours spent copying and pasting or doing tedious tasks?” he contemplates. “I would probably spend more time playing guitar.”

From today, Anthropic is making the capabilities of its most advanced multimodal large language model, Claude 3.5 Sonnet, available through its application programming interface (API). The company has also unveiled a new and improved version of a smaller model, Claude 3.5 Haiku.

While demonstrations of AI agents can be impressive, ensuring the technology performs consistently and without frustrating (or expensive) errors in real-world applications can be challenging. Current models exhibit human-like conversational abilities and can answer queries with remarkable skill, forming the foundation of chatbots like OpenAI’s ChatGPT and Google’s Gemini. They can also execute tasks on computers when supplied with simple commands, by accessing the computer screen as well as input devices such as keyboards and trackpads, or via low-level software interfaces.

Anthropic recently announced that Claude outshines other AI agents on multiple important benchmarks, including SWE-bench, which evaluates an agent’s software development capabilities, and OSWorld, which assesses an agent’s proficiency in operating systems. These claims have not yet been confirmed by independent sources. According to Anthropic, Claude successfully completes tasks in OSWorld 14.9 percent of the time. While this success rate is significantly lower than that of humans, who typically score around 75 percent, it still surpasses the performance of leading AI agents, such as OpenAI’s GPT-4, which manages approximately 7.7 percent accuracy.

Furthermore, Anthropic reveals that several organizations are currently testing the agent-optimized version of Claude. Among them is Canva, which utilizes Claude to streamline design and editing processes, and Replit, which employs the AI for programming tasks. Other initial adopters include The Browser Company, Asana, and Notion.

Ofir Press, a postdoctoral researcher at Princeton University and contributor to SWE-bench, notes that agentic AI often lacks the foresight needed for effective planning and frequently has difficulty recovering from mistakes. He emphasizes that demonstrating their utility requires achieving robust performance on challenging and realistic benchmarks, such as efficiently planning various trips for users and arranging all necessary bookings.

Kaplan adds that Claude has shown an impressive ability to troubleshoot certain errors. For example, when encountering a terminal error while attempting to launch a web server, the model was able to adjust its command to resolve the issue. Additionally, it recognized the need to enable popups when it encountered obstacles while browsing online.

Several technology companies are currently in a competitive race to create AI agents as they strive for market leadership. It may not be long before countless users have these agents readily available. Microsoft, having invested over $13 billion in OpenAI, has announced that it is testing agents designed to operate on Windows computers. Meanwhile, Amazon, which has made significant investments in Anthropic, is looking into how agents could suggest and eventually purchase items for its customers.

Sonya Huang, a partner at the venture capital firm Sequoia who specializes in AI firms, notes that despite the buzz around AI agents, many businesses are simply rebranding existing AI-powered tools. In a conversation with WIRED prior to the Anthropic announcements, she remarked that the technology currently performs best when focused on specific areas like coding tasks. “It’s essential to select problem spaces where the model’s failure is acceptable,” she states. “These are the areas where true agent-native companies will emerge.”

One significant hurdle with agentic AI is that mistakes can pose graver issues than nonsensical chatbot responses. Anthropic has set certain limits on what Claude can accomplish, such as restricting its capacity to utilize a person’s credit card for purchases.

If errors can be sufficiently mitigated, comments Press from Princeton University, users may begin to perceive AI—and technology as a whole—in a radically different light. “I’m incredibly enthusiastic about this new era,” he shares.

Editor

As the Editor of IT Magazine, I curate cutting-edge content on technology trends, collaborating with experts to deliver insightful articles and reviews. With a focus on innovation and precision, I ensure each issue maintains the magazine's reputation as a trusted source in the IT community.