An ’undo-and-retry’ mechanism for agents

An ’undo-and-retry’ mechanism for agents

By @IBMResearch
Publication Date: 2025-11-12 20:00:00

The convenience of the cloud can come with risks, as a wave of recent outages have shown. By one estimate, the average cost of an unplanned IT outage is now $14,000 per minute, up nearly 10% from 2022.

Rising costs have put more pressure than ever on site reliability engineers (SREs) to resolve incidents quickly. But with new servers coming online faster than engineers to keep the system safe, cloud providers have looked with hope toward AI.

Today’s AI tools for IT operations, known colloquially as “AIOps,” mostly help SREs perform triage — spotting symptoms and narrowing down suspected points of failure. Operators don’t have enough trust in AI agents to let them fix incidents directly. Without an auditable trail and a way to rollback unsuccessful moves, operators are unlikely to delegate this last mile of a response to an AI.

A novel safety guarantee proposed by researchers at IBM and University of Illinois at Urbana Champaign (UIUC) could be the first step toward solving…