I spent a day at the Cloud Intelligence / AIOps Workshop at ASPLOS 2026 in Pittsburgh. It was one of the better technical days I’ve had in a while — not because everything was resolved, but because the field is clearly at an inflection point, and the conversations reflected that.

The talk that stuck with me most came at the end of the day, from Ariane Lanier at Meta. She said something I’m going to paraphrase: you aren’t deciding whether agents can work on your systems — they’re already there and being used. The question isn’t whether to let them in. It’s what you’re going to do now that they are.

That framing landed differently for me than it might for a researcher. I work in Azure Observability. I’m both a user of AIOps tooling — maintaining large-scale services where ops quality is the job — and a close partner with the teams building it, since observability is the foundation everything else runs on. The “agents are already here” moment wasn’t abstract. It was recognizable.


The trough, and what’s climbing out of it

Vyas Sekar from CMU offered a characterization I think is accurate: AIOps is in a Trough of Disillusionment. The field has invested heavily in applying AI to operations, and the returns have been underwhelming relative to the hype. A benchmark result from the OpenRCA paper put the best model at 12.5% accuracy on cloud root cause analysis tasks. That’s a number that gets cited as humbling.

I’d push back slightly on the humbling read. The OpenRCA benchmark captures a 30-minute window of signals — a real constraint for RCA, where the relevant context often sits in the hours before an incident. And the models evaluated at ICLR ‘25 submission time are already meaningfully behind what’s available now. From where I sit, the capability picture has shifted substantially in the last three months. The Trough of Disillusionment framing was accurate when most of these talks were written, and I think it’s still fair — but we’re on the cusp of moving on to the Slope of Enlightenment.

What I think is still true: people aren’t yet getting value out of the capabilities that exist. Sekar’s framing here was useful — the problem isn’t that we need new algorithms, it’s that we need better systems. The AI isn’t missing; the infrastructure, context, and operational design to make it effective are. Several talks demonstrated exactly that gap, and pointed at what closing it might look like.


The accountability gap is a velocity problem

The panel and Lanier’s keynote both circled around a question that I think is underappreciated: when an agent causes an incident, who owns it?

Right now, the answer is: human engineers do. And that’s rational. You don’t hand over control of something you’ll be blamed for, especially when the agent’s decision-making is opaque. The result is a meaningful drag on adoption — not because people don’t believe agents can help, but because the accountability infrastructure doesn’t exist yet to share ownership in a sensible way.

This is partly a technical problem (auditability, explainability, verifiable outputs) and partly an organizational one (incentive structures, change management, risk calibration). Lanier was right to flag the people challenges as critical even while setting them out of scope — in practice, they may be harder than the technical ones.


The infrastructure is behind

Lanier’s central argument was that the missing half of AIOps isn’t better AI — it’s better infrastructure. Clean APIs. Clear errors. Safe boundaries. Tools designed for agents rather than adapted from tools designed for humans. She made the point that tokens spent on tool discovery aren’t spent on actual operations — and that agent-friendly errors can make a measurable difference in effectiveness.

I think this is right, and I think it’s nascent industry-wide. The framing I found most useful: Identity → Observability → Guardrails, in that order. You can’t have accountability without identity. You can’t have guardrails without observability. The sequencing matters.


An unresolved debate worth watching

One honest note: there was a live disagreement in the room about declarative vs. imperative approaches for agent-operated infrastructure that didn’t get resolved. One keynote argued that imperative gives way to declarative as the natural evolution for agents. Another position pointed the other way. This is something that will likely evolve as the field gains more production experience with agent-operated systems.


Where this leaves things

AIOps is real. It’s in production. The question of whether agents belong in your systems has been answered by the engineers who are already using them. The question now is whether the infrastructure, the accountability models, and the organizational readiness can catch up to the capability.

Based on what I heard at ASPLOS, and what I’m seeing in the work over the last few months: we’re on the cusp of climbing out of the Trough of Disillusionment. Not because the hard problems are solved, but because the capability gap that was making those problems feel impossible has closed considerably. The work now is the harder, slower work — infrastructure, trust, accountability. That’s where the field needs to focus.