Model Update2026-05-28
Hugging Face Blog
ITBench-AA: Frontier Models Score Below 50%
A new benchmark called ITBench-AA has sent shockwaves through the AI industry by revealing that even the most advanced frontier models score below 50% on agentic enterprise IT tasks. Developed by a consortium of researchers and enterprise IT leaders, ITBench-AA is the first standardized benchmark specifically designed to evaluate AI agents in real-world IT operations scenarios.
The benchmark tests AI agents on tasks such as incident response, system configuration, network troubleshooting, and compliance auditing. Despite the rapid progress seen in general language understanding and coding, the results show that current AI systems struggle with the complexity, ambiguity, and multi-step reasoning required in enterprise IT environments. The highest-performing model achieved only 47% accuracy, while most fell below 40%.
These findings highlight significant gaps in AI capabilities that are critical for enterprise adoption. IT operations involve nuanced decision-making, understanding of legacy systems, and adherence to strict security protocols—areas where AI still falls short. The benchmark creators argue that this is not a failure but a necessary wake-up call for the industry.
The implications are clear: while AI excels at narrow, well-defined tasks, true agentic autonomy in enterprise IT remains elusive. Researchers are now using ITBench-AA to guide development, focusing on areas like long-term memory, error recovery, and cross-system coordination. For IT leaders, the benchmark provides a realistic assessment of what AI can and cannot do today, helping set appropriate expectations.
As investment in AI-powered IT operations grows, ITBench-AA serves as a crucial reality check. It underscores that while the potential is enormous, we are still in the early stages of building AI agents that can reliably manage the complex, messy reality of enterprise infrastructure.