How to scale and optimize IT operations using agentic AI

By SomaSekhar Pamidi, Harry Keir Hughes

09 Oct, 2025
9 min read

Insights

ITOps teams are overwhelmed, with ITIL processes taking at least 80% of current ops teams’ efforts, based on engagements Infosys has with our clients.
Fighting fires in this way limits agility, increases operational costs, and hinders an organization’s ability to focus on strategic innovation.
AI agents offer a range of capabilities to transform ITOps from troubleshooter to value driver, including enhancing customer experience and streamlining support.
With agentic AI, the future operating model for ITOps is both hybrid and intelligent, a combination of humans, traditional AI, agentic AI, and automation.
To get ahead, organizations should prioritize use cases, establish an agentic AI hub, integrate existing automation capabilities, develop AI expertise, focus on data, and implement robust monitoring and feedback loops.

Scaling Agentic AI: Insights from Columbia’s Vishal Misra

This interview features Professor Vishal Misra, Vice Dean of Computing and AI at Columbia University. He explores where agentic AI is headed, and underscores both the promise and the perils of agentic AI.

IT operations (ITOps) is the frontline firefighting force behind every organization — proactively identifying and resolving issues, managing the life cycle of infrastructure and application components, and maintaining a robust security posture against evolving internal and external business threats.

And it’s a huge investment, often consuming 40% to 80% of IT budgets. For years, productivity gains, improved user experience, and enhanced service delivery have primarily come through deterministic automation.

Different ITOps models have been tested over the decades, from integrated operations to platform engineering, and most recently, site reliability engineering (SRE). Each has tapped into the latest tech trends to bring down operational expenses and increase stability and scalability. SRE stands out, cutting costs by up to 40% and speeding incident response.

But only recently has AI — and agentic AI more particularly — been used to optimize ITOps, with most initiatives in the pilot stage as of August 2025. Our AI Business Value Radar report identifies ITOps as an optimal use case for AI, with high levels of user acceptance, viability, and transformation potential (Figure 1).

Figure 1. ITOps use cases have high selection rates and are very viable

Source: Infosys

However, before ITOps — and the rest of the business — can benefit from introducing agentic AI, a thorough transformation of data foundation, operating model, strategy, and change management is required. With these elements in place, organizations can get the full benefits of agentic AI, including more intuitive, adaptive IT processes that bridge the gap between rigid automation and human-like reasoning.

The problem with IT operations

ITOps executives still speak of shift-left operations — resolving issues earlier in the service delivery process — and end-to-end stack monitoring, which looks at the full software engineering stack primarily through an automation lens. Enterprises also use machine learning (ML) to perform pattern analysis and resolve recurrent issues, especially in those with a hybrid cloud stack. These organizations stick to information technology infrastructure library (ITIL) frameworks to align IT services with business needs. In this landscape, ITOps is often there to simply keep the business humming without major incidents ruining reputations.

With hybrid cloud, unified visibility and control of IT architectures and processes is always needed, but rarely achieved.

ITOps staff suffer “alert fatigue” and even burnout, even as more strategic ITOps initiatives are left unanswered for.

This operational strain is evident in the allocation of team efforts across ITIL processes, which include incident management, change management, problem management, risk management, and demand management. These processes alone take at least 80% of current ops teams’ efforts, based on engagements Infosys has with our clients. Through in-person interviews with executives within these client engagements, the following breakdowns have been derived:

Incident management (50%+ effort): The volume and complexity of incidents in IT platforms necessitates significant time investment to resolve, from quick fixes (between 15 and 30 minutes) to major outages that require extensive SME involvement (from four to 40 hours, on average). This impacts business continuity and team productivity. Further, cross-functional teams need to come together to resolve most of these incidents, sometimes with 30 to 40 people working to resolve just one issue.
Change management (20%+ effort): The multi-step, multi-stakeholder process for implementing ITIL changes across a business leads to lengthy timelines of between two and 20 days per change, slowing down critical updates and deployments. Change requesters and change advisory board members are often overwhelmed with the volume and accuracy needed.
Problem management (10%+ effort): The need to identify and resolve root causes of recurring issues consumes resources over extended periods that can range from one to four weeks per problem, indicating underlying systemic challenges.
Risk management (10%+ effort): Proactive efforts to manage vulnerabilities and ensure security across the infrastructure demand significant and ongoing attention.
Demand management (10%+ effort): Fulfilling user requests for new resources and deployments adds to the operational workload, diverting focus from strategic initiatives.
Other activities: Essential supporting tasks further contribute to the overall operational burden.

Fighting fires in this way limits agility, increases operational costs, and hinders an organization’s ability to focus on strategic innovation. It also further complicates things by the need to add other important ITIL processes such as knowledge management, capacity management, and event management. This often means overall team support effort goes beyond 100%, which, as mentioned, usually means a dip in quality and cost control.

A new agentic AI-driven operating model

Hybrid cloud operations are shifting from just maintaining infrastructure to delivering greater business value. This is driven by AI solutions that observe the environment, process the input, and achieve specific objectives — otherwise known as agentic AI. These agentic solutions offer a range of capabilities, including enhancing customer experience, streamlining support, and troubleshooting with AI-assisted support engineers.

With agentic AI, the future operating model for ITOps is both hybrid and intelligent, a combination of humans, traditional AI, agentic AI, and automation.

Importantly, this operating model isn’t a replacement, but a complement to investments already made on automation, observability, and SRE operations.

Here, smart catalogs – a service directory that automates, personalizes, and streamlines IT requests – and AI assistants become the interface for all operation team queries (Figure 2). The agent hub will have all the tools and skills, or autonomous capabilities, to provide autonomous, predictive, and adaptive solutions. Most importantly, humans step up as engineers, defining the processes and building the agents, along with the tasks, skills and automation required to empower the agents.

Together, these advancements are driving hybrid cloud operations toward an AI-first approach — where AI plays a central role in all aspects of service delivery, enabling businesses to be more agile, responsive, and customer centric.

Figure 2. A new operating model for IT operations

Source: Infosys

Specifically for ITIL processes, AI has a big part to play.

Incident management: The agentic incident management life cycle is structured around four key actions. First, it defines the incident by analyzing issues and learns the context through semantic interpretation. Next, it discovers the scope of the incident by identifying affected assets and assessing their criticality to the business. The agent then isolates the cause by validating assets and eliminating potential failure points. Finally, it resolves the incident by applying established solutions or procedures, with escalation to human engineers for complex scenarios. By explicitly addressing these four stages, and according to our work thus far, agentic AI offers the potential for efficiency gains of between 70% and 80% in IT support operations in the incident management process.

Change management: Agentic AI helps change requesters by guiding them through the creation of requests, including risk assessments and implementation or rollback plans. It then aids technical managers in reviewing these requests for technical soundness and completeness. The change advisory board (CAB) benefits from AI's ability to analyze change requests against metrics such as risk assessment and cross-domain overlaps, leading to better decisions. Finally, AI implements the change and performs post-implementation analysis and updates relevant systems for completeness. In all, this promises significant time and resource savings for technical and process teams, potentially between 50% and 60%.

Problem management: Agentic AI can conduct trend analysis to pinpoint recurring incident patterns, leading to the automated creation of problem tickets for further examination. Agents are also good at root cause analysis (RCA), correlating data from IT service management (ITSM) systems, email correspondence, chat logs, and system components, resulting in detailed RCA reports. Also, with its reflective capacity, agentic AI can generate knowledge bases and best practices derived from successful problem resolutions. This strategic initiative reduces operational effort by between 50% and 60% and reduces the burden of repetitive failures.

Risk management: Significant effort goes into researching suitable cybersecurity solutions and assessing the organization’s compliance status. Here, AI can help by triaging vulnerabilities and automatically mapping them to corresponding solutions, drawing from original equipment manufacturer (OEM) patch releases or known workarounds. This helps maintain the environment's security posture. Agents can also be used in the patch deployment life cycle, helping with scheduling, execution, and verification, along with rollback capabilities, reducing the amount of work support staff do on weekends and off-hours.

Demand management: Agentic AI can provide more accurate demand forecasts by analyzing business trends, historical consumption, and real-time metrics. This is already happening in many of our client organizations, where bringing agentic AI together with traditional machine learning is transforming forecasting, especially in use cases where a wide variety of data is available. This data also fuels optimization strategies for existing server, storage, and network resources. The AI can then proactively automate resource scaling based on these insights, ensuring that IT capacity aligns with business needs. While service request fulfillment (where self-service is not present) still largely relies on human expertise, AI can provide procedural support in responding to requests. This data-centric augmentation can lead to an estimated reduction in human effort of between 40% and 50%.

Other activities: IT operations teams' time is often consumed by repetitive tasks such as generating reports, status updates, follow-up communications, data gathering, and knowledge base updates, which can be taken over by AI agents. By offloading routine work to AI, ITOps teams gain the freedom to focus on complex troubleshooting, and proactive system management, leading to a more engaged and effective workforce — which our research has shown increases the success and acceptance of AI use cases.

Other use cases for ITOps include predicting potential incidents by correlating events, logs, and traces, and proactively applying remediation to avoid disruption. Agentic AI is also being used to assist SRE engineers in developing automation scripts, identifying hotspots for these engineers to prioritize and rectify.

Six things to do now

Intelligently applying AI agents to ITOps, along with other AI, is transformational. Organizations can concentrate on delivering friendly user experiences, while strengthening the resilience and availability of vital IT infrastructure, at lower cost.

Prioritize use cases strategically: To start, conduct workshops involving IT operations teams, business stakeholders, and end users. These workshops should identify pain points, recurring issues, and areas with high manual effort or service requests that take a long time to implement. Then analyze and quantify potential return on investment (ROI) for each identified use case. Prioritize use cases based on factors such as potential ROI, feasibility of implementation, data availability, and strategic alignment with business goals.
Establish an agentic AI hub: Develop a strategy for leveraging agentic AI solutions. Evaluate the suitability of public, private, or hybrid agentic AI deployments based on data sensitivity, compliance requirements, existing infrastructure, and cost considerations. You will also need to establish guidelines for selecting the appropriate agentic AI technology stack based on the requirements of each use case. Consider building a centralized agentic hub — a team or center of excellence — to oversee the development, deployment, governance, and scaling of agentic AI solutions across IT operations.
Integrate existing automation capabilities: Audit existing automation tools and scripts within IT operations to identify opportunities to integrate AI and augment this existing automation. Agentic AI can orchestrate workflows involving a range of automation tools, provide intelligent decision-making within automated processes, and handle exceptions more gracefully through prompt engineering techniques. This strategy avoids siloing AI initiatives and builds on past investments.
Develop AI expertise: Implementing and managing agentic AI solutions requires new skills: Enterprises need to invest in training and upskilling their IT staff in areas such as AI/ML fundamentals, prompt engineering, agentic AI frameworks such as LangChain, Semantic Kernel, and AI foundries from different cloud providers, data science basics, and AI ethics. Consider hiring specialized AI/ML engineers and data scientists to lead and support these initiatives. Domain experts and data scientists, along with new AI-skilled teams, will be the pillars of a successful ITOps transformation.
Focus on the data foundation: Agentic AI needs good data that is accessible, clean, well-governed, and relevant to the use case. This includes ensuring data integration across IT systems, including monitoring tools, ITSM, databases, logs, and implementing data quality measures: Large language models can assist here due to their reasoning capabilities. There is also a need to establish appropriate data access controls and privacy protocols.
Implement robust monitoring and feedback loops: Continuously monitor the performance and effectiveness of the deployed AI solutions. Establish feedback loops to capture insights from IT staff and end users on AI's performance and identify areas for improvement. Regularly evaluate the ROI of agentic AI initiatives and iterate on the models and workflows based on performance data. The goal of the transformation should be to make things easier for all involved.

But it doesn’t all happen at once

According to Forrester, “In 2025, technology leaders will triple the adoption of AI for IT operations, providing contextual data to augment human judgment, automatically remediate incidents, and improve business outcomes.”

Despite its promise, making agentic AI work in ITOps isn’t easy. It demands a considered transformation journey, one that many of our clients have just started on.

Not all tasks will benefit equally from AI augmentation. Many struggle to define specific, achievable use cases with measurable outcomes and to quantify the potential benefits, such as reduced downtime, improved efficiency, improved mean-time-to-resolution, or enhanced user satisfaction.

And a lack of clear ROI can make it difficult to secure investment and executive support for agentic AI initiatives. Returns don’t appear overnight. And like any transformation, benefits build gradually as the technology matures.

Governance is nonnegotiable. Success depends on clear frameworks and trust in agentic AI systems. This includes defining roles and responsibilities for AI oversight; implementing mechanisms for monitoring AI agent behavior and decision-making; continuous training of AI models; establishing clear escalation paths for AI-related errors or complex situations; and ensuring compliance with relevant regulations and ethical guidelines – subjects we have addressed in detail in our report, Responsible enterprise AI in the agentic era.

Without robust governance, concerns about accountability, bias, and unintended consequences can hinder the deployment and acceptance of agentic AI in critical IT operations.

When governance is built into ITOps, organizations develop the critical speed and impact needed for this transformation — one that will with time achieve a significant increase in user experience, personalization, and operational effectiveness.

Authors

SomaSekhar Pamidi, Harry Keir Hughes