Insights
- As agentic AI grows in capability, testing is crucial to ensure safety, reliability, ethical behavior, and goal alignment, minimizing risks in real-world deployments.
- However, most organizations lack a comprehensive testing and evaluation framework fine-tuned for agentic AI, limiting confidence in agentic outcomes.
- One approach is to evaluate AI agents at both the system and component level, using both black box evaluation and white box evaluation techniques.
- Black box evaluation focuses on the external behavior of the AI system, while white box evaluation looks under the hood at the internal mechanics of the underlying generative intelligence.
- Implementing this approach also means defining success early on, embedding evaluation into workflows, and involving humans as key players in AI assurance.
Agentic systems are being embedded into business processes. Cisco, for example, has implemented an agentic AI system that can automatically detect potential supply chain disruptions and adjust its inventory levels accordingly. This has resulted in a significant reduction in inventory costs and improved delivery times. UPS has developed ORION (On-Road Integrated Optimization and Navigation), an agentic system that determines and updates the most efficient delivery routes in real time, resulting in a significant reduction in delivery miles and fuel costs. A global accounting company that Infosys worked with has developed a multiagent system to assist auditors in expense vouching. This is saving several million dollars annually by taking away mundane work and providing deeper and more comprehensive audit coverage.
Why traditional risk models are no longer sufficient
Agentic AI is growing in capability. Systems are operating autonomously, reasoning and making decisions, and taking actions to achieve specific goals with minimal to no human intervention. Agentic systems are increasingly capable of learning from interactions and experiences, adapting to new situations, and improving performance over time.
But agentic systems can exhibit nondeterministic behavior and explore novel, unproven strategies to achieve user-specified goals. While these capabilities open new possibilities, they introduce new vulnerabilities and quality concerns, beyond traditional AI (Figure 1). Many of these have been written about in our Tech Navigator Agentic AI journal, including unpredictable agent outcomes (a single input can give wildly different outputs without effective guardrails), excessive costs at training and inference, changes in the environment within which they operate, and goal divergence and misalignment with human goals such as equity and bias avoidance.
Agentic AI testing is therefore crucial to ensure safety, reliability, ethical behavior, and goal alignment, minimizing risks in real-world deployments.
However, deterministic evaluation methods of binary pass/fail outcomes designed for traditional rule-based AI systems often fall short. Most of these benchmarking tools focus on accuracy or efficiency, or short-term contexts, and fail to capture the nuances of adaptive behavior, or risks related to autonomy, goal divergence, variable use of tools and application programming interfaces (APIs), and other probabilistic outcomes.
Further, the lack of a comprehensive testing and evaluation framework fine-tuned for agentic AI makes it difficult to gain confidence in agentic outcomes. This results in lower adoption and could even lead organizations to abandon AI projects before deployment. In our recent report on responsible AI in the agentic AI era, we found that 86% of executives familiar with agentic AI believe it will pose additional risks and compliance challenges for their businesses.
Figure 1. Four types of risk businesses face from agentic AI
Source: Infosys
But get AI assurance right, and organizations can deploy agentic AI safely and reliably, satisfying regulatory requirements and using responsible AI as a growth driver for their business, as we also lay out in the same report.
Black box and white box testing
Agents should be evaluated at design and production stages of deployment. Organizations should use a comprehensive and rigorous evaluation framework, at both the system and component levels. These evaluation processes should incorporate both black box evaluation, which focuses on external behavior and outputs without any insight into internal logic or code, and white box evaluation, which looks at the internal mechanics of the underlying generative intelligence.
Black box evaluation should focus on establishing statistical thresholds. These are cutoff points or minimum scores the agent should meet when tested. For example, accuracy thresholds might mean getting at least 95 out of 100 tasks correct, while safety thresholds might mean the agent must not cause harm in more than one out of 10,000 cases. Another method is to track the consistency of the agent, or the probability that it will give the same answer for a slightly different prompt, while also ensuring that the answer is coherent — in other words, that all retrieved content fits together logically and answers don’t contradict each other.
White box evaluation examines the AI’s internal mechanisms, algorithms, and logic paths to look at reasoning, decision-making, task planning and collaboration with other agents, while also solving for interaction with external tools.
Both methods are important. Black box testing builds user confidence by validating software from the end users’ perspective, ensuring that all functions work as intended, while white box testing provides interpretability and technical assurance by detecting bugs, hidden errors, and security vulnerabilities early in the design process, and by providing complete code coverage and root cause analysis for defects.
Assuring agentic AI is not just about evaluating performance and eliminating every risk but also about creating structured transparency, accountability over autonomous behaviors, and evaluating whether reasoning and evolution in agentic decision-making are well understood. Both black box and white box evaluation help here.
A framework for agentic AI risk management
Organizations should implement both system-level and component-level evaluation metrics to assure against vulnerabilities and risk, provide confidence in quality, and deliver regulatory compliance. Each evaluation method is important here, but its use varies depending on whether a higher- or lower-level understanding of the agent is needed. System-level assurance uses more black box techniques, looking just at the outside of the agentic system to determine if the outputs are as expected, while the component level is evaluated using white box testing, to look inside the hood and ensure the agentic system works, step by step.
More particularly, system-level evaluation focuses on comprehensive coverage for behavioral and scenario-based testing, validating agent performance and determining whether agentic functionality aligns with user expectations and business goals across different scenarios, including edge cases and stressful situations. It is important to implement adversarial testing to intentionally introduce unusual or malicious inputs to expose vulnerabilities and assess the agent's resilience against attacks or errors. Testing for responsible AI behavior, including reliability, safety, and fairness, is also important, as is evaluating whether the system is resource efficient.
A robust system-level evaluation framework should include the following dimensions and metrics:
- Performance: This should include metrics such as goal success rate (how effectively the AI agent achieves its objectives), latency/response time (average time taken to produce the output), autonomous execution rate (the percentage of the task completed without a human) and overall user satisfaction score.
- Reliability: This should be evaluated based on result consistency (same output for similar tasks or inputs), response relevance score (how relevant the response was), adversarial test case pass rate, and the overall availability of the agentic AI system to serve requests.
- Responsible AI: System-level responsible and accountable behavior should be measured using metrics such as policy adherence rate (to assure outputs adhere to business policies and guardrails), fairness scores, toxicity scores and hallucination rate (to evaluate whether the response/output is made up).
- Resource efficiency and cost: This evaluation is key in ensuring the agentic system is cost-effective and sustainable. The framework should include metrics such as hardware and software resource usage and cost; large language model (LLM) token usage per task; number of tool calls per task; and scalability (increase in resource usage with data volume growth and/or user load).
Component-level testing, adopting a white box methodology, is key in providing transparency into the inner working of agentic planning, and to set out how tasks are orchestrated.
This method also evaluates reasoning and the decision-making process, how the system interacts with its environment, and it ensures that both learning and evolution of the agent are well understood.
This approach should focus on testing individual modules in isolation, including agent profile, memory retrieval, reasoning, planning, tool calling, routing, and agent orchestration. It should use detailed records of an agent’s step-by-step actions and decisions, and logs. Figure 2 looks more closely at the dimensions and metrics of our component-level white box evaluation framework.
Figure 2. Dimensions and metrics of white box component-level testing
Source: Infosys
Several methodologies and approaches are available to evaluate these metrics and validate the agentic AI system at system and component levels. These include human as a judge, a human-in-the-loop approach where evaluators review responses and give a thumbs-up or a thumbs-down or leave comments for qualitative feedback, and LLM as a judge, which uses an LLM to compare the agent’s output against a ground truth output or desired behavior and provides a score between 0 and 1.
Agent as a judge is also available to evaluate these metrics across various dimensions. This is a framework where one autonomous agent is used to evaluate the performance of another agent, providing a detailed, step-by-step assessment. A judge agent can be provided with context or ground truth datasets for this purpose. Reinforcement learning with human feedback should be implemented to have humans periodically assess the agent’s evaluations and provide comments on its judgment. Over time, this feedback loop helps the agent’s judgment better align with human preferences.
Finally, the automated evaluator, or traditional approach, enables organizations to write their own evaluation logic, such as checking if a response achieves semantic similarity or satisfies a business rule. Custom evaluators let you define tailored criteria, enabling meaningful, situation-specific insights.
An eight-point plan to implementing responsible agents at scale
Having the right metrics and strategies is important to achieve transparent and explainable agentic AI, but implementing them properly is just as crucial. Here we provide eight imperatives for implementing responsible agents at scale.
- Define success. Be explicit about what constitutes success for your agent. Whether it is achieving a certain accuracy benchmark or meeting a specific response time threshold, clear goals help drive evaluation design.
- Prioritize tracking. Track multiple metrics and balance them, avoiding optimizing for a single metric in isolation. A dashboard that displays all key metrics side by side can help here.
- See where things are headed. Compare current agent performance against a baseline or a previous version. This contextual comparison can highlight improvements or regressions.
- Embed evaluation into workflows. Automate evaluation in the development workflow, ensuring continuous evaluation, so you can catch regressions early on. For this, integrate evaluation as a regular part of your continuous integration/continuous deployment or research pipeline.
- Log and version everything. Logging is key in successful agentic testing. When an agent fails or performs suboptimally, detailed logs help pinpoint the issue. Documenting and versioning everything is another rule of thumb. Keep clear records of your evaluation setup, including any changes to test scenarios or success criteria.
- Human as a key player. If your agent interacts directly with users, consider mechanisms to gather and log human feedback on the agent’s performance.
- Iterate and refine. The goal is to make agents more reliable and explainable with time. Use evaluation results as guidance for improving the agent, and as new challenges emerge, expand the set of metrics to capture them.
- Drive cultural alignment. Set clear KPIs, such as percentage of autonomous actions reviewed, time to detect drift, and amount of unanticipated behavior. Foster cross-functional collaboration and make assurance a shared responsibility throughout your agentic AI teams.
By doing these eight things, organizations can ensure agentic systems are responsible, safe, and reliable. Infosys’ AI Business Value Radar found that engaged staff deliver the best returns: Organizations with deliberate workforce preparation strategies outshine those that deploy AI without fully supporting their employees. This underscores the importance of the eighth imperative: culture often eats strategy for breakfast, and having the right operating model to fit in with agentic AI assurance is paramount in any successful agentic program.
Building the foundation: A platform-based approach to agentic AI governance
This operating model is product led and platform based, as discussed in Infosys’ Responsible enterprise AI in the agentic era.
A platform approach involves the creation of a safe place for enterprise AI agents to be developed, hosted, and tested. Therefore, for seamless enterprise scaling, establishment of a centralized assurance platform for scalable oversight is needed.
The agentic AI assurance platform should provide the various capabilities mentioned in this article, including the ability to define a comprehensive metrics set, enable various evaluation methodologies, integrate tools to gather low-level traces and logs, the ability to simulate real-world scenarios, and be able to perform drift analysis.
Assurance as a continuous, collaborative strategy
Doing assurance of agentic AI systems is not a one-time exercise. It is an ongoing commitment that spans design, deployment, and post-deployment governance, and it must be a concerted, collaborative effort.
The assurance and evaluation must continue in production as well as before deployment, and they must be part of a robust governance process.
The process must also encourage multistakeholder collaboration involving ethicists, policymakers, engineers, and end users.
Assured agentic AI systems are a new business strategy: 48% of companies that implement responsible AI and communicate their efforts experience enhanced brand differentiation. By embedding this level of interpretability, continuous feedback, and safety-first design, organizations can build trustable, auditable agentic systems. This ensures user confidence and readiness for real-world deployment, and paves the way for responsible and widespread adoption of agentic AI systems.
Frequently asked questions
1. What is the main difference between agentic AI and traditional AI?
Traditional AI systems are excellent at pattern recognition, prediction, and automation. However, traditional AI systems can’t make independent decisions beyond their programming. In contrast, agentic AI is an autonomous system designed to pursue high-level goals by independently interpreting goals, planning, and adapting to new data and unexpected events to achieve them with minimal supervision.
2. What is the single biggest security risk of agentic AI?
Agentic systems going rogue is the biggest threat. Therefore, it is extremely important to have continuous observability and evaluation in production. The evaluation must focus on system-level diagnostics as well as revealing the inner workings of the system through a robust white box approach, as detailed in this article.
3. Can agentic AI operate safely without any human oversight?
Human oversight is not a limitation to agentic innovation but a critical component to ensure safety and responsible AI behavior. As agentic AI output becomes more predictable and unbiased, taking humans out of the loop is a possibility, as we see in the development of autonomous vehicles.
4. How does white box testing for agentic AI really work?
White box testing focuses on auditing and evaluating the inner workings of every agentic AI system component and its system interactions through a detailed execution log. The approach makes testing proactive and helps speed up the discovery of anomalies and performance drift.
5. How can we prevent shadow AI in our organization?
To prevent shadow AI — or the use of AI tools or systems within an organization without oversight, approval, or governance — ensure the agentic AI development lifecycle has built-in controls for continuous evaluation of agentic AI performance; reliability; drift; security; compliance with regulatory and responsible AI requirements; and cost. Define an objective criterion model for onboarding and offboarding of AI agents.
6. What is agentic AI drift and why is it dangerous?
Drift can happen in the intended agent goal and context, along with changes in the reasoning model, its environment and tools. Drift can significantly degrade agent performance and reliability, impact security posture and compliance, and increase resource usage and costs.
7. What's the first step my company should take to manage agentic AI risks?
Establish a robust agentic AI life cycle management process and invest in tooling for security, compliance, governance control, continuous observability, and evaluation of agentic AI. Select the vendor/tool(s) which has a comprehensive and forward-looking roadmap.