What It Takes to Make Agentic AI Work in Retail
Insights
- Poor requirements, not code, are the root cause of most production defects, making AI-driven requirement validation a higher-leverage investment than test automation alone.
- Agentic AI delivers real value when tightly governed, grounded in enterprise data, and constrained by explicit guardrails. Not when deployed as open-ended automation.
- Scaled AI adoption depends as much on training, change management, and feedback loops as on model performance or tooling.
In this episode of the Infosys Knowledge Institute Podcast, Dylan Cosper speaks with Prasad Banala, Director of Software Engineering at a large US-based retail organization, about operationalizing agentic AI across the software development lifecycle. Prasad explains how his team applies AI to validate requirements, generate and analyze test cases, and accelerate issue resolution, while maintaining strict governance, human-in-the-loop review, and measurable quality outcomes.
Prasad Banala:
That was the biggest challenge in any organization. If requirements are not right, whatever manual test cases, whatever automation, whatever application code we are writing that is not correct.
So we will end up seeing a lot of defects in production. So that's where we felt that not only using AI and generating the test cases and automation, but also making sure that whatever requirements we are receiving from product team is very, very critical.
Dylan Cosper:
Welcome to the Infosys Knowledge Institute podcast, where business leaders share what they've learned on their technology journey. I'm Dylan Cosper, Infosys Knowledge Institute research program manager. Today, I'm speaking with Prasad Banala, Director of Software Engineering at a large US-based retail organization. Prasad is an accomplished technology leader who's been leading hands-on work with AI agents and quality engineering. Welcome, Prasad.
Prasad Banala:
Hi Dylan, nice to meet you.
Dylan Cosper:
So Prasad, today's discussion centers on how organizations are moving beyond experimentation to responsibly deploying AI at scale. Now, before we get into the specifics, can you share a bit about the work you're leading and what prompted your organization's journey?
Prasad Banala:
I'm dealing with a large major retailer in North America, and I work for them. And I'm a Director Of Software Engineering, taking care of quality performance and site reliability engineering, where my responsibility is to make sure all these applications, there are more than 600 applications, are reliable and scalable and available all the time. And we are continuously exploring opportunities to make sure what kind of technology we need to implement to increase the productivity and also making sure that all these applications are helping the customers. So our company primary motive is to serve others. And we just want to make sure that these applications are available all the time to serve our customers. My primary responsibility is to make sure all these applications are reliable from the quality standpoint and scalable from performance standpoint and available from SRA standpoint.
Dylan Cosper:
What kinds of agent-based projects are you and your team exploring? And why are these valuable for your organization?
Prasad Banala:
The overall organization side, we are exploring many agentic AI solutions, agentic commerce at enterprise level, and also building smart SDLC, starting from requirements to deployments. How do we automate? How do we apply AI to gain productivity and improve the scalability of the features? We have implementing QA pipeline automation using agentic AI automation. So the use cases wise, if you see QA, it really started with writing a manual test cases. So before that, making sure that a story is accurate is very critical to write any kind of development or manual test cases or automation.
The agents are controlling this everything. So how the story is written and what is the scoring of the story and the LLM, how the LLM is fitting? So all this we are controlling through agents. And also when it comes to manual testing, manual test case generation and also automation generation. So it is tremendously helping us. And after the execution, we used to spend a lot of time to analyze the results and produce that report. Now, within hours, we are able to produce those reports and we are able to share and provide the feedback to our development teams. So we are expanding continuously our usage of agents.
This morning, before I joined this meeting, there was a, from my CIO, there was a call saying that, Prasad, we are seeing certain issue in the production. Can you provide me some information? So I did the analysis. We found open issues around that issue. And then we gave it to the LLM So it was able to give me an issue and we were able to analyze that and provide a feedback to our CIO very quickly. So these kind of use cases we are heavily using and we are building agents to do this job. And also the coupons, we are discount retailers and coupons are critical for us for coupon testing is very, very critical and it is taking a lot of time for us. So based on day-to-day challenges, whatever we are seeing in our space, we started implementing AI and in fact, agentic AI to solve all problems.
Dylan Cosper:
We have to talk about responsibility, right? So for you and your team, what sort of guardrails or controls have you put in place to reduce hallucinations and keep agents safe and reliable?
Prasad Banala:
We have organizational standards and policies what to develop, what not to develop. So all the process starts from analyzing data governance approvals. So once we decide the solution, what we want to build, we submit that solution to data governance and AI governance team. They want to review those solutions and they want to approve. And at that point, everything will be analyzed from the architecture standpoint and what is the data we are consuming, what is the data we are exposing to LLM. Everything will be clearly defined there in the architecture diagrams. And the AI governance team is going to review that and they're going to provide the reports. The process starts there. And when we start developing it, we have to make sure that our agents and the LLM is not hallucinating. So for that, we applied so many anti-hallucination patterns while we are developing it.
And also provided grounded context from real and verified data sources like Jira, Confluence, Figma and the documents what we have already verified and attachments. Making sure that it is working on it is not kind of a getting external information and giving some hallucinated results output.
Likewise, we have implemented many other things for automation, adjusting the LLM temperature to low so that it can give a right output in terms of automation. We have implemented authentication, Jira authentication for any Jira-related or organization-related documents to access. They need to provide the credential. So many guardrails we have implemented.
Dylan Cosper:
I imagine that this has evolved over a decent period of time. And there's probably plans to continue to evolve it as newer agentic capabilities become available. You kind of have to evolve along with the technology itself to remain responsible. In a previous discussion we had, Prasad, we got into the idea of how AI is more than tech.
It's about the people too. So how are you building AI skills across teams, especially for tool adoption in their roles?
Prasad Banala:
Building a tool is not a difficult task. So we can put in some, a small team to start building things with the right technology, right technical people. But making sure that team adapt that particular tool and use it, right?
So once this generative AI tools came into market, like Copilot, beginning of Copilot, GPT-4.1, we started training our team. So I have a nearly 200-member team in my portfolio and we have a very big team in our organization. So to make them adapt, we need to train them.
And in my world, we do lot of automation, lot of automation using Python code. And from automation, it reduced some of the manual effort. Now with AI tools, it is reducing additional manual effort in terms of writing automation code and other things. So using AI, while generating the code, means automation effort also is reducing. But at the same time, we need to make sure that this is whatever code it is generating, my team need to take it and integrate with the existing code. So whatever AI is generating, we cannot directly take that code and integrate with my master code. So there will be a human in the loop. And after generating the code, after generating the manual test cases, they need to know how to review those, how to evaluate that output, and how to take that code and integrate with our existing code. All these practices we have started well before. And we worked with the vendors and gave a clear instructions to train, provide the training to all the folks who are working in our organization.
So they have started certifications in terms of doing small certifications of learning on prompt engineering.
We have trained our people and step by step we developed small tools and then got the feedback from the team.
Likewise, while we are building the latest and greatest tools, we enable our teams to learn those tools and adapt those tools. That's how we are able to successfully implement them.
Dylan Cosper:
So talking about test cases, can you walk us through a requirements and testing example? And I'd really love to hear how you're using AI to evaluate story quality, determine whether requirements are ready for testing.
Prasad Banala:
Yeah, that was the biggest challenge in any organization. If requirements are not right, whatever manual test cases, whatever automation, whatever application code we are writing, that is not correct. So we will end up seeing a lot of defects in production. So that's where we felt that not only using AI and generating the test cases and automation, but also making sure that whatever requirements we are receiving from product team is very, very critical.
That's where we have developed a tool called StorySense.
Dylan Cosper:
For example, when story we receive from the story, there will be required components within the story, like a stories, there are story points in it and there is a description in it, there is acceptance criteria in it. So based on that criteria, we gave three points for the required components. And also user centric, user centricity standpoint, we give another three points. So where the story is written based on what, why and who? So that will clearly define the story, the background of the story and the future of the story. So who is going to use it, what it is, other things. And then the third category is kind of investment criteria. So that's where the story is, whether this is small enough to develop it and small enough to test it and is that estimatable, is that kind of a testable array.
All these factors we gave a four-pointers, the investment criteria. So now we created agents in such a way that we are controlling LLM. The prompts, the way it generates, it has to generate the output and the ratings within that boundaries, whatever boundaries we have generated. Again, that is one guardrail we have implemented to make sure that it should not go beyond whatever the limits, whatever the scoring normalization which we are asking for.
So based on that, we get the story, we give the story to input to the tool, the agentic tool that we built. And that is going to give us the score. So seven out of ten, eight out of ten, nine out of ten.
So now we know that it's going to tell where the story is lacking details. And it is going to create a revision of the story in a format how it should be, right? It is going to recommend the product owner to write it in this format. These are the components are missing and this is how the story should be looking like. So that's where we making sure that anything less than seven, we are not accepting for testing. So that's how we were able to accurately get the requirements and give the feedback also to product owners. Hey, this is what is missing. So it's all helping us a lot.
Dylan Cosper:
I mean, it sounds incredibly beneficial. And so when you are deciding where to apply AI, because we've heard stories, of people are trying to use AI for anything and everything. What's your process for vetting use cases so you're not just forcing AI into problems that it shouldn't solve?
Prasad Banala:
So it really makes sense, right? So we should not be spending our efforts where it is not required. For example, for simple rule-based engines, let's say if we wanted to calculate a discount, you buy one, get one free, right? You don't need AI there. So a simple programming language is enough. Likewise, if data quality is not there, it is challenging to implement AI and you don't get the right outputs also. And also, I think wherever there is a risk is involved, let's say you need to take a decision as quickly as possible and wherever you are investing more on the things which can be achieved with low cost, right? So those things, I think, every organization to organizations and defaults. But in my mind, when we implement AI, AI is there to reduce manual effort and increase productivity.
At the same time, it cannot bypass the human in the loop. But we have to see, in the sake of implementing AI, we cannot go ahead and implement in the use cases where it is not required.
Dylan Cosper:
What are the metrics or signals that tell you it's working?
Prasad Banala:
It is not about implementing AI. And you need to know, you need to get the feedback. You need to see where AI is doing wrong or whatever we are getting from AI are useful to our people.
We have more than 200 people in our team. And we need to know who is using these tools? And who is using these tools and how they are using these tools. So for example, in our team, we started with 50 people to start with. And those 50 people started writing manual test cases and generated manual test cases. So now what we are doing is when people start using it, it is very good practice for any AI development, agentic AI development if you are doing it. So we need to have that explainability.
Logging is very important. From guardrail standpoint, we have implemented the logging.
So every action you perform, every test case you generate, whatever calls you're making to LLM and whatever LLM is spitting out and how you are using that data, everything is logged into database and on top of it, we generated a metrics.
And unique users and how many stories they have used to generate the test cases? How many test cases are generated? How many test cases are they are really using it? So when the test cases are generated out of, let's say I'm generating test cases for a user story and the AI gave me 20 test cases. Now out of the 20 test cases, human in the loop thinks that, okay, 10 are useful, remaining 10 are not useful. That means that LLM is giving us more test cases which are unnecessary. And how do we fine tune that particular problem, right? Not only that, and also we have to evaluate the skill set of the person who is really validating the output of LLM because 10 test cases are there and 10 test cases he's using, does he understand what other 10 test cases really that LLM is giving? Is he thinking that even though those test cases are correct, not having enough knowledge on those test cases, the person is rejecting those test cases? So everything is locked, everything we have a matrix created. That way what is happening is we know that these many test stories are from this many stories and this many test cases of manual test cases are generated. Now the user is able to use this many test cases. So currently we are able to use 60% of the test cases of what LLM generated.
But remaining 40%, we got an opportunity to talk to the particular person who already generated because of this metrics. And also go back and see what LLM is generated. Because it's the 40%, we continuously need to evolve. We wanted to achieve a state where 80 to 90% whatever AI is generating, you should be able to use it. So logging is very very critical to produce all these metrics and that is one of the guardrails we have.
Dylan Cosper:
Excellent. I mean, it's the metrics feed off of the guardrails themselves, the things that are keeping you safe and insecure and reliable. Prasad, thank you so much for joining us today. I've really appreciated the discussion and thank you for sharing your insights with me and our audience.
Prasad Banala:
I'm very excited and thank you for inviting me for this podcast and thanks for taking time.
Dylan Cosper:
You can find more details, show notes, and transcripts in the podcast section at infosys.com/IKI. Thanks to our producers, Christine Calhoun and Yulia De Bari, and our recording engineer, Dode Bigley. I'm Dylan Cosper with the Infosys Knowledge Institute. As always, keep learning and keep sharing.