The Applied AI Podcast ep 2

Read the transcript here

Abhiram Mahajani: Hello and welcome to the Infosys applied AI Podcast. In this show, we host our clients, partners and Infosys applied AI professionals, who are doing some remarkable work in this exciting space of AI and Cognitive Automation. We explore what does it take to build successful, scaled AI journeys, and how the industry is evolving to make this a reality. Welcome onboard!

Abhiram Mahajani: Today I'm here to talk about another interesting and rather evolving topic in the space of AI. It is called MLOps. And we're here to sort of demystify this concept of MLOps. And I have today along with me, two of my colleagues from Infosys, Amit and Kaushal, they are they are part of our AI Practice Solution Architecture team. I think without much ado, I'll let my guests introduce themselves. Amit over to you.

Amit Gaonkar: Hi Abhiram, nice to be on this session with you. I'm Amit Gaonkar, from Infosys, AVP and Principal Architect responsible for technology solutions for AI and automation practice with Infosys. And over to you Kaushal.

Kaushal Desai: Hi Abhiram. Hi Amit. I am Kaushal and I'm a Principal Architect working with Amit and in Infosys AI Center of Excellence, where we basically work on, my expertise areas happen to be MLOps, text analytics, etc.

Abhiram Mahajani: Great! Thank you, both of you for taking time out for this conversation. Let's get started. So, we've been in fact, I've been particularly hearing of this term MLOps and there's a lot of, especially on forums like LinkedIn or elsewhere, I've seen multiple people talk about it. Let's start with just the pure definition of it. So, what does MLOps mean in layman's term, Amit if I may ask you?

Amit Gaonkar: I think it is easier to explain it through an example. So, one of our customers recently, they came and said that they have actually created a Machine Learning model, and which can detect a fire and smoke in their factories. Now, typically, when people create such a Machine Learning model, they will take a data set, maybe they would have taken 20,000 or 30,000 images and they would have created an object detection or some kind of a computer vision model, which work pretty well on this kind of a data set and they say that, now, they are ready to deploy it in the production . Now, this is only the development part, people are mostly worried about the accuracy or how good is the machine learning model. But when you want to deploy such models in production, there are going to be a whole lot of new challenges and new skills that are required like for example, in this case, suppose you have a model right which can detect the smoke in a factory, and you are to put it into production. Now, first thing is you have to expose it in the form of a service. So, when a video stream is coming in, you have to look at every frame, and figure out if this model detects whether there is a fire or not. So you need to convert your Machine Learning model into a service, then there can be large number of cameras, right. So, it cannot be deployed as a Python code somewhere on somebody else's desktop, it has to be deployed in a manner that it is completely scalable. Third thing is over a period of time you will come across new types of data and you are not talking about a static data, maybe 20,000 or 30,000 images, but you should be able to keep changing the model. So now the data has to be dynamically created and it should be fed into the model. So, this whole thing, which is about before the model is getting created, and after the model is getting deployed, starting from data, to training the model, to deploying the model, and finally entering the model, right. So, this whole thing is actually covered by the process called as MLOps. And to a large extent it needs to be automated.

Abhiram Mahajani: Great, great. No, thanks. Very, very insightful in terms of how you explain the the overall life cycle. So it looks like a life cycle. It's something but to me, it sounds as though this this should have been the focus always right? So why is it that this topic is being so relevant right now, like what has changed over the past few years for this to suddenly get into the limelight?

Amit Gaonkar: Yeah. So that's a very interesting question. And I think we are uncovering that. So far, a lot of people were focused on creating the models and saying that whether AI can play a role in my use case, so it was mostly experimental. Now, we are seeing that enterprises are getting to a point where they want to actually deploy these models in production. Once you want to deploy the model into production, then you actually end up with a lot of complexities which have to be handled through MLOps. I will just list down few of them. So, the first thing we said is that taking the model and deploying into production is quite complex. It requires different type of skills, right? So, there is no way you can create this model every time and then go through this entire cycle of creating a container out of it, creating a service out of it, making it scalable. This whole thing, if it is done manually that it is very time consuming, it is very expensive, right? So, the complexity is one factor, why you need to have your MLOps infrastructure. The second part is the overall thing about the governance, now imagine that you have 10 or 20 models like this, that you want to deploy, then you need to make sure that all of these models are going through a standard set of process, there is an independent QA that is being done on top of that, then lot of organizations have data related policies. So, things like data security, traceability of the model, you should not end up with a situation where model says something on Monday, I need say something on Wednesday, right? So those kind of trust related issues have to be handled. Third thing about if a model is deployed into production over a period of time, its accuracy may go down, right? So, you need to check whether the model is drifting, particularly when it is exposed to a newer set of data, which is not seen before. So, a lot of issues around the governance needs to be handled. And again, all of these aspects need to be automated. If you combine the complexity and governance, and then think that if you were to have 1000 models in future, then it is a completely unmanageable problem if this whole MLOps process is not automated. So, when you think of this problem, not as a use case level, but think of it as an enterprise scale problem that in 18 months, 24 months, a lot of our customers are going to have 1000 models in production, then this is the right time to actually start putting up MLOps kind of an infrastructure.

Abhiram Mahajani: Got it. So, the gist that I take away from your answer is that because we are now entering the space where AI is truly scaling, to achieve and address the scale aspect, that's where something like this becomes relevant. The same way as DevOps started being relevant, when say, application deployment, started to happen at scale. I hope that's the gist.

Amit Gaonkar: Yes, absolutely.

Abhiram Mahajani: Okay. So maybe you mentioned Amit of the overall life cycle, right? So, let's go a little deeper into what this lifecycle is, or this methodology is, and maybe Kaushal, I'll redirect this to you. What are the different stages of this lifecycle? And you know, how exactly is our organization going through this lifecycle today? What sort of tools skills are being an put at play here?

Kaushal Desai: Sure. I think it's a great question and introduction by Amit and you draw a parallel around DevOps. Right? So, I think at the start of the session, I just want to make sure that yes, it's a good analogy, but it's not a perfect analogy. I mean, how fundamentally machine learning, development, or any AI algorithm development, right, that you are doing is a state, it has a state in terms of a data. So, it so every development starts with acquiring certain data, defining a problem statement, using the data, train the model, monitor the model once you put it in production. So, data is involved, right from Word. So that's where the whole life cycle, right? Whether it's acquiring the data, so basically, is your data trustworthy? Is your data representative of all your problems? or all customers? Maybe let's say you are creating a customer survey, so is it representative of all kinds of customers? Do you have enough data available? Are you acquiring the data? Second step being looking at developing your own algorithm having a centralized environment because again, a lot of machine learning, and especially deep learning technology would require expensive resources, like GPU and stuff like that, right? So how do you as an organization, optimize this resource utilization in a centralized manner? Right? So that's about providing that centralized DevOps environment or a development environment where you can spin up your own environment. And typically, a lot of data science team like to experiment with stuff quite often. So how do you how do you cater some of those? I say, once you deploy a certain methodology, test your models, how do you make sure that your model is repeatable? That means the whole process right from acquiring data to producing the same result with similar data, right using similar hardware, how do you make it repeatable so that you are able to reproduce your results over a period of time? And finally, with machine learning, the AV deployment, the blue green deployment, deploying multiple models of the same problem statement, testing it all together in production becomes absolutely necessary because you may or may not rely on one model as you go along. So it brings certain kinds of challenges all together. Fortunately, or unfortunately, there are a lot of stacks and technologies available, right. And we work with all kinds of customers, like banks who would like to build a specific platform. So, for example, one of our customers that we are working with, they want to design everything on their on-premise openshift environment, where they are looking to specifically develop set of services and models that extract data from their documents. So that's a very capability specific platform, which is completely on-premise. And that includes right from openshift, to Seldon, to Minio, those sorts of technology which are completely on-premise. Now, again, there are customers who have chosen a path, where their data strategy and security strategy in place, with respect to hyperscalers. So, one of our US Telco partners we are working with, what Amit typically said, they came to us with, we have a model, now what? So, we are working with them to get the data, connect to their enterprise ecosystem. How does this whole deployment, automation of deployments or monitoring of models, the whole gambit of MLOps, we are trying to work with them and trying to implement that on Azure based cloud services. So, I guess there are there are different use cases, different organization, lots of technology, we have seen customers going for specific vendors, like DataRobot, Iguazio etc. also for their own implementation. So yeah, the choices are a lot, these are evolving quite quickly. And that's why we always recommend not to stick with a partner or a vendor, but try and adopt an enterprise architecture which, which is flexible, and you can mix and marry a lot of these things.

Abhiram Mahajani: Great, great. Interesting, that you mentioned the whole, say different aspects, right from data preparation, all the way to deployment and for future monitoring. So is it fair to say that, for example, in layman terms, I see data science and that in itself is a skill set, but the skills that are required to do this span much beyond just data sciences is what I could understand. So, could you talk a little about that? Or am I wrong? Is it the same skill being extrapolated?

Kaushal Desai: No, I think it's a marriage of lots of skills together if you really ask me, and that's what makes it a space more challenging, interesting, its width and the depth is mind boggling, right? So, so being a team, where we are working with customer with AI and other technologies and implementation of this kind of platforms, right? We work right things like Apache Spark, to Airflow, to Microsoft Azure, to AWS cloud, anything and everything, because the technology options are so varied, right? Every customer thinks their own solutions differently and depending on where they are in their journey, right? A lot of customers like to adopt initially hyperscalers, they realize that sooner or later, there is some cost involved, there is certain type, certain things, certain constraints, and then terms of data security and stuff like that. So then again, I mean, there are models that you can train on these hyperscalers, right, using their technology. But in such cases, you don't own the knowledge of training the model. Basically, in such cases, a lot of people come back saying that, okay, we have this service, we're supposed to do great stuff, it is doing in some cases, but now what do we do, because you know, it's only 40% coverage on. So that's where, you know, the choice of technology right from the start and effort becomes really important on what you choose, right? Whether it's you want to go complete custom, whether you want to write your own arch scripts or you want to use hyperscaler services, it really depends on whether you want to use hyperscaler also as a service or as a software platform. So, I guess there are there are lots of choices and those confusions are often visible. But I'm glad that we are able to help customers on those aspects.

Abhiram Mahajani: Amazing, amazing! And I think you did mention a telco customer that you refered to. So can you give a specific example? I think Amit started with one example, maybe I'll ask you Kaushal to give a specific example of where this is being put to practice.

Kaushal Desai: Sure. So, I think I think the use cases we started with, a lot have to do with field automation. So you have different tasks in a telco, so telco have a huge workforce. Whether they are in terms of contractors, vendors or their own employees, right from digging a hole, to doing some jumper connection to somebody's house, or to provide a fiber optic cable right. There are lots and lots of tasks are involved and lots of workforce are involved. So a lot of use cases that we are working with, are mainly around workforce optimization, where let's say there are set of tasks, which have ... probably in millions at any given point of time for a telco right? How do you deprioritize some of those tasks, so that you know, you are achieving minimum penalties, you are you are adhering proper SLAs, and you are meeting customer demands, right? So that requires a completely offline kind of a model where we pull millions and millions of tasks that gets created on a daily basis, look at certain data, build our model, to change the priority of a task, let's say, we realize certain task, I mean, most common example being that given a certain task doesn't have a penalty, because we know what kind of task things are like that, right? So can we deprioritize that to somebody else? Or how do you ensure that one person is not getting too many tasks, or one particular vendor is not getting too many tasks? And how do you monitor whatever measure SLA monitors that you do for these vendors? So there is this enterprise data source that we needed to collect first to start with. So first question is, yeah, we have created a model with all this data from five different databases, now in production, how is it going to work? Right? So that's, and then on top of it, customers were not very clear whether they want to do it on-premise, on cloud, there is business shouting at rooftops saying that we have a model, why can't be deployed in production. So I think that's where we came in, we had few workshops with our customer, asked them what is their choices? What is their data security looks like? What is their cloud strategy looks like? We started with an on-premise model, pretty much did a lot of things on-premise, we realized there are some capacity issues in production as we went along with the testing. So we quickly pivoted to deploy a lot of these things. So I think what I was talking about was like, how did we pivot quickly from a on-premise model to off-premise model and how to get this data there? How do you ensure that you're utilizing rest of the Azure resources and not only utilizing things like GPUs? How do you create things like spot instances and things like that, right? So while keeping in mind that every time you develop a model, which is managed centrally, and gets deployed without much of an engineering intervention, obviously, that caters to a lot of enterprise processes, but still having that culture of creating a model, managing that model, deploying that model and monitoring the results of that modeling. I guess, that's one of the few use cases around that particular area that we have with this particular customer right now.

Abhiram Mahajani: Got it. So thanks. Thanks for that. Maybe I'll come to you Amit on, you know, going a little more on how exactly is Infosys, we are addressing this space? So if you could just do a quick summary of that?

Amit Gaonkar: Yeah. So I think while we are a consulting organization in this space in AI, and AI is certainly very important from our business perspective, I think we are also big consumer of AI, right. And if you see in Infosys, almost every service line that we have, now we have an AI based offering embedded into it, whether it is a vertical service line, like a financial services, manufacturing, or horizontal service line, like a validation or cloud and some of those things, right. So we have actually two things that we want to do within Infosys. One is to actually take these offerings to customers, our workforce needs to be enabled on AI, right. And to that extent, we want to democratize AI as much as possible within Infosys. And the second challenge is, we want to harness these machine learning models, which are specific to a particular problem statement, whether it is a horizontal or a vertical problem statement. And when you want to do all these activities, the constraints are GPU is a very expensive resource, so if we can ask that, there are maybe 200,000 plus people, and we can ask people to go and buy or hire a GPU. That's very expensive for us and it is not a scalable model. The second thing, which you briefly talked about is this multi skilled MLOps. So we can't have a situation where you need to have a data engineer, and a ML engineer, and a data scientist working together to create all of these models. We want to automate that as much as possible. And with that objective in mind, we have actually created a Infosys AI cloud, which is actually democratizing a lot of these AI within Infosys. And what we have set up is a large GPU cluster, which has been developed, which has been divided into four development CPUs, larger GPUs which are used for training purpose, and another area where you can deploy the models and do the inferencing. On top of these basic GPU and CPU based infrastructure, we have created a open-source based MLOps stack, right? So that way, if a data scientist comes and he has a piece of code, he can just deploy it on that stack. And he doesn't need the help of a data engineer, or he doesn't need an AI engineer to actually get his model trained and tested and deployed. So by doing that, there is a lot of enablement of people. And then also once people tune their models or they feel that it is available, or it is good, they can deploy it on to a store or publish it on to a store. That way it is available to a larger audience within the organization. So if anybody, any corner of Infosys wants to see, what model is available, he can just quickly go to the store, and he can explore. So that way, we are actually democratizing at a people friend, and also creating lot of good content that can be published to the people who are in front of customers.

Abhiram Mahajani: Great. And while you were describing this, I think and during the course of our conversation, I do sense that this also has some interplay with overall AI governance, and also the ability of how do we how do we manage the accuracy of models and the fairness of those models, and so on. So this whole space of there's also this emerging space of ethical AI or responsibility AI. Does that have any inter linkage with with this topic? Or These are two separate conversations in themselves?

Amit Gaonkar: Yeah, I think it is like a sequence. It's like a game when you go from level one to level two, and level two to level three, right? So I think sofar people have been focused mostly on creating the models. Now people are focused on MLOps and deploying these models at scale. And we see a lot of our customers, they realize that the next game is going to be about ethical AI, responsible AI. And there is a lot of curiosity about, again, how to implement it at scale. And responsible AI implementation to a large extent is going to be only possible, if you have a good MLOps layer in place, because MLOps layers allow you to create these models in a standardized, of lifecycle manner. And also it allows you to create a repository of all your models. So if you want to enforce any kind of governance responsible AI on top of it, I think MLOps will play a basic foundation to enable you to implement some of those principles. Of course, it is a large topic. I don't think we can cover it in this session. But certainly the that is a next thing. That is that is going to come up in AI after this.

Abhiram Mahajani:Great, great, thanks. Thanks for that explanation. In fact, I look forward to maybe recording that as our next conversation in this series. But thank you, thank you so much. Amit and Kaushal. I think the topic in itself is this pretty vast from from the sound of it. There are multiple things to discuss. But I hope our audience did get a get a sense of, you know, what is MLOps? And why exactly is it relevant? And what are what are some of the things that the industry is doing? Once again, thank you so much for joining this. And we'll definitely be in touch and I look forward to speaking to you next ob responsible AI. Thank you.

Abhiram Mahajani:We hope you enjoyed this conversation. For more such talks, do subscribe to the Infosys applied AI podcast on any of your favorite podcast platforms. To know more about what we do in this space, do visit infosys.com/appliedai and if you happen to have any suggestions or if you feel like joining these conversations, do feel free to write to us at appliedai@infosys.com. Thank you for listening.

About Amit Gaonkar

Amit is an AVP & Sr. Principal Technology Architect for AI & Automation services at Infosys.

Amit leads expert group of architects and data scientists, who are responsible for building AI driven solutions in areas such as Document digitization, Video analytics, IT Ops, ML Ops & Responsible AI.

About Kaushal Desai

Kaushal is a Principal Technology Architect for AI & Automation services at Infosys.

Kaushal works drives various initiative in AI technologies, his area of expertise happens to be AI Platforms, ML Ops and NLP based AI solutioning.

Mentioned in the podcast:

Infosys applied AI cloud

Experience

Insight

Innovate

Accelerate

Assure

Application Development and Maintenance

Business Process Management

Consulting Services

Incubating Emerging Offerings

Demystifying MLOps

Read the transcript here

About Amit Gaonkar

About Kaushal Desai