Back to Perspectives

Article

Cloud-agnostic operating model for multicloud resilience

By Manoj Kumar Agrawal

28 Apr, 2025
20 min read

Listen to the article

00:00

Insights

Multicloud deployments happen for many good reasons, but most do not amount to a long-term strategy.
A cloud-native adoption strategy using guidelines from the Cloud Native Computing Foundation can make multicloud more manageable.
Properly managing different data types in cloud is a crucial first step.
Services and functions should be immutable, stateless, and replicable.
Security, compliance, and a site reliability engineering culture are paramount and universal concerns.

Enterprise AI: Trends, Challenges and Innovations with Infosys CTO Rafee Tarafdar

At the InspireNext event in Helsinki, Rafee Tarafdar, Chief Technology Officer at Infosys, delivered an engaging session on enterprise AI. He explored key trends, including AI-augmented work and the rise of autonomous agents, emphasizing Infosys' AI-first strategy as a driving force behind business transformation.

Multicloud is being adopted at a high rate by large organizations. According to a recent survey, about 89% of IT decision-makers globally have a multicloud adoption strategy in place, with around 73% being on hybrid cloud and 16% on public or private multicloud deployments.

But for many organizations Infosys works with, adoption has not been strategic. So, companies are not getting the benefits of resiliency and agility this approach offers.

In truth, a multicloud strategy should only be chosen where it can truly drive business agility and resiliency — and not just as the default architecture for low-cost cloud. Indeed, we believe that companies should aim not primarily to be multicloud but instead to be “cloud agnostic.”

How did we get here?

Most companies adopted multicloud by default — not as part of their strategy. Often, different parts of a business simply chose a cloud provider that best suited their needs.

There are several good reasons to have multiple cloud providers:

Completely disjointed workloads — analytics information that is isolated and separate from other processes, requiring data streaming from other sources.
A need to achieve geographic or regulatory compliance, and therefore, have to pick local cloud vendors.
A focus on speed to build a cloud architecture and pick vendors that best meet specific needs for each application and service at that time, within the budget available.
Concerns about being overly reliant on one cloud vendor and want to improve resiliency by spreading risk across multiple providers.

Of all these reasons, only the first one is a valid long-term strategy. Though this situation is rare among the organizations we deal with. It is much more common that a business has ended up with a multicloud strategy due to the last three reasons in our list. They either need to achieve regulatory compliance, get into the cloud quickly, or build resilience through a mix of vendors.

Regardless of how a company came to a multicloud architecture, the same challenges remain.

There are operational challenges to construct code and develop it so that it works on more than one place. This restricts portability between cloud vendors, which limits agility and results in potential cost lock-in.

Governance is also a challenge. It’s difficult to get a single view of access, activity, potential risks, and costs. Security and compliance around data privacy, access privileges, and data breaches become very federated and distributed. This makes management difficult and creates the potential for loopholes and cracks to emerge in the system which puts data at risk.

Managing multiple cloud vendors also requires a wider range of skills within your teams. You may need to support multiple application programming interfaces (APIs), deployment mechanisms, and operational methodologies for each of your different cloud providers.

Sorting out the mess

Ultimately, each cloud provider works in a different way, and companies often end up designing their cloud approach around these varying implementations. This results in multiple parallel tracks and difficulty in providing a single layer of insight.

To take control of this complexity, we recommend that companies adopt the Cloud Native Computing Foundation (CNCF) guidelines and apply these across all of their cloud deployments. The CNCF landscape specifies a wide range of applicable components and practices around the management of cloud assets including aspects of platform, provisioning, runtime, application development, workload orchestration and observability of operations. They provide practitioners’ view on appropriate tools and methodologies for a cloud native solution, agnostic of a specific provider.

We have also outlined below five cloud-native principles that address the concerns of being cloud agnostic:

1. Appropriate persistence and messaging

The technical construct of a business solution is primarily driven by persistent data, derived data, and the context of the conversation during a user transaction. Persistent data — such as customer information, account details, and preferences — remains relatively constant, while transactional data — such as beneficiary details, transaction amounts, and calculated balances — is mostly contextual and derived in nature and computed as part of the workload execution when users transact within the system.

Workloads that access this data can be deployed as a mesh of services, often spanning multiple instances of deployments. So, we need to ensure that data is available in a resilient manner and with predictable guarantee for service continuity. In case of a disaster, the source of data needs to be available to the compute (workload). This forces us to make a copy of the persisted data in multiple zones so that it’s readily available in the event of a zone failure. The storage and referencing of data should ensure there is no duplicity and conflicts in the relationships, and the master copy of data entities should always remain local and unique to a specific data store. Business data needs guaranteed persistence, requiring stringent availability measures to be implemented. Active redundancy of data copies is best handled within a dedicated network and hence should always be handled by a single cloud vendor. Spreading-related data attributes across different providers make sourcing and computation highly distributed and require additional processes to reconcile and keep information consistent. This adds to the complexity and cost of operations and risks data inconsistencies.

When workloads derive additional information out of the data and relate it to the transactional conversation that is in progress on behalf of the user, that conversation context must be maintained in a transient cache and that is relevant for the service mesh to be able to access it quickly with the right freshness and recency of it being derived. This determines how frictionless the interaction is.

When you log in, you have a session, and this session understands your context and your intents. It creates a bundle of information attributes that resides in the memory. But it is transient and volatile and must be cached somewhere. Replicating this memory across more than one cache provider leads to high complexity in terms of its freshness, life, and memory residency. The caching policies must be handled efficiently and accurately, otherwise unwanted and erratic outcomes can lead to business inconsistencies. This requires a singular distributed cache with multiple available zones of a provider, rather than split the context across multiple cloud providers.

At the same time, messages and events being propagated across the participating components of the workload often require traceability and promise of a guaranteed delivery across the points of communication. The messaging is mostly handled through an event hub. The messages, receipt acknowledgements, and audit logs of transfers should span a controlled boundary of the event hub instead of spanning across different providers. Recovery procedures and repeat attempts for failed messages are always expected to be idempotent. Hence, event hubs should be localized to a single provider’s availability zones.

The below table summarizes the importance of various datasets in a cloud and the appropriate methodology for handling them.

Data category	Data characteristics	Data location	Cloud principles
Persistent	A representation of an entity that, once created, remains in existence forever. It is augmented by additional attributes that add more meaning to the data. Each entity can maintain cardinality with other entities in an established relationship. For example, a customer and account have unique existence and maintain one-to-many relationships between themselves. Additional transversal relationships may exist to enable data derivations and linking.	Always stationary at its origin. Needs to reside in a repository where relationships can clearly be defined and managed in the form of indexes. Best managed within a data provisioning cloud, singularly accessed by local or remote components.	Preferable to be in a single cloud provider’s virtual machine and replicated across redundant data container zones to ensure availability, restorability, and quick access.
Context data	Short duration referential data relevant for a defined timeline. It may be in the form of data lookup tables or a user’s tracking data such as session ID. Occasionally, it may comprise derived datasets, such as currency exchange rates or call data records. Generally, this is volatile in nature.	Takes the form of a distributed cache. May be adopted as a content delivery network or as an array of in-memory cache servers. The data collection is generally arranged in the form of name-value pairs and expire after a set period, so that more recently used data gets a residency preference. The primary purpose of cache is to enable high responsiveness and accumulate related data in a single repository with limited time to live.	Preferable to be in a single cluster zone of a cache utility provider. All resident datasets need to be accessible from referring client systems with very high speed and must contain freshness snapshots of data accurate within the time frame of a user interaction.
State representation	These are messages that help deliver datasets and information from one point to another using a data packet arranged semantically through data markup methodologies. Each message represents a state of an action pending on it by a consumer process. The producer of the message is delinked from the consumer of the message and actions can be performed at disjointed time frames. When the action is completed, the data relevance ceases to exist and it can be removed or kept for backup only for a short time.	It is a point-in-time repository of data which handles only a collection of specific data types called a message. All messages lying in the repository are rendered irrelevant after all actions destined to be performed on them are completed. The paradigm of send and receive or publish and subscribe are used for inserting or deleting data from the repository.	Messaging systems consist of several components like queues and topics which are in the form of partitions spread across file systems of a given distributed system. Messaging systems maintain redundancy through replication into live duplicate copies and need to be accessible by a message broker system at all times. Hence, messaging systems (for example, Kafka, RabbitMQ, etc.) should be deployed in a single cloud provider. They can be interacted upon by computation systems that can be distributed across different business domains. The consumers and producers of the messages can be spread across a multicloud.
Inflight (request /response pair)	Data in transit that moves from a client system to another target system potentially delinked by a proxy. The data is generally secured through encryption while in public zone. These are instructional in nature and contain relevant queries in a markup form required to narrate the command intent in sufficient detail. This data must be transformed by a computing endpoint and responded to the client system in the reverse direction.	Computational endpoints are the workhorses of the cloud ecosystem. They take the form of services that span multiple clouds and are stateless, and can replicate themselves into clusters to provide the exact same service from each participating endpoint. They are ephemeral and can be initiated and destroyed at will to the extent that no request from a client is pending on them. The loss of inflight data is generally tolerable through retries or resubmissions.	These data processing units are distributed as microservices across one or more clusters, which may all be deployed in a single cloud provider or even multiple cloud providers. The traffic to the endpoints is directed by a set of global load balancers (GLB) and proxies called ingress controllers. Ingress controllers are aware of services within a single cloud provider’s cluster. At the same time, GLBs are aware of multiple ingresses, each of which may lie in a different cloud provider. So, the GLB is the routing point of all multicloud services and is the critical component used for multicloud deployments which need to orchestrate heterogeneous services to accomplish a business outcome.

2. Industry-standard orchestration

Services and functions deployed across virtual containers should be immutable, stateless, and replicable. It should be possible to erase or scale them on demand. All dependent parts should pull in together to formulate a self-contained microservice, without relying on the conversational context or state.

Control parameters for a service should be open to runtime changes without requiring redeployment. The environment in which the executables run should be portable and agnostic to various cloud providers’ platform provisioning methodologies. Kubernetes has emerged as the industry leader for workload orchestration. All the cloud providers support Kubernetes based container deployment while additionally, providing choices for proprietary virtualization techniques. Adopting Kubernetes containers when designing solutions, rather than a cloud provider’s proprietary compute engine , puts the workload on a solid ground for leveraging many of the cloud providers in a flexible manner. The Kubernetes engine provides standard constructs for development and deployment of services and makes the workload inherently portable across hybrid deployments and cloud providers.

Articulate infrastructureas-code and automate deployment of workload, ensuring all configurable parameters are centralized in a secured, version-controlled repository and. This helps reduce errors around key management and misconfigurations of applications and lowers the pain of managing dynamic aspects by injecting changes in the overall solution.

3. Single-pane observability

Having a real-time view of the engineering and operational efficiency is important to prevent wasted capacity and identify weak spots risking reliability. It’s important to adopt commercial off-the-shelf tools supporting open standard specifications such as OpenTelemetry — for example, Dynatrace, ELK, Splunk, or Harness.

But when adopting these, enterprises need to be wary of:

Service discoverability: How endpoints interact and are configured.
Application performance: How individual transactions are traced to their completion from the point of submission and what are the volumetrics and turnaround latencies.
Workload capacity utilization: Whether the resources allocated are sufficient or opportunities exist to refactor and redistribute to reduce cost while keeping resiliency intact. Efficient controls are needed to avoid cost escalations and uncontrolled consumption of resources. Respective projects and lines of business should have clear visibility of demand, availability, and utilization.
Service level agreements (SLAs) with service providers: How the multicloud or hybrid environment is functioning with respect to outages, errors, and service quality. With the increase in the number of providers, the probability of outages is a combinatorial outcome of all impending risks across the participating components. Hence, availability factors will be highly influenced by the SLAs signed up with each vendor. The tolerance of error gets more stringent with more prov onboarded. Fault detection and outlier conditions need to be escalated quickly for resolution.

4. Secure, compliant, defendable

Be vigilant around access and authorization patterns and allow identified consumers to access the system with ease and yet under control. Regulatory concerns on reporting data breaches, protecting personal data, and avoiding threats around service denials must be enforced centrally and as a minimum adherence criterion for all participating components in the multicloud.

This means an ever-alert security and compliance enforcement team equipped with tools and detection appliances installed across the enterprise is needed. APIs, web access, databases, and access to compute engines must be hardened and constantly observed for emerging threats as new changes are deployed across the operational environment. Multiple security layers must be enforced so that intrusion is made as difficult as possible.

Patch release policies, software upgrades, announcement policies for end of life, and robust transport layer, and authorization policies need to be vetted by the security department. Often certain components of clouds are rejected to be onboarded when firewall policies are found to be vulnerable or causing contentious compute utilization.

5. Systems reliability engineering (SRE) culture

Cloud operations have rarely been successful without a strong “DevSecOps” pipeline, and this needs sponsorship from engineering directors and a consciousness to perform retrospective mining of risks and opportunities. The SRE mindset must sink into the fabric of engineering practices.

SRE measures include assessing the maturity of IT systems with respect to the aging of application components and tools, evaluating technical debt across DevOps pipelines, understanding performance indicator measures, and instilling a sense of collaborative culture between teams. Often, enforced directives and an order-taking culture leads to a diminished sense of responsibility and willingness to explore risks and opportunities. The leadership should evangelize the rationale of actions and decisions and propagate a sense of shared ownership.

Employees and team members must leverage the best of available talent and capabilities along with the knowledge of existing constraints and have a forward-looking perspective to acquire skills and address challenges as they emerge. When choices and responsibilities converge, there is a greater chance of collaborative gain and success. Teams should come together to contribute their skills toward the collective goals rather than the myopic objectives of key performance indicators.

Recommendations: Be cloud-agnostic, not multicloud

Multicloud approaches can bring benefits but it’s important that companies don’t jump into multicloud unless it makes sense. Always think about cloud adoption as a digital transformation and not just a lift-and-shift exercise. Align your hybrid cloud strategy with workload patterns to ensure a smooth transition.

Evaluate if the workload is ready for a cloud-native deployment by considering service availability, orchestration, data persistence, operations observability, and security. Adopt the right practices, tools, and frameworks and vet the deployment architecture through a change control board, comprising technology, finance, and operations teams.

If your organization spans multiple geographies, consider regulatory and geographic concerns. Regional data tenancy and reporting regulations determine available cloud providers, making solution portability paramount. This allows your organization to choose the most relevant partner on demand. So, construct the solution using industry standards and established principles rather than a specific cloud provider’s features in mind.

Talent and skillsets also matter. Whether through the employee pool or trusted partners, your team must live up to the cloud management and operation requirements. The driving factors may often be influenced by the comfort levels of the employee base. Instill a sense of camaraderie and collaboration with shared responsibility for maximum success.

Additionally, adopt a set of cloud governance tools assimilating “DevSecFinOps” into the management layer to ensure efficient interoperability between participating solution components and productive business outcomes through observation, measurement, and control of costs and resource utilization. Ensure compliance by tracking security and servicing violations.

Don’t be multicloud, be cloud-agnostic.