Tech Navigator: Why building the right data architecture is essential for agentic AI

Insights

  • Realization of agentic value is bounded by the underlying data estate.
  • The data infrastructure must support active and live data retrieval, semantic grounding, and state persistence — or the ability to remember, learn, and act consistently over time.
  • To unlock the value of data for AI, the enterprise must build what we call an agentic data stack, established on five foundational pillars: AI data fabric, multimodal management, domain ontology, memory architecture, and autonomous observability.
  • This is about building a unified, cloud‑based data fabric composed of domain‑aligned data products, real‑time activation pipelines, and governed access patterns such as the model context protocol (MCP).
  • To adopt this agentic data stack at scale, enterprises should follow a “crawl, walk, run, scale” implementation process. This helps manage risk and investment as organizations transform into agentic AI-first.

Organizations are transitioning from the age of generative AI, characterized by large language models (LLMs) that act as passive retrieval and summarization engines, to the era of agentic AI, where bots are designed to perceive complex environments, reason through multistep problems, plan execution paths, and act autonomously to achieve high-level business goals.

However, not all shifts to agentic AI are equal. Industry analysis suggests a divide is forming: a vanguard of approximately 6% of enterprises is already rebuilding workflows around agents as the new operating system of the organization, while the rest just can’t get pilots to work at scale because they are constrained by legacy data architectures, with poorly defined processes, lax governance, and a lack of strategic know-how making the situation even worse.

According to our Enterprise AI Readiness Radar, only 17% of organizations had prepared their data estates adequately for AI at the beginning of 2025, and as noted in recent architectural assessments, 91% of AI models experience quality degradation over time due to stale or fragmented data.

Realization of value is bounded by the underlying data estate. The ability of agents to reason and act correctly means having low-latency access to data that is high quality and provided in a structure they can consume.

Data infrastructure built for human analytics, including dashboards, quarterly reports, and advanced analytics, is not well prepared for autonomous agents, which require millisecond decision loops and continuous context for advanced AI models.

Why agents require an immaculate data estate

To understand the data infrastructure requirements, we must first define the agentic loop, or the way an autonomous agent operates through a cycle of receiving, analyzing, and acting on data:

  1. Perception: This involves ingesting multimodal signals, including text, logs, video, and audio from the environment.
  2. Reasoning: To reason, agents utilize LLMs and domain ontologies, or maps of enterprise concepts and relationships, to understand the current state relative to a goal.
  3. Planning: This step decomposes the goal into a sequence of executable steps.
  4. Action: Agents then trigger tools, application programming interfaces (APIs), or database writes to change the state of the business environment, including systems, files, and workflows.
  5. Memory: Finally, a successful agent learns and refines itself, using active working memory and long-term knowledge that has been recorded to inform future actions.

This loop creates huge demand for data infrastructure. Unlike a human analyst who might not mind a 24-hour delay in data freshness, an agent operating in real time requires a live view of the enterprise.

However, several factors obstruct agentic adoption: siloed data that forces agents to act on incomplete information; a lack of API-driven interaction patterns; limited ability to use multimodal data spanning structured and unstructured sources like emails, video footage, or audio recordings; and governance risks, as companies hesitate to deploy agents whose decision-making processes are hard to understand and potentially precarious.

The data infrastructure must support active and live data retrieval, semantic grounding, and state persistence — or the ability to remember, learn, and act consistently over time.

In this way, the data estate becomes the long-term memory and sensory processing center for the digital workforce.

The agentic data architecture

To unlock the value of data for AI, the enterprise must build what we call an agentic data stack. This architecture resolves the complications of silos, unstructured data, and governance by establishing five foundational pillars: AI data fabric, multimodal management, domain ontology, memory architecture, and autonomous observability.

Pillar one: AI data fabric — the composable data layer

Traditional enterprises had a very clear separation between applications which accessed analytical data and those that accessed transactional data. The perception-reasoning-planning-action loop of agentic applications creates a need for these applications to access both analytical and transactional data. Hence, agentic enterprises need to enhance their data fabric layer from traditional data warehouse and data access architectures to data-as-a-product (DaaP).

Data curation

DaaP is a methodology for curating data products which has its own schema, lineage, quality rules, ownership boundaries, service level agreements (SLAs), pipelines and access policies. This replaces the traditional pattern of aggregating all data into a monolithic lake, allowing the enterprise to preserve distributed ownership while delivering a consistent, agent ready interface. By standardizing the data products lifecycle, the fabric ensures consistent semantics, eliminates redundant integrations, and accelerates the deployment of new agentic use cases.

While agents can operate across data products and domains, it helps to align data products along data domains. This ensures that there is business ownership of data and its lifecycle. Data domains can be aligned by business entity, for example, product or customer, or they can be aligned by consumer function, for example, checkout or payments.

Data domains can leverage virtualization capabilities to create a layer that connects with disparate sources like AWS S3 or Snowflake - cloud-native data warehouses that store, process, and analyze large volumes of data - or source transaction systems. This layer is curated around the data domain with ownership of data aligned with respective business owners and distributed but self-contained and empowered teams with thin, centralized governance.

Data access

To ensure secure, consistent, and LLM‑native access to these data products, the AI data fabric should adopt open standards such as the MCP, enabling each data product to expose governed and permission‑scoped capabilities through a uniform interface. This allows agents to retrieve data, invoke tools, and interact with enterprise systems without custom wiring. Every interaction becomes auditable, policy‑aligned, and role‑aware, dramatically simplifying integration and strengthening governance, effectively turning the data fabric into a tool‑and‑data‑ready substrate for agentic intelligence.

In this combined form, the AI data fabric becomes both the connectivity backbone and the real time activation engine of the agentic enterprise, linking siloed operational systems to cloud hosted data products, and providing the governed, high throughput intelligence needed for safe, scalable autonomous decision making.

Pillar two: Multimodal management — event intelligence layer

While data abstraction using the unified data fabric simplifies data asset discovery for agents, value can be unlocked for agentic systems using an event intelligence layer.

Agents must be able to perceive the world through all available modalities, including text, image, audio and video, and events. This means building a robust, multimodal indexing, temporal analytics and retrieval pipeline on an event fabric.

Understanding event timelines coupled with the ability to access historical insights from the AI data fabric gives agents the ability to adapt and synthesize events in real time. It unlocks powerful capabilities such as enabling agents to discover patterns as events occur within the system, uncovering insights that humans would miss. For example, an account where a large sum of money was transferred in with multiple payments in a short duration could be a potential scam transaction and could be flagged for audit.

Streaming technologies like Kafka and Solace, coupled with Flink, are the backbone for building responsive real-time intelligent systems.

To enable this, diverse data types can be projected into a shared high-dimensional vector space, a mathematical space where the agent represents information so that similar things are close to each other and different things are far apart. This is done using multimodal embedding models like CLIP, ImageBind, and Google multimodal embeddings. An embedding model in this context is a numerical vector that captures the data’s meaning or features; for instance, both text and images can be mapped into this space, and the vectors for text and images that match would have a similar vector or embedding.

For video and data where values change over time, the layer shouldn’t treat video as a collection of static images, as this loses the causal sequence necessary for reasoning. Instead, data windows should be created, segmenting multimodal data into visual features, audio, and temporal graphs to show temporal relationships and enable the agent to answer questions like “why did the production line stop at 10am?” after providing adequate context data.

Domain data products can be augmented using AI-enriched insights derived from the event intelligence fabric or these insights can be curated as separate cross-functional data products available for agents.

Pillar three: Domain ontology and knowledge graphs — enterprise language layer

While vectors provide similarity, they lack truth. An LLM can hallucinate connections that don't exist. To resolve the trust gap, agents must be grounded in a domain ontology.

The ontology, or map, acts as a shared language for the enterprise, the vocabulary of business concepts and relationships that all agents can align on, and define what a customer or product really is across sales, marketing, and support. This enables agents to query maps of meaning to increase intelligence and reduce hallucinations.

In our implementations, a knowledge graph first models the enterprise as a network of entities and relationships — less of a database and more of a flexible and semantic graph that prevents confusing, say, Apple Inc. from apple, the fruit — and prevents hallucinations by agents verifying their reasoning against the knowledge graph. In our paper, A new approach to explainable AI, the knowledge graph, working with a vector RAG (database), was shown to reduce hallucinations by ensuring the query is semantically similar to the retrieved content, with the answers given a high score when they obey the rules of the domain ontology and are validated against the context provided to the agent. For instance, in supply chain optimization, we are using this knowledge graph to block sanctioned relationships like product X being shipped to sanctioned country Y, regardless of the LLM’s probability score.

Pillar four: Agent memory management — agentic knowledge persistence layer

For an agent to act as a coworker, it must remember. Memory is a complex data engineering problem involving storage, retrieval, and state management.

Memory in agentic AI has a distinct taxonomy, which mirrors human cognition, and each memory type has different storage repositories (Figure 1).

Figure 1. The taxonomy of agentic memory

Figure 1. The taxonomy of agentic memory

Source: Infosys

As an agent operates, its episodic memory grows to millions of entries. Searching this entire history for every query is slow and expensive, increasing token costs. The solution is to build hierarchical memory (H-MEM) with different layers. For example, in Figure 1, the case of server failure would require accessing different levels of information: in layer 1, broad classifications are created, for example, “IT support”. In layer 2, specific topics are reported, such as “server outage”. In layer 3, the specific interaction is reported. In this way, the agent can use an index to query the relevant layer. If the current task is about billing, for instance, the agent doesn’t search IT support memories, which reduces noise and improves retrieval precision.

Importantly, as organizations implement multiagent systems, memory must be shared. This can be done through the Blackboard pattern, a central, shared data store which is often a knowledge graph where agents can write their findings and read the state of the world.

Pillar five: Multimodal data observability — the monitoring layer

Traditional application performance monitoring tools like Datadog and New Relic monitor infrastructure, but do not understand agent reasoning. For this, agentic AI requires capturing the agent’s internal monologue.

Observability platforms to capture this chain of thoughts include Galileo and Arize Phoenix, which show reasoning steps as interactive graphs, allowing engineers to see where exactly the agent went off track. For example, did the agent fail because the retrieval returned bad data, or was it because the agent hallucinated in its reasoning process.

The gold standard for observability mechanisms is using agents to monitor other agents. Here, metric agents can continuously scan telemetry for anomalies, while root cause agents can step in when an anomaly is detected and find the sources of the error, before handing over to remediation agents, which automatically execute fixes, such as rolling back a bad deployment or restarting a stuck pipeline.

The agentic maturity model

Enterprises should follow a “crawl, walk, run, scale” implementation process for the agentic data stack. This helps manage risk and investment as organizations transform into agentic AI-first.

  • Level 1 (crawl). At this stage, copilots support individual productivity, while humans initiate and review every step of the process. Agents should have access to document repositories like SharePoint, with querying handled through vector databases.
  • Level 2 (walk). Here, agents have bounded autonomy, executing specific tools within strict limits. Humans are needed when an agent wants to do something that falls outside of these limits, such as approval given to the agent to access sensitive internal data or when the agent wants to perform customer-visible changes. The data requirement here is grounded in access to structured APIs and databases with domain ontology for the specific task.
  • Level 3 (run). This level is when the organization starts implementing and orchestrating multiple agents across cross-functional workflows and where multiple agents collaborate and resolve conflicts. Humans are needed here to check outputs and audit outcomes at the multiagent system level, periodically stress-testing agent performance. The data requirement is also elevated, with a unified data fabric, shared memory, and AI query engine provided for reasoning.
  • Level 4 (scale). Once at this level, organizations have fully-fledged autonomous ecosystems, with significant business transformation and self-optimization through agents. Human oversight shifts to resolving complex and rare tasks such as when outputs are toxic or misleading. These agents set subgoals, optimize their own performance, and manage resources, requiring real-time multimodal data streams and self-healing data pipelines.

The agentic AI-first enterprise

According to Deloitte, close to three-quarters of enterprises are planning to deploy agentic AI by early 2028.

But the transition to agentic AI requires strengthening the enterprise’s data foundations. The limiting factor for most enterprises will not be the intelligence of the model, however important that might seem, but the richness, reliability, and real time availability of the context provided to each model. Agentic systems only perform well when continuously supplied with governed, up to date, multimodal enterprise data that they can interpret and act upon safely.

This is about building a unified, cloud based data fabric composed of domain aligned data products, real time activation pipelines, and governed access patterns such as MCP.

Combined with multi modal perception, grounded domain ontologies, robust memory systems, and strong governance controls, agents can then reason with truth, retrieve with precision, and act with safety.

In time, by following our agentic maturity model, you can transform your data estate from a passive archive into a foundation for a dynamic autonomous enterprise.

Connect with the Infosys Knowledge Institute

All the fields marked with * are required

Opt in for insights from Infosys Knowledge Institute Privacy Statement

Please fill all required fields