This whitepaper explores the approach which leverages large language models to implement Retrieval-Augmented Generation (RAG) for legacy databases without native vector support. The system employs a Text2Query paradigm where an LLM analyzes user queries and database metadata (schema, categorical values, range constraints, and business rules) to generate appropriate SQL or other database queries. Multiple queries may be generated to obtain comprehensive context from different database perspectives. The retrieved structured data is then processed by the LLM to perform final analysis and generate contextually relevant responses. This method bridges the semantic gap between natural language user intent and rigid database structures without requiring database architecture modifications or vector embeddings.
Modern enterprises face a significant challenge: they possess vast amounts of valuable data locked within legacy database systems that cannot be easily queried using natural language or integrated with modern AI systems. Instead of modifying the underlying database architecture, this white paper speaks about leveraging large language models (LLMs) to:
By implementing Text2Query with LLM-Powered Self-Query Retrieval, enterprises can modernize their data interaction capabilities without the disruption and expense of wholesale system replacement.
Many organizations rely on legacy database systems (like relational or NoSQL databases) that store critical business information. However, accessing this data typically requires knowledge of specific query languages (e.g., SQL) and a deep understanding of the database structure (schema, relationships). This creates a significant barrier for non-technical users who need to ask questions and get insights from this data using natural language.
Retrieval-Augmented Generation (RAG) has emerged as a powerful technique for LLMs to answer questions using external knowledge sources, reducing errors, and providing up-to-date, domain-specific information.
However, applying traditional RAG methods to structured legacy databases is challenging because they often require data to be converted into vector embeddings, which these databases don't natively support and can be impractical for large, complex schemas.
There is a need for a solution that allows natural language querying of these legacy systems without requiring costly database overhauls or relying on vector embeddings.
While RAG has proven effective for unstructured data (documents, articles, web content), its application to structured legacy databases presents unique challenges:
The Text2Query paradigm represents an alternative approach that leverages the reasoning capabilities of large language models to bridge the gap between natural language queries and database queries. Rather than transforming database contents into vectors, this approach focuses on LLMs to generate appropriate database queries based on:
By focusing on query generation rather than data transformation, Text2Query offers a path to RAG implementation for legacy databases without the need for architectural overhauls or vector embeddings.
The financial services industry, including banks, insurance companies, and investment firms, stands to benefit significantly from this approach. These organizations often operate on legacy systems with vast amounts of structured data. Enabling natural language access to this data can improve customer service, risk analysis, compliance reporting, and operational efficiency.
This approach supports enterprise scalability by handling large datasets and batch processing. It also enhances security by preserving existing database access controls and introducing LLM-specific safeguards, ensuring compliance with enterprise governance policies.
A large European banking client faced challenges in enabling business analysts to query for legacy financial databases. These databases required complex SQL knowledge and had rigid schemas, making it difficult for non-technical users to retrieve insights. The semantic gap between natural language and structured queries led to delays in decision-making and increased reliance on technical teams. Existing approaches like Vector DB are risky, expensive, inflexible and bring high chances of hallucinations in case of ambiguous or long context conversations.
The core challenge is the semantic gap between how humans ask questions (natural language) and how legacy databases store and retrieve data (structured queries).
Figure 1: Solution idea - Process flow

While the Text2Query approach provides significant benefits for accessing legacy databases through natural language, several important limitations should be acknowledged:
Example of a challenging query:
"Show me the correlation between customer purchase frequency and average order value, segmented by acquisition channel, but only for customers who have made purchases in at least 3 different product categories over the past 2 years, excluding promotional purchases."
This Text2Query approach offers significant advantages:
The Text2Query approach provides a pragmatic path forward for organizations seeking to modernize data access without the disruption and expense of wholesale system replacement. By bridging the gap between natural language and legacy databases, organizations unlock the full potential of their existing data assets while laying the groundwork for more advanced AI-driven data systems in the future.
As data continues to grow in both volume and strategic importance, the ability to access and leverage that data through natural interfaces will become an increasingly critical competitive advantage.
To keep yourself updated on the latest technology and industry trends subscribe to the Infosys Knowledge Institute's publications
Count me in!