Artificial Intelligence

Enabling Natural Language Query through Large Language Models for Data Analysis

Natural language query (NLQ) empowers business users to analyze data by asking questions in everyday language. Modern business intelligence (BI) tools offer NLQ functionality through proprietary natural language processing (NLP) models.

Insights

  • This paper explores using large language models (LLMs) for dynamic SQL query generation from user questions, a novel NLQ approach It begins with a quick introduction to NLQ and the importance of LLMs in data and analytics and gives an overview of the latest developments in text-to-SQL with LLMs.
  • It shares details of a use case that allows data consumers to easily extract insights from enterprise data rather than a particular BI tool, including details of tests performed and code snippets.
  • The paper discusses the effectiveness of the in-context learning approach for NLQ in large-scale enterprise data environments and outlines future plans for fine tuning models with enterprise data context.

Overview of NLQ solutions

Data analysis with natural language expressions has become vital for accessible and user-friendly data analytics. Modern business intelligence (BI) tools, such as Power BI Q&A, Tableau Ask Data, and AWS QuickSight Q, now include NLQ features for self-service BI. Each tool may vary slightly in approach, but their features are comparable and have improved with NLP model advancements.

Typically, NLQ features these tools offer include enable users retrieve specific subsets of data, perform complex aggregations, compare time periods, filter and sort results and view trends.

Some BI tools also include features like autocompletion of user questions, synonym curation, and visualization type recommendations to further enhance user experience. As NLQ capabilities advance, users find it easier to extract insights and make data-driven decisions.

LLM use for data and analytics

Since ChatGPT's release by OpenAI in late 2022, generative AI and LLMs have garnered significant interest and investment for diverse use cases. This paper covers a limited array of applications and operationalization techniques, but you can find comprehensive materials on these topics from Infosys and other sources.

As organizations increasingly leverage LLMs to gain insights from their data assets, BI vendors also intensify their efforts to incorporate LLMs into their data and analytics tools. One of the key focus areas for these efforts is enabling NLQ by integrating LLMs, instead of proprietary NLP models. However, the potential use cases extend to various areas, including data sourcing, synthetic data generation, metadata enrichment, and guided analytics.

Another related use case involves generating or augmenting SQL code using LLMs for software or data engineering purposes, directly with tools like ChatGPT or with software engineering tools like GitHub Copilot. Some data platforms have already integrated Copilot for this purpose.

Text-to-SQL with LLMs

Text-to-SQL with LLMs for NLQ is an evolving research field, with academic papers and implementation-oriented materials generated continuously. A key focus is designing comprehensive NLQ solutions using various techniques and tools, with the current state-of-the-art encompassing the following approaches:

  • A custom application to streamline NLQ use case, involving steps such as generating a prompt by combining the user query with data context provided as a data definition language (DDL), calling LLM application programming interface (API), and executing SQL in the target database for a response.
  • LLM-powered application frameworks such as LangChain, which offer built-in capabilities for structured data analysis. Developers can utilize LangChain components such as SQL Database Chain and SQL Agent to streamline tasks like database connectivity, table description for prompt generation, and direct query execution in databases, reducing coding efforts.
  • LLM-specific capabilities, such as OpenAI function calling, enable developers to create functions that convert NLQs into API calls or database queries.

Another crucial area is enhanced natural language to SQL conversion through context learning, fine-tuning, and in-context techniques to enhance NLQ system performance and accuracy, enabling more precise and reliable data analysis.

Our experimentation overview

Our experiment focused on integrating LLMs to enable NLQ capabilities for data consumers in a modern data environment, independent of any specific BI tool. We tested this in a data landscape inspired by a real client’s needs. We introduced a conversational UI component and an application to convert user queries into SQL using LLM integration. This implementation follows the “Embed LLM As-is into an Application Frame” pattern from Infosys’ LLM pattern repository. Figure 1 presents a high-level architecture and describes the key components used in our experiment.

Figure 1. Solution design

Figure 1. Solution design

Source: Infosys

Component Specifics
Cloud Data Warehouse - Snowflake Snowflake is a cloud-native unified data platform that offers data warehousing and data lake capabilities, along with various tools for different data processing and data science scenarios. For this paper, we used a trial version of Snowflake and TPCH dataset, publicly available from TPC and provided within Snowflake as a sample dataset.
ETL/ELT – dbt dbt (data build tool) is an open-source command-line tool that enables data engineers and analysts to transform, evaluate, and manage data within their data warehouses. It provides a development environment and a set of best practices to build data transformation pipelines.

The dbt schema includes relevant information, including table names, columns, and relationships, to guide query generation within the prompts.
Web UI – Streamlit Streamlit is an open-source Python library to create custom web apps for machine learning and data science applications.
LLM – OPENAI GTP 3.5 text-davinci-003, GPT3.5-turbo GPT 3.5 models are descendants to OPENAI Codex, which was the original model for code completion tasks. GPT3.5-turbo is a powerful model for code completion now.
Data application This application serves as the principal component and performs the following steps:
  • Incorporates user questions from conversational UI into the prompt.
  • Calls LLM APIs to generate SQL.
  • Parses LLM response and connects to Snowflake to execute SQL.
  • Retrieves result set and processes them for user presentation.
LLM library – LangChain LangChain is a library offering different tools for LLM application development. SQL database agent is one of the tools used in text-to-SQL scenarios. In this paper, we used prompt templates.

Code overview

Step 1: Import everything

Step 1: Import everything

Step 2: Import Open API using LangChain LLMs:

Step 2: Import Open API using LangChain LLMs:

Step 3: Read DBT schema and create a prompt by passing user question:

Step 3: Read DBT schema and create a prompt by passing user question:

Step 4: Get a response from OpenAI model by executing through LangChain:

Step 4: Get a response from OpenAI model by executing through LangChain:

For real-time use cases, LangChain Vectorstore gathers the DBT schema and enables look-up activities to retrieve schema details based on user questions, facilitating prompt creation.

DBT schema.yml appears as follows:

DBT schema.yml appears as follows:

To display data graphically, the SQL query output from the enterprise warehouse should include at least two columns. Utilize Python's Matplotlib for visual representation.

NLQ examples:

User question Model LLM response Comments
The maximum ship mode for product type PROMO ANODIZED NICKEL

text-davinci-003


Gpt-3.5-turbo

SELECT P_TYPE, L_SHIPMODE, MAX(L_SHIPMODE) FROM lineitem INNER JOIN part ON lineitem.L_PARTKEY = part.P_PARTKEY WHERE P_TYPE = 'PROMO ANODIZED NICKEL' GROUP BY P_TYPE, L_SHIPMODE;

SELECT MAX(L_SHIPMODE) AS Maximum_Ship_Mode FROM lineitem JOIN part ON lineitem.L_PARTKEY = part.P_PARTKEY WHERE part.P_TYPE = 'PROMO ANODIZED NICKEL';

Davinci model’s response is not correct. However, when we changed the question (next question) it returns the correct SQL code.
What is the maximum ship mode for product type PROMO ANODIZED NICKEL Both models SELECT MAX(L_SHIPMODE) AS Maximum_Ship_Mode FROM lineitem JOIN part ON lineitem.L_PARTKEY = part.P_PARTKEY WHERE part.P_TYPE = 'PROMO ANODIZED NICKEL';  
List the number of open orders by market segment'

Text-davinci-003


Gpt-3.5-turbo

SELECT C_MKTSEGMENT, COUNT(*) AS NUM_OPEN_ORDERS FROM customer c INNER JOIN orders o ON c.C_CUSTKEY = o.O_CUSTKEY WHERE o.O_ORDERSTATUS = 'OPEN' GROUP BY C_MKTSEGMENT;

SELECT C_MKTSEGMENT, COUNT(*) AS num_open_orders FROM orders o JOIN customer c ON o.O_CUSTKEY = c.C_CUSTKEY WHERE O_ORDERSTATUS = 'O' GROUP BY C_MKTSEGMENT;

Davinci model used the status code as ‘OPEN’ whereas gpt-3.5 used the correct code ‘O’. This may be due to the fact that gpt-3.5 has knowledge of data set as TPC website documents benchmark datasets in it is web site. in the next question we provided the correct code as a part of the context and Davinci returned the correct response.
'List the number of open orders by market segment. The code for open order status is "O"' text-davinci-003 SELECT C_MKTSEGMENT, COUNT(*) AS NUM_OPEN_ORDERS FROM customer c INNER JOIN orders o ON c.C_CUSTKEY = o.O_CUSTKEY WHERE o.O_ORDERSTATUS = 'O' GROUP BY C_MKTSEGMENT;  
How many vendors are there from India?'

text-davinci-003


Gpt-3.5-turbo

SELECT COUNT (DISTINCT S_SUPPKEY) FROM TPCH_SF1.SUPPLIER S INNER JOIN TPCH_SF1.NATION N ON S.S_NATIONKEY = N.N_NATIONKEY WHERE N.N_NAME = 'INDIA';

SELECT COUNT (*) AS total_vendors FROM supplier WHERE S_NATIONKEY IN ( SELECT N_NATIONKEY FROM nation WHERE N_NAME = 'INDIA');

The schema we provided used the term “supplier” rather than “vendor” for table and column names. Both models used the correct table and column names.

We also observed style differences between queries provided.

What is country of the supplier for the SMALL ANODIZED TIN product type?

text-davinci-003



Gpt-3.5-turbo

SELECT N.N_NAME FROM nation N INNER JOIN supplier S ON N.N_NATIONKEY = S.S_NATIONKEY INNER JOIN partsupp PS ON S.S_SUPPKEY = PS.PS_SUPPKEY INNER JOIN part P ON PS.PS_PARTKEY = P.P_PARTKEY WHERE P.P_TYPE = 'SMALL ANODIZED TIN'

SELECT n.N_NAME AS Country FROM supplier s JOIN nation n ON s.S_NATIONKEY = n.N_NATIONKEY JOIN partsupp ps ON s.S_SUPPKEY = ps.PS_SUPPKEY JOIN part p ON ps.PS_PARTKEY = p.P_PARTKEY WHERE p.P_TYPE = 'SMALL ANODIZED TIN'

Twenty-five countries should display for this question. But these queries return more than five thousand rows repeating countries multiple times. We had to add “Ensure each country is listed only once” to the question to get the correct list.
List the supplier region, supplier country, name, ship mode, and the product cost for each supplier of the product 180002. Gpt-3.5-turbo SELECT s.S_NAME AS supplier_name, n.R_NAME AS supplier_region, n.N_NAME AS supplier_country, l.L_SHIPMODE AS ship_mode, ps.PS_SUPPLYCOST AS product_cost FROM supplier s JOIN nation n ON s.S_NATIONKEY = n.N_NATIONKEY JOIN partsupp ps ON s.S_SUPPKEY = ps.PS_SUPPKEY JOIN part p ON ps.PS_PARTKEY = p.P_PARTKEY JOIN lineitem l ON p.P_PARTKEY = l.L_PARTKEY WHERE p.P_PARTKEY = 180002 This is an example of a hallucination. There is no region column with Name R_NAME in supplier table. The model made up a column name in the supplier table due to the way we asked the question. When we changed the question like in next row, we got the correct result.
List the region, country, name, ship mode, and product cost for each supplier of the product 180002 Gpt-3.5-turbo SELECT r.R_NAME, n.N_NAME, s.S_NAME, l.L_SHIPMODE, ps.PS_SUPPLYCOST FROM region r JOIN nation n ON r.R_REGIONKEY = n.N_REGIONKEY JOIN supplier s ON n.N_NATIONKEY = s.S_NATIONKEY JOIN partsupp ps ON s.S_SUPPKEY = ps.PS_SUPPKEY JOIN part p ON ps.PS_PARTKEY = p.P_PARTKEY JOIN lineitem l ON p.P_PARTKEY = l.L_PARTKEY WHERE p.P_PARTKEY = 180002  

Conclusion and future work

Our experimental application, which utilizes LLMs to dynamically generate SQL code from user questions, performs satisfactorily as a conversational analytical agent across various evaluated scenarios. These findings affirm the potential of integrating generative AI models with enterprise data platforms, enabling business users to conduct data analysis independently, without SQL expertise or dependence on technology teams. Nonetheless, we've also observed instances where a solution relying solely on prompting general-purpose LLMs may underperform, especially within complex enterprise data platforms.

Large enterprises typically have diverse data environments, comprising various analytical data stores like data lakes and data warehouses. These stores have cloud or on-premises infrastructure, with modern or legacy technologies. These data stores might integrate with modern data catalogs or rely on technical data dictionaries specific to each data store. Given this complexity, users from different departments or line of business (LOB) often use specific business vocabulary and expect data analysis to adhere to specific business rules.

In this diverse environment, an NLQ solution must offer beyond SQL statement generation based on a schema. It should identify the right data store, interpret questions using specific data catalogs and terminologies, and construct queries that implement relevant business rules with the right SQL dialect for each data store. Relying solely on prompts for these tasks is neither scalable nor efficient. Instead, a more effective approach fine tunes a private model with knowledge of the specific data environment.

The effectiveness of fine-tuning versus in-context learning for adapting LLMs to specialized tasks is an active area of research. In a recent study, Mosbach et al. compared both approaches, outlining their advantages and disadvantages. Additionally, in the context of text-to-SQL tasks, Sun et al. discussed a specialized text-to-SQL model and compared fine-tuning with prompting for performance enhancement.

Future work for real-world NLQ scenarios should involve creating a specialized text-to-SQL model by fine-tuning general-purpose LLMs for a specific enterprise data environment. Fine-tuned models and specialized techniques for complex data environments develop more efficient NLQ systems that empower seamless and effective data analysis.

References

Authors

Anil Kumar Nagabhiaru

Senior Technologist

Gokbora Uran

Senior Principal - Enterprise Applications