A new approach to explainable AI


  • Generative AI may generate hallucinations and meaningless outputs, curtailing business benefits. Explanation of outputs is crucial to establish trust in LLMs.
  • Most methods are pregenerative AI and struggle with extensive summarization tasks and use cases that require speed and scalability.
  • However, LLMs can evaluate other LLMs effectively, provided the evaluator LLM is instruction-tuned.
  • LLM performance varies; the crucial factor is employing metrics aligned with the inherent design of LLM architecture.
  • Coupling LLMs with a responsibly designed framework enhances explainability, eases concerns, and ensures technology translates into business value beyond mere hype.

Generative AI is already generating business value for organizations across the world, according to Generative AI Radar research from the Infosys Knowledge Institute. Banks are using it to enhance wealth managers’ experience through better semantic search and summarization of internal documents, as are some of our pharmaceutical clients, where few-shot-learning using GPT 3.5 makes summary reports available for public consumption. Some tax experts are using generative AI to monitor internal social media, identify tax-related posts, classify them as questions or feedback, and then route them to respective owners. The appetite of business to identify use cases will not diminish in 2024, with $5.6 billion projected business spend in the North American region alone.

However, this transformative new technology can also generate hallucinations and meaningless outputs, which will curtail business benefit if not managed effectively.

Thought must be given to explaining where certain outputs can be trusted, to what level of certainty, and when a level of bias or other harmful outputs might be perpetuating in answers delivered by the underlying large language models (LLMs).

One technique to improve the reliability of outputs is – perhaps counterintuitively – to use large language models (LLMs) to evaluate the output of other LLMs.

As we will discuss, this emerging area of research combines the benefits of automation, scale, and speed into one solution for the problem of generative AI explainability.

Current approaches to explainability

Some approaches to tackling unreliable outputs include:

  • Measuring “perplexity”, or how well a model predicts a sample of text.
  • Human evaluation, where outputs are rated on different criteria including relevance, fluency, coherence, and overall quality.
  • BLEU, which measures similarity between a model’s output and one or more reference translations.
  • ROUGE – which is used mainly in summaries, and calculates precision, recall, and model accuracy.
  • And diversity, which assesses the variety and uniqueness of the generated responses by measuring semantic similarity.

Many of these methods for explaining how models came to a certain answer are pre-LLM era methods, using natural language processing (NLP) methods such as attention mechanisms and human-in-the-loop.

But for LLMs, we need methods and metrics of explainability that are fast to implement and iterate on, while also being sensitive to the meaning contained in extracted information.

Though they can be helpful, simple similarity scores and other NLP methods such as BLEU don’t understand the spirit or meaning of generated content, and other popular metrics such as “language mismatch” or “publicity”– looking at grammar and word presence in source and output – might output a poor score when, in fact, the translated meaning between two strings of words are very similar.

Further, one of the problems with many current LLM explainability techniques is that there is just a single source attribution, or provenance of information added. Even when multiple sources are used (Figure 1), source attribution is never fully sufficient. This is especially true when we try to summarize information across several data sources and types.

Figure 1. Most explainability techniques just use source attribution

Figure 1. Most explainability techniques just use source attribution

Source: Infosys

For instance, source attribution might be adequate for question/answer couplings where proof of provenance is straightforward (such as content found in a single page), but becomes inadequate when, say, a financial services advisor is asked to summarize a 100-page document, and it is almost impossible to determine what information (or depth) the LLM is using to create the summary. Of course, in some instances, a human subject matter expert can validate the summary (known as “ground truth evaluation”), but this approach just isn’t scalable (Figure 2).

Figure 2. LLM evaluation balances semantic meaning and scalability; ground truth evaluation doesn’t

Source: Large Language Model Evaluation in 2023: 5 Methods (aimultiple.com);

LLM-Guided Evaluation: Using LLMs to Evaluate LLMs (arthur.ai); [Webinar] LLMs for Evaluating LLMs - YouTube

LLMs to evaluate other LLMs

The best way to evaluate LLMs with speed, sensitivity, and a good balance between scalability and meaning, is by using other LLMs (see again, Figure 2), with metrics that uncover how well the candidate LLM performs on novel, meaning-based tasks such as summarization or question and answering. This provides a scalable, automated solution. Stanford research agrees with us that this is the best way to go in the LLM-era. One LLM will do the problem solving, and another will evaluate its output against specific metrics, leading to a score and supporting evidence to verify why it came to that conclusion (Figure 3).

Figure 3. LLMs evaluating other LLMs

Source: Infosys

Though it might seem counterintuitive that LLMs evaluate other LLMs (given the way they “hallucinate”, or make things up), the results are telling. As long as the evaluator is given a good prompt template, is instruction-tuned, and is provided with examples of good question-answer couplings, LLMs’ ability to parse relative meaning of pieces of language makes them a good candidate in tests conducted by Arthur.ai and Infosys.

The importance of the metric triad

It is important that we define metrics for explainability that work well with LLMs. There are three that are of particular importance, originally used within TruLens LLM quality and effectiveness measurement software:

  • Context relevance – semantic closeness of the query to the retrieved content.
  • Groundedness – how aligned the answers are to the content and context provided.
  • Answer relevance – is the answer relevant or not?

There is a deeper reason for using this triad.

Most LLM tasks that are relatively cheap and valuable in industry are built on retrieval augmented generation (RAG) architectures. RAG, which brings the industry domain or context specificity into tasks such as summarization or Q&A, is a way to break up such tasks into three key touchpoints, including query, context, and response (Figure 4).

Figure 4. The three key touchpoints in RAG systems

Figure 4. The three key touchpoints in RAG systems

Source: Infosys

As can be seen from Figure 4, a query is inserted into the LLM, and context is provided from a vector database which has already seen the supporting documentation for the query (say a financial document for a wealth management query), and a response is generated.

Therefore, context relevance, groundedness, and answer relevance provide a triad of explainability measures that when used together provide a way to measure how meaningful the answer is to any given LLM input in the RAG paradigm. This is aided by a feedback function or scoring system that examines the input, output, and intermediate results, as shown in Figure 5.

Figure 5. Explainability evaluation using the triad

Source: Infosys

The feedback function, also supported by libraries such as TruLens, should include the following attributes:

  • Score: LLM provides a score [0-10] based on the given criteria
  • Criteria: LLM provides the criteria for the specific evaluation
  • Supporting evidence: LLM provides reasons for scoring based on the listed criteria, step-by-step. This is tied back to the evaluation being completed

The scoring system can be derived using chain of thoughts (CoT) prompting by the evaluator LLM. CoT is a technique where LLMs explain their “reasoning” process, and has been shown to significantly increase the performance of tasks that require arithmetic, commonsense, and symbolic reasoning capabilities.

Are some LLMs better than others?

But which LLM to use as the evaluator? Should firms use the same LLM (say Falcon) as both candidate and evaluator (homogenous) or different LLMs (heterogenous)? Research from Arthur.ai has proved that, in fact, it is important which evaluator is used, and also showed that (at the end of 2023), GPT 3.5 Turbo is the best candidate and evaluator across an array of summarization and Q&A tasks. This finding quashed the original hypothesis of their research that an LLM evaluator would be biased towards text that it had itself generated (over text competing models had generated) (Figure 6).

Figure 6. GPT 3.5 Turbo is the best evaluator, as per Arthur.ai research, conducted in 2023

Source: Adapted from Arthur.ai

Instruction tuning is a key ingredient

However, the most important thing about the evaluator LLM is that it should be able to listen to instructions. For this, the LLM doing the evaluating must be “instruction-led”. The Stanford research found that models that are instruction-led are better evaluators. To get these models, firms can fine-tune a model using pairs of input-output instructions, enabling them to learn specific tasks such as summarization, Q&A, or even composing emails.

For example, providing the input, “Provide a list of the most spoken languages”, with the output, “English, French”, trains the model to parse and execute tasks on given instructions. By exposing the model to a wide range of instructions, it gains robust generalization skills, enhancing its ability to generate accurate responses aligned with human-like instruction formats. While instruction-tuning demands a significant number of GPUs to train across a wide number of model parameters, the pay-off is worth it, yielding both better explainability performance and reducing resource-intensive API calls during inference, which, as has been shown, can result in a significant number of hallucinations.

Automation, scale, and speed

At the Infosys Knowledge Institute, we have many documents in many internal folders. Say we prompted an LLM such as GPT 3.5 Turbo, instruction-tuned and trained on this material, to “tell me a joke”. What is the correct answer? It all depends on the context. If the LLM returns a joke, its groundedness score should be zero, as has veered outside the specific content and has used API calls to public information to come up with an answer. On the other hand, if the output returns: “I am sorry, this is impossible. I am an expert Q&A system, whose purpose is to provide information and answers based on the provided context”, the triad of metrics would all show high scores - high context relevance, high groundedness, and a good amount of answer relevance to the question.

Of course, nothing in the LLM world is perfect - using LLMs to evaluate other LLMs can pose significant challenges. While LLM evaluation is faster and more sensitive to nuances in prompting behaviour, this very sensitivity can make LLM evaluators unpredictable; and so full automation should be considered with care.

Arthur.ai research also found that LLM evaluation is constrained by the difficulty of the task being evaluated: if the task requires too many reasoning steps or too many variables to be managed simultaneously, the LLM will struggle. The disclaimer to this is that with time and increased tool usage/API calls, the LLM will get better at its task.

What this all amounts to is that although LLMs evaluating others LLMs should be handled with care, this emerging method of explainability combines the benefits of automation, scale, and speed. Used as part of a responsible AI strategy, the triad, when coupled with LLM-LLM evaluation, is a way to increase trust as we move into a year where generative AI will have to prove its value and move beyond the hype of inflated expectations.

Related Stories

Connect with the Infosys Knowledge Institute

Opt in for insights from Infosys Knowledge Institute Privacy Statement