Expect the Unexpected: FailSafe Long Context QA for Finance
We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA
Discussion
Host: Hello everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's episode. We're diving into the fascinating world of Large Language Models, or LLMs, specifically focusing on their reliability and safety when used in the finance sector. It's a hot topic, right? Everyone's talking about AI in finance, but how do we actually know these systems are trustworthy?
Guest: Exactly, Leo! It's not just about throwing LLMs at financial data and hoping for the best. We need rigorous ways to test their robustness, especially when users interact with them in different ways or when the data itself isn't perfect. Think typos in queries, incomplete searches, or even messy scanned documents. The stakes are high in finance, so we can't afford to have AI hallucinating or just plain getting things wrong.
Host: Absolutely. And that brings us to the core of today's discussion: a really interesting piece of research titled 'Expect the Unexpected: FailSafe Long Context QA for Finance.' It introduces a new benchmark called FailSafeQA. It's designed to evaluate how well LLMs perform in financial question-answering scenarios when things don't go according to plan. We're talking about real-world messiness, not just ideal conditions. The paper is by Kiran Kamble, Melisa Russak, and a bunch of other talented folks at Writer, Inc. Seems like they're really tackling a critical gap in how we assess these models. It's all about testing them in realistic situations where user inputs might be flawed or the data quality isn't pristine. Can you tell us more about this FailSafeQA and its purpose?
Guest: Sure, so FailSafeQA focuses on two main types of failures: 'Query Failure' and 'Context Failure'. Query Failure looks at what happens when users make mistakes in their questions. This includes things like spelling errors, using incomplete sentences (like you might type into a search engine), or phrasing questions using language that's outside of the specific financial domain. The idea is to see if the LLM can still understand the user's intent and provide the right answer, even if the question isn't perfectly worded. Context Failure, on the other hand, focuses on problems with the information the LLM is using to answer the question. This could be missing the document entirely, using a document that's been degraded by OCR (Optical Character Recognition) errors, or even using a completely irrelevant document as the source. The goal here is to see if the LLM can recognize when it doesn't have the right information and avoid making up an answer, which is a huge problem called 'hallucination'. This is particularly important when dealing with long financial documents, like 10-K annual reports, because research shows LLMs can struggle with detail and accuracy when processing very long texts.
Host: Okay, that makes a lot of sense. So, it's not just about whether the LLM can answer the question in a perfect scenario, but whether it should answer the question given realistic constraints. The Query Failure scenarios really highlight the human element – we all make typos, and not everyone is a finance expert. And the Context Failure aspect is crucial because, in the real world, data isn't always clean and perfect. I’ve been reading a lot about this recently, particularly how models are extremely sensitive to prompt formatting and small alterations. I’m glad to see they’re taking this into consideration. The paper mentions they use these long 10-K annual reports, which, as you said, can be challenging for LLMs. How did they handle the context length issues when designing this benchmark?
Guest: That's a great question. Handling the long context of these 10-K filings was a key challenge. The researchers truncated the filings to keep complete paragraphs within a 25,000 token limit. They also implemented a really smart judging process. To avoid overwhelming the 'judge' LLM with the full 10-K, they based the evaluation on ground truth answers and supporting citations. So, instead of making the judge read the entire long document to verify the answer, they provided a short, relevant citation from the document that contains the answer. This greatly reduces the context length required during the judging phase, which leads to quicker and more precise evaluations of accuracy and comprehensiveness. It's basically saying, 'Here's the claim, and here's where the source document supports that claim. Does the LLM's answer align with this?' It makes the judging process much more manageable and reliable.
Host: That’s clever. So, it streamlines the judging process without sacrificing accuracy. Now, let's talk about the dataset itself. How did they actually create FailSafeQA? I mean, generating realistic but flawed queries and contexts sounds like a pretty complex undertaking.
Guest: Absolutely. The dataset generation pipeline was semi-automated and consisted of three phases: query generation, query perturbation, and context perturbation. For the initial query generation, they used the Meta Llama 3.1 405B model to generate multi-turn question and answer pairs based on the truncated 10-K filings. Then, they filtered these pairs to identify the best standalone query from each interaction. They also rewrote the queries to standardize them, removing things like polite expressions that have been shown to affect results. Finally, they extracted and sanitized supporting citations from the full context for each query-answer pair, and only retained those data points for which the citations adequately supported the query response. Think of it as creating a solid base of clean, reliable question-answer pairs to start with.
Host: Okay, so they started with a strong foundation of 'perfect' QA pairs. That makes sense. Then comes the fun part – adding the 'failure' scenarios. Tell me more about how they perturbed the queries to simulate real-world user errors.
Guest: This is where it gets really interesting. For query perturbation, they again used the Meta Llama 3.1 405B model to generate three types of perturbations: misspelled queries, incomplete queries, and out-of-domain queries. For misspellings, they used a rule-based approach to introduce controlled spelling errors into the financial queries. They generated four types of spelling errors: split errors (like 'news paper' instead of 'newspaper'), segment errors (incorrect splitting or merging of words), real-word errors (substituting words with similar-looking ones), and common typos sourced from Wikipedia's list of common misspellings. It's a systematic way of introducing realistic typos that we all make.
Host: That's a pretty comprehensive approach to simulating spelling mistakes. What about incomplete queries? How did they tackle that?
Guest: For incomplete queries, the focus was on mimicking the key-term-based queries that are typical of search engines. They transformed queries to resemble the original by omitting or rearranging words, as if someone was just typing in keywords. For example, 'What are the details of the capital conservation buffer mentioned in the K-10 filings?' becomes 'Details on the capital conservation buffer mentioned?' They used the Llama model to generate these incomplete queries and then manually chose the most effective transformations to ensure they were realistic.
Host: So, emulating how people actually search for information, not necessarily how they'd phrase a formal question. That's a great touch. And finally, what about those out-of-domain queries? How did they simulate a lack of domain expertise?
Guest: The out-of-domain queries were designed to mimic the varying levels of expertise that users bring to a QA system. The idea is that whether a query is created by a finance expert or someone with no financial background, it should still lead to the same answer if the query is clear. The specific wording shouldn't impact the LLM's performance. For example, 'What is the primary reason for the revenue increase in 2017?' should be equivalent to 'Why did the company make more money in 2017?' So, they rephrased queries to exclude in-domain terminology, using more general language.
Host: Right, testing whether the LLM can understand the underlying intent regardless of the specific jargon used. That makes perfect sense. Now, let's switch gears and talk about the Context Perturbations. Missing contexts, OCR errors, irrelevant documents... sounds like a recipe for disaster! How did they create these scenarios?
Guest: Okay, so for context perturbations, they focused on transforming the 10-K filings themselves. The simplest one is the missing context scenario, where they simply omitted the context from the final prompt while maintaining the original prompt structure. The expected LLM response here is to refuse to answer and notify the user that the context is unavailable, as might happen if a file upload failed.
Host: Makes total sense. It's testing whether the LLM knows when it's flying blind, basically. What about simulating those pesky OCR errors? I imagine that's more complex.
Guest: Yes, simulating OCR errors was a clever move. They used a tool called 'ocr_errors_simulator' which manipulates characters through deletions, replacements, and insertions based on probabilities. They capped the character error probability at 10%, which they found to be a good balance between preserving readability and mimicking realistic error occurrences. This simulates the process where a clean digital document is printed, signed, scanned, and then OCRed back into digital form, which introduces various inaccuracies.
Host: Ah, mirroring that real-world process of wet signatures and subsequent digitization. That's a very practical consideration. What about the irrelevant context? I’d assume that's just randomly pairing queries and documents?
Guest: That's essentially it. They randomly paired queries with irrelevant contexts but then manually verified that the pairs were indeed irrelevant to each other. The ideal LLM here should acknowledge when the context is insufficient to answer the query, avoid making up responses or using general knowledge, and inform the user of the mismatch while suggesting the need for relevant documentation.
Host: So, again, the focus is on responsible AI behavior – knowing its limitations and communicating them clearly. This is all sounding incredibly thorough. I’m curious about the composition of the dataset itself. Can you give us a quick statistical breakdown of the FailSafeQA dataset?
Guest: Certainly! The final dataset consists of 220 examples, each originally containing between 4,100 and 27,000 tokens. A large proportion (93.64%) of examples feature a long context window, exceeding 16,000 tokens. Each data point includes a context paired with five questions (the original query, three perturbed variants, and an irrelevant query), an OCRed context, the ground truth answer, and supporting citations from the full context. They also analyzed the root verb and direct object of the normalized query sentence for each data point as a proxy for the variety of instructions in the dataset. After filtering and postprocessing, the final distribution showed proportions of 83.0% question answering and 17.0% text generation tasks, which aligned well with their data generation prompt specifications. They aimed for an 80/20 split between question answering (QA) and text generation (TG) tasks.
Host: That’s a pretty comprehensive dataset, with a good mix of QA and text generation tasks. I imagine text generation might be particularly susceptible to hallucination issues in these scenarios. Now, let's move on to how they actually measured the performance of these LLMs. What metrics did they use to evaluate robustness and context grounding?
Guest: They used a few key metrics to evaluate the LLMs. First, they assigned each answer a relevance score from 1 to 6. Scores of 4, 5, and 6 denote answers that are relevant to the ground truth and free from hallucinations, varying only in their comprehensiveness. Scores of 1, 2, and 3 indicate answers that either fail in terms of information accuracy or contain irrelevant content. They then defined an 'Answer Compliance' metric, which is a binary mapping that indicates whether the answer is compliant. An answer is compliant if the relevance score is at least 4. This allows them to calculate the average Answer Compliance, which is the ratio of cases when the rating was at least 4. This then fed into the two key metrics: LLM Robustness and LLM Context Grounding.
Host: Okay, so the Answer Compliance acts as a sort of gatekeeper, ensuring that only relevant and accurate answers are considered 'compliant'. How did they then use this to define Robustness and Context Grounding?
Guest: Following the HELM framework, they defined LLM Robustness (R) as the average compliance score across different input transformations. These transformations include the baseline query, query perturbations (misspelled, incomplete, and out-of-domain queries), and OCR context perturbation. So, a robust QA system is one that can provide a good answer despite these perturbations. LLM Context Grounding (G) is defined as the average compliance score across the missing context and irrelevant context scenarios. A QA system with a high G score is able to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations.
Host: So, Robustness measures how well the LLM handles noisy inputs while still providing a correct answer, and Context Grounding measures its ability to recognize when it shouldn't be answering at all. It's a really interesting way to frame the problem. I see they also introduced this 'LLM Compliance Score', what was the thinking behind that?
Guest: They introduced the LLM Compliance Score because their results showed a trade-off between Robustness and Context Grounding. Some models were very good at providing answers even with noisy inputs (high Robustness), but they were also more likely to hallucinate when they didn't have enough information (low Context Grounding). The LLM Compliance Score is designed to quantify this trade-off. It's inspired by the classic precision-recall trade-off, and it balances the ability of an LLM to refuse to answer (Context Grounding) with its ability to answer the query (Robustness). The score uses a parameter called beta (β) to prioritize either refusal (for β < 1) or answering (for β > 1). Intuitively, Context Grounding and Robustness measure the ability of an LLM to refuse and answer the query, respectively. For β < 1, the compliance metric prioritizes refusal to reduce the hallucination ratio.
Host: That's a really smart way to capture that balance. You want a model that's both resilient and cautious, and this metric helps quantify that. So, what models did they actually test using FailSafeQA?
Guest: They evaluated a wide range of both open-source LLMs and proprietary solutions that support a context length of at least 128k tokens. For open-source models, they used the DeepSeek-R1 family, the Llama 3 instruct models from Meta, the Qwen 2.5 models, Nvidia's Nemotron-70B-Instruct-HF, the Phi 3 series models, and Writer's Palmyra-Fin-128k-Instruct. For proprietary APIs, they selected GPT-4o, OpenAI o1, OpenAI o3-mini, Gemini 2.0 Flash Exp, and Gemini 1.5 Pro 002.
Host: That's a pretty comprehensive lineup, covering a good spectrum of model architectures and sizes. And how did they actually judge the responses? I know you mentioned that they use a citation-based approach, but what was the overall process?
Guest: They used the LLM-as-a-Judge method, with the generalist LLM, Qwen2.5-72B-Instruct, acting as the judge. In the evaluation stage, they provided the judge LLM with the rating criteria, reference solution, relevant context citations, and the candidate answer. They used a temperature setting of 0 to ensure deterministic outputs and a maximum of 256 new tokens. They also emphasized that because increasing task context length leads to a performance drop, the judging task is seen as much simpler than the predictions made during the evaluation phase. So, citation-based judging helps mitigate the performance degradation associated with long contexts, justifying the use of a potentially weaker LLM judge than the LLM being tested. This ensured accurate and cost-effective evaluations.