Shilpa Blog

Artificial Intelligence has been used widely in most of the industries and recently I have been exploring what are the enhancements made with respect to Finance industry. Though there has been significant development in finance industry like automating complex tasks, improving customer service and enabling in-depth financial analysis. But like all other industries there are some challenges here as LLM output is not reliable and there is an amount of hallucinations, the high cost of updating models to keep pace with evolving regulations and market conditions. There is also difficulty of integrating accurate up-to-date knowledge.

There was 3 contributions that was introduced to address these challenges according to research paper :

IDEA-FinBench — An evaluation benchmark based on questions from global financial professional exams, designed to rigorously test LLMs in financial knowledge
IDEA-FinKER- A financial knowledge enhancement framework that adapts general LLMs to the finance domain using Retrieval-based few-shot learning for real-time knowledge injection and High-quality financial knowledge instructions for fine-tuning
IDEA- FinQA — A financial question-answering system structured around real-time factual enhancement. It includes A data collector to gather financial information, A data querying module for retrieval and LLM-based agents that use external knowledge for accurate responses.

Language Models for Financial NLU Tasks

Initially these models where leveraged to the tasks related to finance industry like Named Entity Recognition (NER) which is used to recognizes the financial terms, sentiment analysis of the customer conversation, any financial event extraction, question answering related to finance industry and Summarization of financial texts. Models like FinBERT took BERT and further pre-trained or fine-tuned it on financial corpora, improving performance in domain specific tasks. Similarly, BBT-FinT5 adapts T5 using financial text corpora to further enhance generation or understanding in financial tasks

The general purpose LLMs like GTP has been used in financial domain generation for advisory, reports, question answering in finance. there has been some improvements on adopting it to Financial domain specific data by opening up pipelines and frameworks for financial modeling like FinGPT. In China the financial LLMs are fed with the Chinese financial Corpora like company reports, disclosures and news to support chinese financial applications. Still we see significant limitations in knowledge and domain alignment.

There are also some models developed for financial NLU tasks like FLUE, FinQA, ConvFinQA, BBT-CFLEB, FinEval

But is this trustworthy?

While PLMs/LLMs have excelled in knowledge-intensive tasks their tendency to hallucinate or produce incorrect claims is serious problem. In this financial domain the errors can be costly. So ensuring trustworthiness is critical. A strong technique to mitigate hallucinations are Retrieval- Augmented Generation (RAG) where the LLM is given external documents as context rather than relying on internal parameter memory. For this we can use the frameworks like LlamaIndex, Langchain. There are some evaluation suites that directly quantify the response quality of models using factual knowledge question-answer pairs constructed from authoritative knowledge encyclopedias.

IDEA-FinBench-Financial Knowledge Benchmark

We usually evaluate or benchmark the general purpose LLM using math, coding or logical thinking but finance LLM lacks this evaluation benchmarks. This paper introduces a method where the IDEA-FinBench uses the CFA and CPA exam questions which is globally authoritative and covers 16 subjects, 4 question types and is bilingual (Chinese and English)

Data Collection

CPA Dataset contains accounting, tax, auditing, corporate strategy, risk management and divided into computational ( logical, reasoning) vs memorization (knowledge retention) which includes single-answer and multi- answer questions
CFA datasets covers fundamentals with single choice in level 1 and case studies, charts, multi-question sets in level 2. The topics includes ethics, quantitative methods, economics, financial reporting, equity, fixed income, derivatives, portfolio management etc.,

Data Sources and Data Processing

These are sourced from official chinese CPA exam site including past and simulated exams and also third-party providers.

All exam items are stored in JSON for consistency and the images, tables, diagram from CFA level 2 are converted or referenced via URLs so models can access structured data. Additionally, they also calculated different categories like questions per subject for CPA or CFA as well as total number of questions so that all the models are fairly evaluated on every subject

Experiments

In this part they have run the actual evaluations of various LLMs and reported the insights.

They have used both open source models and interface only models like APIs and configured the prompts, languages and reasoning modes. The performance results are reported based on accuracy by subject, question type, exam, language and model type.

They did analyze how the model size/ architecture affects performance and differences between base models vs chat/ instruction tuned ones and whether domain-specific financial LLMs outperform general LLMs. Also the weakness in particular subjects or question types possibly exploring how changes in prompt design or format affect performance.

The table shows Average accuracy (%) on the test set. The “SA” in “CPA-SA” column refers to CPA questions with a single answer, the “MA” in “CPA-MA” column refers to CPA questions with multiple answers. Additionally, the “L1” in “CFA-L1” column refers to questions from CFA Level 1, and the “L2” in “CFA-L2” column refers to questions from CFA Level 2.

IDEA-FinKER — Financial Knowledge Enhancement Framework

As we can see the above results even the LLM that are pre-trained and fine-tuned with financial corpora and instruction datasets did not yield expected results. So in this paper they have designed the framework that cleaned and constructed comprehensive database of Chinese financial exam questions and embedded similarity retrieval . They also developed the high quality set for instruction based fine tuning any general LLM by hard-injecting paradigm of knowledge. Soft injection is retrieval based few shot knowledge injection at inference time which is dynamically injecting the content.

Methodology

They have collected the clean and large corpus of chinese financial exam questions and associated metadata. The data is then cleaned, deduplicated and formatted. This data is used for Retrieval database and also fine tuning.

In Soft-injection where the LLM during inference time is provided with the semantically similar prior financial Q and A items to ground its reasoning.

For each new input question, compute embedding similarity to items in FinCorpus and retrive top-k similar Q & A or explanation snippets. Construct a prompt that inserts these retrieved examples as few-shot demonstrations or context together with the target question. Thus, the LLM sees relevant financial context and is more likely to respond factually. This does not change model parameters and it only augments the prompt/context

In hard-injection, the financial knowledge is encoded permanently inside the model by fine-tuning on instruction/question-answer pairs derived from FinCorpus.

Created instruction templates and Q&A instruction pairs based on exam questions and explanations. They used supervised fine tuning training the model to respond to financial instructions. This encourages the model to internalize reasoning patterns, domain constraints, and factual knowledge.

It is hard because it modifies the model weights.

The paper recommends combining soft and hard injection where the model is trained with financial knowledge and use soft injection to keep the freshness, coverage and grounding allowing the model to adapt dynamically to new or rare topics by retrieving relevant context. Also, organizing prompt templates, balancing retrieval vs fine-tuned knowledge and avoid overwhelming the model.

Experimental Setup

The base models are tested such as Llama-2, ChatGLM, Qwen, DISC-FinLLM etc., and the for each base model they experimented with baseline, soft injection only, hard injection only and hybrid approach. The models are evaluated on how injection paradigms impact the performance. They have configured prompts and chain of thought reasoning for fairness.

Result & Analysis

The table gives Average accuracy (%) on the test set, which is the CPA part of IDEAFinBench. The “SA” in “CPA-SA” column refers to CPA questions with a single answer, the “MA” in “CPA-MA” column refers to CPA questions with multiple answers. The percentages indicate the increase by green color or decrease by red color compared to the vanilla model.

It was noted that the impact of IDEA-FinKER was more pronounced on models with weaker capabilities, such as Baichuan2–7B-Chat and Qwen-7B-Chat. Despite Yi-6B-Chat’s commanding lead in the leaderboard, the enhancements from IDEA-FinKER were stable but not as significant.

Although IDEA-FinKER brings notable improvements to the models, it still struggles to compensate for the inherent performance disadvantages of the base models.

IDEA-FinQA: Financial Question & Answering System

This section motivates the need for a full question-answering system in finance beyond benchmarking and knowledge injection and frames the design goals.

Even with the above knowledge injection LLMs have limits due to static training data they cannot inherently know events or facts that occur after their training cutoff. Updating model parameters via fine tuning or pretraining is costly and risky, particularly for preserving model stability.

To address this, the system must retrieve external, up-to date knowledge at inference time to support factual answers. They are proposing 2 artifacts here

FinFact — a Chinese financial factual knowledge verification dataset for evaluating the factual correctness
IDEA-FinQA — The QA system built around real-time knowledge retrieval and LLM agents orchestrating query, retrieval, extraction and response.

The system consists of 3 main modules:

A data collector for gathering financial data
A data querying module with text-index and embedding index-search capabilities
LLM based agents for query rewriter, intention detector, extractor and refiner, response generator.

This not only enhances LLM but also build end to end QA system that can reliably answer finance domain queries with citations and factual grounding

FinFact

It is the factual dataset created to test and benchmark factual correctness in the financial domain especially for Chinese

They have gathered authoritative Chinese financial news from sources like Xinhua Net, China Youth Online, China Economic Net etc

They collect around 120 financial news articles to ensure variety and coverage.From these articles, they generated question answer pairs in 2 styles — Structural questions-Objective queries about entities, dates, numbers, Conversational questions-more natural dialogue style queries referencing statements, opinions or attitudes from the news content.

They used GPT-4 to assist in generating the questions and answers ensuring they remain grounded in the original news text. This serves as testbed for factuality, measuring whether a QA system outputs accurate, verifiable statements in financial contexts.

IDEA-FinQA-Financial QA System

This describes the architecture and components of the proposed QA system. It is subdivided into 3 main parts: Data Collector, Data Search Engine and LLM-driven Agents

Data Collector — Gathers and stores domain relevant financial data, It is responsible for ensuring the system has a rich, up-to-date knowledge base.They have collected from 4 main types of texts : Stock market data, financial news, Security Research reports, Macroeconomic analyses. Long term crawlers for periodic updates that collect new reports daily, store metadata. Real-time crawlers for news and market data- for queries requiring up-to-date information, fetch on demand via search, filter and sort results. This ensures the system’s knowledge base is kept current and broad.

Data Search Engine — Once data is collected, the system must retrieve relevant passages efficiently. This module handles text-based index — text indexing over titles and summaries of reports, segmentation, weighting and recency prioritization and useful for initial filtering / candidate retrieval and embedding based index — paragraph level text embedding similarity search retrieve passages semantically close to query, text is split into paragraphs before embedding, and supports more flexible semantic matching beyond keyword overlap. By combining both the system can do coarse filtering and fine semantic retrieval.

LLM-driven Agents — This is controller later where system uses specialized agents to orchestrate query rewriting, intent detection, extraction and response generation. There are 4 agents:

Query Rewriter — Takes the user query and rewrites or normalizes it for better retrieval and handles context disambiguation or removing redundant phases and ensures the search query is precise and relevant
Intention Detector — determines which type of information or data sources should be consulted. This helps route the query to the appropriate retrieval subprocess
Extractor and Refiner — Takes the retrieved passages and extracts the key facts, data or argument pieces needed to answer. Also refines/ cleans them and depending on the question complexity, it might select just a few facts or assemble a more elaborate knowledge context
Response Generator — With the extracted/refined knowledge in context, this agent prompts the LLM to generate the final answer and also ensures the answer is correct, concise and grounded in the evidence.

Experimental Setup

They have used Qwen1.5–14B-Chat as the base model driving all agents in their system. Other tested models are Yi-34B-Chat and GPT. This uses FinFact dataset to test factual QA capability. They save the full output from each LLM over FinFact queries and use a GPT-4 as a judge to assess the generated responses on metrics including- factuality, relevance and informational content. Structural questions use the standard answers, conversational ones refer back to the original news text for verification.

Results & Analysis

IDEA-FinQA demonstrates a distinct advantage in fact-based question-answering compared to other models. Across the three dimensions — factual, relevant, and informational — IDEA-FinQA leads all other models. The System’s advantage comes from using external grounded knowledge retrieval and agent orchestration which helps avoid hallucinations and outdated facts. This model improves factual QA in the financial domain, across several evaluation criteria.

Some Insights from me on this paper

I really liked the idea where they had benchmarked with a standard set of domain specific knowledge and it is modular and extensible
The hybrid approach of soft and hard injection is good as we can utilize advantages of both approaches and had good results
In end-end approach, Using both text-based and embedding-index retrieval helps balance keyword match and semantic relevance.
They have explicitly addressed the issue hallucination by integrating external knowledge retrieval and fact verification on the dataset is really useful which can judge outputs by factual correctness and not fluency or style.
Rather than retraining huge models from scratch or doing massive domain pretraining their injection methods adapted general LLM to work as domain LLM with less cost.

I see the above good ideas from this paper, I also see some gaps or limitations that this paper has

The soft injection heavily relies on the retrieved context which should be of high quality otherwise the output may be irrelevant, misleading
Hard injection might lead the model to overfit to exam patterns, reducing flexibility in answering outside the training distribution
They have explicitly focused on the CPA and CFA exams but how well it performs in real-world finance queries that deviate from exam style or contain unusual phrasing or unexpected scenarios
The multi-agent pipeline adds overhead and latency. Also, the retrieval infrastructure must scale and stay fresh, which adds operational cost.
Even with retrieval, the system is only as good as underlying corpus, there are gaps with coverage like new topics, niche domains or unindexed sources
The crawler should be extremely robust that it is able to collect recent data or breaking news and incorporate it
Using exam questions biases the evaluation toward the format like multiple choice, structured reasoning, known answer styles. Real-world finance problems are often open-ended, ambiguous or requiring judgement
FinFact is based on selected news sources and the factual verification scope is limited to those contexts. This may not be fully test all forms of hallucination or factual errors
There is complexity in orchestrating the crawler modules, indexing, multiple agents, prompt management and model orchestration whcih is engineering intensive and requires consistency, error handling, source caching and system robustness
For financial use, there are high stakes like legal compliance, auditing, investment decision where mistakes can be costly and the system must have safeguards beyond the model
Lastly, the dataset and system have strong Chinese finance orientation and porting to other standards like US or Europe would require building benchmarks, retrieval pipelines and corpora

Exploring research paper on Financial Knowledge Large Language Model

Language Models for Financial NLU Tasks

But is this trustworthy?

IDEA-FinBench-Financial Knowledge Benchmark

Data Collection

Data Sources and Data Processing

Experiments

IDEA-FinKER — Financial Knowledge Enhancement Framework

Methodology

Experimental Setup

Result & Analysis

IDEA-FinQA: Financial Question & Answering System

FinFact

IDEA-FinQA-Financial QA System

Experimental Setup

Results & Analysis

Some Insights from me on this paper

Shilpa Thota