Mitigating LLM hallucinations

BioStrand

01.09.2024

Audio version

11:48

There is a compelling case underlying the tremendous interest in generative AI and LLMs as the next big technological inflection point in computational drug discovery and development.

For starters, LLMs help expand the data universe of in-silico drug discovery, especially in terms of opening up access to huge volumes of valuable information locked away in unstructured textual data sources including scientific literature, public databases, clinical trial notes, patient records, etc. LLMs provide the much-needed capability to analyze, identify patterns and connections, and extract novel insights about disease mechanisms and potential therapeutic targets.

Their ability to interpret complex scientific concepts and elucidate connections between diseases, genes, and biological processes can help accelerate disease hypothesis generation and the identification of potential drug targets and biomarkers.

When integrated with biomedical knowledge graphs, LLMs help create a unique synergistic model that enables bidirectional data- and knowledge-based reasoning. The explicit structured knowledge of knowledge graphs enhances the knowledge of LLMs while the power of language models streamlines graph construction and user conversational interactions with complex knowledge bases.

However, there are still several challenges that have to be addressed before LLMs can be reliably integrated into in silico drug discovery pipelines and workflows. One of these is hallucinations.

Why do LLMs hallucinate?

At a time of some speculation about laziness and seasonal depression in LLMs, a hallucination leaderboard of 11 public LLMs revealed hallucination rates that ranged from 3% at the top end to 27% at the bottom of the barrel. Another comparative study of two versions of a popular LLM in generating ophthalmic scientific abstracts revealed very high hallucination rates (33% and 29%) of generating fake references.

This tendency of LLMs to hallucinate, ergo present incorrect or unverifiable knowledge as accurate, even at 3% can have serious consequences in critical drug discovery applications.

There are several reasons for LLM hallucinations.

At the core of this behavior is the fact that generative AI models have no actual intelligence, relying instead on a probability-based approach to predict data that is most likely to occur based on patterns and contexts ‘learned’ from their training data. Apart from this inherent lack of contextual understanding, other potential causes include exposure to noise, errors, biases, and inconsistencies in training data, training and generation methods, or even prompting techniques.

For some, hallucination is all LLMs do and others see it as inevitable for any prompt-based large language model. In the context of life sciences research, however, mitigating LLM hallucinations remains one of the biggest obstacles to the large-scale and strategic integration of this potentially transformative technology.

How to mitigate LLM hallucinations?

There are three broad and complementary approaches to mitigating hallucinations in large language models: prompt engineering, fine-tuning, and grounding + prompt augmentation.

Prompt engineering

Prompt engineering is the process of strategically designing user inputs, or prompts, in order to guide model behavior and obtain optimal responses. There are three major approaches to prompt engineering: zero-shot, few-shot, and chain-of-thought prompts. In zero-shot prompting, language models are provided with inputs that are not part of their training data but are still capable of generating reliable results. Few-shot prompting involves providing examples to LLMs before presenting the actual query. Chain-of-thought (CoT) is based on the finding that a series of intermediate reasoning steps provided as examples during prompting can significantly improve the reasoning capabilities of large language models. The chain-of-thought concept has been expanded to include new techniques such as Chain-of-Verification (CoVe), a self-verification process that enables LLMs to check the accuracy and reliability of their output, and Chain of Density (CoD), a process that focuses on summarization rather than reasoning to control the density of information in the generated text.

Prompt engineering, however, has its own set of limitations including prompt constraints that may cramp the ability to query complex domains and the lack of objective metrics to quantify prompt effectiveness.

Fine-tuning

Where the focus of prompt engineering is on the skill required to elicit better LLM output, fine-tuning emphasizes task-specific training in order to enhance the performance of pre-trained models in specific topics or domain areas. A conventional approach to LLM finetuning is full fine-tuning, which involves the additional training of pre-trained models on labeled, domain or task-specific data in order to generate more contextually relevant responses. This is a time, resource and expertise-intensive process. An alternative approach is parameter-efficient fine-tuning (PEFT), conducted on a small set of extra parameters without adjusting the entire model. The modular nature of PEFT means that the training can prioritize select portions or components of the original parameters so that the pre-trained model can be adapted for multiple tasks. LoRA (Low-Rank Adaptation of Large Language Models), a popular PEFT technique, can significantly reduce the resource intensity of fine-tuning while matching the performance of full fine-tuning.

There are, however, challenges to fine-tuning including domain shift issues, the potential for bias amplification and catastrophic forgetting, and the complexities involved in choosing the right hyperparameters for fine-tuning in order to ensure optimal performance.

Grounding & augmentation

LLM hallucinations are often the result of language models attempting to generate knowledge based on information that they have not explicitly memorized or seen. The logical solution, therefore, would be to provide LLMs with access to a curated knowledge base of high-quality contextual information that enables them to generate more accurate responses. Advanced grounding and prompt augmentation techniques can help address many of the accuracy and reliability challenges associated with LLM performance. Both techniques rely on external knowledge sources to dynamically generate context.

Grounding ensures that LLMs have access to up-to-date and use-case-specific information sources to provide the relevant context that may not be available solely from the training data. Similarly, prompt augmentation enhances a prompt with contextually relevant information that enables LLMs to generate a more accurate and pertinent output.

Factual grounding is a technique typically used in the pre-training phase to ensure that LLM output across a variety of tasks is consistent with a knowledge base of factual statements. Post-training grounding relies on a range of external knowledge bases, including documents, code repositories, and public and proprietary databases, to improve the accuracy and relevance of LLMs on specific tasks.

Retrieval-Augmented Generation (RAG), is a distinct framework for the post-training grounding of LLMs based on the most accurate, up-to-date information retrieved from external knowledge bases. The RAG framework enables the optimization of biomedical LLMs output along three key dimensions. One, access to targeted external knowledge sources ensures LLMs' internal representation of information is dynamically refreshed with the most current and contextually relevant data. Two, access to an LLM’s information sources ensures that responses can be validated for relevance and accuracy. And three, there is the emerging potential to extend the RAG framework beyond just text to multimodal knowledge retrieval, spanning images, audio, tables, etc., that can further boost the factuality, interpretability, and sophistication of LLMs.

Also read: How retrieval-augmented generation (RAG) can transform drug discovery

Some of the key challenges of retrieval-augmented generation include the high initial cost of implementation as compared to standalone generative AI. However, in the long run, the RAG-LLM combination will be less expensive than frequently fine-tuning LLMs and provides the most comprehensive approach to mitigating LLM hallucinations.

But even with better grounding and retrieval, scientific applications demand another layer of rigor — validation and reproducibility. Here’s how teams can build confidence in LLM outputs before trusting them in high-stakes discovery workflows.

How to validate LLM outputs in drug discovery pipelines

In scientific settings like drug discovery, ensuring the validity of large language model (LLM) outputs is critical — especially when such outputs may inform downstream experimental decisions. Here are key validation strategies used to assess LLM-generated content in biomedical pipelines:

Validation checklist:

Compare outputs to curated benchmarks
Use structured, peer-reviewed datasets such as DrugBank, ChEMBL, or internal gold standards to benchmark LLM predictions.
Cross-reference with experimental data
Validate AI-generated hypotheses against published experimental results, or integrate with in-house wet lab data for verification.
Establish feedback loops from in vitro validations
Create iterative pipelines where lab-tested results refine future model prompts, improving accuracy over time.

Advancing reproducibility in AI-augmented science

For LLM-assisted workflows to be trustworthy and audit-ready, they must be reproducible — particularly when used in regulated environments.

Reproducibility practices:

Dataset versioning
Track changes in source datasets, ensuring that each model run references a consistent data snapshot.
Prompt logging
Store full prompts (including context and input structure) to reproduce specific generations and analyze outputs over time.
Controlled inference environments
Standardize model versions, hyperparameters, and APIs to eliminate variation in inference across different systems.

Integrated Intelligence with LENS^ai™

Holistic life sciences research requires the sophisticated orchestration of several innovative technologies and frameworks. LENS^ai Integrated Intelligence, our next-generation data-centric AI platform, fluently blends some of the most advanced proprietary technologies into one seamless solution that empowers end-to-end drug discovery and development.

LENS^ai integrates RAG-enhanced bioLLMs with an ontology-driven NLP framework, combining neuro-symbolic logic techniques to connect and correlate syntax (multi-modal sequential and structural data) and semantics (biological functions). A comprehensive and continuously expanding knowledge graph, mapping a remarkable 25 billion relationships across 660 million data objects, links sequence, structure, function, and literature information from the entire biosphere to provide a comprehensive overview of the relationships between genes, proteins, structures, and biological pathways. Our next-generation, unified, knowledge-driven approach to the integration, exploration, and analysis of heterogeneous biomedical data empowers life sciences researchers with the high-tech capabilities needed to explore novel opportunities in drug discovery and development.

Glossary toolkit

Tags: Knowledge Graphs, AI, NLP, Large language models, Retrieval Augmented Generation

Subscribe to our Blog and get new articles right after publication into your inbox.

Related Blogs

Transforming antibody development: Highlights from IPA 2024 TechDay

A year in review: Advancing antibody discovery and beyond

eBook

Download the HYFTs^® — Connecting the Dots & Databases eBook to see how to solve the unique data challenges in biotherapeutics

Back to Blogs

Let’s accelerate change. Partner with us.

Powering
Biotherapeutic
Intelligence™

In Silico Discovery

Powering
Biotherapeutic
Intelligence™

Insight Hub

Powering
Biotherapeutic
Intelligence™

Company

News & Events

Mitigating LLM hallucinations

Audio version

Why do LLMs hallucinate?