15 Functional Module Annotation

This chapter describes how to interpret the functional modules by integrating Large Language Models (LLMs) with a Retrieval-Augmented Generation (RAG) strategy. The Shiny app provides an approach to combine pathway information, scientific literature, and advanced text analysis to generate meaningful names and summaries for your functional modules.

Important

Prerequisites: Before running LLM interpretation, ensure you have completed functional module generation as described in previous chapters. The enriched_functional_modules object used in this chapter should be the output from the module clustering step.

15.1 Overview

The LLM interpretation process integrates multiple information sources to provide comprehensive module annotations:

Information Sources:
- Pathway descriptions from GO, KEGG, Reactome, and metabolite databases
- Gene/metabolite names from your input data
- Scientific literature retrieved from PubMed based on pathway and molecule relevance
- Local documents (optional) - your own research papers or relevant literature
Analysis Pipeline:
1. Information Extraction: Extracts pathway descriptions and molecule information
2. Literature Retrieval: Searches PubMed for relevant scientific papers
3. RAG Strategy: Uses embeddings to find and rank the most relevant literature, see Section 15.2
4. LLM Generation: Generates biological module names and research summaries

This approach leverages both structured biological knowledge and cutting-edge research to provide contextually rich interpretations of your functional modules.

15.2 Understanding the RAG Strategy

The Retrieval-Augmented Generation (RAG) strategy works in several stages:

Stage 1: Literature Search

Searches PubMed using pathway names and gene/metabolite information
Retrieves abstracts and titles for relevant papers
Includes reference papers cited in pathway databases

Stage 2: Embedding-based Similarity

Generates text embeddings for module descriptions (pathways + molecules)
Calculates similarity between module embeddings and paper abstracts
Filters top papers based on cosine similarity scores

Stage 3: GPT-powered Reranking

Uses LLM to assess relevance of each paper to the specific module by assigning relevance scores (0-1) where 1 indicates high relevance to module pathways and molecules
Select the most informative literature for final interpretation

Stage 4: Final Generation

Combines module information with top-ranked papers
Generates biologically meaningful module names
Produces research summaries highlighting current knowledge
Assigns confidence scores (0-1) reflecting the LLM’s assessment of interpretation quality based on available evidence. High confidence means strong literature support and clear biological coherence

This multi-stage approach ensures that generated annotations are both scientifically accurate and contextually relevant.

15.3 Step 1: Load Your Data

Upload previously saved results:

Click “Browse” at the top of the left panel “Upload functional module (.rda)” to upload your analyzed functional module file (.rda format)
Select your file

15.4 Step 2: Configure LLM and API Settings

15.4.1 LLM Model Settings

Configure your LLM and text embedding models settings with references to the table below:

Parameter	Description	Default	Notes
`API provider`	GPT model for text generation	`"gpt-4o-mini-2024-07-18"`	Currently supports OpenAI models only
`Embedding model`	Model for text embeddings	`"text-embedding-3-small"`	Used for document similarity
`API key`	OpenAI API key	Required	Store securely, never commit to version control

Important

Model Selection: Due to changes in how requests are processed in GPT-5 model series, we strongly recommend using the default option (gpt-4o-mini-2024-07-18) for the best performance and quality.

15.4.2 Literature Search Settings

Parameter	Description	Default	Use Case
`Module size cutoff`	Minimum pathways per module	1	Filter small modules to focus on substantial ones
`Years to search`	Years to search back in PubMed	5	Balance between recent findings and comprehensive coverage

Note

API Usage & Processing Time: Each module requires multiple API calls (embedding + LLM generation), with costs scaling by number of modules and literature volume, so consider tweaking Module size cutoff to focus on important modules. Typical processing takes 2-5 minutes per module, with time increasing for more literature and larger modules.

15.4.3 Optional Enhancements

Parameter	Description	Default	Benefits
`Disease or phenotype`	Disease/condition context	`NULL`	Adds disease-specific interpretation
`Local corpus directory`	Local directory with your own PDF files	`NULL`	Incorporates your own research or relevant literature to enhance the interpretation

15.4.4 Required Parameters

Two parameters are required before submit to LLM interpretation:

Embeddings output directory: The analyzed results will be displayed here
Organism gene annotation database: Select the organism of your omic data. You can refer to Data Input Chapter for detailed instruction.

Important

Double Check: Please DO NOT select a directory that contains any of your files, as this step will erase everything inside the chosen directory. Make sure the output directory you select is completely empty.

15.5 Step 3: Run the Analysis

Verify your data is loaded - either from the current session or uploaded file
Review all parameter selections to ensure they match your analysis goals
Click the “Submit” button to start the LLM interpretation progress
Monitor progress - a progress indicator will show the analysis is running
Wait for completion - processing time usually takes more than 10 minutes, wait patiently

15.6 Step 4: Review Results

After successful completion, results will appear in the right panel with updated functional module.

Key result elements:

llm_module_name: AI-generated biological module name
module_annotation: Original representative annotation (fallback if LLM interpretation unavailable)

For each interpreted module, the results include:

Module Name: AI-generated biological module name
Module Summary: Original representative annotation (fallback if LLM interpretation unavailable)
Association With Phenotype: How this module is associated with given phenotype. Empty if no disease or phenotype is entered
Confidence Score: Indicate LLM’s assessment of interpretation quality based on available evidence. High confidence means strong literature support and clear biological coherence

Contains the exact prompt sent to the LLM for transparency. Useful for understanding AI reasoning.

The functional module results are automatically updated with a new column llm_module_name, which is the LLM-generated names. Download the R object for follow-up analysis.

15.7 Step 5: Download the Results

After completing the LLM Interpretation step, you have two options for accessing/downloading the results other than viewing it on Shiny app:

Download the R object (.rda) containing the LLM interpretation results, which can be used for follow-up visualizations described in Chapter 16: Visualization.
Check out the embeddings output directory, which already includes the following for each functional module:
- chunk.txt: AI-generated summaries of the retrieved PubMed literature
- embedding_vector.csv.gz: Text embeddings of the literature with cosine similarity values
- paper_title.txt: A list of PubMed articles identified for this functional module, including their titles and PubMed IDs

15.8 Step 6: View Analysis Code

Click the “Code” button to see the exact R code that replicates your analysis:

Code 1: Shows function() function usage
Code 2: Shows function() function usage

This code can be copied and used in your own R scripts for reproducible analysis.

15.9 Best Practices

Start Small: Begin with a subset of modules to estimate costs and processing time
Optimize Parameters: Adjust filtering parameters based on your module complexity
Use Phenotype Context: Specify relevant diseases/conditions for more targeted interpretations
Include Local Literature: Add your own papers for domain-specific insights
Review Results: Examine confidence scores and retrieved literature quality
Iterate: Refine parameters based on initial results

15.10 Next Steps

Continue to Data Visualization to create informative plots showcasing your LLM-interpreted modules.