13 Pathway Similarity Calculation

This chapter covers how to calculate pathway similarity using the MAPA Shiny application. The Shiny app computes pathway similarity using two powerful methods: traditional similarity and biotext embedding similarity. Both methods prepare your data for downstream clustering analysis, allowing you to choose the approach that best fits your research needs and computational resources.

Important

Prerequisites: Before calculating pathway similarity, ensure you have either:

Completed pathway enrichment analysis in the current session using the Pathway Enrichment section as described in the previous chapter, OR
Upload existing enrichment results using the file upload option at the top of the interface

The processed enrichment data is required for similarity calculation to proceed.

13.1 Step 1: Load Your Data

Option 1: Continue from Previous Step

If you have completed pathway enrichment analysis in the current session, your data will automatically be available for similarity calculation. The interface will detect your data type (gene or metabolite) and display appropriate parameter options.
Option 2: Upload Existing Results

If you have previously saved enrichment results, you can upload them:
1. Click “Browse” at the top of the left panel “Upload Enriched Pathways Result (.rda)” to upload your enrichment results file (.rda format)
2. Select your file
3. Wait for validation - the app will automatically detect: data type (gene or metabolite), available databases from your enrichment analysis, organism information (required for downstream analysis)

13.2 Step 2: Select Method

Once your data is loaded, choose your similarity calculation method using the radio buttons:

Traditional methods: Uses established similarity measures (semantic similarity for GO, overlap-based similarity for all supported databases, see Section 12.1)
Biotext embedding: Uses large language model (LLM) text embeddings to capture semantic relationships

13.2.1 Method 1: Traditional Methods

When you select “Traditional methods”, the interface displays database-specific parameter panels based on your enrichment analysis results. The app automatically detects which databases are available from your enrichment analysis. The traditional similarity method requires four main parameters for each database. While the parameter names vary slightly between data types, they serve the same fundamental purposes:

Parameter	Purpose	Default	Gene Databases	Metabolite Databases
P-adjust cutoff	Significance threshold for pathway inclusion	0.05	GO, KEGG, Reactome	SMPDB, KEGG
Count cutoff	Minimum genes/metabolites per pathway	5	GO, KEGG, Reactome	SMPDB, KEGG
Similarity method	Algorithm for calculating pathway similarity	See below	GO, KEGG, Reactome	SMPDB, KEGG
Similarity cutoff	Threshold for intra-database pathway clustering	0.5	GO, KEGG, Reactome	SMPDB, KEGG

Similarity Method Options:

Only GO terms: Semantic similarity using the simona package
GO/KEGG/Reactome/SMPDB pathways: Gene/metabolite overlap similarity (Jaccard, Dice, Overlap, and Kappa)

Tip

Parameter Selection Guide:

Lower p-adjust cutoff (e.g., 0.01) = More stringent, fewer pathways
Higher count cutoff (e.g., 10) = Focus on larger, more established pathways
Similarity cutoff = Balance between intra-database module granularity (lower values) and specificity (higher values)

13.2.2 Method 2: Biotext Embedding

When you select “Biotext embedding”, the interface displays parameters for LLM-based similarity calculation.

13.2.2.1 API Configuration

Set up your API connection:

API provider: Choose between OpenAI or Gemini
Embedding model: Specify the text embedding model (e.g., “text-embedding-3-small”)
API key: Enter your API key for the selected provider

Tip

API Key Setup:

For OpenAI: Obtain your API key from OpenAI Platform
For Gemini: Get your API key from Google AI Studio

Store your API key securely and never commit it to version control.

13.2.2.2 Database Selection and Filtering

Configure which databases to include and set filtering parameters:

Parameter	Purpose	Default	Gene Analysis	Metabolite Analysis
Available Database	Select databases for embedding analysis	Auto-detected	GO, KEGG, Reactome	SMPDB, KEGG
P-adjust cutoff	Significance threshold for pathway inclusion	0.05	All selected databases	All selected databases
Count cutoff	Minimum genes/metabolites per pathway	5	All selected databases	All selected databases

Note

Database Availability: The databases available for selection depend on your organism and the results from your enrichment analysis. Only databases with significant results will be available for embedding analysis.

13.3 Step 3: Run the Analysis

Verify your data is loaded - either from the current session or uploaded file
Select your similarity method - Traditional or Biotext embedding
Review all parameter selections to ensure they match your analysis goals
Click the “Submit” button to start the similarity calculation
Monitor progress - a progress indicator will show the analysis is running
Wait for completion - processing time varies by method:
- Traditional methods: Usually complete within minutes
- Biotext embedding: May take longer due to API calls and text processing

13.4 Step 4: Review Results

After successful completion, results will appear in the right panel with different content based on your selected method:

Traditional Methods Results
Biotext Embedding Results

Table Tab: View database-specific module results

Browse enriched pathway intra-database modules generated based on pathway similarity
Examine module composition
Download individual database results as CSV files

Key result columns in the module tables:

module: Intra-database module identifier (e.g., “go_Module_2”, “kegg_Module_15”)
module_annotation: Representative pathway name for the module (pathway with lowest p-adjust for ORA or highest |NES| for GSEA)
Description: Names of all pathways in the module (separated by ;)
module_content: All pathway/term IDs grouped in this module
Count: Number of genes/metabolites from input list in the module
p_adjust: Best (lowest) adjusted p-value among pathways in the module

Data Visualization Tab: Generate similarity network plots

Click “Generate plot” to create network visualizations for modules from each database
Customize plot parameters (degree cutoff, text display options)

R Object Tab: Access the complete results object

View a summary of the analysis results
Download the complete functional module object (.rda format) for further analysis in R

For biotext embedding results, you’ll see a simplified results panel to download the results (.rda format) which include a similarity matrix and your enrichment analysis result for use in downstream clustering analysis.

Note

Embedding Results: The similarity matrix from biotext embedding is typically large and computationally intensive to display. Instead of showing the full matrix in the interface, you can download it for use in subsequent analysis steps or external tools.

13.5 Step 5: View Analysis Code

Click the “Code” button to see the exact R code that replicates your analysis. This feature provides:

Reproducibility: Copy and paste code into R scripts
Learning: Understand the underlying MAPA functions and parameters
Customization: Modify parameters for batch processing or additional analyses

13.6 Method Comparison

Aspect	Traditional Methods	Biotext Embedding
Speed	Fast, deterministic	Slower, requires API calls
Cost	Free	API usage costs
Reproducibility	Fully reproducible	Highly reproducible (minor API variations)
Novel discoveries	Based on known annotations	Can identify functional relationships beyond overlap
Internet dependency	Minimal (for database updates)	Required for API access
Cross-database integration	Limited by annotation overlap	Excellent semantic integration

13.7 Next Steps

Once your similarity calculation is complete:

Review and Download Results: Examine the similarity patterns and download results for further analysis and as backup copies so that you can restart from this point if needed.
Proceed to Clustering: Click the “Next” button to move to the Pathway Clustering section

The similarity results will automatically be available for the next step in your MAPA analysis workflow in this current session, where you’ll group related pathways into functional modules.