5  Determining Optimal Clustering Parameters

This chapter describes how to systematically evaluate different clustering parameters to find the optimal settings for grouping your pathway similarity results into functional modules.

Important

Prerequisites: Before determining optimal clustering parameters, ensure you have calculated pathway similarity using either merge_pathways() or get_bioembedsim() as described in Chapter 4. The similarity results will be used as input for this analysis.

5.1 Overview

The determine_optimal_clusters() function systematically evaluates various clustering strategies to identify the optimal parameters for grouping pathway enrichment results. It tests different clustering methods and similarity cutoffs, using modularity and silhouette scores to assess clustering quality.

This analysis helps you make informed decisions about clustering parameters before proceeding to functional module identification, potentially saving time and improving the biological interpretability of your results.

5.2 Supported Input Types

The function accepts two types of input objects:

  1. Traditional similarity objects: Results from merge_pathways() (functional_module class)
  2. Biotext embedding objects: Results from get_bioembedsim() (list with sim_matrix and enriched_pathway)

For traditional similarity objects, the function internally calculates Jaccard similarity between modules from different databases. For biotext embedding objects, it uses the pre-computed semantic similarity matrix.

5.3 Basic Usage

cluster_evaluation <- 
  determine_optimal_clusters(
    object = met_similarity_result,
    cutoff_increment = 0.05  # Test cutoffs from 0.2 to 0.9 in steps of 0.05
  )

# Starting optimal cluster determination...
# Extracting similarity matrix and pathway data...
# Calculating the similarity matrix...
# Starting clustering evaluation across methods and cutoffs...
# Processing method: hierarchical ...
# Processing method: binary_cut ...
# Processing method: girvan_newman ...
# Clustering evaluation completed! Generating plots and finding optimal parameters...
# Analysis complete!
cluster_evaluation <- 
  determine_optimal_clusters(
    object = met_biotext_similarity,
    cutoff_increment = 0.05
  )

5.4 Clustering Methods Evaluated

The function tests three clustering algorithms:

  1. Hierarchical Clustering: Traditional agglomerative clustering with specified linkage method (default linkage method is “complete”)
  2. Binary Cut: Uses the simplifyEnrichment algorithm
  3. Girvan-Newman: Community detection based on edge betweenness

Each method is tested across a range of similarity cutoffs to find the optimal balance between cluster cohesion and separation.

5.5 Evaluation Metrics

Modularity: Measures the strength of division of a network into modules. Higher values indicate better defined community structure.

Silhouette Score: Measures how similar objects are to their own cluster compared to other clusters. Values range from -1 to 1, with higher values indicating better clustering.

5.6 Similarity Cutoff Interpretation

The meaning of sim.cutoff depends on your clustering method:

Edge Filter: Only pathway pairs with similarity above this threshold are connected.

  • Lower values (0.3-0.4): More permissive, denser networks, larger modules
  • Higher values (0.6-0.8): More stringent, sparser networks, smaller modules

Dendrogram Cutting Height: Determines where to cut the clustering tree.

  • Lower values (0.3-0.4): Cut lower in tree, more fine-grained modules
  • Higher values (0.6-0.8): Cut higher in tree, fewer but larger modules
Note

Key Difference in Cutoff Interpretation:

  • In Girvan-Newman, higher cutoffs = smaller modules
  • In Binary Cut/Hierarchical, higher cutoffs = larger modules

Always examine your results and adjust based on the biological interpretability of resulting modules.

5.7 Parameter Customization

# Fine-grained parameter search
detailed_evaluation <- 
  determine_optimal_clusters(
    object = similarity_results,
    cutoff_increment = 0.01,  # Smaller steps for more detailed analysis
    hclust_method = "ward.D2"  # Different hierarchical clustering method
  )

# Available hierarchical clustering methods:
# "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid"

5.8 Results Interpretation

The function returns a comprehensive evaluation with three components:

# View the evaluation plot
cluster_evaluation$evaluation_plot

# Examine detailed results
head(cluster_evaluation$cluster_result)
# # A tibble: 6 × 4
#   method       cutoff metric     value
#   <chr>         <dbl> <chr>      <dbl>
# 1 hierarchical   0.2  modularity   0  
# 2 hierarchical   0.2  silhouette   0.4
# 3 hierarchical   0.25 modularity   0  
# 4 hierarchical   0.25 silhouette   0.4
# 5 hierarchical   0.3  modularity   0  

# Find the best parameter combinations
cluster_evaluation$best_combination
# # A tibble: 22 × 4
#    method        cutoff metric     value
#    <chr>          <dbl> <chr>      <dbl>
#  1 binary_cut      0.85 modularity 0.375
#  2 binary_cut      0.9  modularity 0.375
#  3 girvan_newman   0.2  modularity 0.375
# ...

Evaluation Plot: Shows how clustering quality metrics change across similarity cutoffs for each method. Look for:

  • Peaks in modularity: Indicate good community structure
  • High silhouette scores: Indicate well-separated clusters
  • Consistent performance: Methods that perform well across multiple cutoffs

Best Combination: Identifies the optimal method and cutoff for each metric.

5.9 Next Steps

Use the optimal parameters identified in this analysis for functional module identification in Functional Module Identification. The recommended method and cutoff values will help ensure high-quality, biologically meaningful pathway modules.