library(mapa)2 Data input and preprocessing
This chapter covers the input data requirements for mapa analysis, including the expected data formats and data preprocessing for converting your data to the required format.
2.1 Input Data Format Requirements
2.1.1 Gene Expression Data for ORA
mapa accepts gene expression data with flexible identifier requirements, depending on your analysis type. Your input data must contain at least one of the following identifier columns:
ensembl: Ensembl gene IDs (e.g., “ENSG00000141510”)entrezid: NCBI Entrez gene IDs (e.g., “7157”)uniprot: UniProtKB accession numbers (e.g., “P04637”)symbol: Gene symbols (e.g., “TP53”)
Column names must be in lowercase.
2.1.2 Gene Expression Data for GSEA
One identifier column
(column name must be one of:ensembl,entrezid,uniprot, orsymbol)Finally, only
entrezidwill be used for GSEA in MAPA.
If the you upload other types of IDs, theentrezidwill be matched by the first step of MAPA:convert_id()An
order_bycolumnNote: The column name should be either
fcorp_value_adjust.
Since multiple ranking metrics can be used in GSEA, it is important to provide guidance to users when choosing the appropriate metric. See:
Zyla, J., Marczyk, M., Weiner, J. et al. Ranking metrics in gene set enrichment analysis: do they matter? BMC Bioinformatics 18, 256 (2017). https://doi.org/10.1186/s12859-017-1674-0
In practice, MAPA does not enforce a fixed column name. Instead, the
order_byparameter should be set to match the name of the column containing the chosen ranking metric.
A column named
p_value_adjust, which represents the statistical significance for the analysis results.- Why does
p_value_adjustmatter?
For GSEA, no duplicated gene IDs are allowed.
Currently, MAPA removes duplicatedentrezidby keeping the one with the smallestp_value_adjust.
- NOTE: Only
entrezidis used for performing GSEA in MAPA.
- The keytype for GO and KEGG can only be
"ENTREZID"and"kegg"respectively.
- For Reactome, the internal function
ReactomePA::gsePathway()only supportsentrezid.
- Why does
2.1.3 Get Our Example Data
MAPA provides several built-in demo datasets that help you to go through and study MAPA workflow such as pathway enrichment analysis in the next step Chapter 3, and integrated heatmap visualization in Chapter 8.
To load any dataset for the following step Chapter 3, use:
data(demo_data_ora) # Gene ORA dataset
data(demo_data_gsea) # Gene GSEA dataset
data(demo_data_met) # Metabolite ORA datasetBelow are some information about these datasets:
| Dataset | Content | Rows | Columns | Use | Source |
|---|---|---|---|---|---|
demo_data_ora |
66 significantly downregulated proteins (log2FC, FDR) from muscle tissue | 66 | 3 | Gene Over-Representation Analysis (ORA) | Takasugi et al., Nat Commun 15, 8520 (2024), DOI: 10.1038/s41467-024-52845-x |
demo_data_gsea |
5,290 proteins with fold changes and adjusted p-values from liver tissue | 5,290 | 3 | Gene Set Enrichment Analysis (GSEA) | Takasugi et al., Nat Commun 15, 8520 (2024), DOI: 10.1038/s41467-024-52845-x |
demo_data_met |
106 altered metabolites with KEGG IDs, FDR, and fold-change scores | 106 | 4 | Metabolite pathway enrichment analysis | In-house data |
For ORA, you only need one identifier column plus any additional metadata:
# Example ORA input data
data(demo_data_ora)
head(demo_data_ora)
# # A tibble: 66 × 3
# symbol `log2FC (6 vs 30mo)` `FDR (6 vs 30mo)`
# <chr> <dbl> <dbl>
# 1 Rpl38 -0.590 1.77e- 6
# 2 Akap1 -0.640 1.81e- 2
# 3 Actc1 -2.65 3.24e-10
# 4 Amd1 -1.12 9.45e- 6
# 5 Serpina1e -0.650 1.38e- 3
# 6 Ensa -0.672 1.05e- 5
# 7 Mrps36 -0.775 1.86e- 5
# 8 Aimp2 -0.656 1.63e- 5
# 9 Ank1 -0.547 5.11e- 7
# 10 Aqp1 -0.564 1.50e- 2For GSEA, you need:
- One identifier column (ensembl, entrezid, uniprot, or symbol)
- An
order_bycolumn containing numeric values to rank genes (This is used to create the ranked gene list required for GSEA. Common choices like Log2 fold change values, pvalue)
# Example GSEA input data
data(demo_data_gsea)
head(demo_data_gsea)
# # A tibble: 6 × 3
# symbol fc p_value_adjust
# <chr> <dbl> <dbl>
# 1 Magohb 0.194 0.0129
# 2 Cask -0.0215 0.686
# 3 Igkv5-39 2.16 0.0000853
# 4 Ighg2b 2.09 0.0000557
# 5 Ighg1 1.68 0.482
# 6 Ighg3 -0.325 0.390 2.1.4 Metabolite Data
For metabolite analysis, mapa currently supports Over-Representation Analysis (ORA) only. You only need one identifier column plus any additional metadata (Column names must be in lowercase):
hmdbid: Human Metabolome Database IDs (e.g., “HMDB0000001”, “HMDB0000002”) - Human onlykeggid: KEGG Compound IDs (e.g., “C00001”, “C00002”) - All organisms
- For human metabolites: You can use either hmdbid or keggid as input
- For other organisms: Only keggid is supported
- GSEA for metabolites is not currently supported
# Example metabolite input data
data(demo_data_met)
head(demo_data_met)
# # A tibble: 6 × 4
# variable_id keggid fdr score
# <chr> <chr> <dbl> <dbl>
# 1 M441T680_2_NEG NA 3.59e-16 1.11
# 2 M448T566_NEG C05466 5.12e- 3 1.13
# 3 M229T462_NEG C02678 7.43e- 4 1.20
# 4 M367T590_NEG C04555 2.28e- 4 1.34
# 5 M221T50_POS C13008 7.99e- 3 1.36
# 6 M544T604_POS C04230 1.19e-18 1.532.2 Input Data Preprocessing
This step is required except for non-human metabolite analysis - mapa provides the convert_id() function to perform ID conversion for downstream analysis. Even if your data already contains one of the required identifier columns, it is highly recommended to run this function to ensure all necessary identifiers are present for pathway analysis.
2.2.1 Gene ID conversion
The convert_id() function converts between different ID types and always returns data with all four gene identifier columns (ensembl, entrezid, uniprot, symbol), which are required for mapa to work properly. Since organism-specific annotation database are required for ID conversion and the following enrichment analysis, choose the ID conversion method based on your organism.
For common model organisms, use standard organism database Bioconductor packages.
For most model organisms, Bioconductor already supplies curated organism annotation databases (the OrgDb packages). You can browse the complete list and install the one that matches your species of interest here.
# Mouse
variable_info <- convert_id(
data = demo_ora_data,
query_type = "gene",
from_id_type = "symbol", # This tells the function what your input column represents
organism = "org.Mm.eg.db" # install and load the package at first
)
head(variable_info)
# # A tibble: 6 × 6
# ensembl entrezid uniprot symbol `log2FC (6 vs 30mo)` `FDR (6 vs 30mo)`
# <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 ENSMUSG000000573… 67671 Q52KP0 Rpl38 -0.590 1.77e- 6
# 2 ENSMUSG000000184… 11640 B1AR25 Akap1 -0.640 1.81e- 2
# 3 ENSMUSG000000686… 11464 P68033 Actc1 -2.65 3.24e-10
# 4 ENSMUSG000000752… 11702 P0DMN7 Amd1 -1.12 9.45e- 6
# 5 ENSMUSG000000728… 20704 P81105 Serpi… -0.650 1.38e- 3
# 6 ENSMUSG000000386… 56205 P60840 Ensa -0.672 1.05e- 5 For organisms without standard packages, use AnnotationHub IDs:
non_model_org_dt <- readr::read_csv("examples/example_non_model_org_data.csv")
# Macaca fascicularis (taxid: 9541)
variable_info <- convert_id(
data = non_model_org_dt,
query_type = "gene",
from_id_type = "ensembl",
ah_id = "AH119902", # AnnotationHub ID for Macaca fascicularis
return_orgdb = TRUE # The organism database is needed for downstream analysis
)
# Successfully loaded organism database from AnnotationHub
# Database information:
# OrgDb object:
# | DBSCHEMAVERSION: 2.1
# | DBSCHEMA: NOSCHEMA_DB
# | ORGANISM: Simia fascicularis
# | SPECIES: Simia fascicularis
# | CENTRALID: GID
# | Taxonomy ID: 9541
# | Db type: OrgDb
# | Supporting package: AnnotationDbi
#
# Please see: help('select') for usage information
# Note: The following ID types are not available in the organism database and will be filled with NA: uniprot
# Available ID types in database: ACCNUM, ALIAS, ENSEMBL, ENTREZID, EVIDENCE, EVIDENCEALL, GENENAME, GID, GO, GOALL, ONTOLOGY, ONTOLOGYALL, PMID, REFSEQ, SYMBOL, UNIGENE
# 'select()' returned 1:many mapping between keys and columns
# Warning in clusterProfiler::bitr(geneID = data[[from_id_type]], fromType = from_clusterprofiler_type, :
# 57.38% of input gene IDs are fail to map...
variable_info
# $data
# # A tibble: 183 × 8
# ensembl entrezid uniprot symbol Symbol `Gene type` Tissue Cluster
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 ENSMFAG00000000264 NA NA NA APH1B protein_coding Liver (LL) Cluster U
# 2 ENSMFAG00000000283 NA NA NA PLCG1 protein_coding Liver (LL) Cluster U
# 3 ENSMFAG00000000444 102141791 NA EPS15L1 EPS15L1 protein_coding Liver (LL) Cluster U
# 4 ENSMFAG00000000464 NA NA NA RCOR3 protein_coding Liver (LL) Cluster U
# 5 ENSMFAG00000000583 101865185 NA EXOC7 EXOC7 protein_coding Liver (LL) Cluster U
# 6 ENSMFAG00000000608 NA NA NA CD99 protein_coding Liver (LL) Cluster U
# 7 ENSMFAG00000000627 102121020 NA MTFMT MTFMT protein_coding Liver (LL) Cluster U
# 8 ENSMFAG00000000639 NA NA NA GJC2 protein_coding Liver (LL) Cluster U
# 9 ENSMFAG00000000737 101866375 NA MKRN1 MKRN1 protein_coding Liver (LL) Cluster U
# 10 ENSMFAG00000000852 102131151 NA LOC102131151 CYB5R1 protein_coding Liver (LL) Cluster U
# # ℹ 173 more rows
# # ℹ Use `print(n = ...)` to see more rows
#
# $orgdb
# OrgDb object:
# | DBSCHEMAVERSION: 2.1
# | DBSCHEMA: NOSCHEMA_DB
# | ORGANISM: Simia fascicularis
# | SPECIES: Simia fascicularis
# | CENTRALID: GID
# | Taxonomy ID: 9541
# | Db type: OrgDb
# | Supporting package: AnnotationDbi
# Please see: help('select') for usage informationHow to find AnnotationHub IDs according to taxid?
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("AnnotationHub")
library(AnnotationHub)
ah <- AnnotationHub()
# Search for your organism database by taxid (here taxid is "9541")
query_result <- AnnotationHub::query(ah, c("OrgDb", "9541"))
# Use the ah_id from the results
query_result$ah_id
# [1] "AH119899" "AH119900" "AH119901" "AH119902"2.2.2 Metabolite ID Conversion
For human metabolite data, convert IDs bwteen KEGG and HMDB:
# Convert KEGG IDs to HMDB IDs (human only)
met_variable_info <- convert_id(
data = demo_data_met,
query_type = "metabolite",
from_id_type = "keggid",
organism = "hsa" # KEGG organism code
)
# Result will include both hmdbid and keggid columns
head(met_variable_info)
# # A tibble: 6 × 5
# variable_id keggid fdr score hmdbid
# <chr> <chr> <dbl> <dbl> <chr>
# 1 M448T566_NEG C05466 5.12e- 3 1.13 HMDB0000637
# 2 M229T462_NEG C02678 7.43e- 4 1.20 HMDB0000623
# 3 M367T590_NEG C04555 2.28e- 4 1.34 HMDB0001032
# 4 M221T50_POS C13008 7.99e- 3 1.36 HMDB0006240
# 5 M544T604_POS C04230 1.19e-18 1.53 HMDB0002815
# 6 M544T604_POS C04230 1.19e-18 1.53 HMDB00103802.3 Next Step
Once your data meets the format requirements, you can proceed to enrichment analysis to begin your mapa workflow.