2 Data input and preprocessing

This chapter covers the input data requirements for mapa analysis, including the expected data formats and data preprocessing for converting your data to the required format.

library(mapa)

2.1 Input Data Format Requirements

2.1.1 Gene Expression Data for ORA

mapa accepts gene expression data with flexible identifier requirements, depending on your analysis type. Your input data must contain at least one of the following identifier columns:

ensembl: Ensembl gene IDs (e.g., “ENSG00000141510”)
entrezid: NCBI Entrez gene IDs (e.g., “7157”)
uniprot: UniProtKB accession numbers (e.g., “P04637”)
symbol: Gene symbols (e.g., “TP53”)

Important

Column names must be in lowercase.

2.1.2 Gene Expression Data for GSEA

One identifier column
(column name must be one of: ensembl, entrezid, uniprot, or symbol)

Finally, only entrezid will be used for GSEA in MAPA.
If the you upload other types of IDs, the entrezid will be matched by the first step of MAPA:
```
convert_id()
```
An order_by column
- Note: The column name should be either fc or p_value_adjust.
- Since multiple ranking metrics can be used in GSEA, it is important to provide guidance to users when choosing the appropriate metric. See:
  
  Zyla, J., Marczyk, M., Weiner, J. et al. Ranking metrics in gene set enrichment analysis: do they matter? BMC Bioinformatics 18, 256 (2017). https://doi.org/10.1186/s12859-017-1674-0
- In practice, MAPA does not enforce a fixed column name. Instead, the order_by parameter should be set to match the name of the column containing the chosen ranking metric.
A column named p_value_adjust, which represents the statistical significance for the analysis results.
- Why does p_value_adjust matter?
  For GSEA, no duplicated gene IDs are allowed.
  Currently, MAPA removes duplicated entrezid by keeping the one with the smallest p_value_adjust.
- NOTE: Only entrezid is used for performing GSEA in MAPA.
- The keytype for GO and KEGG can only be "ENTREZID" and "kegg" respectively.
- For Reactome, the internal function ReactomePA::gsePathway() only supports entrezid.

2.1.3 Get Our Example Data

MAPA provides several built-in demo datasets that help you to go through and study MAPA workflow such as pathway enrichment analysis in the next step Chapter 3, and integrated heatmap visualization in Chapter 8.

To load any dataset for the following step Chapter 3, use:

data(demo_data_ora)       # Gene ORA dataset
data(demo_data_gsea)      # Gene GSEA dataset
data(demo_data_met)       # Metabolite ORA dataset

Below are some information about these datasets:

Dataset	Content	Rows	Columns	Use	Source
`demo_data_ora`	66 significantly downregulated proteins (log2FC, FDR) from muscle tissue	66	3	Gene Over-Representation Analysis (ORA)	Takasugi et al., Nat Commun 15, 8520 (2024), DOI: 10.1038/s41467-024-52845-x
`demo_data_gsea`	5,290 proteins with fold changes and adjusted p-values from liver tissue	5,290	3	Gene Set Enrichment Analysis (GSEA)	Takasugi et al., Nat Commun 15, 8520 (2024), DOI: 10.1038/s41467-024-52845-x
`demo_data_met`	106 altered metabolites with KEGG IDs, FDR, and fold-change scores	106	4	Metabolite pathway enrichment analysis	In-house data

Over-Representation Analysis (ORA)
Gene Set Enrichment Analysis (GSEA)

For ORA, you only need one identifier column plus any additional metadata:

# Example ORA input data
data(demo_data_ora)
head(demo_data_ora)
# # A tibble: 66 × 3
#    symbol    `log2FC (6 vs 30mo)` `FDR (6 vs 30mo)`
#    <chr>                    <dbl>             <dbl>
#  1 Rpl38                   -0.590          1.77e- 6
#  2 Akap1                   -0.640          1.81e- 2
#  3 Actc1                   -2.65           3.24e-10
#  4 Amd1                    -1.12           9.45e- 6
#  5 Serpina1e               -0.650          1.38e- 3
#  6 Ensa                    -0.672          1.05e- 5
#  7 Mrps36                  -0.775          1.86e- 5
#  8 Aimp2                   -0.656          1.63e- 5
#  9 Ank1                    -0.547          5.11e- 7
# 10 Aqp1                    -0.564          1.50e- 2

For GSEA, you need:

One identifier column (ensembl, entrezid, uniprot, or symbol)
An order_by column containing numeric values to rank genes (This is used to create the ranked gene list required for GSEA. Common choices like Log2 fold change values, pvalue)

# Example GSEA input data
data(demo_data_gsea)
head(demo_data_gsea)
# # A tibble: 6 × 3
#   symbol        fc p_value_adjust
#   <chr>      <dbl>          <dbl>
# 1 Magohb    0.194       0.0129   
# 2 Cask     -0.0215      0.686    
# 3 Igkv5-39  2.16        0.0000853
# 4 Ighg2b    2.09        0.0000557
# 5 Ighg1     1.68        0.482    
# 6 Ighg3    -0.325       0.390

2.1.4 Metabolite Data

For metabolite analysis, mapa currently supports Over-Representation Analysis (ORA) only. You only need one identifier column plus any additional metadata (Column names must be in lowercase):

hmdbid: Human Metabolome Database IDs (e.g., “HMDB0000001”, “HMDB0000002”) - Human only
keggid: KEGG Compound IDs (e.g., “C00001”, “C00002”) - All organisms

Note

For human metabolites: You can use either hmdbid or keggid as input
For other organisms: Only keggid is supported
GSEA for metabolites is not currently supported

# Example metabolite input data
data(demo_data_met)
head(demo_data_met)
# # A tibble: 6 × 4
#   variable_id    keggid      fdr score
#   <chr>          <chr>     <dbl> <dbl>
# 1 M441T680_2_NEG NA     3.59e-16  1.11
# 2 M448T566_NEG   C05466 5.12e- 3  1.13
# 3 M229T462_NEG   C02678 7.43e- 4  1.20
# 4 M367T590_NEG   C04555 2.28e- 4  1.34
# 5 M221T50_POS    C13008 7.99e- 3  1.36
# 6 M544T604_POS   C04230 1.19e-18  1.53

2.2 Input Data Preprocessing

This step is required except for non-human metabolite analysis - mapa provides the convert_id() function to perform ID conversion for downstream analysis. Even if your data already contains one of the required identifier columns, it is highly recommended to run this function to ensure all necessary identifiers are present for pathway analysis.

2.2.1 Gene ID conversion

The convert_id() function converts between different ID types and always returns data with all four gene identifier columns (ensembl, entrezid, uniprot, symbol), which are required for mapa to work properly. Since organism-specific annotation database are required for ID conversion and the following enrichment analysis, choose the ID conversion method based on your organism.

Model Organisms
Non-Model Organisms

For common model organisms, use standard organism database Bioconductor packages.

Note

For most model organisms, Bioconductor already supplies curated organism annotation databases (the OrgDb packages). You can browse the complete list and install the one that matches your species of interest here.

# Mouse
variable_info <- convert_id(
  data = demo_ora_data,
  query_type = "gene",
  from_id_type = "symbol",  # This tells the function what your input column represents
  organism = "org.Mm.eg.db"  # install and load the package at first
)
head(variable_info)
# # A tibble: 6 × 6
#   ensembl           entrezid uniprot symbol `log2FC (6 vs 30mo)` `FDR (6 vs 30mo)`
#   <chr>             <chr>    <chr>   <chr>                 <dbl>             <dbl>
# 1 ENSMUSG000000573… 67671    Q52KP0  Rpl38                -0.590          1.77e- 6
# 2 ENSMUSG000000184… 11640    B1AR25  Akap1                -0.640          1.81e- 2
# 3 ENSMUSG000000686… 11464    P68033  Actc1                -2.65           3.24e-10
# 4 ENSMUSG000000752… 11702    P0DMN7  Amd1                 -1.12           9.45e- 6
# 5 ENSMUSG000000728… 20704    P81105  Serpi…               -0.650          1.38e- 3
# 6 ENSMUSG000000386… 56205    P60840  Ensa                 -0.672          1.05e- 5

For organisms without standard packages, use AnnotationHub IDs:

non_model_org_dt <- readr::read_csv("examples/example_non_model_org_data.csv")
# Macaca fascicularis (taxid: 9541)
variable_info <- convert_id(
  data = non_model_org_dt,
  query_type = "gene",
  from_id_type = "ensembl",
  ah_id = "AH119902",  # AnnotationHub ID for Macaca fascicularis
  return_orgdb = TRUE  # The organism database is needed for downstream analysis
)

# Successfully loaded organism database from AnnotationHub
# Database information:
# OrgDb object:
# | DBSCHEMAVERSION: 2.1
# | DBSCHEMA: NOSCHEMA_DB
# | ORGANISM: Simia fascicularis
# | SPECIES: Simia fascicularis
# | CENTRALID: GID
# | Taxonomy ID: 9541
# | Db type: OrgDb
# | Supporting package: AnnotationDbi
# 
# Please see: help('select') for usage information
# Note: The following ID types are not available in the organism database and will be filled with NA: uniprot
# Available ID types in database: ACCNUM, ALIAS, ENSEMBL, ENTREZID, EVIDENCE, EVIDENCEALL, GENENAME, GID, GO, GOALL, ONTOLOGY, ONTOLOGYALL, PMID, REFSEQ, SYMBOL, UNIGENE
# 'select()' returned 1:many mapping between keys and columns
# Warning in clusterProfiler::bitr(geneID = data[[from_id_type]], fromType = from_clusterprofiler_type,  :
#   57.38% of input gene IDs are fail to map...

variable_info
# $data
# # A tibble: 183 × 8
#    ensembl            entrezid  uniprot symbol       Symbol  `Gene type`    Tissue     Cluster  
#    <chr>              <chr>     <chr>   <chr>        <chr>   <chr>          <chr>      <chr>    
#  1 ENSMFAG00000000264 NA        NA      NA           APH1B   protein_coding Liver (LL) Cluster U
#  2 ENSMFAG00000000283 NA        NA      NA           PLCG1   protein_coding Liver (LL) Cluster U
#  3 ENSMFAG00000000444 102141791 NA      EPS15L1      EPS15L1 protein_coding Liver (LL) Cluster U
#  4 ENSMFAG00000000464 NA        NA      NA           RCOR3   protein_coding Liver (LL) Cluster U
#  5 ENSMFAG00000000583 101865185 NA      EXOC7        EXOC7   protein_coding Liver (LL) Cluster U
#  6 ENSMFAG00000000608 NA        NA      NA           CD99    protein_coding Liver (LL) Cluster U
#  7 ENSMFAG00000000627 102121020 NA      MTFMT        MTFMT   protein_coding Liver (LL) Cluster U
#  8 ENSMFAG00000000639 NA        NA      NA           GJC2    protein_coding Liver (LL) Cluster U
#  9 ENSMFAG00000000737 101866375 NA      MKRN1        MKRN1   protein_coding Liver (LL) Cluster U
# 10 ENSMFAG00000000852 102131151 NA      LOC102131151 CYB5R1  protein_coding Liver (LL) Cluster U
# # ℹ 173 more rows
# # ℹ Use `print(n = ...)` to see more rows
# 
# $orgdb
# OrgDb object:
# | DBSCHEMAVERSION: 2.1
# | DBSCHEMA: NOSCHEMA_DB
# | ORGANISM: Simia fascicularis
# | SPECIES: Simia fascicularis
# | CENTRALID: GID
# | Taxonomy ID: 9541
# | Db type: OrgDb
# | Supporting package: AnnotationDbi
# Please see: help('select') for usage information

Tip

How to find AnnotationHub IDs according to taxid?

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("AnnotationHub")

library(AnnotationHub)
ah <- AnnotationHub()
# Search for your organism database by taxid (here taxid is "9541")
query_result <- AnnotationHub::query(ah, c("OrgDb", "9541"))
# Use the ah_id from the results
query_result$ah_id
# [1] "AH119899" "AH119900" "AH119901" "AH119902"

2.2.2 Metabolite ID Conversion

For human metabolite data, convert IDs bwteen KEGG and HMDB:

# Convert KEGG IDs to HMDB IDs (human only)
met_variable_info <- convert_id(
  data = demo_data_met,
  query_type = "metabolite", 
  from_id_type = "keggid",
  organism = "hsa" # KEGG organism code
)

# Result will include both hmdbid and keggid columns
head(met_variable_info)
# # A tibble: 6 × 5
#   variable_id  keggid      fdr score hmdbid     
#   <chr>        <chr>     <dbl> <dbl> <chr>      
# 1 M448T566_NEG C05466 5.12e- 3  1.13 HMDB0000637
# 2 M229T462_NEG C02678 7.43e- 4  1.20 HMDB0000623
# 3 M367T590_NEG C04555 2.28e- 4  1.34 HMDB0001032
# 4 M221T50_POS  C13008 7.99e- 3  1.36 HMDB0006240
# 5 M544T604_POS C04230 1.19e-18  1.53 HMDB0002815
# 6 M544T604_POS C04230 1.19e-18  1.53 HMDB0010380

2.3 Next Step

Once your data meets the format requirements, you can proceed to enrichment analysis to begin your mapa workflow.