2 Data input and preprocessing

This chapter covers the input data requirements for mapa analysis, including the expected data formats and data preprocessing for converting your data to the required format.

library(mapa)

2.1 Input Data Format Requirements

2.1.1 Gene Expression Data

mapa accepts gene expression data with flexible identifier requirements, depending on your analysis type. Your input data must contain at least one of the following identifier columns:

ensembl: Ensembl gene IDs (e.g., “ENSG00000141510”)
entrezid: NCBI Entrez gene IDs (e.g., “7157”)
uniprot: UniProtKB accession numbers (e.g., “P04637”)
symbol: Gene symbols (e.g., “TP53”)

Important

Column names must be in lowercase.

Over-Representation Analysis (ORA)
Gene Set Enrichment Analysis (GSEA)

For ORA, you only need one identifier column plus any additional metadata:

# Example ORA input data
ora_data <- readr::read_csv("examples/example_ora_gene_data.csv")
head(ora_data)
# # A tibble: 6 × 2
#   variable_id ensembl        
#   <chr>       <chr>          
# 1 gene_1      ENSG00000100097
# 2 gene_2      ENSG00000139193
# 3 gene_3      ENSG00000163513
# 4 gene_4      ENSG00000127863
# 5 gene_5      ENSG00000115604
# 6 gene_6      ENSG00000153002

For GSEA, you need:

One identifier column (ensembl, entrezid, uniprot, or symbol)
An order_by column containing numeric values to rank genes (This is used to create the ranked gene list required for GSEA. Common choices like Log2 fold change values, pvalue)

# Example GSEA input data
gsea_data <- readr::read_csv("examples/example_gsea_data.csv")
head(gsea_data)
# # A tibble: 6 × 8
#   variable_id ensembl         genename                                                 genetype       length     fc  p_value p_value_adjust
#   <chr>       <chr>           <chr>                                                    <chr>           <dbl>  <dbl>    <dbl>          <dbl>
# 1 gene_1      ENSG00000160072 ATPase family AAA domain containing 3B                   protein-coding  26084 0.358  0.00517         0.0202 
# 2 gene_11     ENSG00000157911 peroxisomal biogenesis factor 10                         protein-coding   9834 0.234  0.000261        0.00514
# 3 gene_12     ENSG00000269896 small nuclear ribonucleoprotein polypeptide N pseudogene pseudo            214 0.0665 0.00884         0.0277 
# 4 gene_16     ENSG00000157933 SKI proto-oncogene                                       protein-coding  82826 0.232  0.000220        0.00490
# 5 gene_29     ENSG00000149527 phospholipase C eta 2                                    protein-coding  79553 0.0373 0.00720         0.0244 
# 6 gene_31     ENSG00000171621 splA/ryanodine receptor domain and SOCS box containing 1 protein-coding  76639 2.68   0.00636         0.0226

2.1.2 Metabolite Data

For metabolite analysis, mapa currently supports Over-Representation Analysis (ORA) only. You only need one identifier column plus any additional metadata (Column names must be in lowercase):

hmdbid: Human Metabolome Database IDs (e.g., “HMDB0000001”, “HMDB0000002”) - Human only
keggid: KEGG Compound IDs (e.g., “C00001”, “C00002”) - All organisms

Note

For human metabolites: You can use either hmdbid or keggid as input
For other organisms: Only keggid is supported
GSEA for metabolites is not currently supported

# Example metabolite input data with HMDB IDs (human only)
metabolite_data <- readr::read_csv("examples/example_metabolite_data.csv")
head(metabolite_data)
# # A tibble: 6 × 4
#   variable_id    keggid      fdr score
#   <chr>          <chr>     <dbl> <dbl>
# 1 M441T680_2_NEG NA     3.59e-16  1.11
# 2 M448T566_NEG   C05466 5.12e- 3  1.13
# 3 M229T462_NEG   C02678 7.43e- 4  1.20
# 4 M367T590_NEG   C04555 2.28e- 4  1.34
# 5 M221T50_POS    C13008 7.99e- 3  1.36
# 6 M544T604_POS   C04230 1.19e-18  1.53

2.2 Input Data Preprocessing

This step is required except for non-human metabolite analysis - mapa provides the convert_id() function to perform ID conversion for downstream analysis. Even if your data already contains one of the required identifier columns, you must run this function to ensure all necessary identifiers are present for pathway analysis.

2.2.1 Gene ID conversion

The convert_id() function converts between different ID types and always returns data with all four gene identifier columns (ensembl, entrezid, uniprot, symbol), which are required for mapa to work properly. Since organism-specific annotation database are required for ID conversion and the following enrichment analysis, choose the ID conversion method based on your organism.

Model Organisms
Non-Model Organisms

For common model organisms, use standard organism database Bioconductor packages.

Note

For most model organisms, Bioconductor already supplies curated organism annotation databases (the OrgDb packages). You can browse the complete list and install the one that matches your species of interest here.

# Human
variable_info <- convert_id(
  data = ora_data,
  query_type = "gene",
  from_id_type = "ensembl",  # This tells the function what your input column represents
  organism = "org.Hs.eg.db"  # install the package at first
)
head(variable_info)
# # A tibble: 6 × 5
#   ensembl         entrezid uniprot    symbol   variable_id
#   <chr>           <chr>    <chr>      <chr>    <chr>      
# 1 ENSG00000100097 3956     A0A384MR27 LGALS1   gene_1     
# 2 ENSG00000139193 939      B2RDZ0     CD27     gene_2     
# 3 ENSG00000163513 7048     A0AAQ5BI03 TGFBR2   gene_3     
# 4 ENSG00000127863 55504    Q9NS68     TNFRSF19 gene_4     
# 5 ENSG00000115604 8809     Q13478     IL18R1   gene_5     
# 6 ENSG00000153002 1360     O60834     CPB1     gene_6

For organisms without standard packages, use AnnotationHub IDs:

non_model_org_dt <- readr::read_csv("examples/example_non_model_org_data.csv")
# Macaca fascicularis (taxid: 9541)
variable_info <- convert_id(
  data = non_model_org_dt,
  query_type = "gene",
  from_id_type = "ensembl",
  ah_id = "AH119902",  # AnnotationHub ID for Macaca fascicularis
  return_orgdb = TRUE  # The organism database is needed for downstream analysis
)

# Successfully loaded organism database from AnnotationHub
# Database information:
# OrgDb object:
# | DBSCHEMAVERSION: 2.1
# | DBSCHEMA: NOSCHEMA_DB
# | ORGANISM: Simia fascicularis
# | SPECIES: Simia fascicularis
# | CENTRALID: GID
# | Taxonomy ID: 9541
# | Db type: OrgDb
# | Supporting package: AnnotationDbi
# 
# Please see: help('select') for usage information
# Note: The following ID types are not available in the organism database and will be filled with NA: uniprot
# Available ID types in database: ACCNUM, ALIAS, ENSEMBL, ENTREZID, EVIDENCE, EVIDENCEALL, GENENAME, GID, GO, GOALL, ONTOLOGY, ONTOLOGYALL, PMID, REFSEQ, SYMBOL, UNIGENE
# 'select()' returned 1:many mapping between keys and columns
# Warning in clusterProfiler::bitr(geneID = data[[from_id_type]], fromType = from_clusterprofiler_type,  :
#   57.38% of input gene IDs are fail to map...

variable_info
# $data
# # A tibble: 183 × 8
#    ensembl            entrezid  uniprot symbol       Symbol  `Gene type`    Tissue     Cluster  
#    <chr>              <chr>     <chr>   <chr>        <chr>   <chr>          <chr>      <chr>    
#  1 ENSMFAG00000000264 NA        NA      NA           APH1B   protein_coding Liver (LL) Cluster U
#  2 ENSMFAG00000000283 NA        NA      NA           PLCG1   protein_coding Liver (LL) Cluster U
#  3 ENSMFAG00000000444 102141791 NA      EPS15L1      EPS15L1 protein_coding Liver (LL) Cluster U
#  4 ENSMFAG00000000464 NA        NA      NA           RCOR3   protein_coding Liver (LL) Cluster U
#  5 ENSMFAG00000000583 101865185 NA      EXOC7        EXOC7   protein_coding Liver (LL) Cluster U
#  6 ENSMFAG00000000608 NA        NA      NA           CD99    protein_coding Liver (LL) Cluster U
#  7 ENSMFAG00000000627 102121020 NA      MTFMT        MTFMT   protein_coding Liver (LL) Cluster U
#  8 ENSMFAG00000000639 NA        NA      NA           GJC2    protein_coding Liver (LL) Cluster U
#  9 ENSMFAG00000000737 101866375 NA      MKRN1        MKRN1   protein_coding Liver (LL) Cluster U
# 10 ENSMFAG00000000852 102131151 NA      LOC102131151 CYB5R1  protein_coding Liver (LL) Cluster U
# # ℹ 173 more rows
# # ℹ Use `print(n = ...)` to see more rows
# 
# $orgdb
# OrgDb object:
# | DBSCHEMAVERSION: 2.1
# | DBSCHEMA: NOSCHEMA_DB
# | ORGANISM: Simia fascicularis
# | SPECIES: Simia fascicularis
# | CENTRALID: GID
# | Taxonomy ID: 9541
# | Db type: OrgDb
# | Supporting package: AnnotationDbi
# Please see: help('select') for usage information

Tip

How to find AnnotationHub IDs according to taxid?

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("AnnotationHub")

library(AnnotationHub)
ah <- AnnotationHub()
# Search for your organism database by taxid (here taxid is "9541")
query_result <- AnnotationHub::query(ah, c("OrgDb", "9541"))
# Use the ah_id from the results
query_result$ah_id
# [1] "AH119899" "AH119900" "AH119901" "AH119902"

2.2.2 Metabolite ID Conversion

For human metabolite data, convert IDs bwteen KEGG and HMDB:

# Convert KEGG IDs to HMDB IDs (human only)
variable_info <- convert_id(
  data = metabolite_data,
  query_type = "metabolite", 
  from_id_type = "keggid",
  organism = "hsa" # KEGG organism code
)

# Result will include both hmdbid and keggid columns
head(variable_info)
# # A tibble: 6 × 5
#   variable_id  keggid      fdr score hmdbid     
#   <chr>        <chr>     <dbl> <dbl> <chr>      
# 1 M448T566_NEG C05466 5.12e- 3  1.13 HMDB0000637
# 2 M229T462_NEG C02678 7.43e- 4  1.20 HMDB0000623
# 3 M367T590_NEG C04555 2.28e- 4  1.34 HMDB0001032
# 4 M221T50_POS  C13008 7.99e- 3  1.36 HMDB0006240
# 5 M544T604_POS C04230 1.19e-18  1.53 HMDB0002815
# 6 M544T604_POS C04230 1.19e-18  1.53 HMDB0010380

2.3 Next Steps

Once your data meets the format requirements, you can proceed to enrichment analysis to begin your mapa workflow.