library(mapa)
2 Data input and preprocessing
This chapter covers the input data requirements for mapa analysis, including the expected data formats and data preprocessing for converting your data to the required format.
2.1 Input Data Format Requirements
2.1.1 Gene Expression Data
mapa accepts gene expression data with flexible identifier requirements, depending on your analysis type. Your input data must contain at least one of the following identifier columns:
ensembl
: Ensembl gene IDs (e.g., “ENSG00000141510”)entrezid
: NCBI Entrez gene IDs (e.g., “7157”)uniprot
: UniProtKB accession numbers (e.g., “P04637”)symbol
: Gene symbols (e.g., “TP53”)
Column names must be in lowercase.
For ORA, you only need one identifier column plus any additional metadata:
# Example ORA input data
<- readr::read_csv("examples/example_ora_gene_data.csv")
ora_data head(ora_data)
# # A tibble: 6 × 2
# variable_id ensembl
# <chr> <chr>
# 1 gene_1 ENSG00000100097
# 2 gene_2 ENSG00000139193
# 3 gene_3 ENSG00000163513
# 4 gene_4 ENSG00000127863
# 5 gene_5 ENSG00000115604
# 6 gene_6 ENSG00000153002
For GSEA, you need:
- One identifier column (ensembl, entrezid, uniprot, or symbol)
- An
order_by
column containing numeric values to rank genes (This is used to create the ranked gene list required for GSEA. Common choices like Log2 fold change values, pvalue)
# Example GSEA input data
<- readr::read_csv("examples/example_gsea_data.csv")
gsea_data head(gsea_data)
# # A tibble: 6 × 8
# variable_id ensembl genename genetype length fc p_value p_value_adjust
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 gene_1 ENSG00000160072 ATPase family AAA domain containing 3B protein-coding 26084 0.358 0.00517 0.0202
# 2 gene_11 ENSG00000157911 peroxisomal biogenesis factor 10 protein-coding 9834 0.234 0.000261 0.00514
# 3 gene_12 ENSG00000269896 small nuclear ribonucleoprotein polypeptide N pseudogene pseudo 214 0.0665 0.00884 0.0277
# 4 gene_16 ENSG00000157933 SKI proto-oncogene protein-coding 82826 0.232 0.000220 0.00490
# 5 gene_29 ENSG00000149527 phospholipase C eta 2 protein-coding 79553 0.0373 0.00720 0.0244
# 6 gene_31 ENSG00000171621 splA/ryanodine receptor domain and SOCS box containing 1 protein-coding 76639 2.68 0.00636 0.0226
2.1.2 Metabolite Data
For metabolite analysis, mapa currently supports Over-Representation Analysis (ORA) only. You only need one identifier column plus any additional metadata (Column names must be in lowercase):
hmdbid
: Human Metabolome Database IDs (e.g., “HMDB0000001”, “HMDB0000002”) - Human onlykeggid
: KEGG Compound IDs (e.g., “C00001”, “C00002”) - All organisms
- For human metabolites: You can use either hmdbid or keggid as input
- For other organisms: Only keggid is supported
- GSEA for metabolites is not currently supported
# Example metabolite input data with HMDB IDs (human only)
<- readr::read_csv("examples/example_metabolite_data.csv")
metabolite_data head(metabolite_data)
# # A tibble: 6 × 4
# variable_id keggid fdr score
# <chr> <chr> <dbl> <dbl>
# 1 M441T680_2_NEG NA 3.59e-16 1.11
# 2 M448T566_NEG C05466 5.12e- 3 1.13
# 3 M229T462_NEG C02678 7.43e- 4 1.20
# 4 M367T590_NEG C04555 2.28e- 4 1.34
# 5 M221T50_POS C13008 7.99e- 3 1.36
# 6 M544T604_POS C04230 1.19e-18 1.53
2.2 Input Data Preprocessing
This step is required except for non-human metabolite analysis - mapa provides the convert_id()
function to perform ID conversion for downstream analysis. Even if your data already contains one of the required identifier columns, you must run this function to ensure all necessary identifiers are present for pathway analysis.
2.2.1 Gene ID conversion
The convert_id()
function converts between different ID types and always returns data with all four gene identifier columns (ensembl
, entrezid
, uniprot
, symbol
), which are required for mapa to work properly. Since organism-specific annotation database are required for ID conversion and the following enrichment analysis, choose the ID conversion method based on your organism.
For common model organisms, use standard organism database Bioconductor packages.
For most model organisms, Bioconductor already supplies curated organism annotation databases (the OrgDb packages). You can browse the complete list and install the one that matches your species of interest here.
# Human
<- convert_id(
variable_info data = ora_data,
query_type = "gene",
from_id_type = "ensembl", # This tells the function what your input column represents
organism = "org.Hs.eg.db" # install the package at first
)head(variable_info)
# # A tibble: 6 × 5
# ensembl entrezid uniprot symbol variable_id
# <chr> <chr> <chr> <chr> <chr>
# 1 ENSG00000100097 3956 A0A384MR27 LGALS1 gene_1
# 2 ENSG00000139193 939 B2RDZ0 CD27 gene_2
# 3 ENSG00000163513 7048 A0AAQ5BI03 TGFBR2 gene_3
# 4 ENSG00000127863 55504 Q9NS68 TNFRSF19 gene_4
# 5 ENSG00000115604 8809 Q13478 IL18R1 gene_5
# 6 ENSG00000153002 1360 O60834 CPB1 gene_6
For organisms without standard packages, use AnnotationHub IDs:
<- readr::read_csv("examples/example_non_model_org_data.csv")
non_model_org_dt # Macaca fascicularis (taxid: 9541)
<- convert_id(
variable_info data = non_model_org_dt,
query_type = "gene",
from_id_type = "ensembl",
ah_id = "AH119902", # AnnotationHub ID for Macaca fascicularis
return_orgdb = TRUE # The organism database is needed for downstream analysis
)
# Successfully loaded organism database from AnnotationHub
# Database information:
# OrgDb object:
# | DBSCHEMAVERSION: 2.1
# | DBSCHEMA: NOSCHEMA_DB
# | ORGANISM: Simia fascicularis
# | SPECIES: Simia fascicularis
# | CENTRALID: GID
# | Taxonomy ID: 9541
# | Db type: OrgDb
# | Supporting package: AnnotationDbi
#
# Please see: help('select') for usage information
# Note: The following ID types are not available in the organism database and will be filled with NA: uniprot
# Available ID types in database: ACCNUM, ALIAS, ENSEMBL, ENTREZID, EVIDENCE, EVIDENCEALL, GENENAME, GID, GO, GOALL, ONTOLOGY, ONTOLOGYALL, PMID, REFSEQ, SYMBOL, UNIGENE
# 'select()' returned 1:many mapping between keys and columns
# Warning in clusterProfiler::bitr(geneID = data[[from_id_type]], fromType = from_clusterprofiler_type, :
# 57.38% of input gene IDs are fail to map...
variable_info# $data
# # A tibble: 183 × 8
# ensembl entrezid uniprot symbol Symbol `Gene type` Tissue Cluster
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 ENSMFAG00000000264 NA NA NA APH1B protein_coding Liver (LL) Cluster U
# 2 ENSMFAG00000000283 NA NA NA PLCG1 protein_coding Liver (LL) Cluster U
# 3 ENSMFAG00000000444 102141791 NA EPS15L1 EPS15L1 protein_coding Liver (LL) Cluster U
# 4 ENSMFAG00000000464 NA NA NA RCOR3 protein_coding Liver (LL) Cluster U
# 5 ENSMFAG00000000583 101865185 NA EXOC7 EXOC7 protein_coding Liver (LL) Cluster U
# 6 ENSMFAG00000000608 NA NA NA CD99 protein_coding Liver (LL) Cluster U
# 7 ENSMFAG00000000627 102121020 NA MTFMT MTFMT protein_coding Liver (LL) Cluster U
# 8 ENSMFAG00000000639 NA NA NA GJC2 protein_coding Liver (LL) Cluster U
# 9 ENSMFAG00000000737 101866375 NA MKRN1 MKRN1 protein_coding Liver (LL) Cluster U
# 10 ENSMFAG00000000852 102131151 NA LOC102131151 CYB5R1 protein_coding Liver (LL) Cluster U
# # ℹ 173 more rows
# # ℹ Use `print(n = ...)` to see more rows
#
# $orgdb
# OrgDb object:
# | DBSCHEMAVERSION: 2.1
# | DBSCHEMA: NOSCHEMA_DB
# | ORGANISM: Simia fascicularis
# | SPECIES: Simia fascicularis
# | CENTRALID: GID
# | Taxonomy ID: 9541
# | Db type: OrgDb
# | Supporting package: AnnotationDbi
# Please see: help('select') for usage information
How to find AnnotationHub IDs according to taxid?
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("AnnotationHub")
BiocManager
library(AnnotationHub)
<- AnnotationHub()
ah # Search for your organism database by taxid (here taxid is "9541")
<- AnnotationHub::query(ah, c("OrgDb", "9541"))
query_result # Use the ah_id from the results
$ah_id
query_result# [1] "AH119899" "AH119900" "AH119901" "AH119902"
2.2.2 Metabolite ID Conversion
For human metabolite data, convert IDs bwteen KEGG and HMDB:
# Convert KEGG IDs to HMDB IDs (human only)
<- convert_id(
variable_info data = metabolite_data,
query_type = "metabolite",
from_id_type = "keggid",
organism = "hsa" # KEGG organism code
)
# Result will include both hmdbid and keggid columns
head(variable_info)
# # A tibble: 6 × 5
# variable_id keggid fdr score hmdbid
# <chr> <chr> <dbl> <dbl> <chr>
# 1 M448T566_NEG C05466 5.12e- 3 1.13 HMDB0000637
# 2 M229T462_NEG C02678 7.43e- 4 1.20 HMDB0000623
# 3 M367T590_NEG C04555 2.28e- 4 1.34 HMDB0001032
# 4 M221T50_POS C13008 7.99e- 3 1.36 HMDB0006240
# 5 M544T604_POS C04230 1.19e-18 1.53 HMDB0002815
# 6 M544T604_POS C04230 1.19e-18 1.53 HMDB0010380
2.3 Next Steps
Once your data meets the format requirements, you can proceed to enrichment analysis to begin your mapa workflow.