Best Practices for Interpreting Omics Data with Pathway Enrichment Analysis

Why This Topic Matters

Pathway enrichment is powerful — and deceptively simple.

✓ A go-to method
For interpreting high-dimensional omics data

✓ Widely used in cancer research
To extract biological meaning from gene lists

❗ But beware:
- Easy to misuse and misreport with unclear inputs and online tools
- Overwhelming: many options for databases, methods, and cutoffs

Outline

1. What is pathway enrichment analysis (PEA)?

2. Overview of methods: ORA, GSEA, Topology

3. Running a PEA: Test, Input, Background, Databases, Online Tools

4. Common pitfalls & tips for better PEA

5. Visualization and reproducibility

6. Open Questions

Main Sources:

Interpreting omics data with pathway enrichment analysis

Nine quick tips for pathway enrichment analysis

What Is Pathway Enrichment Analysis?

Statistical approach to identify biological pathways whose genes show non-random patterns in an omics dataset
Inputs: either a gene list or a full ranked list of features from an omics dataset
Outputs: enriched biological processes or pathways
Common in web tools: gProfiler, Enrichr, DAVID, Reactome, ExpressAnalyst etc.

Methods of Pathway Enrichment Analysis

1. Over-representation analysis (ORA):
- Uses a gene list and a background.
- Compares observed vs. expected overlap.

2. Gene Set Enrichment Analysis (GSEA):
- Uses a full ranked list of genes.
- Captures subtle, coordinated effects.

3. Topology-based analysis:
- Uses gene–gene relationships in pathway networks.

The Web Tool Black Box Problem

100 random unrelated genes

Visit Enrichr

Problems here: - Background is offered as a “new option” not a requirement WHICH IT IS - The website doesn’t state what statistical test is being done. It refers to this set of tools as “gene set enrichment analysis” which is confusing, because GSEA is a specific statistical algorithm invented for pathway analysis, and this isn’t it - Databases are out of date, and no easy way to record what was used (not reproducible) -No guide to interpretation – results are ranked by adjusted p-value. Let’s talk about p-value adjustment. When you run multiple tests, you should adjust your p-values because the more tests you perform, the more likely you are to get significant results by chance alone. But even with adjustment, consider this: there are 625 pathways in this database. With an adjusted p-value threshold of 0.05, we’d expect around 31 false positives, just by random chance.

In this case, we happen to know that our list of 100 genes is, as best we can estimate, unrelated — so we’re not surprised that some results still come up significant. But what if you do believe your genes are biologically related? What if you’re excited by these results and decide to include them in a paper or presentation?

There’s no rule saying you can’t do that — but this example shows why you might want to pause and reflect. Tools like this are incredibly useful, but interpreting them wisely takes context and caution. We’ll walk through how to use these tools to generate insight, not just output.

Over-representation analysis (ORA):

Examine whether any pathways are observed in a gene list of interest more than expected by chance compared with a background set

Uses DEGs and a background
Compares observed vs expected overlap
P-values are adjusted for multiple-testing!

Kalyanee. ORA methods use a hypergeometric test (or very closely-related Fisher’s test) to test whether the overlap between two lists are more than is expected by change. Over-representation-based methods are conceptually straightforward but have several limitations, such as assuming independence of each gene and requiring an arbitrary cutoff to define differentially expressed gene sets.

Suppose you have:

50 upregulated genes (your list of interest).

10,000 total expressed genes (background set).

Let’s say Pathway X contains 100 genes out of the 10,000. Of your 50 upregulated genes, 10 are part of Pathway X.

Key question: Is having 10 out of your 50 genes in Pathway X significantly more than you’d expect just by random chance, given that Pathway X is only 1% (100/10,000) of all possible genes?

ORA doesn’t give you how strongly a certain pathway is presented in your list of interesting genes but rather how much disproportionate representation is there of the pathway in your list.

ORA is not recommended in some cases: 1.What happens when you combine the up- and down regulated gene lists to compare? Are your results valid? 2. How do duplicate genes in the input list and background matter? artificial over-representation?

Gene Set Enrichment Analysis (GSEA):

First rank the total gene set on the basis of detected signals, such as change of gene expression, then tests whether genes annotated to the same pathway tend to cluster together at the top (or bottom) of the ranked list.

Uses a full ranked list of genes
Captures subtle, coordinated effects
P-values are adjusted for multiple-testing!

Topology-based analysis:

Account for additional information that impacts pathway activity by integrating scores measuring gene positions within a pathway and gene–gene interactions into the enrichment tests.

Aim to increase the sensitivity of pathway enrichment analysis by considering genes’ “co-expression”
Requires experimental evidence for pathway structures and gene–gene interactions

General Workflow

Start with omics data (e.g., RNA-seq, ATAC-seq, etc.)
Select statistical method (ORA, GSEA, TPEA)
Choose your input (e.g. gene list)
Choose annotation database (e.g., GO, KEGG, Reactome)
Perform PEA and visualize
Document all assumptions and parameters

The Importance of Input and Background Sets

Input set: filtered list of genes, proteins, or metabolites
Background set: all features detected in the experiment
❗ Using all genes in the genome as background = common mistake

Example: Why Background Matters

RNA-seq study using genome-wide vs. expressed-gene background
Only 44% overlap in enriched pathways
Use actual detected features as background!

Reference Annotation Databases

KEGG, Reactome, GO, MetaCyc, EcoCyc, etc.
Pathway size and definitions vary between databases
Use the most updated versions
Report: database, version, date

Web Tool Summary Table

Tool	ORA	GSEA	Topology	GO	KEGG	Reactome	MSigDB	Other Databases
g:Profiler	✅	🔶		✅	✅	✅	✅	TRANSFAC, miRTarBase, WikiPathways
Enrichr	✅			✅	✅	✅	✅	ChEA, DrugMatrix, TF/miRNA
DAVID	✅			✅	✅	✅		Panther, BioCarta
WebGestalt	✅	✅	✅	✅	✅	✅	✅	WikiPathways, user-defined sets
Reactome	✅	✅				✅
PantherDB	✅	✅		✅				Panther Pathways
Metascape	✅			✅	✅	✅	✅	CORUM, WikiPathways
ShinyGO	✅			✅	✅		🔸	Limited subset of MSigDB
PathDIP	✅		✅	✅	✅	✅		PID, BioCarta, PPI-aware pathways
GSEA-MSigDB		✅		✅	✅	✅	✅	Hallmark, C1–C7 collections
ExpressAnalyst	✅	✅		✅	✅	✅	✅	BioCarta, WikiPathways
Cytoscape EnrichMap		✅		Any	Any	Any	Any	Visualization

9 Quick Tips to Avoid PEA Pitfalls

Know what analysis type fits your data
Clean, validated input gene list only
Use >1 PEA tool, compare results
Document all tool versions & parameters
Always use adjusted p-values
Choose statistical tests & visualizations wisely
Consider analyzing gene subgroups/networks
Validate with recent literature
Review with a wet-lab biologist or clinician

Common Errors in Published PEA Studies

90% of surveyed studies used incorrect background sets
16 of 25 popular tools used outdated databases
Up to 40% more false positives without multiple testing correction

Visualizing Your Results

Visualization improves interpretation
Use clustering to group redundant terms
Visualization examples

Key plot labels:

ORA (dotplot): Gene ratio (x-axis), count (point size), adjusted p-value (color)
GSEA (ridgeplot): logFC (x-axis), term (y-axis), adjusted p-value (color)

Reporting is Key

The result of pathway enrichment depends on the data, assumptions, and tools you use.
Therefore, these must be reported in your presentations and publications:
- Gene list origin and filtering criteria
- Background set definition
- Tool name, version, database, and date
- Statistical method, FDR threshold

Integrated Omics

Combines transcriptomics, proteomics, GWAS, etc.
Tool: ActivePathways (combines p-values from each omics layer)
Pros: improves power
Cons: complex, needs careful integration

Final Thoughts

Pathway enrichment is powerful if done right
Many omics types require special considerations
Don’t rely on defaults or black-box tools
Collaboration between biologists and bioinformaticians is key

Open Questions

What are the best ways to integrate multiple data layers?
What makes a good PEA tool for non-programmers?

:::::::::::::