Before testing for differential expression, we need to assess sample quality, remove genes that are too lowly expressed to test reliably, and normalize the data so that samples are comparable.
In this notebook you will:
Output: A normalized DGEList saved as
DGE_filtered_normalized.rds for use in the differential
expression script.
Load required packages and define output directories.
library(limma)
library(edgeR)
library(tidyverse)
library(pheatmap)
library(RColorBrewer)
library(ggrepel)
library(here)
set.seed(42)
figures_dir <- here("demo-analysis", "output", "differential-expression", "figures")
rds_dir <- here("demo-analysis", "output", "differential-expression")
dir.create(figures_dir, showWarnings = FALSE, recursive = TRUE)
dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
counts_file <- "/blue/bioinf_workshop/share/nfcore_rnaseq_files/star_rsem/rsem.merged.gene_counts.tsv"
metadata_file <- here("demo-analysis", "data", "metadata", "sample_metadata.csv")
if (!file.exists(counts_file)) stop("Count file not found: ", counts_file)
if (!file.exists(metadata_file)) stop("Metadata file not found: ", metadata_file)
cat("✓ Count file found\n")
## ✓ Count file found
cat("✓ Metadata file found\n")
## ✓ Metadata file found
The nf-core/rnaseq pipeline with RSEM produces a merged count matrix.
We need the raw expected counts
(rsem.merged.gene_counts.tsv), not the TPM or scaled
variants. We also drop the transcript_id(s) column since we
are working at the gene level.
counts_raw <- read.table(counts_file, header = TRUE, sep = "\t",
check.names = FALSE, stringsAsFactors = FALSE) %>%
dplyr::select(-any_of("transcript_id(s)"))
cat("Genes:", nrow(counts_raw), "| Samples:", ncol(counts_raw) - 1, "\n")
## Genes: 45706 | Samples: 7
head(counts_raw[, 1:4], n = 3)
## gene_id gene_name SRR12546980 SRR12546982
## 1 ENSMUSG00000000001 Gnai3 2660 3405.00
## 2 ENSMUSG00000000003 Pbsn 0 0.00
## 3 ENSMUSG00000000028 Cdc45 1528 988.99
The count matrix uses Ensembl IDs
(e.g. ENSMUSG00000000003) as gene identifiers. We convert
these to gene symbols using the mouse annotation package
org.Mm.eg.db so that results are easier to interpret
downstream. Genes with no matching symbol retain their Ensembl ID as a
fallback.
library(org.Mm.eg.db)
symbols <- AnnotationDbi::mapIds(org.Mm.eg.db, keys = counts_raw$gene_id,
column = "SYMBOL", keytype = "ENSEMBL",
multiVals = "first")
counts_raw$gene_name <- ifelse(is.na(symbols), counts_raw$gene_id, symbols)
cat("✓ Mapped", sum(!is.na(symbols)), "of", length(symbols), "genes to symbols\n")
## ✓ Mapped 29478 of 45706 genes to symbols
A small number of genes share the same symbol (e.g. pseudogenes). We make these unique by appending the Ensembl ID, then set gene symbols as row names to produce the final counts matrix used throughout the rest of the workshop.
dups <- duplicated(counts_raw$gene_name) | duplicated(counts_raw$gene_name, fromLast = TRUE)
counts_raw$gene_name[dups] <- paste0(counts_raw$gene_name[dups], "_", counts_raw$gene_id[dups])
expression_data <- counts_raw %>%
dplyr::select(-gene_id) %>%
column_to_rownames("gene_name")
cat("✓ Final matrix:", nrow(expression_data), "genes x", ncol(expression_data), "samples\n")
## ✓ Final matrix: 45706 genes x 6 samples
head(expression_data[, 1:4], n = 3)
## SRR12546980 SRR12546982 SRR12546984 SRR12546986
## Gnai3 2660 3405.00 2947 2999
## Pbsn 0 0.00 0 0
## Cdc45 1528 988.99 1527 832
sample_info <- read.csv(metadata_file,
header = TRUE,
stringsAsFactors = FALSE)
sample_info
## sample group genotype treatment Run
## 1 SRR12546980 PRMT7_WT PRMT7_WT Untreated SRR12546980
## 2 SRR12546982 PRMT7_KD PRMT7_KD Untreated SRR12546982
## 3 SRR12546984 PRMT7_WT PRMT7_WT Untreated SRR12546984
## 4 SRR12546986 PRMT7_KD PRMT7_KD Untreated SRR12546986
## 5 SRR12546988 PRMT7_WT PRMT7_WT Untreated SRR12546988
## 6 SRR12546990 PRMT7_KD PRMT7_KD Untreated SRR12546990
# Confirm sample names match count matrix columns
all(sample_info$sample %in% colnames(expression_data))
## [1] TRUE
The DGEList object bundles counts and sample metadata
into a single container used throughout the edgeR and limma-voom
workflow.
DGE <- DGEList(counts = expression_data,
samples = sample_info)
# Set the group factor explicitly — WT is the reference level (listed first)
DGE$samples$group <- factor(DGE$samples$genotype,
levels = c("PRMT7_WT", "PRMT7_KD"))
DGE
## An object of class "DGEList"
## $counts
## SRR12546980 SRR12546982 SRR12546984 SRR12546986 SRR12546988 SRR12546990
## Gnai3 2660 3405.00 2947 2999 3012 3593
## Pbsn 0 0.00 0 0 0 0
## Cdc45 1528 988.99 1527 832 1431 961
## H19 0 0.00 0 0 0 0
## Scml2 273 213.01 323 188 323 203
## 45701 more rows ...
##
## $samples
## group lib.size norm.factors sample genotype treatment
## SRR12546980 PRMT7_WT 24337058 1 SRR12546980 PRMT7_WT Untreated
## SRR12546982 PRMT7_KD 29234454 1 SRR12546982 PRMT7_KD Untreated
## SRR12546984 PRMT7_WT 25754725 1 SRR12546984 PRMT7_WT Untreated
## SRR12546986 PRMT7_KD 25530601 1 SRR12546986 PRMT7_KD Untreated
## SRR12546988 PRMT7_WT 25440551 1 SRR12546988 PRMT7_WT Untreated
## SRR12546990 PRMT7_KD 29756203 1 SRR12546990 PRMT7_KD Untreated
## Run
## SRR12546980 SRR12546980
## SRR12546982 SRR12546982
## SRR12546984 SRR12546984
## SRR12546986 SRR12546986
## SRR12546988 SRR12546988
## SRR12546990 SRR12546990
Genes with very low counts across all samples add noise without
contributing useful signal and inflate the multiple testing burden.
filterByExpr() applies a principled heuristic: a gene must
reach a minimum expression level in enough samples given the
experimental design.
keep.exprs <- filterByExpr(DGE, group = DGE$samples$group)
table(keep.exprs)
## keep.exprs
## FALSE TRUE
## 31923 13783
DGE <- DGE[keep.exprs, , keep.lib.sizes = FALSE]
cat("Genes remaining after filtering:", nrow(DGE), "\n")
## Genes remaining after filtering: 13783
Even after filtering, samples differ in library size (total reads) and composition bias (a few highly expressed genes can dominate the library). TMM normalization calculates a scaling factor for each sample to correct for both. Factors close to 1.0 indicate little correction was needed.
DGE <- calcNormFactors(DGE, method = "TMM")
DGE$samples
## group lib.size norm.factors sample genotype treatment
## SRR12546980 PRMT7_WT 24323440 0.9658872 SRR12546980 PRMT7_WT Untreated
## SRR12546982 PRMT7_KD 29211491 1.0310403 SRR12546982 PRMT7_KD Untreated
## SRR12546984 PRMT7_WT 25741290 0.9680670 SRR12546984 PRMT7_WT Untreated
## SRR12546986 PRMT7_KD 25511001 1.0323176 SRR12546986 PRMT7_KD Untreated
## SRR12546988 PRMT7_WT 25427173 0.9659094 SRR12546988 PRMT7_WT Untreated
## SRR12546990 PRMT7_KD 29730447 1.0402622 SRR12546990 PRMT7_KD Untreated
## Run
## SRR12546980 SRR12546980
## SRR12546982 SRR12546982
## SRR12546984 SRR12546984
## SRR12546986 SRR12546986
## SRR12546988 SRR12546988
## SRR12546990 SRR12546990
Library size is the total number of reads per sample after filtering. Large differences (>2–3×) between samples may indicate technical issues. Moderate variation is normal and is handled by normalization.
lib_size_df <- data.frame(
sample = rownames(DGE$samples),
lib_size = DGE$samples$lib.size,
group = DGE$samples$group
)
ggplot(lib_size_df, aes(x = sample, y = lib_size / 1e6, fill = group)) +
geom_col() +
labs(title = "Library Sizes After Filtering",
x = "Sample", y = "Library Size (Millions of Reads)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set2")
Multidimensional Scaling (MDS) places samples in two-dimensional space such that similar samples are closer together. Samples from the same group should cluster together and separate clearly from the other group. Outlier samples or unexpected clustering patterns are red flags.
mds <- plotMDS(DGE, top = 500, plot = FALSE, gene.selection = "common")
mds_df <- data.frame(
sample = rownames(DGE$samples),
dim1 = mds$x,
dim2 = mds$y,
genotype = DGE$samples$genotype
)
ggplot(mds_df, aes(x = dim1, y = dim2, color = genotype, label = sample)) +
geom_point(size = 4, alpha = 0.8) +
geom_text_repel(size = 3, max.overlaps = 20) +
labs(title = "MDS Plot — Sample Similarity",
subtitle = "Top 500 most variable genes",
x = paste0("Dim 1 (", round(mds$var.explained[1] * 100, 1), "%)"),
y = paste0("Dim 2 (", round(mds$var.explained[2] * 100, 1), "%)")) +
theme_minimal() +
scale_color_brewer(palette = "Set1")
ggsave(here(figures_dir, "mds_plot.png"), width = 8, height = 6, dpi = 300)
Pairwise Spearman correlations between all samples. Samples in the same group should show higher correlations with each other than with samples in the other group. Typical within-group correlations are 0.95–0.99.
logcpm <- cpm(DGE, log = TRUE)
cor_matrix <- cor(logcpm, method = "spearman")
annotation_col <- data.frame(
Genotype = DGE$samples$genotype,
row.names = colnames(logcpm)
)
# Save to file
pheatmap(cor_matrix,
annotation_col = annotation_col,
annotation_row = annotation_col,
main = "Sample Correlation Heatmap",
color = colorRampPalette(c("blue", "white", "red"))(100),
breaks = seq(0.7, 1, length.out = 101),
fontsize = 8,
filename = here(figures_dir, "correlation_heatmap.png"),
width = 8, height = 7)
# Display in document
pheatmap(cor_matrix,
annotation_col = annotation_col,
annotation_row = annotation_col,
main = "Sample Correlation Heatmap",
color = colorRampPalette(c("blue", "white", "red"))(100),
breaks = seq(0.7, 1, length.out = 101),
fontsize = 8)
Save the filtered and normalized DGEList so the differential expression script can load it directly without re-running QC.
saveRDS(DGE, file = here(rds_dir, "DGE_filtered_normalized.rds"))
cat("✓ Saved:", here(rds_dir, "DGE_filtered_normalized.rds"), "\n")
## ✓ Saved: /blue/bioinf_workshop/hkates/rnaseq_workshop/demo-analysis/output/differential-expression/DGE_filtered_normalized.rds
cat(" Genes:", nrow(DGE), "| Samples:", ncol(DGE), "\n")
## Genes: 13783 | Samples: 6
cat(" Groups:", paste(levels(DGE$samples$group), collapse = ", "), "\n")
## Groups: PRMT7_WT, PRMT7_KD
sessionInfo()
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Mm.eg.db_3.21.0 AnnotationDbi_1.72.0 IRanges_2.44.0
## [4] S4Vectors_0.48.0 Biobase_2.70.0 BiocGenerics_0.56.0
## [7] generics_0.1.4 here_1.0.2 ggrepel_0.9.8
## [10] RColorBrewer_1.1-3 pheatmap_1.0.13 lubridate_1.9.5
## [13] forcats_1.0.1 stringr_1.6.0 dplyr_1.2.0
## [16] purrr_1.2.1 readr_2.2.0 tidyr_1.3.2
## [19] tibble_3.3.1 ggplot2_4.0.2 tidyverse_2.0.0
## [22] edgeR_4.8.2 limma_3.66.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.50.0 gtable_0.3.6 xfun_0.57 bslib_0.10.0
## [5] lattice_0.22-9 tzdb_0.5.0 vctrs_0.7.2 tools_4.5.3
## [9] RSQLite_2.4.6 blob_1.3.0 pkgconfig_2.0.3 S7_0.2.1
## [13] lifecycle_1.0.5 compiler_4.5.3 farver_2.1.2 textshaping_1.0.5
## [17] Biostrings_2.78.0 statmod_1.5.1 Seqinfo_0.99.2 htmltools_0.5.9
## [21] sass_0.4.10 yaml_2.3.12 crayon_1.5.3 pillar_1.11.1
## [25] jquerylib_0.1.4 cachem_1.1.0 tidyselect_1.2.1 locfit_1.5-9.12
## [29] digest_0.6.39 stringi_1.8.7 labeling_0.4.3 rprojroot_2.1.1
## [33] fastmap_1.2.0 grid_4.5.3 cli_3.6.5 magrittr_2.0.4
## [37] dichromat_2.0-0.1 withr_3.0.2 scales_1.4.0 bit64_4.6.0-1
## [41] timechange_0.4.0 XVector_0.50.0 httr_1.4.8 rmarkdown_2.30
## [45] bit_4.6.0 otel_0.2.0 ragg_1.5.2 png_0.1-9
## [49] hms_1.1.4 memoise_2.0.1 evaluate_1.0.5 knitr_1.51
## [53] rlang_1.1.7 Rcpp_1.1.1 glue_1.8.0 DBI_1.3.0
## [57] rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.2