Overview

Before testing for differential expression, we need to assess sample quality, remove genes that are too lowly expressed to test reliably, and normalize the data so that samples are comparable.

In this notebook you will:

  • Load count data and sample metadata
  • Create a DGEList object
  • Filter lowly expressed genes
  • Apply TMM normalization
  • Visualize sample quality

Output: A normalized DGEList saved as DGE_filtered_normalized.rds for use in the differential expression script.


Setup

Load required packages and define output directories.

library(limma)
library(edgeR)
library(tidyverse)
library(pheatmap)
library(RColorBrewer)
library(ggrepel)
library(here)

set.seed(42)

figures_dir <- here("demo-analysis", "output", "differential-expression", "figures")
rds_dir     <- here("demo-analysis", "output", "differential-expression")

dir.create(figures_dir, showWarnings = FALSE, recursive = TRUE)
dir.create(rds_dir,     showWarnings = FALSE, recursive = TRUE)
counts_file <- "/blue/bioinf_workshop/share/nfcore_rnaseq_files/star_rsem/rsem.merged.gene_counts.tsv"
metadata_file <- here("demo-analysis", "data", "metadata", "sample_metadata.csv")

if (!file.exists(counts_file))   stop("Count file not found: ",    counts_file)
if (!file.exists(metadata_file)) stop("Metadata file not found: ", metadata_file)

cat("✓ Count file found\n")
## ✓ Count file found
cat("✓ Metadata file found\n")
## ✓ Metadata file found

Load Raw Counts

The nf-core/rnaseq pipeline with RSEM produces a merged count matrix. We need the raw expected counts (rsem.merged.gene_counts.tsv), not the TPM or scaled variants. We also drop the transcript_id(s) column since we are working at the gene level.

counts_raw <- read.table(counts_file, header = TRUE, sep = "\t",
                         check.names = FALSE, stringsAsFactors = FALSE) %>%
  dplyr::select(-any_of("transcript_id(s)"))

cat("Genes:", nrow(counts_raw), "| Samples:", ncol(counts_raw) - 1, "\n")
## Genes: 45706 | Samples: 7
head(counts_raw[, 1:4], n = 3)
##              gene_id gene_name SRR12546980 SRR12546982
## 1 ENSMUSG00000000001     Gnai3        2660     3405.00
## 2 ENSMUSG00000000003      Pbsn           0        0.00
## 3 ENSMUSG00000000028     Cdc45        1528      988.99

Convert Ensembl IDs to Gene Symbols

The count matrix uses Ensembl IDs (e.g. ENSMUSG00000000003) as gene identifiers. We convert these to gene symbols using the mouse annotation package org.Mm.eg.db so that results are easier to interpret downstream. Genes with no matching symbol retain their Ensembl ID as a fallback.

library(org.Mm.eg.db)

symbols <- AnnotationDbi::mapIds(org.Mm.eg.db, keys = counts_raw$gene_id,
                                 column = "SYMBOL", keytype = "ENSEMBL",
                                 multiVals = "first")

counts_raw$gene_name <- ifelse(is.na(symbols), counts_raw$gene_id, symbols)

cat("✓ Mapped", sum(!is.na(symbols)), "of", length(symbols), "genes to symbols\n")
## ✓ Mapped 29478 of 45706 genes to symbols

Build Final Count Matrix

A small number of genes share the same symbol (e.g. pseudogenes). We make these unique by appending the Ensembl ID, then set gene symbols as row names to produce the final counts matrix used throughout the rest of the workshop.

dups <- duplicated(counts_raw$gene_name) | duplicated(counts_raw$gene_name, fromLast = TRUE)
counts_raw$gene_name[dups] <- paste0(counts_raw$gene_name[dups], "_", counts_raw$gene_id[dups])

expression_data <- counts_raw %>%
  dplyr::select(-gene_id) %>%
  column_to_rownames("gene_name")

cat("✓ Final matrix:", nrow(expression_data), "genes x", ncol(expression_data), "samples\n")
## ✓ Final matrix: 45706 genes x 6 samples
head(expression_data[, 1:4], n = 3)
##       SRR12546980 SRR12546982 SRR12546984 SRR12546986
## Gnai3        2660     3405.00        2947        2999
## Pbsn            0        0.00           0           0
## Cdc45        1528      988.99        1527         832

Sample Metadata

sample_info <- read.csv(metadata_file,
                         header           = TRUE,
                         stringsAsFactors = FALSE)

sample_info
##        sample    group genotype treatment         Run
## 1 SRR12546980 PRMT7_WT PRMT7_WT Untreated SRR12546980
## 2 SRR12546982 PRMT7_KD PRMT7_KD Untreated SRR12546982
## 3 SRR12546984 PRMT7_WT PRMT7_WT Untreated SRR12546984
## 4 SRR12546986 PRMT7_KD PRMT7_KD Untreated SRR12546986
## 5 SRR12546988 PRMT7_WT PRMT7_WT Untreated SRR12546988
## 6 SRR12546990 PRMT7_KD PRMT7_KD Untreated SRR12546990
# Confirm sample names match count matrix columns
all(sample_info$sample %in% colnames(expression_data))
## [1] TRUE

Create DGEList Object

The DGEList object bundles counts and sample metadata into a single container used throughout the edgeR and limma-voom workflow.

DGE <- DGEList(counts  = expression_data,
               samples = sample_info)

# Set the group factor explicitly — WT is the reference level (listed first)
DGE$samples$group <- factor(DGE$samples$genotype,
                              levels = c("PRMT7_WT", "PRMT7_KD"))
DGE
## An object of class "DGEList"
## $counts
##       SRR12546980 SRR12546982 SRR12546984 SRR12546986 SRR12546988 SRR12546990
## Gnai3        2660     3405.00        2947        2999        3012        3593
## Pbsn            0        0.00           0           0           0           0
## Cdc45        1528      988.99        1527         832        1431         961
## H19             0        0.00           0           0           0           0
## Scml2         273      213.01         323         188         323         203
## 45701 more rows ...
## 
## $samples
##                group lib.size norm.factors      sample genotype treatment
## SRR12546980 PRMT7_WT 24337058            1 SRR12546980 PRMT7_WT Untreated
## SRR12546982 PRMT7_KD 29234454            1 SRR12546982 PRMT7_KD Untreated
## SRR12546984 PRMT7_WT 25754725            1 SRR12546984 PRMT7_WT Untreated
## SRR12546986 PRMT7_KD 25530601            1 SRR12546986 PRMT7_KD Untreated
## SRR12546988 PRMT7_WT 25440551            1 SRR12546988 PRMT7_WT Untreated
## SRR12546990 PRMT7_KD 29756203            1 SRR12546990 PRMT7_KD Untreated
##                     Run
## SRR12546980 SRR12546980
## SRR12546982 SRR12546982
## SRR12546984 SRR12546984
## SRR12546986 SRR12546986
## SRR12546988 SRR12546988
## SRR12546990 SRR12546990

Filter Lowly Expressed Genes

Genes with very low counts across all samples add noise without contributing useful signal and inflate the multiple testing burden. filterByExpr() applies a principled heuristic: a gene must reach a minimum expression level in enough samples given the experimental design.

keep.exprs <- filterByExpr(DGE, group = DGE$samples$group)
table(keep.exprs)
## keep.exprs
## FALSE  TRUE 
## 31923 13783
DGE <- DGE[keep.exprs, , keep.lib.sizes = FALSE]
cat("Genes remaining after filtering:", nrow(DGE), "\n")
## Genes remaining after filtering: 13783

TMM Normalization

Even after filtering, samples differ in library size (total reads) and composition bias (a few highly expressed genes can dominate the library). TMM normalization calculates a scaling factor for each sample to correct for both. Factors close to 1.0 indicate little correction was needed.

DGE <- calcNormFactors(DGE, method = "TMM")
DGE$samples
##                group lib.size norm.factors      sample genotype treatment
## SRR12546980 PRMT7_WT 24323440    0.9658872 SRR12546980 PRMT7_WT Untreated
## SRR12546982 PRMT7_KD 29211491    1.0310403 SRR12546982 PRMT7_KD Untreated
## SRR12546984 PRMT7_WT 25741290    0.9680670 SRR12546984 PRMT7_WT Untreated
## SRR12546986 PRMT7_KD 25511001    1.0323176 SRR12546986 PRMT7_KD Untreated
## SRR12546988 PRMT7_WT 25427173    0.9659094 SRR12546988 PRMT7_WT Untreated
## SRR12546990 PRMT7_KD 29730447    1.0402622 SRR12546990 PRMT7_KD Untreated
##                     Run
## SRR12546980 SRR12546980
## SRR12546982 SRR12546982
## SRR12546984 SRR12546984
## SRR12546986 SRR12546986
## SRR12546988 SRR12546988
## SRR12546990 SRR12546990

Quality Control

Library Sizes

Library size is the total number of reads per sample after filtering. Large differences (>2–3×) between samples may indicate technical issues. Moderate variation is normal and is handled by normalization.

lib_size_df <- data.frame(
  sample   = rownames(DGE$samples),
  lib_size = DGE$samples$lib.size,
  group    = DGE$samples$group
)

ggplot(lib_size_df, aes(x = sample, y = lib_size / 1e6, fill = group)) +
  geom_col() +
  labs(title = "Library Sizes After Filtering",
       x = "Sample", y = "Library Size (Millions of Reads)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set2")

MDS Plot

Multidimensional Scaling (MDS) places samples in two-dimensional space such that similar samples are closer together. Samples from the same group should cluster together and separate clearly from the other group. Outlier samples or unexpected clustering patterns are red flags.

mds <- plotMDS(DGE, top = 500, plot = FALSE, gene.selection = "common")

mds_df <- data.frame(
  sample   = rownames(DGE$samples),
  dim1     = mds$x,
  dim2     = mds$y,
  genotype = DGE$samples$genotype
)

ggplot(mds_df, aes(x = dim1, y = dim2, color = genotype, label = sample)) +
  geom_point(size = 4, alpha = 0.8) +
  geom_text_repel(size = 3, max.overlaps = 20) +
  labs(title    = "MDS Plot — Sample Similarity",
       subtitle = "Top 500 most variable genes",
       x = paste0("Dim 1 (", round(mds$var.explained[1] * 100, 1), "%)"),
       y = paste0("Dim 2 (", round(mds$var.explained[2] * 100, 1), "%)")) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

ggsave(here(figures_dir, "mds_plot.png"), width = 8, height = 6, dpi = 300)

Sample Correlation Heatmap

Pairwise Spearman correlations between all samples. Samples in the same group should show higher correlations with each other than with samples in the other group. Typical within-group correlations are 0.95–0.99.

logcpm     <- cpm(DGE, log = TRUE)
cor_matrix <- cor(logcpm, method = "spearman")

annotation_col <- data.frame(
  Genotype  = DGE$samples$genotype,
  row.names = colnames(logcpm)
)

# Save to file
pheatmap(cor_matrix,
         annotation_col = annotation_col,
         annotation_row = annotation_col,
         main     = "Sample Correlation Heatmap",
         color    = colorRampPalette(c("blue", "white", "red"))(100),
         breaks   = seq(0.7, 1, length.out = 101),
         fontsize = 8,
         filename = here(figures_dir, "correlation_heatmap.png"),
         width = 8, height = 7)

# Display in document
pheatmap(cor_matrix,
         annotation_col = annotation_col,
         annotation_row = annotation_col,
         main     = "Sample Correlation Heatmap",
         color    = colorRampPalette(c("blue", "white", "red"))(100),
         breaks   = seq(0.7, 1, length.out = 101),
         fontsize = 8)

Save Processed Data

Save the filtered and normalized DGEList so the differential expression script can load it directly without re-running QC.

saveRDS(DGE, file = here(rds_dir, "DGE_filtered_normalized.rds"))

cat("✓ Saved:", here(rds_dir, "DGE_filtered_normalized.rds"), "\n")
## ✓ Saved: /blue/bioinf_workshop/hkates/rnaseq_workshop/demo-analysis/output/differential-expression/DGE_filtered_normalized.rds
cat("  Genes:", nrow(DGE), "| Samples:", ncol(DGE), "\n")
##   Genes: 13783 | Samples: 6
cat("  Groups:", paste(levels(DGE$samples$group), collapse = ", "), "\n")
##   Groups: PRMT7_WT, PRMT7_KD

Session Information

sessionInfo()
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Mm.eg.db_3.21.0  AnnotationDbi_1.72.0 IRanges_2.44.0      
##  [4] S4Vectors_0.48.0     Biobase_2.70.0       BiocGenerics_0.56.0 
##  [7] generics_0.1.4       here_1.0.2           ggrepel_0.9.8       
## [10] RColorBrewer_1.1-3   pheatmap_1.0.13      lubridate_1.9.5     
## [13] forcats_1.0.1        stringr_1.6.0        dplyr_1.2.0         
## [16] purrr_1.2.1          readr_2.2.0          tidyr_1.3.2         
## [19] tibble_3.3.1         ggplot2_4.0.2        tidyverse_2.0.0     
## [22] edgeR_4.8.2          limma_3.66.0        
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.50.0   gtable_0.3.6      xfun_0.57         bslib_0.10.0     
##  [5] lattice_0.22-9    tzdb_0.5.0        vctrs_0.7.2       tools_4.5.3      
##  [9] RSQLite_2.4.6     blob_1.3.0        pkgconfig_2.0.3   S7_0.2.1         
## [13] lifecycle_1.0.5   compiler_4.5.3    farver_2.1.2      textshaping_1.0.5
## [17] Biostrings_2.78.0 statmod_1.5.1     Seqinfo_0.99.2    htmltools_0.5.9  
## [21] sass_0.4.10       yaml_2.3.12       crayon_1.5.3      pillar_1.11.1    
## [25] jquerylib_0.1.4   cachem_1.1.0      tidyselect_1.2.1  locfit_1.5-9.12  
## [29] digest_0.6.39     stringi_1.8.7     labeling_0.4.3    rprojroot_2.1.1  
## [33] fastmap_1.2.0     grid_4.5.3        cli_3.6.5         magrittr_2.0.4   
## [37] dichromat_2.0-0.1 withr_3.0.2       scales_1.4.0      bit64_4.6.0-1    
## [41] timechange_0.4.0  XVector_0.50.0    httr_1.4.8        rmarkdown_2.30   
## [45] bit_4.6.0         otel_0.2.0        ragg_1.5.2        png_0.1-9        
## [49] hms_1.1.4         memoise_2.0.1     evaluate_1.0.5    knitr_1.51       
## [53] rlang_1.1.7       Rcpp_1.1.1        glue_1.8.0        DBI_1.3.0        
## [57] rstudioapi_0.18.0 jsonlite_2.0.0    R6_2.6.1          systemfonts_1.3.2