1 Project Summary

PI: Dr. Weizhou Zhang

Institution: University of Florida

Department: UF Health Cancer Center

Study Contact: Tanzia Islam

Project Title: The distinct roles of MSH2 and MLH1 in basal-like breast cancer

Study Summary:

Sample type(s): ? cells

Organism: Homo sapiens

Analysis goal(s):

Report-prepared-by:
- Dr. Heather Kates, Bioinformatics Analyst

Report-reviewed-by:
- Dr. Jason Brant, Assistant Professor

2 Data Downloads

2.1 Download Raw Sequencing Data

Below is a link to download the raw sequencing files. These files are very large (>150GB); download only when needed. Note that you must be logged into your UF dropbox account for this link to work.

This file was not included in this report.

2.2 Download Sequencing Data Quality Control Summary

MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Pipeline QC results are visualized in the report which collates pipeline QC from FastQC, TrimGalore, samtools flagstat, samtools idxstats, samtools stats, picard CollectMultipleMetrics, picard MarkDuplicates, Preseq, deepTools plotProfile, deepTools plotFingerprint and featureCounts.

Download MultiQC Report

3 Data Pre-processing and Normalization

3.1 Peak Calling and Consensus Peakset Generation

Raw ATAC-seq data were processed using the nf-core/atacseq 2.1.2 1 (https://nf-co.re/atacseq/2.1.2/) pipeline for the primary human genome. Peak calling was performed with MACS2 in broad peak mode for each individual sample. A consensus peakset was then generated across all samples using the pipeline’s consensus workflow, which merges peaks detected in at least two replicates in any one group. Per-contrast consensus peaks were identified for differential accessibility testing.

3.2 Normalization

Spike-in normalization was not available for this dataset. Counts were normalized using the TMM (Trimmed Mean of M-values) method to account for library size differences.

## Spike-in data not available. Performing TMM normalization instead.

3.3 Sample Summary

3.4 Quality Control Plots

3.4.1 PCA Plots

Principal Component Analysis (PCA) plots are used here to visualize the overall structure of the data. Two PCA plots are shown:

  • Raw Counts: Variance between samples based on raw counts without any normalization applied.
  • TMM Normalized Counts: Variance after TMM normalization, which adjusts for library size differences without external spike-in controls.

The side-by-side comparison highlights the effect of TMM normalization on sample clustering.

3.4.2 Correlation Heatmaps

The correlation heatmaps shown here represent the pairwise correlation between samples based on their log-transformed counts per million (CPM). Two sets of heatmaps are presented:

  • Raw Counts: The first heatmap shows the sample correlation using raw counts, providing an overview of how similarly the samples behave without any normalization.
  • TMM Normalized Counts: The second heatmap illustrates the correlation between samples after TMM normalization, which adjusts for library size differences without the use of spike-in controls.

These heatmaps help assess the overall quality of the samples and identify potential batch effects or outliers in the dataset.

4 Differential Accessibility Analysis

To determine differential accessibility, a consensus set of peaks was first defined across all samples in the dataset included in this report. Peaks were included in the consensus set if they were present in at least two replicates within any group, ensuring robust detection of shared accessible regions. If more than one contrast was tested, this consensus peakset was filtered to retain only peaks present in at least two replicates in either of the contrast’s groups.

For each per-contrast consensus peak, the number of aligned reads overlapping that region was counted for every sample, resulting in a matrix of read counts per peak per sample (“peak count matrix”).

Differential accessibility analysis of the peak count matrix was performed using the R package edgeR v. 3.4 (McCarthy et al. 2012)2 to identify peaks with statistically significant differences in accessibility between the two conditions in each contrast. This edgeR analysis normalizes count data using spike-in factors, filters peaks with a minimum count of 10 and a minimum proportion of 0.5 across samples for a given contrast, fits a model using the formula ~0 + Condition, and performs differential expression testing to identify significant changes between conditions. Specifically, estimateDisp(dge_contrast, design_contrast) was used to estimate the biological variation (dispersion) in the data, and glmQLFit(dge_contrast, design_contrast) was used to fit a quasi-likelihood generalized linear model, which is more robust to overdispersion compared to glmFit for this type of data.

4.1 Overview of Contrasts

## The following 1 contrasts were tested for differential accessibility:
## 1. MLH1KO vs MLH1R4

4.2 Summary of Differential Accessibility per-contrast

In the per-contrast summary tables below, the “significance” column indicates the “direction” of the significant results (in what group the peak is more accessible), and the number of peaks indicates the number of peaks that are significantly more accessible (FDR < 0.05) in that group. Consensus peaks that were not significantly different between the two groups in the contrast are reported as “Not significant”.

4.3 Downloadable Results Tables

Results of Differential Accessibility Analysis are provided below. Peak annotation was performed using HOMER as implemented in the nf-core/atacseq pipeline. Promoter regions were defined as peaks +/- 1,000 bp from TSS, and this designation may differ from the HOMER derived annotation in the annotation column(s) which considers promoter regions as those –1,000 bp to +100 bp relative to the transcription start site (TSS)

Download Differential Accessibility Results

Contrast DA Promoter Regions DA All Annotated Regions
MLH1KO_vs_MLH1R4

5 Visualization of Differential Accessibility

5.1 UCSC Track Hubs

UCSC track hubs were generated for each contrast to support interactive visualization of differential accessibility results. These hubs allow genomic regions of interest to be viewed alongside signal tracks from individual samples in the UCSC Genome Browser.

For each contrast, a set of filtered differentially accessible (DA) peaks was converted to BigBed format for browser visualization. Peaks were filtered to retain those with a false discovery rate (FDR) less than 0.05, an absolute log2 fold change greater than 1, an average accessibility (logCPM) greater than 2, and an annotated gene name. These filtered peaks were included as a track within the hub.

Track hubs also include BigWig signal tracks for each individual sample to compare accessibility profiles across replicates and conditions in the genome browser. All hubs are hosted on a web-accessible directory linked to the UCSC Genome Browser.

Clicking the “UCSC Track” links in the table below will open the tracks in the UCSC Genome Browser. Make sure to right click and select open a new tab if you do not want to navigate away from this report.

If you prefer viewing the data in IGV, click the Download IGV Tracks link to save a zip folder to your computer.

Once the folder is downloaded, unzip it, go to www.igv.org in your browser, then select “Tracks” > “Local File” and select all files in the downloaded folder.

Contrast UCSC_Track IGV_Zip
MLH1KO_vs_MLH1R4 View Tracks in UCSC Browser Download IGV Tracks

5.2 Barplots of Differential Peak Accessibility by Peak Annotation

Barplots show the number of differentially accessible (DA) peaks within each annotation category (e.g., promoter, intron, intergenic) for each contrast. Peaks were grouped based on their significance and direction of accessibility (i.e., more accessible in one group vs. the other, or not significant).

5.3 MA Plots

MA plots visualize the relationship between average accessibility (log2 CPM) and differential accessibility (log2 fold change) for each peak within a given contrast. Each point represents a consensus peak, with its position determined by its average accessibility across all samples (x-axis) and the estimated log fold change between groups (y-axis).

Peaks are colored based on statistical significance and direction of change: red indicates peaks more accessible in one group, blue indicates peaks more accessible in the other, and gray represents non-significant differences. Only peaks with an absolute log2 fold change greater than 1 and an FDR below 0.05 are considered significant.

These plots are useful for assessing the global characteristics of each contrast — for example, whether one group has more widespread accessibility changes, whether there is an even distribution of up- and down-regulated peaks, and whether extreme fold changes are occurring mostly in low-abundance regions (which may reflect noise). They also help confirm that log fold changes are centered around zero in non-significant peaks, as expected when normalization and modeling are appropriately specified.

5.4 Volcano Plots

Volcano plots summarize the results of differential accessibility analysis by displaying the magnitude of change (log2 fold change) on the x-axis and statistical significance (–log10 FDR) on the y-axis. Each point represents a consensus peak tested within a given contrast.

Peaks are colored based on statistical significance and direction of change: red indicates peaks more accessible in one group, blue indicates peaks more accessible in the other, and gray represents non-significant differences. Only peaks with an absolute log2 fold change greater than 1 and an FDR below 0.05 are considered significant.

Volcano plots are used to assess the overall strength and distribution of differential accessibility signals in each contrast. They help reveal whether a contrast has many significant peaks or only a few, whether the effect sizes are modest or extreme, and whether the signal is symmetric or biased toward one condition. Like MA plots, volcano plots serve as a quality check to confirm that differential patterns are consistent with expectations and that significance is not driven solely by low-count or high-variance regions.

6 Pathway Enrichment Analysis

Pathway enrichment analysis was performed to identify functional enrichment of gene lists and to compare these significant results across contrasts.

6.1 Differentially Accessible Genes

To create gene lists to input into enrichment analyses, differentially accessible peaks were filtered by FDR < 0.05 and split into more accessible in left group relative to right group (logFC >1) and less accessible (logFC < 1). From these filtered peaks, those annotated with entrez IDs were selected and these entrez IDs were used as input into the enrichment analysis.

6.2 Gene Ontology (GO) Enrichment Analysis

Gene Ontology enrichment of gene annotation of differentially accessibile regions was performed using the enrichGO function in clusterProfiler v4.8 (Yu et al. 2012) in each of three GO categories (BP, MF, CC). Interactive GO enrichment plots are based on the top 10 significantly enriched GO terms per GO category per gene list. Hover over the plot to view p-value, gene ratio, and up to the top 20 DE genes (sorted by DE adj.p.value) in that term.

To assess similarities between gene lists’ enrichment results, if a gene list(s) had significant results for a different gene lists’ top-10 term, that result is displayed as well regardless of whether or not the result was in top 10. Downloadable results excel file include all significant results for all gene lists.

📥 Download GO_BP_enrich_plot
📥 Download GO_MF_enrich_plot
📥 Download GO_CC_enrich_plot

6.3 KEGG Pathway Enrichment Analysis

KEGG enrichment was performed using the enrichKEGG function in clusterProfiler v4.8 (Yu et al. 2012). Interactive KEGG enrichment plots are based on the top 10 significantly enriched KEGG pathways per gene list. Hover over the plot to view p-value, gene ratio, and up to the top 20 DE genes (sorted by DE adj.p.value) in that pathway.

To assess similarities between gene lists’ enrichment results, if a gene list(s) had significant results for a different gene lists’ top-10 pathway, that result is displayed as well regardless of whether or not the result was in top 10. Downloadable results excel file include all significant results for all gene lists.

📥 Download kegg_enrichment_plot

7 References


  1. Patel, H., Espinosa-Carrasco, J., Langer, B., Ewels, P., nf-core bot, Garcia, M. U., Syme, R., Peltzer, A., Talbot, A., Behrens, D., Gabernet, G., Jin, M., Hörtenhuber, M., Gonzalez Rodriguez, J., Menden, K., & An, Ö. (2022). nf-core/atacseq: 2.1.2. [GitHub repository]. https://github.com/nf-core/atacseq (accessed August 7, 2022).↩︎

  2. McCarthy DJ, Chen Y and Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297↩︎