PI: Dr. Weizhou Zhang
Institution: University of Florida
Department: UF Health Cancer Center
Study Contact: Tanzia Islam
Project Title: The distinct roles of MSH2 and MLH1 in basal-like breast cancer
Study Summary:
Sample type(s): ? cells
Organism: Homo sapiens
Analysis goal(s):
Report-prepared-by:
- Dr. Heather Kates, Bioinformatics
Analyst
Report-reviewed-by:
- Dr. Jason Brant, Assistant
Professor
Below is a link to download the raw sequencing files. These files are very large (>150GB); download only when needed. Note that you must be logged into your UF dropbox account for this link to work.
This file was not included in this report.
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Pipeline QC results are visualized in the report which collates pipeline QC from FastQC, TrimGalore, samtools flagstat, samtools idxstats, samtools stats, picard CollectMultipleMetrics, picard MarkDuplicates, Preseq, deepTools plotProfile, deepTools plotFingerprint and featureCounts.
Raw ATAC-seq data were processed using the nf-core/atacseq 2.1.2 1 (https://nf-co.re/atacseq/2.1.2/) pipeline for the primary human genome. Peak calling was performed with MACS2 in broad peak mode for each individual sample. A consensus peakset was then generated across all samples using the pipeline’s consensus workflow, which merges peaks detected in at least two replicates in any one group. Per-contrast consensus peaks were identified for differential accessibility testing.
Spike-in normalization was not available for this dataset. Counts were normalized using the TMM (Trimmed Mean of M-values) method to account for library size differences.
## Spike-in data not available. Performing TMM normalization instead.
Principal Component Analysis (PCA) plots are used here to visualize the overall structure of the data. Two PCA plots are shown:
The side-by-side comparison highlights the effect of TMM normalization on sample clustering.
The correlation heatmaps shown here represent the pairwise correlation between samples based on their log-transformed counts per million (CPM). Two sets of heatmaps are presented:
These heatmaps help assess the overall quality of the samples and identify potential batch effects or outliers in the dataset.
To determine differential accessibility, a consensus set of peaks was first defined across all samples in the dataset included in this report. Peaks were included in the consensus set if they were present in at least two replicates within any group, ensuring robust detection of shared accessible regions. If more than one contrast was tested, this consensus peakset was filtered to retain only peaks present in at least two replicates in either of the contrast’s groups.
For each per-contrast consensus peak, the number of aligned reads overlapping that region was counted for every sample, resulting in a matrix of read counts per peak per sample (“peak count matrix”).
Differential accessibility analysis of the peak count matrix was performed using the R package edgeR v. 3.4 (McCarthy et al. 2012)2 to identify peaks with statistically significant differences in accessibility between the two conditions in each contrast. This edgeR analysis normalizes count data using spike-in factors, filters peaks with a minimum count of 10 and a minimum proportion of 0.5 across samples for a given contrast, fits a model using the formula ~0 + Condition, and performs differential expression testing to identify significant changes between conditions. Specifically, estimateDisp(dge_contrast, design_contrast) was used to estimate the biological variation (dispersion) in the data, and glmQLFit(dge_contrast, design_contrast) was used to fit a quasi-likelihood generalized linear model, which is more robust to overdispersion compared to glmFit for this type of data.
## The following 1 contrasts were tested for differential accessibility:
## 1. MLH1KO vs MLH1R4
In the per-contrast summary tables below, the “significance” column indicates the “direction” of the significant results (in what group the peak is more accessible), and the number of peaks indicates the number of peaks that are significantly more accessible (FDR < 0.05) in that group. Consensus peaks that were not significantly different between the two groups in the contrast are reported as “Not significant”.
Results of Differential Accessibility Analysis are provided below. Peak annotation was performed using HOMER as implemented in the nf-core/atacseq pipeline. Promoter regions were defined as peaks +/- 1,000 bp from TSS, and this designation may differ from the HOMER derived annotation in the annotation column(s) which considers promoter regions as those –1,000 bp to +100 bp relative to the transcription start site (TSS)
Contrast | DA Promoter Regions | DA All Annotated Regions |
---|---|---|
MLH1KO_vs_MLH1R4 |