Overview

This document explains the demo analysis setup.

Demo Analysis File Layout

What is GitHub and Why Are We Using It?

GitHub is a platform for hosting and sharing code. The workshop analysis scripts, configuration files, and directory structure are all stored in a repository (repo) — essentially a project folder that lives on GitHub. Rather than downloading files one by one, we use git clone to copy the entire repo to HiPerGator in one command. You won't need to know much more about Git for this workshop, but if you want to learn more, GitHub's own guides are a good starting point.

The files for the demo analysis will be created by cloning the workshop's GitHub repository:

What You'll Create (Your Working Space)

/blue/bioinf_workshop/$USER/
└── rnaseq_workshop/
    ├── .gitignore
    ├── mkdocs.yml
    ├── README.md
    ├── docs/
    └── demo-analysis/
        ├── data/
        │   ├── metadata/
        │   │   ├── SraRunTable.csv          # Raw metadata from NCBI SRA
        │   │   └── sample_metadata.csv      # Cleaned metadata (created for you)
        │   ├── raw/                         # Symlinks to raw .fastq files
        │   └── two-factor-design/           # Drosophila dataset (for optional script)
        │       ├── salmon.merged.gene_counts.tsv
        │       └── dme_elev_samples.tsv
        └── output/
            ├── differential-expression/     # Created by scripts 01-03
            │   ├── DGE_filtered_normalized.rds
            │   ├── figures/
            │   └── results/
            └── optional/                    # Created by optional/opt_01_prepare_nfcore_data.R
                ├── rsem.merged.gene_counts.tsv
                ├── sample_info.tsv
                ├── gene_annotation.tsv
                ├── data_summary.txt
                ├── library_sizes.png
                └── README.txt

Shared Workshop Data (Read-Only)

/blue/bioinf_workshop/share/nfcore_rnaseq_files/
├── fastqc/
├── fq_lint/
├── genome/
├── multiqc/
├── pipeline_info/
├── star_rsem/
└── trimgalore/

Note: This directory is read-only. Your scripts will read from here but write output to your own working directory.

Setup Instructions

Create Your Working Directory and Clone the Repo

If you are not already logged in to HiPerGator, open a terminal and SSH in:

ssh $USER@hpg.rc.ufl.edu

Then navigate to your workshop directory and clone the repo:

cd /blue/bioinf_workshop/$USER
git clone https://github.com/UFHCC-BCBSR/res-bioinfo-rnaseq-workshop.git rnaseq_workshop
cd rnaseq_workshop
ls -la

You should see: demo-analysis/, docs/, .gitignore, .Renviron, mkdocs.yml, README.md

Launch RStudio Server

Make sure you are in the cloned repo directory before submitting

RStudio Server sources .Renviron from the directory where it is launched from. You must be inside rnaseq_workshop/ before running sbatch to ensure this is sourced.

pwd

You should see /blue/bioinf_workshop/$USER/rnaseq_workshop — if not, run cd /blue/bioinf_workshop/$USER/rnaseq_workshop first.

Create and submit a SLURM job to launch RStudio Server:

cat > rstudio.sbatch << 'EOF'
#!/bin/bash
#SBATCH --job-name=rserver
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=28gb
#SBATCH --time=05:00:00
#SBATCH --output=rserver_%j.log
#SBATCH --error=rserver_%j.err
#SBATCH --account=bioinf_workshop
#SBATCH --qos=bioinf_workshop

module purge; module load R/4.5
rserver
EOF
sbatch rstudio.sbatch

Check that the job is running:

squeue -u $USER

Once running, check the log for your connection details (replace 12345678 with your actual job ID):

cat rserver_12345678.log

You will see output like:

Starting rserver on port 45261 in the /blue/bioinf_workshop/username/rnaseq_workshop directory.
Create an SSH tunnel with:
ssh -N -L 8080:c0710a-s29.ufhpc:45261 username@hpg.rc.ufl.edu
Then, open in the local browser:
http://localhost:8080

What is all that other output in the log?

You will see RStudio Server startup messages after the SSH tunnel instructions, including what looks like an ERROR about a missing config file. This is normal and can be ignored. If RStudio loaded in your browser, everything is working correctly.

To connect to RStudio:

Open a new terminal window on your local machine (not on HiPerGator) and run the ssh -N -L ... command shown in your log. The terminal will appear to hang — this is normal, leave it running.
Open any browser and go to http://localhost:8080.

Verify Your Setup in RStudio

Once RStudio opens, verify your working directory and package library path in the Console.

First check your working directory — it should already be set to the repo root if you submitted the job from the right place:

getwd()

/blue/bioinf_workshop/username/rnaseq_workshop

If it is not correct, set it manually (replace username with your GatorLink username):

setwd("/blue/bioinf_workshop/username/rnaseq_workshop")

Then use the gear icon ⚙ in the Files panel → Go To Working Directory to confirm.

Next, verify your package library path is correct:

.libPaths()

[1] "/blue/bioinf_workshop/share/R_libs"
[2] "/usr/local/lib/R/site-library"
[3] "/usr/lib/R/site-library"
[4] "/usr/lib/R/library"

If your .libPaths() looks different

The shared library at /blue/bioinf_workshop/share/R_libs must be first in the list — this is where all the workshop packages are pre-installed. If it is not listed or not first, the scripts will fail with "package not found" errors. In R Console, run:

source("/blue/bioinf_workshop/<username>/rnaseq_workshop/.Rprofile")

All scripts use the here package to build file paths relative to the repository root, so they work for everyone without needing to change any paths in the code.

Running the Differential Expression Analysis

With your working directory set, open and run the three numbered scripts in order — each picks up where the previous one left off.

01 — Quality Control

File: demo-analysis/scripts/01_quality_control.Rmd
Run as: Chunk by chunk in RStudio

Loads the nf-core/rnaseq count matrix directly from the shared directory, converts Ensembl IDs to gene symbols, assesses sample quality, filters lowly expressed genes, and applies TMM normalization.

Before running, confirm your metadata looks correct by opening demo-analysis/data/metadata/sample_metadata.csv in the Files panel.

Outputs → output/differential-expression/:

File	Description
`DGE_filtered_normalized.rds`	Normalized DGEList for script 02
`figures/library_sizes_filtered.png`	Library size QC plot
`figures/mds_plot.png`	Sample similarity plot
`figures/correlation_heatmap.png`	Sample correlation heatmap

02 — Differential Expression

File: demo-analysis/scripts/02_differential_expression.Rmd
Run as: Chunk by chunk in RStudio
Requires: Script 01 to have been run

Identifies differentially expressed genes between PRMT7 knockdown and wildtype using limma-voom.

Outputs → output/differential-expression/:

File	Description
`results/de_results_all.csv`	Full DE results table
`results/de_results_significant.csv`	FDR < 0.05 only
`results/sessionInfo.txt`	Session info
`figures/volcano_plot.png`	Volcano plot
`figures/ma_plot.png`	MA plot
`figures/heatmap_top50.png`	Top 50 DE genes heatmap

03 — Pathway Analysis

File: demo-analysis/scripts/03_pathway_analysis.Rmd
Run as: Chunk by chunk in RStudio
Requires: Script 02 to have been run

Identifies enriched biological processes and pathways among differentially expressed genes using GO and KEGG over-representation analysis.

Outputs → output/differential-expression/:

File	Description
`results/GO_BP_enrichment.csv`	Full GO biological process results
`results/GO_BP_enrichment_simplified.csv`	Simplified GO results
`results/GO_BP_upregulated.csv`	GO results for upregulated genes
`results/GO_BP_downregulated.csv`	GO results for downregulated genes
`results/KEGG_enrichment.csv`	KEGG pathway results
`figures/GO_BP_dotplot.png`	GO dotplot
`figures/GO_BP_emap.png`	GO enrichment map
`figures/GO_BP_cnetplot.png`	GO concept network
`figures/GO_BP_simplified_dotplot.png`	Simplified GO dotplot
`figures/GO_up_vs_down.png`	Up vs. down GO comparison
`figures/KEGG_dotplot.png`	KEGG dotplot
`figures/GO_vs_KEGG_comparison.png`	GO vs. KEGG comparison

Optional Scripts

opt_01 — Prepare nf-core Data

**File:** `scripts/optional/opt_01_prepare_nfcore_data.R` **Run as:** Source in RStudio Reads the raw nf-core/rnaseq pipeline output, converts Ensembl IDs to gene symbols, and generates QC and summary files. The core workshop scripts read the count matrix directly, so this script is not required — but it is useful if you want to explore the data preparation steps in more detail. **Outputs** → `output/optional/`: | File | Description | |------|-------------| | `rsem.merged.gene_counts.tsv` | Gene count matrix with gene symbols | | `sample_info.tsv` | Sample metadata | | `gene_annotation.tsv` | Ensembl ID to gene symbol mapping | | `data_summary.txt` | Summary statistics | | `library_sizes.png` | Library size QC plot | | `README.txt` | File descriptions |

opt_02 — edgeR + GREIN Comparison

**File:** `scripts/optional/opt_02_edgeR_GREIN_comparison.Rmd` **Run as:** Chunk by chunk in RStudio **Requires:** Script 02 to have been run Reproduces a GREIN-style edgeR exact test analysis and compares results to limma-voom. Demonstrates reproducibility challenges when methods documentation is incomplete. **Outputs** → `output/differential-expression/`: | File | Description | |------|-------------| | `results/edger_grein_matched_results.csv` | edgeR vs. limma-voom comparison | | `results/sessionInfo_edger.txt` | Session info | | `figures/volcano_plot_edger_grein.png` | edgeR volcano plot | | `figures/ma_plot_edger_grein.png` | edgeR MA plot |

opt_03 — Advanced Designs: Two-Factor Analysis

**File:** `scripts/optional/opt_03_advanced_designs.Rmd` **Run as:** Chunk by chunk in RStudio **Requires:** Nothing — standalone script Demonstrates limma-voom with a two-factor experimental design using a published Drosophila temperature adaptation dataset. Covers interaction models, contrast matrices, and parallel vs. divergent response patterns. **Outputs** → `output/differential-expression/`: | File | Description | |------|-------------| | `results/drosophila_contrast_results.csv` | Full contrast results | | `results/drosophila_maine_temperature.csv` | Maine temperature effect | | `results/drosophila_panama_temperature.csv` | Panama temperature effect | | `results/drosophila_interaction.csv` | Interaction results | | `results/drosophila_parallel_response_genes.csv` | Parallel response genes | | `results/drosophila_divergent_response_genes.csv` | Divergent response genes | | `figures/volcano_drosophila_temperature.png` | Drosophila volcano plot |

Troubleshooting

Working directory is wrong

getwd()

If it doesn't return /blue/bioinf_workshop/username/rnaseq_workshop, run:

setwd("/blue/bioinf_workshop/username/rnaseq_workshop")

Then use the gear icon ⚙ in the Files panel → Go To Working Directory.

"Cannot find file" errors

list.files("demo-analysis/output/differential-expression")

If empty, you need to run the preceding script first.

Package not installed

If .libPaths() is correct and you still get package errors, let an instructor know — all required packages should be pre-installed in the shared library. If you need to install a package yourself:

install.packages(c("tidyverse", "ggrepel", "pheatmap", "RColorBrewer", "here"))
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("limma", "edgeR"))