HiPerGator for Bioinformatics | UFHCC BCB-SR Bioinformatics Hub

UF Health Cancer Institute Bioinformatics Hub

HiPerGator for Bioinformatics

Getting Started with Bioinformatics on HiPerGator

This guide introduces bioinformatics researchers to high-performance computing (HPC) using UF’s HiPerGator system. Whether you’re new to bioinformatics or ready to scale up your analyses, HiPerGator provides the computational power and memory needed for modern genomic research.

Why Use HPC for Bioinformatics?

Bioinformatics analyses are naturally suited for HPC because they typically involve:

Large datasets: Genomic files (FASTQ, BAM, VCF) can be hundreds of GB to TB in size
Memory-intensive processes: Genome assembly, variant calling, and large matrix operations often require 100GB+ of RAM
Parallel processing: Many bioinformatics tools can utilize multiple cores simultaneously
Long-running analyses: Genome assemblies, phylogenetic analyses, and population genomics can take days to weeks

HiPerGator servers have up to 1TB of memory and 128 cores per node - far beyond what’s available on personal computers.

What is HiPerGator?

HiPerGator is UF’s supercomputer consisting of hundreds of powerful servers managed by a job scheduler called Slurm. Instead of running analyses on your laptop, you submit jobs to a queue where they run on dedicated compute nodes with the resources you specify.

Getting Started: Essential Steps

1. Get Access

Request a HiPerGator account here. You’ll need UF faculty sponsorship.

2. Learn Basic Command Line

HiPerGator uses the Linux command line. If you’re new to this:

Try the Data Carpentry Unix tutorial
Complete HiPerGator training materials

3. Transfer Your Data

Use these methods based on data size:

Small files (<1GB): scp or rsync commands
Large genomic datasets: Globus transfer

4. Set Up Your Software Environment

Most bioinformatics tools are pre-installed:

module load samtools
module load blast
module load bwa

For Python/R packages, use conda environments:

module load conda
conda create -n myanalysis -c bioconda fastqc multiqc
conda activate myanalysis

5. Write Your Analysis Script

Your analysis should run from start to finish without interaction. For example:

#!/bin/bash
# Quality control
fastqc *.fastq.gz
multiqc .

# Alignment
bwa mem reference.fa sample_R1.fastq.gz sample_R2.fastq.gz | samtools sort -o sample.bam
samtools index sample.bam

# Variant calling
bcftools mpileup -f reference.fa sample.bam | bcftools call -mv -o sample.vcf

6. Create a Job Script

Tell Slurm what resources you need:

#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your.email@ufl.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32gb
#SBATCH --time=24:00:00
#SBATCH --partition=hpg-default

module load bwa samtools bcftools

# Run your analysis
bash my_analysis.sh

7. Submit and Monitor Your Job

sbatch my_job.sbatch
squeue -u $USER  # Check job status

Optimizing for Bioinformatics

Use Parallel Processing

Most bioinformatics tools support threading:

BWA: -t 8 for 8 threads
BLAST: -num_threads 8
SAMtools: -@ 8

Always match the thread count to your --cpus-per-task request.

Process Multiple Samples

Use job arrays to process many samples simultaneously:

#SBATCH --array=1-96  # For 96 samples
sample=$(sed -n "${SLURM_ARRAY_TASK_ID}p" sample_list.txt)
fastqc ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz

Request Appropriate Resources

Genome assembly: 64+ cores, 500GB+ memory
RNA-seq alignment: 8-16 cores, 32GB memory
Variant calling: 4-8 cores, 16GB memory
Quality control: 2-4 cores, 8GB memory

Storage Considerations

Home directory: Small files, scripts (40GB limit)
Blue storage: Large datasets, long-term storage
Scratch space: Temporary job outputs (auto-deleted after 31 days)

Getting Help

UFRC Documentation: https://docs.rc.ufl.edu/
Support tickets: https://www.rc.ufl.edu/get-support/
Office hours: Check UFRC website for current schedule

Next Steps

Once comfortable with basic job submission:

Explore workflow managers like Nextflow or Snakemake
Learn about GPU computing for machine learning applications
Set up automated pipelines for routine analyses

HiPerGator makes complex bioinformatics analyses accessible and efficient. Start with simple jobs and gradually increase complexity as you become more comfortable with the system.

← HiPerGator Demo 0001-01-01