Precision Medicine Bioinformatics

Introduction to bioinformatics for DNA and RNA sequence analysis

DNA alignment lab

This lab will introduce students to a basic, and very common commandline bioinformatics task. We will obtain raw DNA sequence data, a reference sequence, perform any necessary indexing, align the raw data to reference, mark duplicate reads, and then prepare the alignment results for visualization in a genome viewer.

Prepare directories for analysis

First, let’s create a directory in your workspace for all your results (e.g. dna_alignment_lab). Within that directory we will also make three subdirectories to organize our analysis:

cd /workspace
mkdir dna_alignment_lab
cd dna_alignment_lab
mkdir alignment_results
mkdir fastq_files
mkdir reference_sequences

Obtain sequence data for alignment (fastq files)

We have provided a sample dataset for this exercise. This data represents paired (2 x 100bp) DNA exome sequence data in fastq format. DNA was isolated from a breast cancer cell line (HCC1395). We have provided sequence reads that have been limited to chromosome 21.

How could raw sequence data be limited to chr21?

We previously aligned the complete dataset, extracted only chr21 alignments, then reverted back to the raw (unaligned) reads.

#download seq data 
cd /workspace/dna_alignment_lab/fastq_files 

wget http://genomedata.org/seq-tec-workshop/read_data/dna_alignment_exercise/dataset_lab/HCC1395_Exome_chr21_R1.fastq.gz
wget http://genomedata.org/seq-tec-workshop/read_data/dna_alignment_exercise/dataset_lab/HCC1395_Exome_chr21_R2.fastq.gz

Obtain a reference sequence

For convenience, we also provide a reference genome file, limited to chromosome 21.

#download ref data 
cd /workspace/dna_alignment_lab/reference_sequences 

wget http://genomedata.org/seq-tec-workshop/references/human/chr21/chr21_references.fa

Index reference file with bwa

Now that we have our data, we need to create the files necessary to run an alignment. The first thing we need to do is index our reference sequence. Indexing your reference sequence allows the aligner to rapidly narrow down the potential origin of the query sequence within the genome, saving both time and memory.

cd /workspace/dna_alignment_lab/reference_sequences 

bwa index chr21_references.fa

Run bwa mem to create an alignment

Okay, now that we have an indexed reference sequence, we are ready to create an alignment.

cd /workspace/dna_alignment_lab/alignment_results

bwa mem -t 8 -o /workspace/dna_alignment_lab/alignment_results/HCC1395_Exome_chr21.sam /workspace/dna_alignment_lab/reference_sequences/chr21_references.fa /workspace/dna_alignment_lab/fastq_files/HCC1395_Exome_chr21_R1.fastq.gz /workspace/dna_alignment_lab/fastq_files/HCC1395_Exome_chr21_R2.fastq.gz

Convert sam to bam format

We’ve got an alignment! But we have several post-processing steps to complete. The first step is to convert your large .sam file to compressed .bam file.

cd /workspace/dna_alignment_lab/alignment_results

samtools view -@ 8 -h -b -o HCC1395_Exome_chr21.bam HCC1395_Exome_chr21.sam

Query name sort bam files

Next we need to sort the reads by their read names. This is expected for the next step (duplicate marking).

cd /workspace/dna_alignment_lab/alignment_results

java -Xmx60g -jar /home/ubuntu/bin/picard.jar SortSam I=HCC1395_Exome_chr21.bam O=HCC1395_Exome_chr21_namesorted_picard.bam SO=queryname

Perform duplicate marking

Next we need to mark the duplicate reads within our data. Duplicate reads are typically defined as reads with identical start and stop alignment positions. These reads are likely to be exact copies of a single DNA molecule. These reads introduce bias in your variant calling. If you did not mark duplicates, you risk over-representing areas that are preferentially amplified during PCR.

cd /workspace/dna_alignment_lab/alignment_results

java -Xmx60g -jar /home/ubuntu/bin/picard.jar MarkDuplicates I=HCC1395_Exome_chr21_namesorted_picard.bam  O=HCC1395_Exome_chr21_namesorted_picard_mrkdup.bam ASSUME_SORT_ORDER=queryname METRICS_FILE=HCC1395_Exome_chr21_mrk_dup_metrics.txt QUIET=true COMPRESSION_LEVEL=0 VALIDATION_STRINGENCY=LENIENT

# This command will also print out a txt file that gives you some metrics about the number of duplicates identified 

Position sort bam files

Next we need to position sort the bam file. Indexing requires a position-sorted bam.

cd /workspace/dna_alignment_lab/alignment_results

java -Xmx60g -jar /home/ubuntu/bin/picard.jar SortSam I=HCC1395_Exome_chr21_namesorted_picard_mrkdup.bam O=HCC1395_Exome_chr21_pos_sorted_mrkdup_picard.bam SO=coordinate

Index bam file

In order to efficiently load and search a bam file, downstream applications typically require an index. This is very similar to the index you created of your reference file.

cd /workspace/dna_alignment_lab/alignment_results

java -Xmx60g -jar /home/ubuntu/bin/picard.jar BuildBamIndex I=HCC1395_Exome_chr21_pos_sorted_mrkdup_picard.bam

Generate a flagstat report for bam file

There are many different ways to assess the quality of an alignment. We are going to use one method to get a fast summary of some common alignment statistics.

cd /workspace/dna_alignment_lab/alignment_results

samtools flagstat HCC1395_Exome_chr21_pos_sorted_mrkdup_picard.bam > HCC1395_Exome_chr21_pos_sorted_mrkdup_picard_flagstat.flagstat

Clean up un-needed sam/bam files

Keep final sorted, duplicated marked, bam/bai files and mrkdup.txt files. Delete everything else.

cd /workspace/dna_alignment_lab/alignment_results

mkdir final
mv HCC1395_Exome_chr21_pos_sorted_mrkdup_picard.* final/
mv *.txt final/
mv *.flagstat final/

rm *.sam
rm *.bam

View your alignment

Finally, we will visualize the alignments in IGV, a popular genome viewer. We will provide instructions for this. This is the URL you will use to view your bam. Note, you will need to replace student.i.p.address with the IP address of your personal AWS instance.

http://student.i.p.address/workspace/dna_alignment_lab/alignment_results/final/HCC1395_Exome_chr21_pos_sorted_mrkdup_picard.bam