Precision Medicine Bioinformatics

Introduction to bioinformatics for DNA and RNA sequence analysis

RNAseq Fusion Calling


In addition to providing information about gene expression, RNA-seq data can be used to discover transcripts which result from chromosomal translocations. Translocations and their resultant chimeric (AKA fusion) transcripts are important driver mutations in many cancers. A variety of specialized alignment and filtering strategies have been developed to identify fusion transcripts from RNA, but these programs suffer from low specificity (i.e. many false-positives) and poor correlation across methods.

This tutorial uses the kallisto and pizzly tools for fusion detection from RNA-seq data. Kallisto quantifies transcript abundance through pseudoalignment. Pizzly aligns reads which kallisto has flagged as potentially spanning fusion junctions. Running the tutorial requires RNA fastq files, a reference transcriptome, and a gene annotation file- see below.


Prerequisites- This module assumes you have completed the initial setup process for the course including:

Additional setup:

Important: pizzly will not work with the most recent Ensembl human GTF annotation file. Download the version 87 GTF as shown in the below code block. We will subset the reference transcriptome fasta file. Fasta splitting programs which do not preserve the full Ensembl header such as gffread or gtf_to_fasta will not work with pizzly.

# Get files from source
mkdir -p /workspace/inputs/reference/fusion
cd /workspace/inputs/reference/fusion/
gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz       
gunzip -k Homo_sapiens.GRCh38.87.gtf.gz

# Get annotations for only chromosomes 6 and 17
cat Homo_sapiens.GRCh38.87.gtf | grep --color=never -w "^6" >> chr617.gtf
cat Homo_sapiens.GRCh38.87.gtf | grep --color=never -w "^17">> chr617.gtf 

# Parse transcriptome fasta, preserving full ensembl headers
#  Setup directory
mkdir -p /workspace/inputs/reference/fusion/per-feature
cd /workspace/inputs/reference/fusion/per-feature
#  Split fasta at each instance of a sequence header and write to new file
csplit -s -z ../Homo_sapiens.GRCh38.cdna.all.fa '/>/' '{*}'
#  If from chromosomes 6 or 17, rename files using the columns of the original ensemble header
#  (This step takes about 12 minutes. You can proceed with the next section in /workspace/inputs/data/fastq/chr6_and_chr17)
for f in xx*; do awk -F ":" 'NR==1 && $3=="6" || NR==1 && $3=="17"{print $2 "." $3 "." $4 "." $5}' $f | xargs -I{} mv $f {}.fa; done
#  Concatenate features from chromsomes 6 and 17 to a new reference fasta  
cd /workspace/inputs/reference/fusion
cat ./per-feature/GRCh38.6.*.fa ./per-feature/GRCh38.17.*.fa > chr617.fa
rm -rf per-feature

Run Fusion Alignment

mkdir -p /workspace/rnaseq/fusion
cd /workspace/rnaseq/fusion
kallisto index -i index.617.idx -k 31 --make-unique /workspace/inputs/reference/fusion/chr617.fa
kallisto quant -i index.617.idx --fusion -o kquant-norm617 /workspace/inputs/data/fastq/chr6_and_chr17/fusion/subRNAseq_Norm_R1.fastq.gz /workspace/inputs/data/fastq/chr6_and_chr17/fusion/subRNAseq_Norm_R2.fastq.gz
kallisto quant -i index.617.idx --fusion -o kquant-tumor617 /workspace/inputs/data/fastq/chr6_and_chr17/fusion/subRNAseq_Tumor_R1.fastq.gz /workspace/inputs/data/fastq/chr6_and_chr17/fusion/subRNAseq_Tumor_R2.fastq.gz
# (Output is created in the directory where command is run)
cd /workspace/rnaseq/fusion
pizzly -k 31 --gtf /workspace/inputs/reference/fusion/chr617.gtf --cache index-norm617.cache.txt --align-score 2 --insert-size 400 --fasta /workspace/inputs/reference/fusion/chr617.fa --output norm-fuse617 kquant-norm617/fusion.txt
pizzly -k 31 --gtf /workspace/inputs/reference/fusion/chr617.gtf --cache index-tumor617.cache.txt --align-score 2 --insert-size 400 --fasta /workspace/inputs/reference/fusion/chr617.fa --output tumor-fuse617 kquant-tumor617/fusion.txt

Possible Additional Analysis

# Create full index
kallisto index -i index.full.idx /workspace/inputs/reference/fusion/Homo_sapiens.GRCh38.cdna.all.fa

# Quantify potential fusions (**40 min**)
cd /workspace/rnaseq/fusion
kallisto quant -i index.full.idx --fusion -o kquant-norm-full /workspace/inputs/data/fastq/chr6_and_chr17/fusion/RNAseq_Norm_R1.fastq.gz /workspace/inputs/data/fastq/chr6_and_chr17/fusion/RNAseq_Norm_R2.fastq.gz
kallisto quant -i index.full.idx --fusion -o kquant-tumor-full /workspace/inputs/data/fastq/chr6_and_chr17/fusion/RNAseq_Tumor_R1.fastq.gz /workspace/inputs/data/fastq/chr6_and_chr17/fusion/RNAseq_Tumor_R2.fastq.gz

# Call fusions
pizzly -k 31 --gtf /workspace/inputs/reference/fusion/chr617.gtf --cache index-full-norm.cache.txt --align-score 2 --insert-size 400 --fasta /workspace/inputs/reference/fusion/chr617.fa --output norm-full kquant-norm-full/fusion.txt
pizzly -k 31 --gtf /workspace/inputs/reference/fusion/chr617.gtf --cache index-full-tumor.cache.txt --align-score 2 --insert-size 400 --fasta /workspace/inputs/reference/fusion/chr617.fa --output tumor-full kquant-tumor-full/fusion.txt