Indexing
Indexing is widely used in bioinformatics workflows to improve performance. Typically it is applied to large data files with many records to improve the ability of tools to rapidly access random locations of the file. For example, if we have alignments against the entire genome and we want to visualize those alignments for a single gene on chr17 in a genome viewer such as IGV, we don’t want the viewer to have to scan through the entire file. Indexing allows us to jump right to the correct place in the file and pull out just the information we need without reading much of the file.
There are many different types of indexes and tools that produce them. You have already seen the fasta index .fai
for our reference genome. Other things that get indexed for this course include: reference transcript fasta, reference transcriptome GTF, VCF files containing called variants, VCF files containing variant annotations, BAM files containing alignments, tab-delimited files with various pipeline results, etc.
There is no one standard way of producing the best index file for an particular purpose. This is because it depends on how information is organized in the file and how one intends to access it. This means that even the same data file is sometimes indexed more than once.
The most common example of this is the reference genome. Each alignment algorithm we use (and sometimes even different versions of the same algorithm) requires its own distinctive index. Having the right index for each tool is important, and trying to use an incorrect one is a common error encountered in this kind of analysis.
In the section below you will create several of the indexes we will use in later sections. Other indexing commands may appear throughout the course as they are needed.
Create a reference genome index for use with BWA for DNA alignments
Use bwa index to create an index for alignment. This step can take a long time (e.g. > 1 hour) on large genomes such as the human genome reference. However, you only have to do this once for each reference genome and version of BWA. Furthermore, it is sometimes possible to download a pre-computed version of the index for your genome. However, one should be careful with this approach given all of the caveats above about algorithm versions and all of the different versions of the reference genome. Some algorithms may offer a multi-threaded approach to indexing as well.
# run time for this index is ~4 minutes
cd /workspace/inputs/references/genome
bwa index /workspace/inputs/references/genome/ref_genome.fa
For many alignment algorithms, some of the algorithmic innovation is actually in how the index is structured and used. To learn more about the widely used BWA aligner, refer to the papers for BWA and BWA-mem.
Create a reference genome index for use with HISAT for splice RNA alignments to the genome
Use HISAT’s indexing tool to create an index of the reference genome. HISAT
cd /workspace/inputs/references/transcriptome
# create a database of observed splice sites represented in our reference transcriptome GTF
/usr/local/bin/hisat2-2.0.4/hisat2_extract_splice_sites.py ref_transcriptome.gtf > splicesites.tsv
head splicesites.tsv
# create a database of exon regions in our reference transcriptome GTF
/usr/local/bin/hisat2-2.0.4/hisat2_extract_exons.py ref_transcriptome.gtf > exons.tsv
head exons.tsv
# build the reference genome index for HISAT and supply the exon and splice site information extracted in the previous steps
# specify to use 8 threads with the `-p 8` option
# run time for this index is ~5 minutes
/usr/local/bin/hisat2-2.0.4/hisat2-build -p 8 --ss splicesites.tsv --exon exons.tsv /workspace/inputs/references/genome/ref_genome.fa ref_genome
To learn more about the HISAT2 algorithm, refer to the paper for HISAT.
Create a reference transcriptome index for use with Kallisto
In contrast to HISAT, Kallisto performs RNA-seq analysis using transcript sequences only (i.e. it is “reference free”). This can be particularly useful for those studying organisms without a reference genome build. Kallisto still needs an index, but this time it will be of the transcript sequences instead of the DNA (chromosome) sequences. Kallisto uses an index but remember that Kallisto does not perform alignment or use a reference genome sequence. Instead it performs “pseudoalignment” to determine the compatibility of reads with targets (transcript sequences in this case).
cd /workspace/inputs/references/transcriptome
mkdir kallisto
cd kallisto
# tidy up the headers to just include the ensembl transcript ids
cat ../ref_transcriptome.fa | perl -ne 'if ($_ =~ /\d+\s+(ENST\d+)/){print ">$1\n"}else{print $_}' > ref_transcriptome_clean.fa
# run time for this index is ~30 seconds
kallisto index --index=ref_transcriptome_kallisto_index ref_transcriptome_clean.fa
- Take note of the output when
kallisto index
is being run. What information is provided? - What k-mer size is being used? What are the trade-offs between long vs. short k-mers?
- How many k-mers were identified in our reference transcriptome?
- How many transcripts do these correspond to?
To learn more about the Kallisto algorithm, refer to the Kallisto paper.
Index VCF annotation files for use with GATK
Remember that we downloaded various annotation files to use with GATK. To help GATK work with them efficiently we will now create indexes for each using GATK’s indexing tool. As with the indexing steps above, these commands can take several minutes.
cd /workspace/inputs/references/gatk/
#SNP calibration call sets - dbsnp, hapmap, omni, and 1000G
# Runtime: ~ 4min
gatk --java-options '-Xmx24g' IndexFeatureFile --feature-file /workspace/inputs/references/gatk/Homo_sapiens_assembly38.dbsnp138.vcf.gz
gatk --java-options '-Xmx24g' IndexFeatureFile --feature-file /workspace/inputs/references/gatk/hapmap_3.3.hg38.vcf.gz
gatk --java-options '-Xmx24g' IndexFeatureFile --feature-file /workspace/inputs/references/gatk/1000G_omni2.5.hg38.vcf.gz
# Runtime: ~ 3min
gatk --java-options '-Xmx24g' IndexFeatureFile --feature-file /workspace/inputs/references/gatk/1000G_phase1.snps.high_confidence.hg38.vcf.gz
#Indel calibration call sets - dbsnp, Mills
gatk --java-options '-Xmx24g' IndexFeatureFile --feature-file /workspace/inputs/references/gatk/Homo_sapiens_assembly38.known_indels.vcf.gz
gatk --java-options '-Xmx24g' IndexFeatureFile --feature-file /workspace/inputs/references/gatk/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz