Annotation
Obtain Additional GATK resource files needed
Use Google’s gsutil to download various annotation files that will be used by GATK and other resources. gsutil will be used to download these file from Google cloud storage.
cd /workspace/inputs/references/
mkdir -p gatk
cd gatk
# SNP calibration call sets - dbsnp, hapmap, omni, and 1000G
# Runtime: < 2min
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf .
# Runtime: ~ 2min
bgzip --threads 8 Homo_sapiens_assembly38.dbsnp138.vcf
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz .
# Indel calibration call sets - dbsnp, Mills
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz .
# Interval lists that can be used to parallelize certain GATK tasks
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/wgs_calling_regions.hg38.interval_list .
gsutil cp -r gs://genomics-public-data/resources/broad/hg38/v0/scattered_calling_intervals/ .
# list the files we just downloaded
ls -lh
Download coordinates describing the Exome Reagent used to generate our exome data (SeqCapEZ_Exome_v3.0)
The reagent used to produce the exome sequenced data for this course was the SeqCapEZ_Exome_v3.0 from Roche Nimblegen. Nimblegen provides a set of files describing this reagent which can be downloaded from their site
here. We are most interested in the file named SeqCap_EZ_Exome_v3_hg19_capture_targets.bed
which contains the chromosome, start, stop, and gene annotation for each probe used in the reagent. Let’s go ahead and downlod these files for later use.
# change directories
mkdir -p /workspace/inputs/references/exome
cd /workspace/inputs/references/exome
# download the files
wget -c https://sequencing.roche.com/content/dam/rochesequence/worldwide/resources/SeqCapEZ_Exome_v3.0_Design_Annotation_files.zip
unzip SeqCapEZ_Exome_v3.0_Design_Annotation_files.zip
# remove the zip
rm -f SeqCapEZ_Exome_v3.0_Design_Annotation_files.zip
Convert SeqCapEZ_Exome_v3.0
You might have noticed that these annotation files from Nimblegen are all from the hg19 genome assembly. Obviously this presents a problem as the analysis we’re performing is using the newer hg38 assembly. This issue is actually not an uncommon situation and a variety of tools exist that are designed to make converting from one assembly to another easier. In the next section we will be using UCSC liftover to perform this task.
To start we first need to download a chain file specific to the assembly conversion we want to perform (in our case hg19 -> hg38). These files provide a mapping between the two assemblies and can be downloaded from the UCSC site. Once we have our chain file we can run liftOver
which will take following positional arguments.
- Path to bed file to convert
- Path to chain file for the desired conversion (downloaded from UCSC)
- Path to output file that will contain the new coordinates
- Path to output file containing any coordinates that do not convert successfully
It will also be usefull to have a version of this file without the gene annotations lets go ahead and create that here as well.
Here we are processing two of the downloaded files, SeqCap_EZ_Exome_v3_hg19_primary_targets.bed
and SeqCap_EZ_Exome_v3_hg19_capture_targets.bed
. The former corresponding to the exome regions that we are interested in and the latter corresponding to the regions that probes were able to be designed on.
# change to the appropriate directory
cd /workspace/inputs/references/exome
# download the chain file
wget -c http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
# run liftover
liftOver SeqCapEZ_Exome_v3.0_Design_Annotation_files/SeqCap_EZ_Exome_v3_hg19_primary_targets.bed hg19ToHg38.over.chain.gz SeqCap_EZ_Exome_v3_hg38_primary_targets.bed unMapped.bed
liftOver SeqCapEZ_Exome_v3.0_Design_Annotation_files/SeqCap_EZ_Exome_v3_hg19_capture_targets.bed hg19ToHg38.over.chain.gz SeqCap_EZ_Exome_v3_hg38_capture_targets.bed unMapped1.bed
# create a version in standard bed format (chr, start, stop)
cut -f 1-3 SeqCap_EZ_Exome_v3_hg38_primary_targets.bed > SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.bed
cut -f 1-3 SeqCap_EZ_Exome_v3_hg38_capture_targets.bed > SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.bed
# take a quick look at the format of these files
head SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.bed
head SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.bed
# a very common question in exome (or any capture based approach) is: how big is my capture space?
# use bedtools to determine the size of the capture space represented by this bed file
# first sort the bed files and store the sorted versions
bedtools sort -i SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.bed > SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.bed
bedtools sort -i SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.bed > SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.bed
# now merge the bed files to collapse any overlapping regions so they are not double counted.
bedtools merge -i SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.bed > SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.bed
bedtools merge -i SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.bed > SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.bed
# finally use a Perl one liner to determine the size of each non-overlapping region and determine the cumulative sum
perl -ne 'chomp; @l=split("\t",$_); $size += $l[2]-$l[1]; print "$size\n"' SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.bed
perl -ne 'chomp; @l=split("\t",$_); $size += $l[2]-$l[1]; print "$size\n"' SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.bed
# note that the size of the space targeted by the exome reagent is ~63 Mbp. Does that sound reasonable?
# now create a subset of these bed files
grep -w -P "^chr6|^chr17" SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.bed > exome_regions.bed
grep -w -P "^chr6|^chr17" SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.bed > probe_regions.bed
# clean up intermediate files
#rm -fr SeqCapEZ_Exome_v3.0_Design_Annotation_files/ SeqCap_EZ_Exome_v3_hg38_primary_targets.bed SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.bed SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.bed unMapped.bed
Obtaining an interval list for the exome bed files
NOTE: The following interval list is being created for an annotation file the corresponds to the whole genome. The ref_genome.dict
we downloaded in the previous section only contains chr6 and chr17. We need to either trim the exome ROI file down to just these chrs as well. Or also download/create a full version of the dict file.
# first for the complete exome and probe bed file
cd /workspace/inputs/references/
mkdir temp
cd temp
wget http://genomedata.org/pmbio-workshop/references/genome/all/ref_genome.dict
cd /workspace/inputs/references/exome
java -jar $PICARD BedToIntervalList I=SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.bed O=SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.interval_list SD=/workspace/inputs/references/temp/ref_genome.dict
java -jar $PICARD BedToIntervalList I=SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.bed O=SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.interval_list SD=/workspace/inputs/references/temp/ref_genome.dict
rm -fr /workspace/inputs/references/temp/
# next for our subset exome and probe regions file
cd /workspace/inputs/references/exome
java -jar /usr/local/bin/picard.jar BedToIntervalList I=exome_regions.bed O=exome_regions.bed.interval_list SD=/workspace/inputs/references/genome/ref_genome.dict
java -jar /usr/local/bin/picard.jar BedToIntervalList I=probe_regions.bed O=probe_regions.bed.interval_list SD=/workspace/inputs/references/genome/ref_genome.dict
Obtaining transcriptome reference files
In this section we will download some reference transcriptome annotations in the mostly widely used formats (.fa and .gtf). As with the other annotation files, we have created a subsetted version of them to make downstream analysis more practical for a live tutorial demonstration. To create these annotation files we followed these basic steps:
- Download complete GTF files from Ensembl represent all gene/transcript annotations (e.g.
Homo_sapiens.GRCh38.94.gtf.gz
) from Ensembl’s FTP site. - Fix the chromosome names in this GTF. Remember that Ensembl uses names like
1
,2
, etc. but our reference genome uses names likechr1
,chr2
, etc. We perform this conversion using chromosome synonym mappings from Ensembl and a simple scriptconvertEnsemblGTF.pl
. - Next we use a simple grep to pull out the subset of chromosomes we care about
- Finally we use the resulting subsetted GTF to create FASTA sequences for each transcript using the
gtf_to_fasta
tool. This tools takes exon coordinates in a GTF file and a reference genome sequence and creates the reference transcript sequences.
Complete details of how these files were created can be found in the Developer Notes.
Now lets actually download these files and examine them…
# make sure CHRS environment variable is set. If this command doesn't give a value, please return to the Environment section of the course
echo $CHRS
# create a directory for transcriptome input files
mkdir -p /workspace/inputs/references/transcriptome
cd /workspace/inputs/references/transcriptome
# download the files
wget http://genomedata.org/pmbio-workshop/references/transcriptome/$CHRS/ref_transcriptome.gtf
wget http://genomedata.org/pmbio-workshop/references/transcriptome/$CHRS/ref_transcriptome.fa
# take a look at the contents of the gtf file. Press 'q' to exit the 'less' display.
less -p start_codon -S ref_transcriptome.gtf
# How many unique gene IDs are in the .gtf file?
# We can use a perl command-line command to find out:
perl -ne 'if ($_ =~ /(gene_id\s\"ENSG\w+\")/){print "$1\n"}' ref_transcriptome.gtf | sort | uniq | wc -l
- Using perl -ne ‘’ will execute the code between single quotes, on the .gtf file, line-by-line.
- The $_ variable holds the contents of each line.
- The ‘if ($_ =~//)’ is a pattern-matching command which will look for the pattern “gene_id” followed by a space followed by “ENSG” and one or more word characters (indicated by \w+) surrounded by double quotes.
- The pattern to be matched is enclosed in parentheses. This allows us to print it out from the special variable $1.
- The output of this perl command will be a long list of ENSG Ids.
- By piping to sort, then uniq, then word count we can count the unique number of genes in the file.
# what are all the feature types listed in the third column of the GTF?
# how does the following command (3 commands piped together) answer that question?
cut -f 3 ref_transcriptome.gtf | sort | uniq -c
Review key definitions for this section
- Reference genome - The nucleotide sequence of the chromosomes of a species. Genes are the functional units of a reference genome and gene annotations describe the structure of transcripts expressed from those gene loci.
- Gene annotations - Descriptions of gene/transcript models for a genome. A transcript model consists of the coordinates of the exons of a transcript on a reference genome. Additional information such as the strand the transcript is generated from, gene name, coding portion of the transcript, alternate transcript start sites, and other information may be provided.
- GTF (.gtf) file - A common file format referred to as Gene Transfer Format used to store gene and transcript annotation information. You can learn more about this format here: http://genome.ucsc.edu/FAQ/FAQformat#format4