Precision Medicine Bioinformatics

Introduction to bioinformatics for DNA and RNA sequence analysis



First and foremost, we are extremely grateful to the patients who donate tumor samples for research and educational purposes such as this. Progress in disease research would not be possible without such contributions. These donations have helped to realize significant improvements in our understanding of cancer biology and improvements in cancer outcomes. For one of many great summaries of these advances, please refer to Cancer - our world in data.

About the samples

Throughout the course we will use data from several sources, but many of the examples and exercises relate specifically to single cell line, chosen to represent a hypothetical patient. More precisely, we will use a well-described breast cancer cell line (HCC1395) and its matched lymphoblastoid cell line (HCC1395BL). The data are hosted on our server at The cell line itself can be obtained (with a nominal fee) from the American Type Culture Collection (ATCC), a nonprofit organization which collects, stores, and distributes standard reference microorganisms, cell lines and other materials for research and development (citation). Product pages for the HCC1395 cell lines can be found here: HCC1395/CRL-2324, HCC1395BL/CRL-2325).

NOTE from ATCC: This cell line was deposited at the ATCC by Dr. Adi F. Gazdar and is provided for research purposes only. Neither the cell line nor products derived from it may be sold or used for commercial purposes. Nor can the cells be distributed to third parties for purposes of sale, or producing for sale, cells or their products. The cells are provided as service to the research community.

The HCC1395 cell line was obtained from a 43 year old Caucasian female patient. The HCC1395 cell line is described as being of tissue orgin: mammary gland; breast/duct. The HCC1395BL cell line was created from a B lymphoblast that was tranformed by the EBV virus. The patient’s cancer was described as: TNM stage I, grade 3, primary ductal carcinoma. This cell line was initiated in the 1990s from a patient with a family history of cancer (patient’s mother had breast cancer). The cell line took 14 months to establish. The patient received chemotherapy prior to isolation of the tumor (PMID: 9833771). This tumor is concidered “Triple-Negative” by classic typing: ERBB2 -ve (aka HER2/neu), PR -ve, and ER -ve). Otherwise it is one of those difficult to classify by expression-based molecular typing but is likely of the “Basal” sub-type (PMID: 22003129). The tumor cell line is known to be polyploid. The tumor is also described as TP53 mutation positive.

About the data

The table below provides sample details for all data files. Note that the ‘Readgroup ID’ can be used to match downloaded files to these details. The data set for this course corresponds to the matched tumor/normal pair described above. For each of these samples, whole genome, exome and RNA-seq was performed. Whole genome sequencing was performed to a target median coverage depth of ~30x for the normal samples and ~50x for the tumor sample. Exome sequencing was performed to a target median coverage depth of ~100x. RNA-seq was performed for both tumor and normal. Note that due to the difference in tissue of origin, comparing the two RNA-seq samples to each other does not make sense biologically. Note that some of these data types have multiple lanes of data (required to hit the target total depth).

Data (Filename Prefix / Readgroup ID) MGI ID Platform FC[-BC].Lane Library Sample Name
Exome_Norm 2891351068 Illumina C1TD1ACXX-CGATGT.7 exome_norm_lib1 HCC1395BL_DNA
Exome_Tumor 2891351066 Illumina C1TD1ACXX-ATCACG.7 exome_tumor_lib1 HCC1395_DNA
RNAseq_Norm_Lane1 2895625992 Illumina H3MYFBBXX-CTTGTA.4 rna_norm_lib1 HCC1395BL_RNA
RNAseq_Norm_Lane2 2895626097 Illumina H3MYFBBXX-CTTGTA.5 rna_norm_lib1 HCC1395BL_RNA
RNAseq_Tumor_Lane1 2895626107 Illumina H3MYFBBXX-GCCAAT.4 rna_tumor_lib1 HCC1395_RNA
RNAseq_Tumor_Lane2 2895626112 Illumina H3MYFBBXX-GCCAAT.5 rna_tumor_lib1 HCC1395_RNA
WGS_Norm_Lane1 2891323123 Illumina D1VCPACXX.6 wgs_norm_lib1 HCC1395BL_DNA
WGS_Norm_Lane2 2891323124 Illumina D1VCPACXX.7 wgs_norm_lib2 HCC1395BL_DNA
WGS_Norm_Lane3 2891323125 Illumina D1VCPACXX.8 wgs_norm_lib3 HCC1395BL_DNA
WGS_Tumor_Lane1 2891322951 Illumina D1VCPACXX.1 wgs_tumor_lib1 HCC1395_DNA
WGS_Tumor_Lane2 2891323174 Illumina D1VCPACXX.2 wgs_tumor_lib1 HCC1395_DNA
WGS_Tumor_Lane3 2891323175 Illumina D1VCPACXX.3 wgs_tumor_lib2 HCC1395_DNA
WGS_Tumor_Lane4 2891323150 Illumina D1VCPACXX.4 wgs_tumor_lib2 HCC1395_DNA
WGS_Tumor_Lane5 2891323147 Illumina D1VCPACXX.5 wgs_tumor_lib3 HCC1395_DNA

Download the data

Use wget to download all data to your instance:

mkdir -p /workspace/inputs/data/fastq 
cd /workspace/inputs/data/fastq
wget -c$CHRS/Exome_Norm.tar
wget -c$CHRS/Exome_Tumor.tar
wget -c$CHRS/RNAseq_Norm.tar
wget -c$CHRS/RNAseq_Tumor.tar
wget -c$CHRS/WGS_Norm.tar
wget -c$CHRS/WGS_Tumor.tar

Unpack the individual fastq files

cd /workspace/inputs/data/fastq
tar -xvf Exome_Norm.tar
tar -xvf Exome_Tumor.tar
tar -xvf RNAseq_Norm.tar
tar -xvf RNAseq_Tumor.tar
tar -xvf WGS_Norm.tar
tar -xvf WGS_Tumor.tar

Review the data files downloaded

cd /workspace/inputs/data/fastq

# list all files downloaded

# view the exome normal sample data files
tree Exome_Norm

# view the WGS normal sample data files
tree WGS_Norm

# view the WGS tumor sample data files. why does it look so different from the tumor sample?
tree WGS_Tumor

# view the RNA-seq tumor sample data files.
tree RNAseq_Tumor

Review the structure/contents of FASTQ files

cd /workspace/inputs/data/fastq

# show the first ten lines of the Exome Tumor fastq files
zcat Exome_Tumor/Exome_Tumor_R1.fastq.gz | head
zcat Exome_Tumor/Exome_Tumor_R2.fastq.gz | head

# what do R1 and R2 refer to? What is the length of each read?
zcat Exome_Tumor/Exome_Tumor_R1.fastq.gz | head -n 2 | tail -n 1 | wc

# how many lines are there in the Exome_Tumor file
zcat Exome_Tumor/Exome_Tumor_R1.fastq.gz | wc -l # There are: 33,326,620

# how many paired reads or fragments are there then?
expr 33326620 / 4 # There are: 8,331,655 paired end reads

# how many total bases of data are in the Exome Tumor data set?
echo "8331655 * (101 * 2)" | bc # There are: 1,682,994,310 bases of data

# how many total bases when expressed as "gigabases" (specify 2 decimal points using `scale`)
echo "scale=2; (8331655 * (101 * 2))/1000000000" | bc # There are: 1.68 Gbp of data

# what is the average coverage we expect to achieve with this much data for the exome region targeted?

# first determine the size of our exome regions (answer = 6683920). 
cat /workspace/inputs/references/exome/exome_regions.bed | perl -ne 'chomp; @l=split("\t", $_); $size += $l[2]-$l[1]; if (eof){print "size = $size\n"}' 

# now determine the average coverage of these positions by our bases of data
echo "scale=2; (8331655 * (101 * 2))/6683920" | bc # Average covered expected = 251.79x

# what is the fundamental assumption of this calculation that is at least partially not true? What effect will this have on the observed coverage?