Data
Acknowledgement
First and foremost, we are extremely grateful to the patients who donate tumor samples for research and educational purposes such as this. Progress in disease research would not be possible without such contributions. These donations have helped to realize significant improvements in our understanding of cancer biology and improvements in cancer outcomes. For one of many great summaries of these advances, please refer to Cancer - our world in data.
About the samples
Throughout the course we will use data from several sources, but many of the examples and exercises relate specifically to single cell line, chosen to represent a hypothetical patient. More precisely, we will use a well-described breast cancer cell line (HCC1395) and its matched lymphoblastoid cell line (HCC1395BL). The data are hosted on our server at genomedata.org. The cell line itself can be obtained (with a nominal fee) from the American Type Culture Collection (ATCC), a nonprofit organization which collects, stores, and distributes standard reference microorganisms, cell lines and other materials for research and development (citation). Product pages for the HCC1395 cell lines can be found here: HCC1395/CRL-2324, HCC1395BL/CRL-2325).
NOTE from ATCC: This cell line was deposited at the ATCC by Dr. Adi F. Gazdar and is provided for research purposes only. Neither the cell line nor products derived from it may be sold or used for commercial purposes. Nor can the cells be distributed to third parties for purposes of sale, or producing for sale, cells or their products. The cells are provided as service to the research community.
The HCC1395 cell line was obtained from a 43 year old Caucasian female patient. The HCC1395 cell line is described as being of tissue orgin: mammary gland; breast/duct. The HCC1395BL cell line was created from a B lymphoblast that was tranformed by the EBV virus. The patient’s cancer was described as: TNM stage I, grade 3, primary ductal carcinoma. This cell line was initiated in the 1990s from a patient with a family history of cancer (patient’s mother had breast cancer). The cell line took 14 months to establish. The patient received chemotherapy prior to isolation of the tumor (PMID: 9833771). This tumor is concidered “Triple-Negative” by classic typing: ERBB2 -ve (aka HER2/neu), PR -ve, and ER -ve). Otherwise it is one of those difficult to classify by expression-based molecular typing but is likely of the “Basal” sub-type (PMID: 22003129). The tumor cell line is known to be polyploid. The tumor is also described as TP53 mutation positive.
About the data
The table below provides sample details for all data files. Note that the ‘Readgroup ID’ can be used to match downloaded files to these details. The data set for this course corresponds to the matched tumor/normal pair described above. For each of these samples, whole genome, exome and RNA-seq was performed. Whole genome sequencing was performed to a target median coverage depth of ~30x for the normal samples and ~50x for the tumor sample. Exome sequencing was performed to a target median coverage depth of ~100x. RNA-seq was performed for both tumor and normal. Note that due to the difference in tissue of origin, comparing the two RNA-seq samples to each other does not make sense biologically. Note that some of these data types have multiple lanes of data (required to hit the target total depth).
Data (Filename Prefix / Readgroup ID) | MGI ID | Platform | FC[-BC].Lane | Library | Sample Name |
---|---|---|---|---|---|
Exome_Norm | 2891351068 | Illumina | C1TD1ACXX-CGATGT.7 | exome_norm_lib1 | HCC1395BL_DNA |
Exome_Tumor | 2891351066 | Illumina | C1TD1ACXX-ATCACG.7 | exome_tumor_lib1 | HCC1395_DNA |
RNAseq_Norm_Lane1 | 2895625992 | Illumina | H3MYFBBXX-CTTGTA.4 | rna_norm_lib1 | HCC1395BL_RNA |
RNAseq_Norm_Lane2 | 2895626097 | Illumina | H3MYFBBXX-CTTGTA.5 | rna_norm_lib1 | HCC1395BL_RNA |
RNAseq_Tumor_Lane1 | 2895626107 | Illumina | H3MYFBBXX-GCCAAT.4 | rna_tumor_lib1 | HCC1395_RNA |
RNAseq_Tumor_Lane2 | 2895626112 | Illumina | H3MYFBBXX-GCCAAT.5 | rna_tumor_lib1 | HCC1395_RNA |
WGS_Norm_Lane1 | 2891323123 | Illumina | D1VCPACXX.6 | wgs_norm_lib1 | HCC1395BL_DNA |
WGS_Norm_Lane2 | 2891323124 | Illumina | D1VCPACXX.7 | wgs_norm_lib2 | HCC1395BL_DNA |
WGS_Norm_Lane3 | 2891323125 | Illumina | D1VCPACXX.8 | wgs_norm_lib3 | HCC1395BL_DNA |
WGS_Tumor_Lane1 | 2891322951 | Illumina | D1VCPACXX.1 | wgs_tumor_lib1 | HCC1395_DNA |
WGS_Tumor_Lane2 | 2891323174 | Illumina | D1VCPACXX.2 | wgs_tumor_lib1 | HCC1395_DNA |
WGS_Tumor_Lane3 | 2891323175 | Illumina | D1VCPACXX.3 | wgs_tumor_lib2 | HCC1395_DNA |
WGS_Tumor_Lane4 | 2891323150 | Illumina | D1VCPACXX.4 | wgs_tumor_lib2 | HCC1395_DNA |
WGS_Tumor_Lane5 | 2891323147 | Illumina | D1VCPACXX.5 | wgs_tumor_lib3 | HCC1395_DNA |
Download the data
Use wget to download all data to your instance:
mkdir -p /workspace/inputs/data/fastq
cd /workspace/inputs/data/fastq
wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/Exome_Norm.tar
wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/Exome_Tumor.tar
wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/RNAseq_Norm.tar
wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/RNAseq_Tumor.tar
wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/WGS_Norm.tar
wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/WGS_Tumor.tar
Unpack the individual fastq files
cd /workspace/inputs/data/fastq
tar -xvf Exome_Norm.tar
tar -xvf Exome_Tumor.tar
tar -xvf RNAseq_Norm.tar
tar -xvf RNAseq_Tumor.tar
tar -xvf WGS_Norm.tar
tar -xvf WGS_Tumor.tar
Review the data files downloaded
cd /workspace/inputs/data/fastq
# list all files downloaded
tree
# view the exome normal sample data files
tree Exome_Norm
# view the WGS normal sample data files
tree WGS_Norm
# view the WGS tumor sample data files. why does it look so different from the tumor sample?
tree WGS_Tumor
# view the RNA-seq tumor sample data files.
tree RNAseq_Tumor
Review the structure/contents of FASTQ files
cd /workspace/inputs/data/fastq
# show the first ten lines of the Exome Tumor fastq files
zcat Exome_Tumor/Exome_Tumor_R1.fastq.gz | head
zcat Exome_Tumor/Exome_Tumor_R2.fastq.gz | head
# what do R1 and R2 refer to? What is the length of each read?
zcat Exome_Tumor/Exome_Tumor_R1.fastq.gz | head -n 2 | tail -n 1 | wc
# how many lines are there in the Exome_Tumor file
zcat Exome_Tumor/Exome_Tumor_R1.fastq.gz | wc -l # There are: 33,326,620
# how many paired reads or fragments are there then?
expr 33326620 / 4 # There are: 8,331,655 paired end reads
# how many total bases of data are in the Exome Tumor data set?
echo "8331655 * (101 * 2)" | bc # There are: 1,682,994,310 bases of data
# how many total bases when expressed as "gigabases" (specify 2 decimal points using `scale`)
echo "scale=2; (8331655 * (101 * 2))/1000000000" | bc # There are: 1.68 Gbp of data
# what is the average coverage we expect to achieve with this much data for the exome region targeted?
# first determine the size of our exome regions (answer = 6683920).
cat /workspace/inputs/references/exome/exome_regions.bed | perl -ne 'chomp; @l=split("\t", $_); $size += $l[2]-$l[1]; if (eof){print "size = $size\n"}'
# now determine the average coverage of these positions by our bases of data
echo "scale=2; (8331655 * (101 * 2))/6683920" | bc # Average covered expected = 251.79x
# what is the fundamental assumption of this calculation that is at least partially not true? What effect will this have on the observed coverage?