Precision Medicine Bioinformatics

Introduction to bioinformatics for DNA and RNA sequence analysis

ChIP-seq analysis

ChIP-seq analysis

This exercise uses H3K4me3 ChIPseq data from the brain tissue of a Alzheimer’s disease patient that was produced by the ENCODE consortium (accession ENCBS748EGF) First, click through to that ENCODE page and compare it to another H3K4me3 experiment on the portal: ENCSR539TKY

Notice that ENCODE enforces rigorous QC standards and displays that information prominently on their page. When analyzing your own ChIP-seq data, their data quality standards are a great benchmark to use. This lesson doesn’t cover doing ChIP-seq QC, but it’s incredibly important, given that ChIP-seq experiments are more finicky than many other sequencing approaches. There are many resources available for more information, including the excellent Intro to ChIPseq using HPC online course.

To get started with analyzing this data, the pre-aligned sequences (e.g. alz_H3K4me3_rep1.bam) in bam format have been pre-downloaded to your instance:

cd ~/workspace/chipseq_data

ls 

These are small bam files that have been subset to just the first 10Mb of chr17 to speed up this analysis.

In order for tools to access the data in these bams, you’ll need to create an index file for each. Instead of running that command 4 different times, use xargs:

ls -1 *.bam | xargs -n 1 samtools index

Calling peaks

MACS is a package for ChIP-seq analysis that has many utilities. You can see these for yourself by running

macs2 --help

Today, we’re interested in using the callpeak utility to find peaks in our ChIP-seq data that correspond to H3K4me3 marks in this sample. Let’s see what inputs are needed:

macs2 callpeak --help

Woah, that’s a lot of options. MACS is highly configurable, and different types of ChIP-seq experiments might require different tweaks. Luckily for you, it has sensible defaults that will work well for the data type that you’re exploring today.

One step that’s frequently done in ChIP-seq is the removal of duplicate reads. In the above MACS options, there are some for handling dups appropriately, but we need to know what we’re dealing with first. Let’s check this bam file for duplicate reads:

samtools view -f 1024 alz_H3K4me3_rep1.bam | wc -l

Looks like they removed the duplicate reads already as part of their filtering process. Creating a new bam sans duplicates is typically not necessary, but since they have done it, we won’t need to worry about dealing with them in MACS.

As your next step, go ahead and try running MACS on this data. Note that you’re providing the ChIP data along with “input” data that serves as background. Having such input data is essential to distinguishing true signal from noise.

docker run -v /home/ubuntu/workspace/chipseq_data:/home/ubuntu/workspace/chipseq_data fooliu/macs2 callpeak -t /home/ubuntu/workspace/chipseq_data/alz_H3K4me3_rep1.bam /home/ubuntu/workspace/chipseq_data/alz_H3K4me3_rep2.bam -c /home/ubuntu/workspace/chipseq_data/alz_input_rep1.bam /home/ubuntu/workspace/chipseq_data/alz_input_rep2.bam -f BAM --call-summits -p 0.01 -n /home/ubuntu/workspace/chipseq_data/macs_callpeak

MACS outputs

MACS creates outputs in several different formats that can be useful in different contexts:

narrowPeak.png

(image from https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html CC-N)

Go ahead and generate that plot now:

Rscript macs_callpeak_model.r

In an ideal case, the pdf file that it creates will look something like this, representing the bimodal pattern of the shift size.

In this case, since we’re using a small subset of the genome, the data is sparse and a little more choppy, but you can see the same essential shape.

Manual review

ChIP-seq library preparation is notoriously finicky and it’s important to dig in to the raw data and really get a feel for how it looks before believing that your results are solid. IGV is great for this, but it’s rather hard to load up 4 bam files to review on a laptop screen (and it certainly doesn’t scale to experiments with a dozen or more samples!). Since the essential feature that we care about with ChIP-seq is the depth, we can use a format more suited to that: bigwig.

docker run -v /home/ubuntu/workspace:/docker_workspace quay.io/wtsicgp/cgpbigwig:1.6.0 bam2bw -a -r /docker_workspace/ensembl-vep/homo_sapiens/108_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -i /docker_workspace/chipseq_data/alz_H3K4me3_rep1.bam -o /docker_workspace/chipseq_data/alz_H3K4me3_rep1.bw

A bigwig file is a compressed format that contains genomic coordinates and values associated with each. In this case, it will be depth over the entire genome. It’s important to remember that when making peak calls, MACS isn’t using the raw depth, but is applying normalization to the data. Nonetheless, this is often a reasonable way to examine the data, especially when the experiments are similar in terms of depth and quality.

Exercises:

awk '{if($1>5){print $0}}' infile >outfile'