Precision Medicine Bioinformatics

Introduction to bioinformatics for DNA and RNA sequence analysis

Intro to the Common Workflow Language

Key concepts

Learning objectives

Lecture

Background and Rationale for a Common Workflow Language

Often in data analysis a collection of programmatic tools and resources are required to perform a specific task. This can take the form of bash scripts linking tools together, however bash scripts offer no standardization and often need to be tweaked from one project to the next. The Common Workflow Language (CWL) is a specification for designing portable and scalable workflows. It is open source and available on github under an Apache 2.0 license. Using CWL instead of bash offers a number of advantages, these include:

  1. Automated parallelization of workflow subtasks
  2. Modular, can easily add and remove tools from a workflow
  3. Portability, workflows will work with any environment with CWL installed
  4. No need to monitor workflows, one sub-task failure won’t cause the entire workflow to fail
  5. Standardized format makes reading workflows easier

In this module we will use CWL with Docker to build an analysis pipeline to perform a simple DNA alignment.

Installing CWL, Docker, and data prerequisites

In order to begin we will need to have both CWL and docker installed. Instructions for installing both are available here:

  1. How to install docker
  2. How to install cwl

Further we will need reads to perform the alignment on and a reference file to align to. For this tutorial we will use downsampled reads from the HCC1395 data set. For a reference file we’ll just align to chromosome 22 so things go a bit faster; we can download the reference from ensembl.

  1. HCC1395 data from a single lane
  2. Chromosome 22 Fasta

CWL Workflow Pieces

A typical CWL workflow consists of three main pieces. The first piece is a yaml file specifying the inputs to the workflow. In this tutorial the inputs are simply the bam file containing reads to align and a fasta file to align the reads to. The second piece is a cwl file containing the workflow, in other words how things are run, what the outputs should be, etc. The last piece is a set of cwl files specifying how the tools will be run. Don’t worry if this all doesn’t make sense; things should clear up as we go along. For our example as mentioned we will be constructing a workflow to perform DNA alignment. Go ahead and download the yml and cwl files and put them all in the same directory. You can do so by clicking on the links below:

  1. data inputs
    1. gerald_C1TD1ACXX_7_ATCACG.bam
    2. Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
  2. yaml file specifying the inputs
    1. inputs.yml
  3. Workflow file
    1. workflow.cwl
  4. Individual files to run tools
    1. gunzip.cwl
    2. index_fa.cwl
    3. sam2fastq.cwl
    4. bwa_index.cwl
    5. bwa_mem.cwl
    6. sam2bam.cwl
    7. sort_bam.cwl
    8. bam_index.cwl

The inputs.yml file

Okay let’s start by going over what the input.yml file is. Simply put, as it sounds it is specifying the inputs given to the workflow. In our workflow we only have two inputs, a bam file and a reference file. The inputs.yml is specifying what the input is (i.e. files), where the inputs exist (i.e. file paths), and the identifier the cwl workflow will use to refer to the inputs.

bam:
  class: File
  path: /Users/zskidmor/Desktop/lab_meeting/gerald_C1TD1ACXX_7_ATCACG.bam
reference:
  class: File
  path: /Users/zskidmor/Desktop/lab_meeting/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz

The workflow.cwl files

The workflow file specifies how things should be run: the inputs, outputs, and steps corresponding to the specific workflow.

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow
label: "alignment workflow"

inputs:
  bam:
    type: File
    doc: bam file to align
  reference:
    type: File
    doc: gzipped reference fasta to align to

outputs:
  index_ref_out:
    type: File
    outputSource: index_ref/fasta_index
  bam_index_out:
    type: File
    outputSource: index_bam/bam_index

steps:
  gnu_unzip:
    run: gunzip.cwl
    in:
      reference_file: reference
    out: [ unzipped_fasta ]
  index_ref:
    run: index_fa.cwl
    in:
      reference_file: gnu_unzip/unzipped_fasta
    out: [ fasta_index ]
  sam2fastq:
    run: sam2fastq.cwl
    in:
      bam_file: bam
    out:  [ fastq1, fastq2 ]
  bwa_index:
    run: bwa_index.cwl
    in:
      reference_file: gnu_unzip/unzipped_fasta
    out: [ bwa_ref_index ]
  align_fastq:
    run: bwa_mem.cwl
    in:
      reference_index: bwa_index/bwa_ref_index
      fastq1_file: sam2fastq/fastq1
      fastq2_file: sam2fastq/fastq2
    out: [ aligned_sam ]
  sam2bam:
    run: sam2bam.cwl
    in:
      sam_file: align_fastq/aligned_sam
    out: [ aligned_bam ]
  sort_bam:
    run: sort_bam.cwl
    in:
      bam_file: sam2bam/aligned_bam
    out: [ sorted_bam ]
  index_bam:
    run: bam_index.cwl
    in:
      bam_file: sort_bam/sorted_bam
    out: [ bam_index ]

Command.cwl files

The command.cwl files specify how to run a given command for a step in the workflow. In the example below we go over how the file is structured for the gnu_unzip step specified in the workflow.cwl.

#!/usr/bin/env cwl-runner

class: CommandLineTool

cwlVersion: v1.0

baseCommand: [ "gunzip" ]

arguments: [ "-c" ]

requirements:
  - class: DockerRequirement
    dockerPull: ubuntu:xenial

inputs:
    reference_file:
        type: File
        inputBinding:
            position: 1

outputs:
    unzipped_fasta:
        type: stdout

stdout: reference.fa

Putting it all together

Now that we’ve gone over the basics let’s go ahead and run this workflow. On a typical computer the workflow should run in approx. 7-10 minutes depending on if docker images need to be pulled down from the web.

cwltool --outdir ~/Desktop/cwl_test workflow.cwl inputs.yml