Intro to the Common Workflow Language
Key concepts
- …
Learning objectives
- …
Lecture
Background and Rationale for a Common Workflow Language
Often in data analysis a collection of programmatic tools and resources are required to perform a specific task. This can take the form of bash scripts linking tools together, however bash scripts offer no standardization and often need to be tweaked from one project to the next. The Common Workflow Language (CWL) is a specification for designing portable and scalable workflows. It is open source and available on github under an Apache 2.0 license. Using CWL instead of bash offers a number of advantages, these include:
- Automated parallelization of workflow subtasks
- Modular, can easily add and remove tools from a workflow
- Portability, workflows will work with any environment with CWL installed
- No need to monitor workflows, one sub-task failure won’t cause the entire workflow to fail
- Standardized format makes reading workflows easier
In this module we will use CWL with Docker to build an analysis pipeline to perform a simple DNA alignment.
Installing CWL, Docker, and data prerequisites
In order to begin we will need to have both CWL and docker installed. Instructions for installing both are available here:
Further we will need reads to perform the alignment on and a reference file to align to. For this tutorial we will use downsampled reads from the HCC1395 data set. For a reference file we’ll just align to chromosome 22 so things go a bit faster; we can download the reference from ensembl.
CWL Workflow Pieces
A typical CWL workflow consists of three main pieces. The first piece is a yaml file specifying the inputs to the workflow. In this tutorial the inputs are simply the bam file containing reads to align and a fasta file to align the reads to. The second piece is a cwl file containing the workflow, in other words how things are run, what the outputs should be, etc. The last piece is a set of cwl files specifying how the tools will be run. Don’t worry if this all doesn’t make sense; things should clear up as we go along. For our example as mentioned we will be constructing a workflow to perform DNA alignment. Go ahead and download the yml and cwl files and put them all in the same directory. You can do so by clicking on the links below:
- data inputs
- yaml file specifying the inputs
- Workflow file
- Individual files to run tools
The inputs.yml file
Okay let’s start by going over what the input.yml file is. Simply put, as it sounds it is specifying the inputs given to the workflow. In our workflow we only have two inputs, a bam file and a reference file. The inputs.yml is specifying what the input is (i.e. files), where the inputs exist (i.e. file paths), and the identifier the cwl workflow will use to refer to the inputs.
bam:
class: File
path: /Users/zskidmor/Desktop/lab_meeting/gerald_C1TD1ACXX_7_ATCACG.bam
reference:
class: File
path: /Users/zskidmor/Desktop/lab_meeting/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
The workflow.cwl files
The workflow file specifies how things should be run: the inputs, outputs, and steps corresponding to the specific workflow.
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow
label: "alignment workflow"
inputs:
bam:
type: File
doc: bam file to align
reference:
type: File
doc: gzipped reference fasta to align to
outputs:
index_ref_out:
type: File
outputSource: index_ref/fasta_index
bam_index_out:
type: File
outputSource: index_bam/bam_index
steps:
gnu_unzip:
run: gunzip.cwl
in:
reference_file: reference
out: [ unzipped_fasta ]
index_ref:
run: index_fa.cwl
in:
reference_file: gnu_unzip/unzipped_fasta
out: [ fasta_index ]
sam2fastq:
run: sam2fastq.cwl
in:
bam_file: bam
out: [ fastq1, fastq2 ]
bwa_index:
run: bwa_index.cwl
in:
reference_file: gnu_unzip/unzipped_fasta
out: [ bwa_ref_index ]
align_fastq:
run: bwa_mem.cwl
in:
reference_index: bwa_index/bwa_ref_index
fastq1_file: sam2fastq/fastq1
fastq2_file: sam2fastq/fastq2
out: [ aligned_sam ]
sam2bam:
run: sam2bam.cwl
in:
sam_file: align_fastq/aligned_sam
out: [ aligned_bam ]
sort_bam:
run: sort_bam.cwl
in:
bam_file: sam2bam/aligned_bam
out: [ sorted_bam ]
index_bam:
run: bam_index.cwl
in:
bam_file: sort_bam/sorted_bam
out: [ bam_index ]
Command.cwl files
The command.cwl files specify how to run a given command for a step in the workflow. In the example below we go over how the file is structured for the gnu_unzip step specified in the workflow.cwl.
#!/usr/bin/env cwl-runner
class: CommandLineTool
cwlVersion: v1.0
baseCommand: [ "gunzip" ]
arguments: [ "-c" ]
requirements:
- class: DockerRequirement
dockerPull: ubuntu:xenial
inputs:
reference_file:
type: File
inputBinding:
position: 1
outputs:
unzipped_fasta:
type: stdout
stdout: reference.fa
Putting it all together
Now that we’ve gone over the basics let’s go ahead and run this workflow. On a typical computer the workflow should run in approx. 7-10 minutes depending on if docker images need to be pulled down from the web.
cwltool --outdir ~/Desktop/cwl_test workflow.cwl inputs.yml