Intro to the Common Workflow Language
Background and Rationale for a Common Workflow Language
Often in data analysis a collection of programmatic tools and resources are required to perform a specific task. This can take the form of bash scripts linking tools together, however bash scripts offer no standardization and often need to be tweaked from one project to the next. The Common Workflow Language (CWL) is a specification for designing portable and scalable workflows. It is open source and available on github under an Apache 2.0 license. Using CWL instead of bash offers a number of advantages, these include:
- Automated parallelization of workflow subtasks
- Modular, can easily add and remove tools from a workflow
- Portability, workflows will work with any environment with CWL installed
- No need to monitor workflows, one sub-task failure won’t cause the entire workflow to fail
- Standardized format makes reading workflows easier
In this module we will use CWL with Docker to build an analysis pipeline to perform a simple DNA alignment.
Installing CWL, Docker, and data prerequisites
In order to begin we will need to have both CWL and docker installed. Instructions for installing both are available here:
- How to install docker
- How to install cwl
Further we will need reads to perform the alignment on and a reference file to align to. For this tutorial we will use downsampled reads from the HCC1395 data set. For a reference file we’ll just align to chromosome 22 so things go a bit faster; we can download the reference from ensembl.
- HCC1395 data from a single lane
- Chromosome 22 Fasta
CWL Workflow Pieces
A typical CWL workflow consists of three main pieces. The first piece is a yaml file specifying the inputs to the workflow. In this tutorial the inputs are simply the bam file containing reads to align and a fasta file to align the reads to. The second piece is a cwl file containing the workflow, in other words how things are run, what the outputs should be, etc. The last piece is a set of cwl files specifying how the tools will be run. Don’t worry if this all doesn’t make sense; things should clear up as we go along. For our example as mentioned we will be constructing a workflow to perform DNA alignment. Go ahead and download the yml and cwl files and put them all in the same directory. You can do so by clicking on the links below:
- data inputs
- yaml file specifying the inputs
- Workflow file
- Individual files to run tools
The inputs.yml file
Okay let’s start by going over what the input.yml file is. Simply put, as it sounds it is specifying the inputs given to the workflow. In our workflow we only have two inputs, a bam file and a reference file. The inputs.yml is specifying what the input is (i.e. files), where the inputs exist (i.e. file paths), and the identifier the cwl workflow will use to refer to the inputs.
bam: class: File path: /Users/zskidmor/Desktop/lab_meeting/gerald_C1TD1ACXX_7_ATCACG.bam reference: class: File path: /Users/zskidmor/Desktop/lab_meeting/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
The workflow.cwl files
The workflow file specifies how things should be run: the inputs, outputs, and steps corresponding to the specific workflow.
#!/usr/bin/env cwl-runner cwlVersion: v1.0 class: Workflow label: "alignment workflow" inputs: bam: type: File doc: bam file to align reference: type: File doc: gzipped reference fasta to align to outputs: index_ref_out: type: File outputSource: index_ref/fasta_index bam_index_out: type: File outputSource: index_bam/bam_index steps: gnu_unzip: run: gunzip.cwl in: reference_file: reference out: [ unzipped_fasta ] index_ref: run: index_fa.cwl in: reference_file: gnu_unzip/unzipped_fasta out: [ fasta_index ] sam2fastq: run: sam2fastq.cwl in: bam_file: bam out: [ fastq1, fastq2 ] bwa_index: run: bwa_index.cwl in: reference_file: gnu_unzip/unzipped_fasta out: [ bwa_ref_index ] align_fastq: run: bwa_mem.cwl in: reference_index: bwa_index/bwa_ref_index fastq1_file: sam2fastq/fastq1 fastq2_file: sam2fastq/fastq2 out: [ aligned_sam ] sam2bam: run: sam2bam.cwl in: sam_file: align_fastq/aligned_sam out: [ aligned_bam ] sort_bam: run: sort_bam.cwl in: bam_file: sam2bam/aligned_bam out: [ sorted_bam ] index_bam: run: bam_index.cwl in: bam_file: sort_bam/sorted_bam out: [ bam_index ]
The command.cwl files specify how to run a given command for a step in the workflow. In the example below we go over how the file is structured for the gnu_unzip step specified in the workflow.cwl.
#!/usr/bin/env cwl-runner class: CommandLineTool cwlVersion: v1.0 baseCommand: [ "gunzip" ] arguments: [ "-c" ] requirements: - class: DockerRequirement dockerPull: ubuntu:xenial inputs: reference_file: type: File inputBinding: position: 1 outputs: unzipped_fasta: type: stdout stdout: reference.fa
Putting it all together
Now that we’ve gone over the basics let’s go ahead and run this workflow. On a typical computer the workflow should run in approx. 7-10 minutes depending on if docker images need to be pulled down from the web.
cwltool --outdir ~/Desktop/cwl_test workflow.cwl inputs.yml