Note
Make sure that you have successfully installed the blacktie module before trying the activities below.
To test whether your installation was successful, open a new terminal session and type the following command.
$ blacktie
You should see the help text for blacktie and it should look something like this:
$ blacktie
usage: blacktie [-h] [--version]
[--prog {tophat,cufflinks,cuffmerge,cuffdiff,all}]
[--hide-logs] [--no-email]
[--mode {analyze,dry_run,qsub_script}]
config_file
This script reads options from a yaml formatted file and organizes the
execution of tophat/cufflinks runs for multiple condition sets.
positional arguments:
config_file Path to a yaml formatted config file containing setup
options for the runs.
optional arguments:
-h, --help show this help message and exit
--version Print version number.
--prog {tophat,cufflinks,cuffmerge,cuffdiff,all}
Which program do you want to run? (default: tophat)
--hide-logs Make your log directories hidden to keep a tidy
'looking' base directory. (default: False)
--no-email Don't send email notifications. (default: False)
--mode {analyze,dry_run,qsub_script}
1) 'analyze': run the analysis pipeline. 2) 'dry_run':
walk through all steps that would be run and print out
the command lines; however, do not send the commands
to the system to be run. 3) 'qsub_script': generate
bash scripts suitable to be sent to a compute
cluster's SGE through the qsub command. (default:
analyze)
If this worked, great! Let’s move on to what all that means.
This tells blacktie which part of the pipeline you would like to run. Any part can be run individually as long as the correct files exist. You can also run the whole thing from tophat to cuffdiff in one fell swoop if you like!
This names your log files so that they are hidden in “*nix” systems.
blacktie can run in three modes. The first, analyze, actually runs the pipeline and does the analyses. However, it can be useful to simply view what WOULD be done to make sure that `blacktie is producing command line calls that match what you expected. For this, use the dry_run mode.
Further, if you are working on a compute cluster running something like a “Sun Grid Engine” (SGE) to which you must submit jobs using qsub, it may not be a good idea to submit a job running all of blacktie as a single qsub job. For this it can be helpful to have blacktie write all of your qsub scripts for you based on a template. Each bash script represents a single program call to the tophat/cufflinks suite.
Note
A starter template for SGE submission can be found here: blacktie/examples/qsub.template. You will want to become familiar with how Mako processes templates if you plan to customize this much.
Here is what the starter template looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | #!/bin/bash
#$ -S /bin/bash # Use a real BASH shell on the worker node
#$ -q ${queues} # What queues do you want to submit to
#$ -M ${email_addy} # Send email updates to this address
#$ -m beas # When to send an email update
#$ -e /data/users/dunnw/logs/${call_id}.e # Write standard error to this file
#$ -o /data/users/dunnw/logs/${call_id}.o # Write standard out to this file
#$ -N ${job_name} # Name my job this
#$ -R y # Reserve cores for me until there are the number I asked for
#$ -pe openmp ${core_range} # Use openmp for multiprocessor use and give me core_range cores
LD_LIBRARY_PATH="${ld_library_path}$${}{LD_LIBRARY_PATH}" # Make sure worker's LD_LIBRARY_PATH contains ld_library_path
# HPC clusters frequently use a module system to provide system wide access to
# certain programs. The following makes sure that the tools needed are loaded
# for **MY** cluster. You will need alter this to make sure your cluster is set up
# based on its system.
module load bowtie2/2.0.2
module load tophat/2.0.6
module load cufflinks/2.0.2
module load samtools/0.1.18
# basic staging stuff
DATAHOME="${datahome}"
MYSCRATCH="/scratch/$${}{USER}"
mkdir -p $MYSCRATCH
cd $MYSCRATCH
# Remind me what will be done
echo ''
echo "${cmd_str}"
echo ''
# Run my job
${cmd_str}
# Pack up results and send it home to log-in node
tar -zcvf ${call_id}.tar.gz ${out_dir}
cp ${call_id}.tar.gz $${}{DATAHOME}/
# Back into the shadows
cd $HOME
rm -rf $MYSCRATCH
|
The configuration file is a YAML-based document that is where we will store all of the complexity of the options, input and output files of the typical tophat/cufflinks workflow. This way we have though about what we want to do with our RNA-seq data from start to finish before we actually start the analysis. Also, this config file acts as a check on our poor memory. If you get strange results you don’t have to worry about whether you entered the samples backwards since you can go back to this config file and see exactly what files and settings were used.
Note
If you are running blacktie in analyze mode, you will have many more files created that document every step of the process where the output files are actually placed as well as central log files.
Here is a dummy example of a config file:
Note
A copy of this file can be found here: blacktie/examples/blacktie_config_example.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | # The document starts after the '---'
# By the way: everything after a '#' on a line
# will be ignored by the program and acts as a
# comment or note to explain things.
---
# run_options is a dictionary that contains variables that will be needed for
# many or all stages of the run
run_options:
base_dir: /path/to/project/base_dir
run_id: False # name your run: if false; uses current date/time for uniqe run_id everytime
bowtie_indexes_dir: /path/to/bowtie2_indexes
email_info:
sender: from_me@gmail.com
to: to_you@email.com
li: /path/to/file/containing/base64_encoded/login_info # base64_encoded pswrd for from_me@email.com
custom_smtp:
host: smtp.gmail.com # or what ever your email smtp server is
port: 587 # or which ever port your smtp server uses
# `tophat_options`:
# -----------------
# This is a dictionary that contains variables needed for all the tophat runs.
# The names of the key:value combinations are taken directly from the tophat
# option names but have the leading '-' removed.
# -o becomes o; --library-type becomes library-type
# **This is true for the cufflinks, cuffmerge, cuffdiff option dictionaries.**
# `from_conditions`:
# ------------------
# This is a special value that tells blacktie that you don't want to name a single
# value for this option but would rather set the value individually for each of
# your samples/conditions. If you set the `o` value here:
# **all of your different sample results would
# be written to the same output directory and
# each would overwrite the next!**
# Hence: from_conditions
# However if you made all of your libraries the same way, things like `r` and
# `mate-std-dev` can be set here to avoid writing the same values over and over
# and perhaps making a mistake or two.
# `positional_args`:
# ------------------
# This is a dictionary inside of the `tophat_options` dictionary.
# It is where you put the arguments to tophat that do not have 'flags' to make
# their identity explicit like `-o path/to/output_dir` or `--library-type fr-unstranded`
# For tophat, these values are
# [1] the bowtie index name
# [2] the fastq files containing the left_reads
# [3] the fastq files containing the right_reads
# They will be different for cufflinks, cuffmerge, cuffdiff so consult the
# respective help text or manuals, but you should be fine if you just use what
# I have set up in this file already.
tophat_options:
o: from_conditions
library-type: fr-unstranded
p: 6
r: 125
mate-std-dev: 25
G: from_conditions
no-coverage-search: True
positional_args:
bowtie2_index: from_conditions
left_reads: from_conditions
right_reads: from_conditions
cufflinks_options:
o: from_conditions
p: 7
GTF-guide: from_conditions # If you want to use annotation as *TRUTH* set this to False and set 'GTF' to 'from_conditions'
GTF: False # if an option set to false, it will be ommited from the command string
3-overhang-tolerance: 5000
frag-bias-correct: from_conditions
multi-read-correct: True
upper-quartile-norm: True
positional_args:
accepted_hits: from_conditions
cuffmerge_options:
o: from_conditions # output directory
ref-gtf: from_conditions
p: 6
ref-sequence: from_conditions
positional_args:
assembly_list: from_conditions # file with path to cufflinks gtf files to be merged
cuffdiff_options:
o: from_conditions
labels: from_conditions
p: 6
time-series: True
upper-quartile-norm: True
frag-bias-correct: from_conditions
multi-read-correct: True
positional_args:
transcripts_gtf: from_conditions
sample_bams: from_conditions
cummerbund_options:
cuffdiff-dir: from_conditions
gtf-path: from_conditions
out: from_conditions
file-type: pdf
# options for --mode qsub_script
# If you are not using --mode qsub_script, then set all to 'None'
qsub_options:
queues: 'queue1,queue3,queue5'
datahome: '/path/to/baseDirectory/on/cluster/'
core_range: 40-64 # how many cpus do you want
ld_library_path: '' # leave this blank unless you know what it is and need it
template: /path/to/your/altered/version/of/qsub.template
# `condition_queue`:
# ------------------
# This is a list of info related to each sample/condition contained in your RNA-sequence
# experiment(s)
# `name`: the name of this condition program. Usually something like a time-point
# ID or treatment type. Should be as short as possible while still being a useful label.
# `experiment_id`: this is how you group different experiments to be included in a
# single cuffmerge/cuffdiff program call. All conditions in a time
# series should share the same `experiment_id` and be placed in
# `condition_queue` in the order that you want them to be sent to
# cuffdiff.
# `replicate_id`: this is how you group data for biological replicates of a single
# experimental condition experiments to be included in a cuffdiff program
# call. Each replicate of a condition should have a unique `experiment_id`.
# `left_reads`: a list of the paths to fastq files containing left reads for
# each condition.
# `right_reads`: list of fastqs containing the right mates for the fastqs in
# `left_reads`.
# **NOTE** right mate file must be in same order as provided to `left_reads`
condition_queue:
-
name: exp1_control
experiment_id: 0
replicate_id: 0
left_reads:
- /path/to/exp1_control/techRep1.left_reads.fastq
- /path/to/exp1_control/techRep2.left_reads.fastq
right_reads:
- /path/to/exp1_control/techRep1.right_reads.fastq
- /path/to/exp1_control/techRep2.right_reads.fastq
genome_seq: /path/to/species/genome.fa
gtf_annotation: /path/to/species/annotation.gtf
bowtie2_index: species.bowtie2_index.basename
-
name: exp1_control
experiment_id: 0
replicate_id: 1
left_reads:
- /path/to/exp1_control/techRep1.left_reads.fastq
- /path/to/exp1_control/techRep2.left_reads.fastq
right_reads:
- /path/to/exp1_control/techRep1.right_reads.fastq
- /path/to/exp1_control/techRep2.right_reads.fastq
genome_seq: /path/to/species/genome.fa
gtf_annotation: /path/to/species/annotation.gtf
bowtie2_index: species.bowtie2_index.basename
-
name: exp1_treatment
experiment_id: 0
replicate_id: 0
left_reads:
- /path/to/exp1_treatment/techRep1.left_reads.fastq
- /path/to/exp1_treatment/techRep2.left_reads.fastq
right_reads:
- /path/to/exp1_treatment/techRep1.right_reads.fastq
- /path/to/exp1_treatment/techRep2.right_reads.fastq
genome_seq: /path/to/species/genome.fa
gtf_annotation: /path/to/species/annotation.gtf
bowtie2_index: species.bowtie2_index.basename
-
name: exp2_control
experiment_id: 1
replicate_id: 0
left_reads:
- /path/to/exp2_control/techRep1.left_reads.fastq
- /path/to/exp2_control/techRep2.left_reads.fastq
right_reads:
- /path/to/exp2_control/techRep1.right_reads.fastq
- /path/to/exp2_control/techRep2.right_reads.fastq
genome_seq: /path/to/species/genome.fa
gtf_annotation: /path/to/species/annotation.gtf
bowtie2_index: species.bowtie2_index.basename
-
name: exp2_treatment
experiment_id: 1
replicate_id: 0
left_reads:
- /path/to/exp2_treatment/techRep1.left_reads.fastq
- /path/to/exp2_treatment/techRep2.left_reads.fastq
right_reads:
- /path/to/exp2_treatment/techRep1.right_reads.fastq
- /path/to/exp2_treatment/techRep2.right_reads.fastq
genome_seq: /path/to/species/genome.fa
gtf_annotation: /path/to/species/annotation.gtf
bowtie2_index: species.bowtie2_index.basename
...
|
Todo
Add the slots for custom email server options.
Changed in version v0.2.0rc1: any smtp server should now be usable if you code the host and port into the yaml config file. Any email can be used as the recipient.
New in version v0.2.0rc1: added --no-email option.
Warning
gmail’s 2-step authentication will NOT work. Sorry. I will look into how to deal with that eventually.
You will need to provide your password in order to use the email notifications but it is not a good idea to store human readable passwords lying around your system. So the file that is used to store your password must contain a version of your password that has been encoded in base64. This will scramble your password beyond most people’s ability to read it as a password as long as you don’t name it something silly like password_file.txt.
The help text for blacktie-encode is:
$ blacktie-encode -h
usage: blacktie-encode [-h] input_file
This script takes a path to a file where you have placed your password for the
email you want blacktie to use as the "sender" in its notification emails. It
will replace the file with one containing your password once it has encoded it
out of human readable plain-text into seemingly meaningless text. **THIS IS
NOT FOOLPROOF:** If someone knows exactly what to look for they might figure
it out. ALWAYS use good password practices and never use the same password for
multiple important accounts!
positional arguments:
input_file Path to a file where you have placed your password for the email
you want blacktie to use as the "sender" in its notification
emails.
optional arguments:
-h, --help show this help message and exit
A more detailed tutorial is under development, so watch this space!