Getting started¶

Note

Make sure that you have successfully installed the blacktie module before trying the activities below.

To test whether your installation was successful, open a new terminal session and type the following command.

$ blacktie

You should see the help text for blacktie and it should look something like this:

$ blacktie

usage: blacktie [-h] [--version]
    [--prog {tophat,cufflinks,cuffmerge,cuffdiff,all}]
    [--hide-logs] [--no-email]
    [--mode {analyze,dry_run,qsub_script}]
    config_file

This script reads options from a yaml formatted file and organizes the
execution of tophat/cufflinks runs for multiple condition sets.

positional arguments:
    config_file           Path to a yaml formatted config file containing setup
        options for the runs.

optional arguments:
    -h, --help            show this help message and exit
    --version             Print version number.
    --prog {tophat,cufflinks,cuffmerge,cuffdiff,all}
        Which program do you want to run? (default: tophat)
    --hide-logs           Make your log directories hidden to keep a tidy
        'looking' base directory. (default: False)
    --no-email            Don't send email notifications. (default: False)
    --mode {analyze,dry_run,qsub_script}
        1) 'analyze': run the analysis pipeline. 2) 'dry_run':
        walk through all steps that would be run and print out
        the command lines; however, do not send the commands
        to the system to be run. 3) 'qsub_script': generate
        bash scripts suitable to be sent to a compute
        cluster's SGE through the qsub command. (default:
        analyze)

If this worked, great! Let’s move on to what all that means.

The –prog option¶

This tells blacktie which part of the pipeline you would like to run. Any part can be run individually as long as the correct files exist. You can also run the whole thing from tophat to cuffdiff in one fell swoop if you like!

The –hide-logs option¶

This names your log files so that they are hidden in “*nix” systems.

The –modes option¶

blacktie can run in three modes. The first, analyze, actually runs the pipeline and does the analyses. However, it can be useful to simply view what WOULD be done to make sure that `blacktie is producing command line calls that match what you expected. For this, use the dry_run mode.

Further, if you are working on a compute cluster running something like a “Sun Grid Engine” (SGE) to which you must submit jobs using qsub, it may not be a good idea to submit a job running all of blacktie as a single qsub job. For this it can be helpful to have blacktie write all of your qsub scripts for you based on a template. Each bash script represents a single program call to the tophat/cufflinks suite.

Note

A starter template for SGE submission can be found here: blacktie/examples/qsub.template. You will want to become familiar with how Mako processes templates if you plan to customize this much.

Here is what the starter template looks like:

#!/bin/bash
#$ -S /bin/bash                                             # Use a real BASH shell on the worker node
#$ -q ${queues}                                             # What queues do you want to submit to
#$ -M ${email_addy}                                         # Send email updates to this address
#$ -m beas                                                  # When to send an email update
#$ -e /data/users/dunnw/logs/${call_id}.e                   # Write standard error to this file
#$ -o /data/users/dunnw/logs/${call_id}.o                   # Write standard out to this file
#$ -N ${job_name}                                           # Name my job this
#$ -R y                                                     # Reserve cores for me until there are the number I asked for
#$ -pe openmp ${core_range}                                 # Use openmp for multiprocessor use and give me core_range cores

LD_LIBRARY_PATH="${ld_library_path}$${}{LD_LIBRARY_PATH}"   # Make sure worker's LD_LIBRARY_PATH contains ld_library_path


# HPC clusters frequently use a module system to provide system wide access to 
# certain programs.  The following makes sure that the tools needed are loaded
# for **MY** cluster. You will need alter this to make sure your cluster is set up
# based on its system.

module load bowtie2/2.0.2
module load tophat/2.0.6
module load cufflinks/2.0.2
module load samtools/0.1.18


# basic staging stuff
DATAHOME="${datahome}"
MYSCRATCH="/scratch/$${}{USER}"


mkdir -p $MYSCRATCH
cd $MYSCRATCH


# Remind me what will be done
echo ''
echo "${cmd_str}"
echo ''

# Run my job
${cmd_str}


# Pack up results and send it home to log-in node
tar -zcvf ${call_id}.tar.gz ${out_dir}
cp ${call_id}.tar.gz $${}{DATAHOME}/

# Back into the shadows
cd $HOME
rm -rf $MYSCRATCH

The configuration file¶

The configuration file is a YAML-based document that is where we will store all of the complexity of the options, input and output files of the typical tophat/cufflinks workflow. This way we have though about what we want to do with our RNA-seq data from start to finish before we actually start the analysis. Also, this config file acts as a check on our poor memory. If you get strange results you don’t have to worry about whether you entered the samples backwards since you can go back to this config file and see exactly what files and settings were used.

Note

If you are running blacktie in analyze mode, you will have many more files created that document every step of the process where the output files are actually placed as well as central log files.

Here is a dummy example of a config file:

Note

A copy of this file can be found here: blacktie/examples/blacktie_config_example.yaml

# The document starts after the '---'

# By the way: everything after a '#' on a line
# will be ignored by the program and acts as a
# comment or note to explain things.

---
# run_options is a dictionary that contains variables that will be needed for
# many or all stages of the run
run_options:
    base_dir: /path/to/project/base_dir
    run_id: False            #  name your run: if false; uses current date/time for uniqe run_id everytime
    bowtie_indexes_dir: /path/to/bowtie2_indexes 
    email_info:
        sender: from_me@gmail.com           
        to: to_you@email.com
        li: /path/to/file/containing/base64_encoded/login_info      # base64_encoded pswrd for from_me@email.com
    custom_smtp: 
        host: smtp.gmail.com   # or what ever your email smtp server is
        port: 587              # or which ever port your smtp server uses



# `tophat_options`:
# -----------------
# This is a dictionary that contains variables needed for all the tophat runs.
# The names of the key:value combinations are taken directly from the tophat
# option names but have the leading '-' removed.

# -o becomes o; --library-type becomes library-type

# **This is true for the cufflinks, cuffmerge, cuffdiff option dictionaries.** 

# `from_conditions`:
# ------------------
# This is a special value that tells blacktie that you don't want to name a single
# value for this option but would rather set the value individually for each of
# your samples/conditions.  If you set the `o` value here: 

#    **all of your different sample results would
#      be written to the same output directory and
#      each would overwrite the next!**
# Hence: from_conditions

# However if you made all of your libraries the same way, things like `r` and
# `mate-std-dev` can be set here to avoid writing the same values over and over
# and perhaps making a mistake or two.

# `positional_args`:
# ------------------
# This is a dictionary inside of the `tophat_options` dictionary.
# It is where you put the arguments to tophat that do not have 'flags' to make
# their identity explicit like `-o path/to/output_dir` or `--library-type fr-unstranded`

# For tophat, these values are 
#     [1] the bowtie index name
#     [2] the fastq files containing the left_reads
#     [3] the fastq files containing the right_reads

# They will be different for cufflinks, cuffmerge, cuffdiff so consult the
# respective help text or manuals, but you should be fine if you just use what
# I have set up in this file already.

tophat_options:
    o: from_conditions
    library-type: fr-unstranded
    p: 6
    r: 125
    mate-std-dev: 25
    G: from_conditions
    no-coverage-search: True
    positional_args:
        bowtie2_index: from_conditions
        left_reads: from_conditions
        right_reads: from_conditions

cufflinks_options:
    o: from_conditions
    p: 7
    GTF-guide: from_conditions  # If you want to use annotation as *TRUTH* set this to False and set 'GTF' to 'from_conditions'
    GTF: False                  # if an option set to false, it will be ommited from the command string 
    3-overhang-tolerance: 5000
    frag-bias-correct: from_conditions 
    multi-read-correct: True
    upper-quartile-norm: True
    positional_args:
        accepted_hits: from_conditions

cuffmerge_options:
    o: from_conditions # output directory
    ref-gtf: from_conditions
    p: 6
    ref-sequence: from_conditions
    positional_args:
        assembly_list: from_conditions # file with path to cufflinks gtf files to be merged

cuffdiff_options:
    o: from_conditions
    labels: from_conditions
    p: 6
    time-series: True
    upper-quartile-norm: True
    frag-bias-correct: from_conditions
    multi-read-correct: True
    positional_args:
        transcripts_gtf: from_conditions
        sample_bams: from_conditions


cummerbund_options:
    cuffdiff-dir: from_conditions
    gtf-path: from_conditions
    out: from_conditions
    file-type: pdf


# options for --mode qsub_script
# If you are not using --mode qsub_script, then set all to 'None'
qsub_options:
  queues: 'queue1,queue3,queue5'
  datahome: '/path/to/baseDirectory/on/cluster/'
  core_range: 40-64 # how many cpus do you want
  ld_library_path: ''  # leave this blank unless you know what it is and need it
  template: /path/to/your/altered/version/of/qsub.template


# `condition_queue`:
# ------------------
# This is a list of info related to each sample/condition contained in your RNA-sequence
# experiment(s)

# `name`: the name of this condition program. Usually something like a time-point
#         ID or treatment type. Should be as short as possible while still being a useful label. 

# `experiment_id`: this is how you group different experiments to be included in a
#             single cuffmerge/cuffdiff program call.  All conditions in a time
#             series should share the same `experiment_id` and be placed in
#             `condition_queue` in the order that you want them to be sent to
#             cuffdiff.

# `replicate_id`: this is how you group data for biological replicates of a single
#             experimental condition experiments to be included in a cuffdiff program
#             call.  Each replicate of a condition should have a unique `experiment_id`.

# `left_reads`: a list of the paths to fastq files containing left reads for
#               each condition. 

# `right_reads`: list of fastqs containing the right mates for the fastqs in
#                `left_reads`.
#                 **NOTE** right mate file must be in same order as provided to `left_reads`

condition_queue:
    -
        name: exp1_control
        experiment_id: 0
        replicate_id: 0
        left_reads:
            - /path/to/exp1_control/techRep1.left_reads.fastq
            - /path/to/exp1_control/techRep2.left_reads.fastq
        right_reads:
            - /path/to/exp1_control/techRep1.right_reads.fastq
            - /path/to/exp1_control/techRep2.right_reads.fastq
        genome_seq: /path/to/species/genome.fa
        gtf_annotation: /path/to/species/annotation.gtf
        bowtie2_index: species.bowtie2_index.basename

    -
        name: exp1_control
        experiment_id: 0
        replicate_id: 1
        left_reads:
            - /path/to/exp1_control/techRep1.left_reads.fastq
            - /path/to/exp1_control/techRep2.left_reads.fastq
        right_reads:
            - /path/to/exp1_control/techRep1.right_reads.fastq
            - /path/to/exp1_control/techRep2.right_reads.fastq
        genome_seq: /path/to/species/genome.fa
        gtf_annotation: /path/to/species/annotation.gtf
        bowtie2_index: species.bowtie2_index.basename
        
    -
        name: exp1_treatment
        experiment_id: 0
        replicate_id: 0
        left_reads:
            - /path/to/exp1_treatment/techRep1.left_reads.fastq
            - /path/to/exp1_treatment/techRep2.left_reads.fastq
        right_reads:
            - /path/to/exp1_treatment/techRep1.right_reads.fastq
            - /path/to/exp1_treatment/techRep2.right_reads.fastq
        genome_seq: /path/to/species/genome.fa
        gtf_annotation: /path/to/species/annotation.gtf
        bowtie2_index: species.bowtie2_index.basename

    -
        name: exp2_control
        experiment_id: 1
        replicate_id: 0
        left_reads:
            - /path/to/exp2_control/techRep1.left_reads.fastq
            - /path/to/exp2_control/techRep2.left_reads.fastq
        right_reads:
            - /path/to/exp2_control/techRep1.right_reads.fastq
            - /path/to/exp2_control/techRep2.right_reads.fastq
        genome_seq: /path/to/species/genome.fa
        gtf_annotation: /path/to/species/annotation.gtf
        bowtie2_index: species.bowtie2_index.basename

    -
        name: exp2_treatment
        experiment_id: 1
        replicate_id: 0
        left_reads:
            - /path/to/exp2_treatment/techRep1.left_reads.fastq
            - /path/to/exp2_treatment/techRep2.left_reads.fastq
        right_reads:
            - /path/to/exp2_treatment/techRep1.right_reads.fastq
            - /path/to/exp2_treatment/techRep2.right_reads.fastq
        genome_seq: /path/to/species/genome.fa
        gtf_annotation: /path/to/species/annotation.gtf
        bowtie2_index: species.bowtie2_index.basename


...

Todo

Add the slots for custom email server options.

Using e-mail notifications¶

Changed in version v0.2.0rc1: any smtp server should now be usable if you code the host and port into the yaml config file. Any email can be used as the recipient.

New in version v0.2.0rc1: added --no-email option.

Warning

gmail’s 2-step authentication will NOT work. Sorry. I will look into how to deal with that eventually.

You will need to provide your password in order to use the email notifications but it is not a good idea to store human readable passwords lying around your system. So the file that is used to store your password must contain a version of your password that has been encoded in base64. This will scramble your password beyond most people’s ability to read it as a password as long as you don’t name it something silly like password_file.txt.

The help text for blacktie-encode is:

$ blacktie-encode -h

usage: blacktie-encode [-h] input_file

This script takes a path to a file where you have placed your password for the
email you want blacktie to use as the "sender" in its notification emails. It
will replace the file with one containing your password once it has encoded it
out of human readable plain-text into seemingly meaningless text. **THIS IS
NOT FOOLPROOF:** If someone knows exactly what to look for they might figure
it out. ALWAYS use good password practices and never use the same password for
multiple important accounts!

positional arguments:
    input_file  Path to a file where you have placed your password for the email
        you want blacktie to use as the "sender" in its notification
        emails.

optional arguments:
    -h, --help  show this help message and exit

Tutorial¶

A more detailed tutorial is under development, so watch this space!

Getting started¶

The –prog option¶

The –hide-logs option¶

The –modes option¶

The configuration file¶

Using e-mail notifications¶

Tutorial¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Getting started¶

The –prog option¶

The –hide-logs option¶

The –modes option¶

The configuration file¶

Using e-mail notifications¶

Tutorial¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation