CZ Biohub Configuration

All nf-core pipelines have been successfully configured for use on the AWS Batch at the Chan Zuckerberg Biohub here.

To use, run the pipeline with -profile czbiohub_aws. This will download and launch the czbiohub_aws.config which has been pre-configured with a setup suitable for the AWS Batch. Using this profile, a docker image containing all of the required software will be downloaded, and converted to a Singularity image before execution of the pipeline.

Ask Olga (olga.botvinnik@czbiohub.org) if you have any questions!

Run the pipeline from a small AWS EC2 Instance

The pipeline will monitor and submit jobs to AWS Batch on your behalf. To ensure that the pipeline is successful, it will need to be run from a computer that has constant internet connection. Unfortunately for us, Biohub has spotty WiFi and even for short pipelines, it is highly recommended to run them from AWS.

1. Start tmux

tmux is a “Terminal Multiplexer” that allows for commands to continue running even when you have closed your laptop. Start a new tmux session with tmux new and we’ll name this session nextflow.

tmux new -n nextflow

Now you can run pipelines with abandon!

2. Make a GitHub repo for your workflows (optional :)

To make sharing your pipelines and commands easy between your teammates, it’s best to share code in a GitHub repository. One way is to store the commands in a Makefile (example) which can contain multiple nextflow run commands so that you don’t need to remember the S3 bucket or output directory for every single one. Makefiles are broadly used in the software community for running many complex commands. Makefiles can have a lot of dependencies and be confusing, so we’re only going to write simple Makefiles.

rnaseq:
nextflow run -profile czbiohub_aws nf-core/rnaseq \
      --reads 's3://czb-maca/Plate_seq/24_month/180626_A00111_0166_BH5LNVDSXX/fastqs/*{R1,R2}*.fastq.gz' \
      --genome GRCm38 \
      --outdir s3://olgabot-maca/nextflow-test/

Human_Mouse_Zebrafish:

nextflow run czbiohub/nf-kmer-similarity -latest -profile aws \
--samples s3://kmer-hashing/hematopoeisis/smartseq2/human_mouse_zebrafish/samples.csv

Merkin2012_AWS:

nextflow run czbiohub/nf-kmer-similarity -latest --sra "SRP016501" \
-r olgabot/support-csv-directory-or-sra \-profile aws

In this example, one would run the rnaseq rule and the nextflow command beneath it with:

make rnaseq

If one wanted to run a different command, e.g. human_mouse_zebrafish, they would specify that command instead. For example:

make human_mouse_zebrafish

Makefiles are a very useful way of storing longer commands with short mnemonic words.

Once you create a new repository (best to initialize with a .gitignore, license - MIT and README), clone that repository to your EC2 instance. For example, if the repository is called kh-workflows, this is what the command would look like:

git clone https://github.com/czbiohub/kh-workflows

Now both create and edit a Makefile:

cd
nano Makefile

Write your rule with a colon after it, and on the next line must be a tab, not spaces. Once you’re done, exit the program (the ^ command shown in nano means “Control”), write the file, add it to git, commit it, and push it up to GitHub.

git add Makefile
git commit -m "Added makefile"
git push origin master

3. Run your workflow

Remember to specify -profile czbiohub_aws to grab the CZ Biohub-specific AWS configurations, and an --outdir with an AWS S3 bucket so you don’t run out of space on your small AMI

nextflow run -profile czbiohub_aws nf-core/rnaseq \
    --reads 's3://czb-maca/Plate_seq/24_month/180626_A00111_0166_BH5LNVDSXX/fastqs/*{R1,R2}*.fastq.gz' \
    --genome GRCm38 \
    --outdir s3://olgabot-maca/nextflow-test/

4. If you lose connection, how do you restart the jobs

If you close your laptop, get onto the train, or lose WiFi connection, you may lose connection to AWS and may need to restart the jobs. To reattach, use the command tmux attach and you should see your Nextflow output! To get the named session, use:

tmux attach -n nextflow

To restart the jobs from where you left off, add the -resume flag to your nextflow command:

nextflow run -profile czbiohub_aws nf-core/rnaseq \
    --reads 's3://czb-maca/Plate_seq/24_month/180626_A00111_0166_BH5LNVDSXX/fastqs/*{R1,R2}*.fastq.gz' \
    --genome GRCm38 \
    --outdir s3://olgabot-maca/nextflow-test/ \
    -resume

It’s important that this command be re-run from the same directory as there is a “hidden” .nextflow folder that contains all the metadata and information about previous runs.

iGenomes specific configuration

A local copy of the iGenomes resource has been made available on s3://czbiohub-reference/igenomes (in us-west-2 region) so you should be able to run the pipeline against any reference available in the igenomes.config specific to the nf-core pipeline.

You can do this by simply using the --genome <GENOME_ID> parameter.

For Human and Mouse, we use GENCODE gene annotations. This doesn’t change how you would specify the genome name, only that the pipelines run with the czbiohub_aws profile would be with GENCODE rather than iGenomes.

NB: You will need an account to use the HPC cluster on PROFILE CLUSTER in order to run the pipeline. If in doubt contact IT. NB: Nextflow will need to submit the jobs via the job scheduler to the HPC cluster and as such the commands above will have to be executed on one of the login nodes. If in doubt contact IT.

High Priority Queue

If you would like to run with the High Priority queue, specify the highpriority config profile after czbiohub_aws. When applied after the main czbiohub_aws config, it overwrites the process queue identifier.

To use it, submit your run with with -profile czbiohub_aws,highpriority.

Note that the order of config profiles here is important. For example, -profile highpriority,czbiohub_aws will not work.

Config file

See config file on GitHub

czbiohub_aws.config
/*
 * -------------------------------------------------
 *  Nextflow config file for Chan Zuckerberg Biohub
 * -------------------------------------------------
 * Defines reference genomes, using iGenome paths
 * Imported under the default 'standard' Nextflow
 * profile in nextflow.config
 */
 
//Profile config names for nf-core/configs
params {
    config_profile_description = 'Chan Zuckerberg Biohub AWS Batch profile provided by nf-core/configs.'
    config_profile_contact     = 'Olga Botvinnik (@olgabot)'
    config_profile_url         = 'https://www.czbiohub.org/'
}
 
docker {
    enabled = true
}
 
process {
    resourceLimits = [
        memory: 1952.GB,
        cpus: 96,
        time: 240.h
    ]
    executor       = 'awsbatch'
    queue          = 'default-971039e0-830c-11e9-9e0b-02c5b84a8036'
    errorStrategy  = 'ignore'
}
 
workDir = "s3://czb-nextflow/intermediates/"
 
aws.region = 'us-west-2'
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
params.tracedir = './'
 
params {
    saveReference          = true
 
    // Largest SPOT instances available on AWS: https://ec2instances.info/
    max_memory             = 1952.GB
    max_cpus               = 96
    max_time               = 240.h
 
    // Compatible with multiple versions of rnaseq pipeline
    seq_center             = "czbiohub"
    seqCenter              = "czbiohub"
 
    // illumina iGenomes reference file paths on CZ Biohub reference s3 bucket
    // No final slash because it's added later
    igenomes_base          = "s3://czbiohub-reference/igenomes"
 
    // GENCODE (human + mouse) reference file paths on CZ Biohub reference s3 bucket
    // No final slash because it's added later
    gencode_base           = "s3://czbiohub-reference/gencode"
    transgenes_base        = "s3://czbiohub-reference/transgenes"
    refseq_base            = "s3://czbiohub-reference/ncbi/genomes/refseq/"
 
    // AWS configurations
    awsregion              = "us-west-2"
    awsqueue               = 'default-971039e0-830c-11e9-9e0b-02c5b84a8036'
 
    igenomes_ignore        = true
    igenomesIgnore         = true
    //deprecated
 
    fc_extra_attributes    = 'gene_name'
    fc_group_features      = 'gene_id'
    fc_group_features_type = 'gene_type'
 
    trim_pattern           = "_+S+"
 
    // GENCODE GTF and fasta files
    genomes {
        GRCh38 {
            fasta            = "${params.gencode_base}/human/v30/GRCh38.p12.genome.ERCC92.fa"
            gtf              = "${params.gencode_base}/human/v30/gencode.v30.annotation.ERCC92.gtf"
            transcript_fasta = "${params.gencode_base}/human/v30/gencode.v30.transcripts.ERCC92.fa"
            star             = "${params.gencode_base}/human/v30/STARIndex/"
            salmon_index     = "${params.gencode_base}/human/v30/salmon_index/"
        }
        GRCm38 {
            fasta            = "${params.gencode_base}/mouse/vM21/GRCm38.p6.genome.ERCC92.fa"
            gtf              = "${params.gencode_base}/mouse/vM21/gencode.vM21.annotation.ERCC92.gtf"
            transcript_fasta = "${params.gencode_base}/mouse/vM21/gencode.vM21.transcripts.ERCC92.fa"
            star             = "${params.gencode_base}/mouse/vM21/STARIndex/"
        }
        'AaegL5.0' {
            fasta = "${params.refseq_base}/invertebrate/Aedes_aegypti/GCF_002204515.2_AaegL5.0/nf-core--rnaseq/reference_genome/GCF_002204515.2_AaegL5.0_genomic.fna"
            gtf   = "${params.refseq_base}/invertebrate/Aedes_aegypti/GCF_002204515.2_AaegL5.0/nf-core--rnaseq/reference_genome/GCF_002204515.2_AaegL5.0_genomic.gtf"
            bed   = "${params.refseq_base}/invertebrate/Aedes_aegypti/GCF_002204515.2_AaegL5.0/nf-core--rnaseq/reference_genome/GCF_002204515.2_AaegL5.0_genomic.bed"
            star  = "${params.refseq_base}/invertebrate/Aedes_aegypti/GCF_002204515.2_AaegL5.0/nf-core--rnaseq/reference_genome/star/"
        }
    }
 
    transgenes {
        ChR2 {
            fasta = "${params.transgenes_base}/ChR2/ChR2.fa"
            gtf   = "${params.transgenes_base}/ChR2/ChR2.gtf"
        }
        Cre {
            fasta = "${params.transgenes_base}/Cre/Cre.fa"
            gtf   = "${params.transgenes_base}/Cre/Cre.gtf"
        }
        ERCC {
            fasta = "${params.transgenes_base}/ERCC92/ERCC92.fa"
            gtf   = "${params.transgenes_base}/ERCC92/ERCC92.gtf"
        }
        GCaMP6m {
            fasta = "${params.transgenes_base}/GCaMP6m/GCaMP6m.fa"
            gtf   = "${params.transgenes_base}/GCaMP6m/GCaMP6m.gtf"
        }
        GFP {
            fasta = "${params.transgenes_base}/Gfp/Gfp.fa"
            gtf   = "${params.transgenes_base}/Gfp/Gfp.gtf"
        }
        NpHR {
            fasta = "${params.transgenes_base}/NpHR/NpHR.fa"
            gtf   = "${params.transgenes_base}/NpHR/NpHR.gtf"
        }
        RCaMP {
            fasta = "${params.transgenes_base}/RCaMP/RCaMP.fa"
            gtf   = "${params.transgenes_base}/RCaMP/RCaMP.gtf"
        }
        RGECO {
            fasta = "${params.transgenes_base}/RGECO/RGECO.fa"
            gtf   = "${params.transgenes_base}/RGECO/RGECO.gtf"
        }
        Tdtom {
            fasta = "${params.transgenes_base}/Tdtom/Tdtom.fa"
            gtf   = "${params.transgenes_base}/Tdtom/Tdtom.gtf"
        }
        'Car-T' {
            fasta = "${params.transgenes_base}/car-t/car-t.fa"
            gtf   = "${params.transgenes_base}/car-t/car-t.gtf"
        }
        zsGreen {
            fasta = "${params.transgenes_base}/zsGreen/zsGreen.fa"
            gtf   = "${params.transgenes_base}/zsGreen/zsGreen.gtf"
        }
    }
}
 
 
profiles {
    highpriority {
        process {
            queue = 'highpriority-971039e0-830c-11e9-9e0b-02c5b84a8036'
        }
    }
}