We are extremely happy to see this paper out describing the changes and updates to nf-core/sarek, the DNA variant calling pipeline, in the last several years.
In 2020, we embarked on the journey of rewriting the whole pipeline in DSL2. One of the major motivations was to bring down cloud computing costs and generally reduce storage space and computational resources.
Overview
No nf-core pipeline without a metro map 🚇:
New tools
We added new tools: BwaMem2 and DragMap for alignment, more variant callers (DeepVariant, GATK HaplotypeCaller Joint Calling & Single Sample variant recalibration, CNVKit, Tiddit), and more annotation possibilities.
Some tools were replaced: Trimming is now done with FastP, CRAM quality control with Mosdepth. For convenience we added more quality control options: When starting from variant-calling directly, all input files can now run through the alignment QC steps.
Resource optimization
Use CRAM files
We ditched BAM format where possible and switched to CRAM saving us 4x work storage space.
Split files (but not too much)
The splitFastq()
operator was replaced by a FastP process to split the Fastq files before
alignment (default 12) plus replacing trimgalore! for read trimming.
We also changed the default grouping of the intervals for BQSR to 21 (instead of 124) reducing storage space another 4x and speeding up processing.
Cost savings
Overall, we reduced computational costs on AWS (last summer, using spot instances) by 70% to about $20 from FASTQs to annotated VCFs using Strelka, Manta, and VEP.
Benchmarking: a.k.a is it any good?
We benchmarked the germline track with Illumina, MGI, and BGI GiaB samples and the somatic track with Seq2C samples. We recently joined the NCBench effort to continuously validate the pipeline on release.
Team work makes dream work
This was a gigantic team effort with Lasse Folkersen, Anders Sune Pedersen, Francesco Lescai, Susanne Jodoin, Edmund Miller, Matthias Seybold, Oskar Wacker, Nick Smith, Gisela Gabernet, Sven Nahnsen, and many many others from the nf-core community:
Also shout out to all the amazing people starting sarek way back in 2016: Szilveszter Juhos, Malin Larsson, Pall I. Olason, Marcel Martin, Jesper Eisfeldt, Sebastian DiLorenzo, Johanna Sandgren, Teresita Díaz De Ståhl, Phil Ewels, Valtteri Wirta, Monica Nistér, Max Käller, and Björn Nystedt.
Join the fun
If you want to join us, visit: https://nf-co.re/join/ we’re on the #sarek channel on slack, and you’re welcome to join #sarek_dev if you really want to get involved.
There is more
If you want to know more, here are some recent talks detailing the changes and development journey: