GATK VARIANT CALLING BEST PRACTICE WORKFLOW
Call short variants (SNP+INDEL) with GATK’s Mutect2:
Step
Tool
Reason
Indexing
Picard
Create genome sequence dictionnary
Indexing
Samtools
Index fasta genome sequence
Grouping
Picard
Add or replace possible missing read groups
Indexing
Sambamba
Index re-grouped BAM-formatted alignments
Calling
Mutect2
Call short variants with Mutect2
Contaminations
GetPileupSummaries
Tabulates pileup metrics for inferring contamination
Contaminations
CalculateContamination
Estimate cross sample contamination
Orientation
LearnReadOrientationModel
Search for sequencing artifacts based on read orientation
Filtering
FilterMutectCalls
Use previously estimated biases and filter variants
Usage
Via module
This usage is recommended with Snakemake >=7.9.
You can include this meta-wrapper in your workflow via the Snakemake module system:
module gatk variant calling best practice workflow:
meta_wrapper: "v9.2.0/meta/bio/gatk_mutect2_calling"
pathvars:
results="...", # Path to results directory
resources="...", # Path to resources directory
logs="...", # Path to logs directory
per="...", # Pattern for sample identifiers, e.g. ``"{sample}"``
genome_sequence="...", # Path to FASTA file with genome sequence
genome_sequence_index="...", # Path to FAI file with genome sequence index
genome_dict="...", # Path to DICT file with genome sequence dictionary
bed_intervals="...", # Path to BED file with genome intervals
known_variants="...", # Path to gzipped VCF file containing known variants
known_variants_tbi="...", # Path to TBI index corresponding to known variants
bam_file="...", # Path/Pattern to aligned reads (SAM/BAM/CRAM)
use rule * from gatk variant calling best practice workflow as gatk variant calling best practice workflow_*
Upon using the rules, you can additionally modify input, output, log, and params as needed (see the definition of each rule below and the modules documentation). For additional parameters in each individual wrapper, please refer to their corresponding documentation (see links below).
Via copy-paste
Alternatively, you can directly copy-paste and modify the full meta-wrapper code below into your workflow.
Execution
When running with
snakemake --sdm conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Used wrappers
The following individual wrappers are used in this meta-wrapper:
Please refer to each wrapper in above list for additional configuration parameters and information about the executed code.
Code
rule create_dict:
input:
"<genome_sequence>",
output:
"<genome_dict>",
threads: 1
resources:
mem_mb=1024,
log:
"<logs>/picard/create_dict.log",
params:
extra="",
wrapper:
"v7.6.0/bio/picard/createsequencedictionary"
rule samtools_index:
input:
"<genome_sequence>",
output:
"<genome_sequence_index>",
log:
"<logs>/genome_index.log",
params:
extra="", # optional params string
wrapper:
"v7.6.0/bio/samtools/faidx"
rule picard_replace_read_groups:
input:
"<bam_file>",
output:
"<results>/picard/<per>.bam",
threads: 1
resources:
mem_mb=1024,
log:
"<logs>/picard/replace_rg/<per>.log",
params:
# Required for GATK
extra="--RGLB lib1 --RGPL illumina --RGPU {sample} --RGSM {sample}",
wrapper:
"v7.6.0/bio/picard/addorreplacereadgroups"
rule sambamba_index_picard_bam:
input:
"<results>/picard/<per>.bam",
output:
"<results>/picard/<per>.bam.bai",
threads: 1
log:
"<logs>/sambamba/index/<per>.log",
params:
extra="",
wrapper:
"v6.1.0/bio/sambamba/index"
rule mutect2_call:
input:
fasta="<genome_sequence>",
fasta_dict="<genome_dict>",
fasta_fai="<genome_sequence_index>",
map="<results>/picard/<per>.bam",
map_idx="<results>/picard/<per>.bam.bai",
intervals="<bed_intervals>",
output:
vcf="<results>/variant/<per>.vcf",
bam="<results>/variant/<per>.bam",
f1r2="<results>/counts/<per>.f1r2.tar.gz",
threads: 1
resources:
mem_mb=1024,
params:
extra=" --tumor-sample {sample} ",
log:
"<logs>/mutect/<per>.log",
wrapper:
"v7.6.0/bio/gatk/mutect"
rule gatk_get_pileup_summaries:
input:
bam="<results>/picard/<per>.bam",
bai="<results>/picard/<per>.bam.bai",
variants="<known_variants>",
variants_tbi="<known_variants_tbi>",
intervals="<bed_intervals>",
output:
temp("<results>/summaries/<per>.table"),
threads: 1
resources:
mem_mb=1024,
params:
extra="",
log:
"<logs>/summary/<per>.log",
wrapper:
"v7.6.0/bio/gatk/getpileupsummaries"
rule gatk_calculate_contamination:
input:
tumor="<results>/summaries/<per>.table",
output:
temp("<results>/contamination/<per>.pileups.table"),
threads: 1
resources:
mem_mb=1024,
log:
"<logs>/contamination/<per>.log",
params:
extra="",
wrapper:
"v7.6.0/bio/gatk/calculatecontamination"
rule gatk_learn_read_orientation_model:
input:
f1r2="<results>/counts/<per>.f1r2.tar.gz",
output:
temp("<results>/artifacts_prior/<per>.tar.gz"),
threads: 1
resources:
mem_mb=1024,
params:
extra="",
log:
"<logs>/learnreadorientationbias/<per>.log",
wrapper:
"v7.6.0/bio/gatk/learnreadorientationmodel"
rule filter_mutect_calls:
input:
vcf="<results>/variant/<per>.vcf",
ref="<genome_sequence>",
ref_dict="<genome_dict>",
ref_fai="<genome_sequence_index>",
bam="<results>/picard/<per>.bam",
bam_bai="<results>/picard/<per>.bam.bai",
contamination="<results>/contamination/<per>.pileups.table",
f1r2="<results>/artifacts_prior/<per>.tar.gz",
output:
vcf="<results>/variant/<per>.filtered.vcf.gz",
vcf_idx="<results>/variant/<per>.filtered.vcf.gz.tbi",
threads: 1
resources:
mem_mb=1024,
log:
"<logs>/gatk/filter/<per>.log",
params:
extra="--create-output-variant-index --min-median-mapping-quality 35 --max-alt-allele-count 3",
java_opts="",
wrapper:
"v7.6.0/bio/gatk/filtermutectcalls"