The Snakemake Wrappers repository¶
The Snakemake Wrapper Repository is a collection of reusable wrappers that allow to quickly use popular tools from Snakemake rules and workflows.
Usage¶
The general strategy is to include a wrapper into your workflow via the wrapper directive, e.g.
rule samtools_sort:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.sorted.bam"
params:
"-m 4G"
threads: 8
wrapper:
"0.2.0/bio/samtools/sort"
Here, Snakemake will automatically download the corresponding wrapper from https://github.com/snakemake/snakemake-wrappers/tree/0.2.0/bio/samtools/sort. Thereby, 0.2.0 can be replaced with the version tag you want to use, or a commit id. This ensures reproducibility since changes in the wrapper implementation won’t be propagated automatically to your workflow. Alternatively, e.g., for development, the wrapper directive can also point to full URLs, including the local file://
.
Each wrapper defines required software packages and versions. In combination with the --use-conda
flag of Snakemake, these will be deployed automatically.
Contribute¶
We invite anybody to contribute to the Snakemake Wrapper Repository. If you want to contribute we suggest the following procedure:
- Fork the repository: https://github.com/snakemake/snakemake-wrappers
- Clone your fork locally.
- Locally, create a new branch:
git checkout -b my-new-snakemake-wrapper
- Commit your contributions to that branch and push them to your fork:
git push -u origin my-new-snakemake-wrapper
- Create a pull request.
The pull request will be reviewed and included as fast as possible. Contributions should follow the coding style of the already present examples, i.e.:
- provide a
meta.yaml
with name, description and author(s) of the wrapper - provide an
environment.yaml
which lists all required software packages (the packages should be available for installation via the default anaconda channels or via the conda channels bioconda or conda-forge. Other sustainable community maintained channels are possible as well.) - provide a minimal test case in a subfolder called
test
, with an exampleSnakefile
that shows how to use the wrapper, some minimal testing data (also check existing wrappers for suitable data) and add an invocation of the test intest.py
- follow the python style guide, using 4 spaces for indentation.
Testing locally¶
If you want to debug your contribution locally, before creating a pull request,
we recommend adding your test case to the start of the list in test.py
, so
that it runs first. Then, install miniconda with the channels as described for
bioconda and set up an
environment with the necessary dependencies and activate it:
conda create -n test-snakemake-wrappers snakemake pytest conda
conda activate test-snakemake-wrappers
Afterwards, from the main directory of the repo, you can run the tests with:
pytest test.py -v
If you use a keyboard interrupt after your test has failed, you will get all the relevant stdout and stderr messages printed.
If you also want to test the docs generation locally, create another environment and activate it:
conda create -n test-snakemake-wrapper-docs sphinx sphinx_rtd_theme pyyaml sphinx-copybutton
conda activate test-snakemake-wrapper-docs
Then, enter the respective directory and build the docs:
cd docs
make html
If it runs through, you can open the main page at docs/_build/html/index.html
in a web browser. If you want to start fresh, you can clean up the build
with make clean
.
Wrappers¶
Wrappers allow to quickly use popular tools and libraries in Snakemake workflows.
The menu on the left (expand by clicking (+) if necessary), lists all available wrappers.
ADAPTERREMOVAL¶
rapid adapter trimming, identification, and read merging. For more information see AdapterRemoval documentation.
Example¶
This wrapper can be used in the following way:
rule adapterremoval_se:
input:
sample=["reads/se/{sample}.fastq"]
output:
fq="trimmed/se/{sample}.fastq.gz",
discarded="trimmed/se/{sample}.discarded.fastq.gz",
settings="stats/se/{sample}.settings"
log:
"logs/adapterremoval/se/{sample}.log"
params:
adapters="--adapter1 ACGGCTAGCTA",
extra="",
merge_singletons=True, # Irrelevant for SE; just for testing purposes
threads: 1
wrapper:
"0.73.0/bio/adapterremoval"
rule adapterremoval_pe:
input:
sample=["reads/pe/{sample}.1.fastq", "reads/pe/{sample}.2.fastq"]
output:
fq1="trimmed/pe/{sample}_R1.fastq.gz",
fq2="trimmed/pe/{sample}_R2.fastq.gz",
collapsed="trimmed/pe/{sample}.collapsed.fastq.gz",
collapsed_trunc="trimmed/pe/{sample}.collapsed_trunc.fastq.gz",
singleton="trimmed/pe/{sample}.singleton.fastq.gz",
discarded="trimmed/pe/{sample}.discarded.fastq.gz",
settings="stats/pe/{sample}.settings"
log:
"logs/adapterremoval/pe/{sample}.log"
params:
adapters="--adapter1 ACGGCTAGCTA --adapter2 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
extra="--collapse --collapse-deterministic",
threads: 2
wrapper:
"0.73.0/bio/adapterremoval"
rule adapterremoval_pe_collapse_single:
input:
sample=["reads/pe/{sample}.1.fastq", "reads/pe/{sample}.2.fastq"]
output:
fq1="trimmed/pe_collapse/{sample}_R1.fastq.gz",
fq2="trimmed/pe_collapse/{sample}_R2.fastq.gz",
singleton="trimmed/pe_collapse/{sample}.fastq.gz",
discarded="trimmed/pe_collapse/{sample}.discarded.fastq.gz",
settings="stats/pe_collapse/{sample}.settings"
log:
"logs/adapterremoval/pe_collapse/{sample}.log"
params:
adapters="--adapter1 ACGGCTAGCTA --adapter2 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
extra="--collapse --collapse-deterministic",
merge_singletons=True,
threads: 2
wrapper:
"0.73.0/bio/adapterremoval"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
adapterremoval==2.3.1
Input/Output¶
Input:
- raw fastq file with R1 reads
- raw fastq file with R2 reads
Output:
- trimmed fastq file with R1 reads
- trimmed fastq file with R2 reads
- fastq file with singleton reads (those where mate was filtered out)
- fastq file with collapsed reads (only for PE and if collapsing of reads is enabled)
- fastq file with collapsed truncated reads, i.e. that were trimmed due the presence of low-quality or ambiguous nucleotides (only for PE and if collapsing of reads is enabled)
- fastq file with discarded reads
- settings and stats file
Notes¶
- If merge_singletons is set (only for PE and if collapsing of reads is enabled), then collapsed and collapsed truncated files are not created and reads are appended to the singleton file.
Authors¶
- Filipe G. Vieira
Code¶
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2020, Filipe G. Vieira"
__license__ = "MIT"
from snakemake.shell import shell
from pathlib import Path
import tempfile
extra = snakemake.params.get("extra", "") + " "
adapters = snakemake.params.get("adapters", "")
collapse_pe = (
True if "--collapse " in extra or "--collapse-deterministic " in extra else False
)
merge_singletons = snakemake.params.get("merge_singletons", False)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Check input files
n = len(snakemake.input.sample)
assert (
n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."
# Input files
if n == 1 or "--interleaved " in extra or "--interleaved-input " in extra:
reads = "--file1 {}".format(snakemake.input.sample)
else:
reads = "--file1 {} --file2 {}".format(*snakemake.input.sample)
# Gzip or Bzip compressed output?
compress_out = ""
if all(
[
Path(value).suffix == ".gz"
for key, value in snakemake.output.items()
if key != "settings"
]
):
compress_out = "--gzip"
elif all(
[
Path(value).suffix == ".bz2"
for key, value in snakemake.output.items()
if key != "settings"
]
):
compress_out = "--bzip2"
else:
raise ValueError(
"all output files (except for 'settings') must be compressed the same way"
)
# Output files
if n == 1 or "--interleaved " in extra or "--interleaved-output " in extra:
trimmed = f"--output1 {snakemake.output.fq}"
else:
trimmed = f"--output1 {snakemake.output.fq1} --output2 {snakemake.output.fq2}"
# Collapsed reads output
if n == 2:
trimmed += f" --singleton {snakemake.output.singleton}"
if collapse_pe:
if merge_singletons:
out_collapsed = tempfile.NamedTemporaryFile()
out_collapsed_trunc = tempfile.NamedTemporaryFile()
trimmed += f" --outputcollapsed {out_collapsed.name} --outputcollapsedtruncated {out_collapsed_trunc.name}"
else:
trimmed += f" --outputcollapsed {snakemake.output.collapsed} --outputcollapsedtruncated {snakemake.output.collapsed_trunc}"
shell(
"(AdapterRemoval --threads {snakemake.threads} "
"{reads} "
"{adapters} "
"{extra} "
"{compress_out} "
"{trimmed} "
"--discarded {snakemake.output.discarded} "
"--settings {snakemake.output.settings}"
") {log}"
)
if collapse_pe and merge_singletons:
shell("cat {out_collapsed.name} >> {snakemake.output.singleton}")
out_collapsed.close()
shell("cat {out_collapsed_trunc.name} >> {snakemake.output.singleton}")
out_collapsed_trunc.close()
ARRIBA¶
Detect gene fusions from chimeric STAR output
Example¶
This wrapper can be used in the following way:
rule arriba:
input:
# STAR bam containing chimeric alignments
bam="{sample}.bam",
# path to reference genome
genome="genome.fasta",
# path to annotation gtf
annotation="annotation.gtf",
output:
# approved gene fusions
fusions="fusions/{sample}.tsv",
# discarded gene fusions
discarded="fusions/{sample}.discarded.tsv" # optional
log:
"logs/arriba/{sample}.log"
params:
# arriba blacklist file
blacklist="blacklist.tsv", # strongly recommended, see https://arriba.readthedocs.io/en/latest/input-files/#blacklist
# file containing known fusions
known_fusions="", # optional
# file containing information from structural variant analysis
sv_file="", # optional
# optional parameters
extra="-T -P -i 1,2"
threads: 1
wrapper:
"0.73.0/bio/arriba"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
arriba==1.1.0
Authors¶
- Jan Forster
Code¶
__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
discarded_fusions = snakemake.output.get("discarded", "")
if discarded_fusions:
discarded_cmd = "-O " + discarded_fusions
else:
discarded_cmd = ""
blacklist = snakemake.params.get("blacklist")
if blacklist:
blacklist_cmd = "-b " + blacklist
else:
blacklist_cmd = ""
known_fusions = snakemake.params.get("known_fusions")
if known_fusions:
known_cmd = "-k" + known_fusions
else:
known_cmd = ""
sv_file = snakemake.params.get("sv_file")
if sv_file:
sv_cmd = "-d" + sv_file
else:
sv_cmd = ""
shell(
"arriba "
"-x {snakemake.input.bam} "
"-a {snakemake.input.genome} "
"-g {snakemake.input.annotation} "
"{blacklist_cmd} "
"{known_cmd} "
"{sv_cmd} "
"-o {snakemake.output.fusions} "
"{discarded_cmd} "
"{extra} "
"{log}"
)
ART¶
For art, the following wrappers are available:
ART_PROFILER_ILLUMINA¶
Use the art profiler to create a base quality score profile for Illumina read data from a fastq file.
This wrapper can be used in the following way:
rule art_profiler_illumina:
input:
"data/{sample}.fq",
output:
"profiles/{sample}.txt"
log:
"logs/art_profiler_illumina/{sample}.log"
params: ""
threads: 2
wrapper:
"0.73.0/bio/art/profiler_illumina"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
art==2016.06.05
- David Laehnemann
- Victoria Sack
__author__ = "David Laehnemann, Victoria Sack"
__copyright__ = "Copyright 2018, David Laehnemann, Victoria Sack"
__email__ = "david.laehnemann@hhu.de"
__license__ = "MIT"
from snakemake.shell import shell
import os
import tempfile
import re
# Create temporary directory that will only contain the symbolic link to the
# input file, in order to sanely work with the art_profiler_illumina cli
with tempfile.TemporaryDirectory() as temp_input:
# ensure that .fastq and .fastq.gz input files work, as well
filename = os.path.basename(snakemake.input[0]).replace(".fastq", ".fq")
# figure out the exact file extension after the above substitution
ext = re.search("fq(\.gz)?$", filename)
if ext:
fq_extension = ext.group(0)
else:
raise IOError(
"Incompatible extension: This art_profiler_illumina "
"wrapper requires input files with one of the following "
"extensions: fastq, fastq.gz, fq or fq.gz. Please adjust "
"your input and the invocation of the wrapper accordingly."
)
os.symlink(
# snakemake paths are relative, but the symlink needs to be absolute
os.path.abspath(snakemake.input[0]),
# the following awkward file name generation has reasons:
# * the file name needs to be unique to the execution of the
# rule, as art will create and mv temporary files with its basename
# in the output directory, which causes utter confusion when
# executing instances of the rule in parallel
# * temp file name cannot have any read infixes before the file
# extension, because otherwise art does read enumeration magic
# that messes up output file naming
os.path.join(
temp_input,
filename.replace(
"." + fq_extension, "_preventing_art_magic_spacer." + fq_extension
),
),
)
# include output folder name in the profile_name command line argument and
# strip off the file extension, as art will add its own ".txt"
profile_name = os.path.join(
os.path.dirname(snakemake.output[0]), filename.replace("." + fq_extension, "")
)
shell(
"( art_profiler_illumina {snakemake.params} {profile_name}"
" {temp_input} {fq_extension} {snakemake.threads} ) 2> {snakemake.log}"
)
BAMTOOLS¶
For bamtools, the following wrappers are available:
BAMTOOLS FILTER¶
Filters BAM files. For more information about bamtools see bamtools documentation and bamtools source code.
This wrapper can be used in the following way:
rule bamtools_filter:
input:
"{sample}.bam"
output:
"filtered/{sample}.bam"
params:
# optional parameters
tags = [ "NM:<4", "MQ:>=10" ], # list of key:value pair strings
min_size = "-2000",
max_size = "2000",
min_length = "10",
max_length = "20",
# to add more optional parameters (see bamtools filter --help):
additional_params = "-mapQuality \">=0\" -isMapped \"true\""
log:
"logs/bamtools/filtered/{sample}.log"
wrapper:
"0.73.0/bio/bamtools/filter"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bamtools==2.5.1
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# extract arguments
params = ""
extra_limits = ""
tags = snakemake.params.get("tags")
min_size = snakemake.params.get("min_size")
max_size = snakemake.params.get("max_size")
min_length = snakemake.params.get("min_length")
max_length = snakemake.params.get("max_length")
additional_params = snakemake.params.get("additional_params")
if tags and tags is not None:
params = params + " " + " ".join(map('-tag "{}"'.format, tags))
if min_size and min_size is not None:
params = params + ' -insertSize ">=' + min_size + '"'
if max_size and max_size is not None:
extra_limits = extra_limits + ' -insertSize "<=' + max_size + '"'
else:
if max_size and max_size is not None:
params = params + ' -insertSize "<=' + max_size + '"'
if min_length and min_length is not None:
params = params + ' -length ">=' + min_length + '"'
if max_length and max_length is not None:
extra_limits = extra_limits + ' -length "<=' + max_length + '"'
else:
if max_length and max_length is not None:
params = params + ' -length "<=' + max_length + '"'
if additional_params and additional_params is not None:
params = params + " " + additional_params
if extra_limits:
params = params + " | bamtools filter" + extra_limits
shell(
"(bamtools filter"
" -in {snakemake.input[0]}" + params + " -out {snakemake.output[0]}) {log}"
)
BAMTOOLS FILTER WITH JSON¶
Filters BAM files with JSON-script for filtering parameters and rules. For more information about bamtools see bamtools documentation and bamtools source code.
This wrapper can be used in the following way:
rule bamtools_filter_json:
input:
"{sample}.bam"
output:
"filtered/{sample}.bam"
params:
json="filtering-rules.json",
region="" # optional parameter for defining a specific region, e.g. "chr1:500..chr3:750"
log:
"logs/bamtools/filtered/{sample}.log"
wrapper:
"0.73.0/bio/bamtools/filter_json"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bamtools==2.5.1
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
region = snakemake.params.get("region")
region_param = ""
if region and region is not None:
region_param = ' -region "' + region + '"'
shell(
"(bamtools filter"
" -in {snakemake.input[0]}"
" -out {snakemake.output[0]}"
+ region_param
+ " -script {snakemake.params.json}) {log}"
)
BAMTOOLS SPLIT¶
Split bam file into sub files, default by reference
This wrapper can be used in the following way:
rule bamtools_split:
input:
"mapped/{sample}.bam",
output:
"mapped/{sample}.REF_xx.bam",
params:
extra="-reference",
log:
"logs/bamtoos_split/{sample}.log",
wrapper:
"0.73.0/bio/bamtools/split"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bamtools==2.5.1
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2021, Patrik Smeds"
__email__ = "patrik.smeds@scilifelab.uu.se"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
if len(snakemake.input) != 1:
raise ValueError("One bam input file expected, got: " + str(len(snakemake.input)))
shell("bamtools split -in {snakemake.input} {extra} {log}")
BAMTOOLS STATS¶
Use bamtools to collect statistics from a BAM file. For more information about bamtools see bamtools documentation and bamtools source code.
This wrapper can be used in the following way:
rule bamtools_stats:
input:
"{sample}.bam"
output:
"{sample}.bamstats"
params:
"-insert" # optional summarize insert size data
log:
"logs/bamtools/stats/{sample}.log"
wrapper:
"0.73.0/bio/bamtools/stats"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bamtools==2.5.1
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"(bamtools stats {snakemake.params} -in {snakemake.input[0]} > {snakemake.output[0]}) {log}"
)
BCFTOOLS¶
For bcftools, the following wrappers are available:
BCFTOOLS CALL¶
Call variants with bcftools call.
This wrapper can be used in the following way:
rule bcftools_call:
input:
pileup="{sample}.pileup.bcf",
output:
calls="{sample}.calls.bcf",
params:
caller="-m", # valid options include -c/--consensus-caller or -m/--multiallelic-caller
options="--ploidy 1 --prior 0.001",
log:
"logs/bcftools_call/{sample}.log",
wrapper:
"0.73.0/bio/bcftools/call"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools=1.11
- Johannes Köster
- Michael Hall
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
class CallerOptionError(Exception):
pass
valid_caller_opts = {"-c", "--consensus-caller", "-m", "--multiallelic-caller"}
caller_opt = snakemake.params.get("caller", "")
if caller_opt.strip() not in valid_caller_opts:
raise CallerOptionError(
"bcftools call expects either -m/--multiallelic-caller or "
"-c/--consensus-caller as caller option."
)
options = snakemake.params.get("options", "")
shell(
"bcftools call {options} {caller_opt} --threads {snakemake.threads} "
"-o {snakemake.output.calls} {snakemake.input.pileup} 2> {snakemake.log}"
)
BCFTOOLS CONCAT¶
Concatenate vcf/bcf files with bcftools. For more information see BCFtools documentation.
This wrapper can be used in the following way:
rule bcftools_concat:
input:
calls=["a.bcf", "b.bcf"]
output:
"all.bcf"
params:
"" # optional parameters for bcftools concat (except -o)
wrapper:
"0.73.0/bio/bcftools/concat"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools=1.11
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"bcftools concat {snakemake.params} -o {snakemake.output[0]} "
"{snakemake.input.calls} "
"{log}"
)
BCFTOOLS FILTER¶
filter vcf/bcf file.
This wrapper can be used in the following way:
rule bcf_filter_o_vcf:
input:
"{prefix}.bcf",
output:
"{prefix}.filter.vcf",
log:
"log/{prefix}.filter.vcf.log",
params:
filter="-i 'QUAL > 5'",
extra="",
wrapper:
"0.73.0/bio/bcftools/filter"
rule bcf_filter_o_vcf_gz:
input:
"{prefix}.bcf",
output:
"{prefix}.filter.vcf.gz",
log:
"log/{prefix}.filter.vcf.gz.log",
params:
filter="-i 'QUAL > 5'",
extra="",
wrapper:
"0.73.0/bio/bcftools/filter"
rule bcf_filter_o_bcf:
input:
"{prefix}.bcf",
output:
"{prefix}.filter.bcf",
log:
"log/{prefix}.filter.bcf.log",
params:
filter="-i 'QUAL > 5'",
extra="",
wrapper:
"0.73.0/bio/bcftools/filter"
rule bcf_filter_o_bcf_gz:
input:
"{prefix}.bcf",
output:
"{prefix}.filter.bcf.gz",
log:
"log/{prefix}.filter.bcf.gz.log",
params:
filter="-i 'QUAL > 5'",
extra="",
wrapper:
"0.73.0/bio/bcftools/filter"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools==1.9
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2021, Patrik Smeds"
__email__ = "patrik.smeds@scilifelab.uu.se"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
if snakemake.output[0].endswith("bcf"):
output_format = "-Ou"
elif snakemake.output[0].endswith("bcf.gz"):
output_format = "-Ob"
elif snakemake.output[0].endswith("vcf"):
output_format = "-Ov"
elif snakemake.output[0].endswith("vcf.gz"):
output_format = "-Oz"
if len(snakemake.input) > 1:
raise Exception("Only one input file expected, got: " + str(len(snakemake.input)))
if len(snakemake.output) > 1:
raise Exception("Only one output file expected, got: " + str(len(snakemake.output)))
filter = snakemake.params.get("filter", "")
extra = snakemake.params.get("extra", "")
shell(
"bcftools filter {filter} {extra} {snakemake.input[0]} "
"{output_format} "
"-o {snakemake.output[0]} "
"{log}"
)
BCFTOOLS INDEX¶
Index vcf/bcf file. For more information see BCFtools documentation.
This wrapper can be used in the following way:
rule bcftools_index:
input:
"a.bcf"
output:
"a.bcf.csi"
params:
extra="" # optional parameters for bcftools index
wrapper:
"0.73.0/bio/bcftools/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools=1.11
- Jan Forster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
## Extract arguments
extra = snakemake.params.get("extra", "")
shell("bcftools index" " {extra}" " {snakemake.input[0]}")
BCFTOOLS MERGE¶
Merge vcf/bcf files with bcftools. For more information see BCFtools documentation.
This wrapper can be used in the following way:
rule bcftools_merge:
input:
calls=["a.bcf", "b.bcf"]
output:
"all.bcf"
params:
"" # optional parameters for bcftools concat (except -o)
wrapper:
"0.73.0/bio/bcftools/merge"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools=1.11
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2018, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
shell(
"bcftools merge {snakemake.params} -o {snakemake.output[0]} "
"{snakemake.input.calls}"
)
BCFTOOLS MPILEUP¶
Generate VCF or BCF containing genotype likelihoods for one or multiple alignment (BAM or CRAM) files with bcftools mpileup.
This wrapper can be used in the following way:
rule bcftools_mpileup:
input:
index="genome.fasta.fai",
ref="genome.fasta", # this can be left out if --no-reference is in options
alignments="mapped/{sample}.bam",
output:
pileup="pileups/{sample}.pileup.bcf",
params:
options="--max-depth 100 --min-BQ 15",
log:
"logs/bcftools_mpileup/{sample}.log",
wrapper:
"0.73.0/bio/bcftools/mpileup"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools=1.11
- Michael Hall
__author__ = "Michael Hall"
__copyright__ = "Copyright 2020, Michael Hall"
__email__ = "michael@mbh.sh"
__license__ = "MIT"
from snakemake.shell import shell
class MissingReferenceError(Exception):
pass
options = snakemake.params.get("options", "")
# determine if a fasta reference is provided or not and add to options
if "--no-reference" not in options:
ref = snakemake.input.get("ref", "")
if not ref:
raise MissingReferenceError(
"The --no-reference option was not given, but no fasta reference was "
"provided."
)
options += " --fasta-ref {}".format(ref)
shell(
"bcftools mpileup {options} --threads {snakemake.threads} "
"--output {snakemake.output.pileup} "
"{snakemake.input.alignments} 2> {snakemake.log}"
)
BCFTOOLS NORM¶
Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows. For more information see BCFtools documentation.
This wrapper can be used in the following way:
rule norm_vcf:
input:
"{prefix}.vcf"
output:
"{prefix}.vcf"
params:
"" # optional parameters for bcftools norm (except -o)
wrapper:
"0.73.0/bio/bcftools/norm"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools=1.11
- Dayne Filer
__author__ = "Dayne Filer"
__copyright__ = "Copyright 2019, Dayne Filer"
__email__ = "dayne.filer@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
shell(
"bcftools norm {snakemake.params} {snakemake.input[0]} " "-o {snakemake.output[0]}"
)
BCFTOOLS REHEADER¶
Change header or sample names of vcf/bcf file. For more information see BCFtools documentation.
This wrapper can be used in the following way:
rule bcftools_reheader:
input:
vcf="a.bcf",
## new header, can be omitted if "samples" is set
header="header.txt",
## file containing new sample names, can be omitted if "header" is set
samples="samples.tsv"
output:
"a.reheader.bcf"
params:
extra="", # optional parameters for bcftools reheader
view_extra="-O b" # add output format for internal bcftools view call
wrapper:
"0.73.0/bio/bcftools/reheader"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools=1.11
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
from snakemake.shell import shell
## Extract arguments
header = snakemake.input.get("header", "")
if header:
header_cmd = "-h " + header
else:
header_cmd = ""
samples = snakemake.input.get("samples", "")
if samples:
samples_cmd = "-s " + samples
else:
samples_cmd = ""
extra = snakemake.params.get("extra", "")
view_extra = snakemake.params.get("view_extra", "")
shell(
"bcftools reheader "
"{extra} "
"{header_cmd} "
"{samples_cmd} "
"{snakemake.input.vcf} "
"| bcftools view "
"{view_extra} "
"> {snakemake.output}"
)
BCFTOOLS SORT¶
Sort vcf/bcf file. For more information see BCFtools documentation.
This wrapper can be used in the following way:
rule bcftools_sort:
input:
"{sample}.bcf"
output:
"{sample}.sorted.bcf"
log:
"logs/bcftools/sort/{sample}.log"
params:
tmp_dir = "`mktemp -d`",
# Set to True, in case you want uncompressed BCF output
uncompressed_bcf = False,
# Extra arguments
extras = ""
resources:
mem_mb = 8000
wrapper:
"0.73.0/bio/bcftools/sort"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools==1.11
- Filipe G. Vieira
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2020, Filipe G. Vieira"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
max_mem = snakemake.resources.get("mem_mb", "")
if max_mem:
max_mem = "--max-mem {}M".format(max_mem)
else:
max_mem = snakemake.resources.get("mem_gb", "")
if max_mem:
max_mem = "--max-mem {}G".format(max_mem)
else:
max_mem = ""
tmp_dir = snakemake.params.get("tmp_dir", "")
if tmp_dir:
tmp_dir = "--temp-dir {}".format(tmp_dir)
else:
tmp_dir = ""
uncompressed_bcf = snakemake.params.get("uncompressed_bcf", False)
out_name, out_ext = path.splitext(snakemake.output[0])
if out_ext == ".vcf":
out_format = "v"
elif out_ext == ".bcf":
if uncompressed_bcf:
out_format = "u"
else:
out_format = "b"
elif out_ext == ".gz":
out_name, out_ext = path.splitext(out_name)
if out_ext == ".vcf":
out_format = "z"
else:
raise ValueError("output file with invalid extension (.vcf, .vcf.gz, .bcf).")
else:
raise ValueError("output file with invalid extension (.vcf, .vcf.gz, .bcf).")
shell(
"bcftools sort {max_mem} {tmp_dir} {extra} --output-type {out_format} --output-file {snakemake.output[0]} {snakemake.input[0]} {log}"
)
BCFTOOLS VIEW¶
View vcf/bcf file in a different format. For more information see BCFtools documentation.
This wrapper can be used in the following way:
rule bcf_to_vcf:
input:
bcf="{prefix}.bcf"
output:
vcf="{prefix}.vcf"
params:
"" # optional parameters for bcftools view (except -o)
log:
"logs/{prefix}.log"
wrapper:
"0.73.0/bio/bcftools/view"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools==1.11
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"bcftools view {extra} --threads {snakemake.threads} {snakemake.input} "
"-o {snakemake.output} {log}"
)
BEDTOOLS¶
For bedtools, the following wrappers are available:
BEDTOOLS COMPLEMENT¶
Bedtools complement maps all regions of the genome which are not covered by the input.
This wrapper can be used in the following way:
rule bedtools_complement_bed:
input:
in_file="a.bed",
genome="dummy.genome"
output:
"results/bed-complement/a.complement.bed"
params:
## Add optional parameters
extra="-L"
log:
"logs/a.complement.bed.log"
wrapper:
"0.73.0/bio/bedtools/complement"
rule bedtools_complement_vcf:
input:
in_file="a.vcf",
genome="dummy.genome"
output:
"results/vcf-complement/a.complement.vcf"
params:
## Add optional parameters
extra="-L"
log:
"logs/a.complement.vcf.log"
wrapper:
"0.73.0/bio/bedtools/complement"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bedtools=2.29
Input:
- BED/GFF/VCF files
- genome file (genome file format)
Output:
- complemented BED/GFF/VCF file
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"(bedtools complement"
" {extra}"
" -i {snakemake.input.in_file}"
" -g {snakemake.input.genome}"
" > {snakemake.output[0]})"
" {log}"
)
COVERAGEBED¶
Returns the depth and breadth of coverage of features from B on the intervals in A.
This wrapper can be used in the following way:
rule coverageBed:
input:
a="bed/{sample}.bed",
b="mapped/{sample}.bam"
output:
"stats/{sample}.cov"
log:
"logs/coveragebed/{sample}.log"
params:
extra="" # optional parameters
threads: 8
wrapper:
"0.73.0/bio/bedtools/coveragebed"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bedtools==2.29.0
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra_params = snakemake.params.get("extra", "")
input_a = snakemake.input.a
input_b = snakemake.input.b
output_file = snakemake.output[0]
if not isinstance(output_file, str) and len(snakemake.output) != 1:
raise ValueError("Output should be one file: " + str(output_file) + "!")
shell(
"coverageBed"
" -a {input_a}"
" -b {input_b}"
" {extra_params}"
" > {output_file}"
" {log}"
)
BEDTOOLS GENOMECOVERAGEBED¶
bedtools
’s genomeCoverageBed computes the coverage of a feature file as histograms, per-base reports or BEDGRAPH summaries among a given genome. For usage information about genomeCoverageBed, please see bedtools
’s documentation. For more information about bedtools
, also see the source code.
This wrapper can be used in the following way:
rule genomecov_bam:
input:
"bam_input/{sample}.sorted.bam"
output:
"genomecov_bam/{sample}.genomecov"
log:
"logs/genomecov_bam/{sample}.log"
params:
"-bg" # optional parameters
wrapper:
"0.73.0/bio/bedtools/genomecov"
rule genomecov_bed:
input:
# for genome file format please see:
# https://bedtools.readthedocs.io/en/latest/content/general-usage.html#genome-file-format
bed="bed_input/{sample}.sorted.bed",
ref="bed_input/genome_file"
output:
"genomecov_bed/{sample}.genomecov"
log:
"logs/genomecov_bed/{sample}.log"
params:
"-bg" # optional parameters
wrapper:
"0.73.0/bio/bedtools/genomecov"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bedtools==2.29.2
Input:
- BED/GFF/VCF files grouped by chromosome and genome file (genome file format) OR
- BAM files sorted by position.
Output:
- genomecov (.genomecov)
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
genome = ""
input_file = ""
if (os.path.splitext(snakemake.input[0])[-1]) == ".bam":
input_file = "-ibam " + snakemake.input[0]
if len(snakemake.input) > 1:
if (os.path.splitext(snakemake.input[0])[-1]) == ".bed":
input_file = "-i " + snakemake.input.get("bed")
genome = "-g " + snakemake.input.get("ref")
shell(
"(genomeCoverageBed"
" {snakemake.params}"
" {input_file}"
" {genome}"
" > {snakemake.output[0]}) {log}"
)
BEDTOOLS INTERSECT¶
Intersect BED/BAM/VCF files with bedtools.
This wrapper can be used in the following way:
rule bedtools_merge:
input:
left="A.bed",
right="B.bed"
output:
"A_B.intersected.bed"
params:
## Add optional parameters
extra="-wa -wb" ## In this example, we want to write original entries in A and B for each overlap.
log:
"logs/intersect/A_B.log"
wrapper:
"0.73.0/bio/bedtools/intersect"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bedtools=2.29.0
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
from snakemake.shell import shell
## Extract arguments
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"(bedtools intersect"
" {extra}"
" -a {snakemake.input.left}"
" -b {snakemake.input.right}"
" > {snakemake.output})"
" {log}"
)
BEDTOOLS MERGE¶
Merge entries in one or multiple BED/BAM/VCF/GFF files with bedtools.
This wrapper can be used in the following way:
rule bedtools_merge:
input:
# Multiple bed-files can be added as list
"A.bed"
output:
"A.merged.bed"
params:
## Add optional parameters
extra="-c 1 -o count" ## In this example, we want to count how many input lines we merged per output line
log:
"logs/merge/A.log"
wrapper:
"0.73.0/bio/bedtools/merge"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bedtools=2.29.0
- Jan Forster
__author__ = "Jan Forster, Felix Mölder"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "j.forster@dkfz.de, felix.moelder@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
## Extract arguments
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
if len(snakemake.input) > 1:
if all(f.endswith(".gz") for f in snakemake.input):
cat = "zcat"
elif all(not f.endswith(".gz") for f in snakemake.input):
cat = "cat"
else:
raise ValueError("Input files must be all compressed or uncompressed.")
shell(
"({cat} {snakemake.input} | "
"sort -k1,1 -k2,2n | "
"bedtools merge {extra} "
"-i stdin > {snakemake.output}) "
" {log}"
)
else:
shell(
"( bedtools merge"
" {extra}"
" -i {snakemake.input}"
" > {snakemake.output})"
" {log}"
)
BEDTOOLS SLOP¶
Increase the size of each feature in a BED/BAM/VCF by a specified factor.
This wrapper can be used in the following way:
rule bedtools_merge:
input:
"A.bed"
output:
"A.slop.bed"
params:
## Genome file, tab-seperated file defining the length of every contig
genome="genome.txt",
## Add optional parameters
extra = "-b 10" ## in this example, we want to increase the feature by 10 bases to both sides
log:
"logs/slop/A.log"
wrapper:
"0.73.0/bio/bedtools/slop"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bedtools=2.29.0
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
from snakemake.shell import shell
## Extract arguments
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"(bedtools slop"
" {extra}"
" -i {snakemake.input[0]}"
" -g {snakemake.params.genome}"
" > {snakemake.output})"
" {log}"
)
BEDTOOLS SORT¶
Sorts bed, vcf or gff files by chromosome and other criteria, for more information please see bedtools sort documentation.
This wrapper can be used in the following way:
rule bedtools_sort:
input:
in_file="a.bed"
output:
"results/bed-sorted/a.sorted.bed"
params:
## Add optional parameters for sorting order
extra="-sizeA"
log:
"logs/a.sorted.bed.log"
wrapper:
"0.73.0/bio/bedtools/sort"
rule bedtools_sort_bed:
input:
in_file="a.bed",
# an optional sort file can be set as genomefile by the variable genome or
# as fasta index file by the variable faidx
genome="dummy.genome"
output:
"results/bed-sorted/a.sorted_by_file.bed"
params:
## Add optional parameters
extra=""
log:
"logs/a.sorted.bed.log"
wrapper:
"0.73.0/bio/bedtools/sort"
rule bedtools_sort_vcf:
input:
in_file="a.vcf",
# an optional sort file can be set either as genomefile by the variable genome or
# as fasta index file by the variable faidx
faidx="genome.fasta.fai"
output:
"results/vcf-sorted/a.sorted_by_file.vcf"
params:
## Add optional parameters
extra=""
log:
"logs/a.sorted.vcf.log"
wrapper:
"0.73.0/bio/bedtools/sort"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bedtools=2.29
Input:
- BED/GFF/VCF files
- optional a tab separating file that determines the sorting order and contains the chromosome names in the first column
- optional a fasta index file
Output:
- complemented BED/GFF/VCF file
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
genome = snakemake.input.get("genome", "")
faidx = snakemake.input.get("faidx", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
if genome:
extra += " -g {}".format(genome)
elif faidx:
extra += " -faidx {}".format(faidx)
shell(
"(bedtools sort"
" {extra}"
" -i {snakemake.input.in_file}"
" > {snakemake.output[0]})"
" {log}"
)
BENCHMARK¶
For benchmark, the following wrappers are available:
CHM-EVAL¶
Evaluate given VCF file with chm-eval (https://github.com/lh3/CHM-eval) for benchmarking variant calling.
This wrapper can be used in the following way:
rule chm_eval:
input:
kit="resources/chm-eval-kit",
vcf="{sample}.vcf"
output:
summary="chm-eval/{sample}.summary", # summary statistics
bed="chm-eval/{sample}.err.bed.gz" # bed file with errors
params:
extra="",
build="38"
log:
"logs/chm-eval/{sample}.log"
wrapper:
"0.73.0/bio/benchmark/chm-eval"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
perl=5.26
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
kit = snakemake.input.kit
vcf = snakemake.input.vcf
build = snakemake.params.build
extra = snakemake.params.get("extra", "")
if not snakemake.output[0].endswith(".summary"):
raise ValueError("Output file must end with .summary")
out = snakemake.output[0][:-8]
shell("({kit}/run-eval -g {build} -o {out} {extra} {vcf} | sh) {log}")
CHM-EVAL-KIT¶
Download CHM-eval kit (https://github.com/lh3/CHM-eval) for benchmarking variant calling.
This wrapper can be used in the following way:
rule chm_eval_kit:
output:
directory("resources/chm-eval-kit")
params:
# Tag and version must match, see https://github.com/lh3/CHM-eval/releases.
tag="v0.5",
version="20180222"
log:
"logs/chm-eval-kit.log"
cache: True
wrapper:
"0.73.0/bio/benchmark/chm-eval-kit"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
curl
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
url = (
"https://github.com/lh3/CHM-eval/releases/"
"download/{tag}/CHM-evalkit-{version}.tar"
).format(version=snakemake.params.version, tag=snakemake.params.tag)
os.makedirs(snakemake.output[0])
shell(
"(curl -L {url} | tar --strip-components 1 -C {snakemake.output[0]} -xf - &&"
"(cd {snakemake.output[0]}; chmod +x htsbox run-eval k8)) {log}"
)
CHM-EVAL-SAMPLE¶
Download CHM-eval sample (https://github.com/lh3/CHM-eval) for benchmarking variant calling.
This wrapper can be used in the following way:
rule chm_eval_sample:
output:
bam="resources/chm-eval-sample.bam",
bai="resources/chm-eval-sample.bam.bai"
params:
# Optionally only grab the first 100 records.
# This is for testing, remove next line to grab all records.
first_n=100
log:
"logs/chm-eval-sample.log"
wrapper:
"0.73.0/bio/benchmark/chm-eval-sample"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools=1.10
curl
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
url = "ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR134/ERR1341796/CHM1_CHM13_2.bam"
pipefail = ""
fmt = "-b"
prefix = snakemake.params.get("first_n", "")
if prefix:
prefix = "| head -n {} | samtools view -h -b".format(prefix)
fmt = "-h"
pipefail = "set +o pipefail"
shell(
"""
{pipefail}
{{
samtools view {fmt} {url} {prefix} > {snakemake.output.bam}
samtools index {snakemake.output.bam}
}} {log}
"""
)
else:
shell(
"""
{{
curl -L {url} > {snakemake.output.bam}
samtools index {snakemake.output.bam}
}} {log}
"""
)
BISMARK¶
For bismark, the following wrappers are available:
BAM2NUC¶
Calculate mono- and di-nucleotide coverage of the reads and compares them with average genomic sequence composition (see https://github.com/FelixKrueger/Bismark/blob/master/bam2nuc).
This wrapper can be used in the following way:
# Nucleotide stats for genome is required for further stats for BAM file
rule bam2nuc_for_genome:
input:
genome_fa="indexes/{genome}/{genome}.fa.gz"
output:
"indexes/{genome}/genomic_nucleotide_frequencies.txt"
log:
"logs/indexes/{genome}/genomic_nucleotide_frequencies.txt.log"
wrapper:
"0.73.0/bio/bismark/bam2nuc"
# Nucleotide stats for BAM file
rule bam2nuc_for_bam:
input:
genome_fa="indexes/{genome}/{genome}.fa.gz",
bam="bams/{sample}_{genome}.bam"
output:
report="bams/{sample}_{genome}.nucleotide_stats.txt"
log:
"logs/{sample}_{genome}.nucleotide_stats.txt.log"
wrapper:
"0.73.0/bio/bismark/bam2nuc"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.3.4.3
bismark==0.22.1
samtools==1.9
Input:
genome_fa
: Path to genome in FastA format (e.g. *.fa, *.fasta, *.fa.gz, *.fasta.gz). All genomes FastA from it’s parent folder will be takenbam
: Optional BAM or CRAM file (or multiple space separated files). If bam arg isn’t provided, option –genomic_composition_only will be used to generate genomic composition table genomic_nucleotide_frequencies.txt.
Output:
- Genome nucleotide frequencies genomic_nucleotide_frequencies.txt will be generated in ‘genome_fa’ directory, optional output.
report
: Report file (or space separated files), pattern ‘{bam_file_name}.nucleotide_stats.txt’.
- Roman Cherniatchik
"""Snakemake wrapper for bam2nuc tool that calculates mono- and di-nucleotide coverage of the reads and compares them with average genomic sequence
composition."""
# https://github.com/FelixKrueger/Bismark/blob/master/bam2nuc
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
cmdline_args = ["bam2nuc {extra}"]
genome_fa = snakemake.input.get("genome_fa", None)
if not genome_fa:
raise ValueError("bismark/bam2nuc: Error 'genome_fa' input not specified.")
genome_folder = os.path.dirname(genome_fa)
cmdline_args.append("--genome_folder {genome_folder:q}")
bam = snakemake.input.get("bam", None)
if bam:
cmdline_args.append("{bam}")
bams = bam if isinstance(bam, list) else [bam]
report = snakemake.output.get("report", None)
if not report:
raise ValueError("bismark/bam2nuc: Error 'report' output isn't specified.")
reports = report if isinstance(report, list) else [report]
if len(reports) != len(bams):
raise ValueError(
"bismark/bam2nuc: Error number of paths in output:report ({} files)"
" should be same as in input:bam ({} files).".format(
len(reports), len(bams)
)
)
output_dir = os.path.dirname(reports[0])
if any(output_dir != os.path.dirname(p) for p in reports):
raise ValueError(
"bismark/bam2nuc: Error all reports should be in same directory:"
" {}".format(output_dir)
)
if output_dir:
cmdline_args.append("--dir {output_dir:q}")
else:
cmdline_args.append("--genomic_composition_only")
# log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
cmdline_args.append("{log}")
# run
shell(" ".join(cmdline_args))
# Move outputs into proper position.
if bam:
log_append = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)
expected_2_actual_paths = []
for bam_path, report_path in zip(bams, reports):
bam_name = os.path.basename(bam_path)
bam_basename = os.path.splitext(bam_name)[0]
expected_2_actual_paths.append(
(
report_path,
os.path.join(
output_dir, "{}.nucleotide_stats.txt".format(bam_basename)
),
)
)
for (exp_path, actual_path) in expected_2_actual_paths:
if exp_path and (exp_path != actual_path):
shell("mv {actual_path:q} {exp_path:q} {log_append}")
BISMARK¶
Align BS-Seq reads using Bismark (see https://github.com/FelixKrueger/Bismark/blob/master/bismark).
This wrapper can be used in the following way:
# Example: Pair-ended reads
rule bismark_pe:
input:
fq_1="reads/{sample}.1.fastq",
fq_2="reads/{sample}.2.fastq",
genome="indexes/{genome}/{genome}.fa",
bismark_indexes_dir="indexes/{genome}/Bisulfite_Genome",
genomic_freq="indexes/{genome}/genomic_nucleotide_frequencies.txt"
output:
bam="bams/{sample}_{genome}_pe.bam",
report="bams/{sample}_{genome}_PE_report.txt",
nucleotide_stats="bams/{sample}_{genome}_pe.nucleotide_stats.txt",
bam_unmapped_1="bams/{sample}_{genome}_unmapped_reads_1.fq.gz",
bam_unmapped_2="bams/{sample}_{genome}_unmapped_reads_2.fq.gz",
ambiguous_1="bams/{sample}_{genome}_ambiguous_reads_1.fq.gz",
ambiguous_2="bams/{sample}_{genome}_ambiguous_reads_2.fq.gz"
log:
"logs/bams/{sample}_{genome}.log"
params:
# optional params string, e.g: -L32 -N0 -X400 --gzip
# Useful options to tune:
# (for bowtie2)
# -N: The maximum number of mismatches permitted in the "seed", i.e. the first L base pairs
# of the read (deafault: 1)
# -L: The "seed length" (deafault: 28)
# -I: The minimum insert size for valid paired-end alignments. ~ min fragment size filter (for
# PE reads)
# -X: The maximum insert size for valid paired-end alignments. ~ max fragment size filter (for
# PE reads)
# --gzip: Gzip intermediate fastq files
# --ambiguous --unmapped
# -p: bowtie2 parallel execution
# --multicore: bismark parallel execution
# --temp_dir: tmp dir for intermediate files instead of output directory
extra=' --ambiguous --unmapped --nucleotide_coverage',
basename='{sample}_{genome}'
wrapper:
"0.73.0/bio/bismark/bismark"
# Example: Single-ended reads
rule bismark_se:
input:
fq="reads/{sample}.fq.gz",
genome="indexes/{genome}/{genome}.fa",
bismark_indexes_dir="indexes/{genome}/Bisulfite_Genome",
genomic_freq="indexes/{genome}/genomic_nucleotide_frequencies.txt"
output:
bam="bams/{sample}_{genome}.bam",
report="bams/{sample}_{genome}_SE_report.txt",
nucleotide_stats="bams/{sample}_{genome}.nucleotide_stats.txt",
bam_unmapped="bams/{sample}_{genome}_unmapped_reads.fq.gz",
ambiguous="bams/{sample}_{genome}_ambiguous_reads.fq.gz"
log:
"logs/bams/{sample}_{genome}.log",
params:
# optional params string
extra=' --ambiguous --unmapped --nucleotide_coverage',
basename='{sample}_{genome}'
wrapper:
"0.73.0/bio/bismark/bismark"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.3.4.3
bismark==0.22.1
samtools==1.9
Input:
- In SE mode one reads file with keay ‘fq=…’
- In PE mode two reads files with keys ‘fq_1=…’, ‘fq_2=…’
bismark_indexes_dir
: The path to the folder Bisulfite_Genome created by the Bismark_Genome_Preparation script, e.g. ‘indexes/hg19/Bisulfite_Genome’
Output:
bam
: Bam file. Output file will be renamed if differs from default NAME_pe.bam or NAME_se.bamreport
: Aligning report file. Output file will be renamed if differs from default NAME_PE_report.txt or NAME_SE_report.txtnucleotide_stats
: Optional nucleotides report file. Output file will be renamed if differs from default NAME_pe.nucleotide_stats.txt or NAME_se.nucleotide_stats.txt
- Roman Cherniatchik
"""Snakemake wrapper for aligning methylation BS-Seq data using Bismark."""
# https://github.com/FelixKrueger/Bismark/blob/master/bismark
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
from tempfile import TemporaryDirectory
def basename_without_ext(file_path):
"""Returns basename of file path, without the file extension."""
base = os.path.basename(file_path)
split_ind = 2 if base.endswith(".gz") else 1
base = ".".join(base.split(".")[:-split_ind])
return base
extra = snakemake.params.get("extra", "")
cmdline_args = ["bismark {extra} --bowtie2"]
outdir = os.path.dirname(snakemake.output.bam)
if outdir:
cmdline_args.append("--output_dir {outdir}")
genome_indexes_dir = os.path.dirname(snakemake.input.bismark_indexes_dir)
cmdline_args.append("{genome_indexes_dir}")
if not snakemake.output.get("bam", None):
raise ValueError("bismark/bismark: Error 'bam' output file isn't specified.")
if not snakemake.output.get("report", None):
raise ValueError("bismark/bismark: Error 'report' output file isn't specified.")
# basename
if snakemake.params.get("basename", None):
cmdline_args.append("--basename {snakemake.params.basename:q}")
basename = snakemake.params.basename
else:
basename = None
# reads input
single_end_mode = snakemake.input.get("fq", None)
if single_end_mode:
# for SE data, you only have to specify read1 input by -i or --in1, and
# specify read1 output by -o or --out1.
cmdline_args.append("--se {snakemake.input.fq:q}")
mode_prefix = "se"
if basename is None:
basename = basename_without_ext(snakemake.input.fq)
else:
# for PE data, you should also specify read2 input by -I or --in2, and
# specify read2 output by -O or --out2.
cmdline_args.append("-1 {snakemake.input.fq_1:q} -2 {snakemake.input.fq_2:q}")
mode_prefix = "pe"
if basename is None:
# default basename
basename = basename_without_ext(snakemake.input.fq_1) + "_bismark_bt2"
# log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
cmdline_args.append("{log}")
# run
shell(" ".join(cmdline_args))
# Move outputs into proper position.
expected_2_actual_paths = [
(
snakemake.output.bam,
os.path.join(
outdir, "{}{}.bam".format(basename, "" if single_end_mode else "_pe")
),
),
(
snakemake.output.report,
os.path.join(
outdir,
"{}_{}_report.txt".format(basename, "SE" if single_end_mode else "PE"),
),
),
(
snakemake.output.get("nucleotide_stats", None),
os.path.join(
outdir,
"{}{}.nucleotide_stats.txt".format(
basename, "" if single_end_mode else "_pe"
),
),
),
]
log_append = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)
for (exp_path, actual_path) in expected_2_actual_paths:
if exp_path and (exp_path != actual_path):
shell("mv {actual_path:q} {exp_path:q} {log_append}")
BISMARK2BEDGRAPH¶
Generate bedGraph and coverage files from positional methylation files created by bismark_methylation_extractor (see https://github.com/FelixKrueger/Bismark/blob/master/bismark2bedGraph).
This wrapper can be used in the following way:
# Example for CHG+CHH summary coverage:
rule bismark2bedGraph_noncpg:
input:
"meth/CHG_context_{sample}.txt.gz",
"meth/CHH_context_{sample}.txt.gz"
output:
bedGraph="meth_non_cpg/{sample}_non_cpg.bedGraph.gz",
cov="meth_non_cpg/{sample}_non_cpg.bismark.cov.gz"
log:
"logs/meth_non_cpg/{sample}_non_cpg.log"
params:
extra="--CX"
wrapper:
"0.73.0/bio/bismark/bismark2bedGraph"
# Example for CpG only coverage
rule bismark2bedGraph_cpg:
input:
"meth/CpG_context_{sample}.txt.gz"
output:
bedGraph="meth_cpg/{sample}_CpG.bedGraph.gz",
cov="meth_cpg/{sample}_CpG.bismark.cov.gz"
log:
"logs/meth_cpg/{sample}_CpG.log"
wrapper:
"0.73.0/bio/bismark/bismark2bedGraph"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.3.4.3
bismark==0.22.1
samtools==1.9
Input:
- Files generated by bismark_methylation_extractor, e.g. CpG_context*.txt.gz, CHG_context*.txt.gz, CHH_context*.txt.gz. By default only CpG file is required, if ‘–CX’ option is output is build by merged input files.
Output:
bedGraph
: Bismark methylation level track, *.bedGraph.gz (0-based start, 1-based end coordintates, i.e. end offset exclusive)cov
: Optional bismark coverage file *.bismark.cov.gz, file name is calculated by bedGraph name (1-based start and end, i.e. end offset inclusive)
- Roman Cherniatchik
"""Snakemake wrapper for Bismark bismark2bedGraph tool."""
# https://github.com/FelixKrueger/Bismark/blob/master/bismark2bedGraph
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
bedGraph = snakemake.output.get("bedGraph", "")
if not bedGraph:
raise ValueError("bismark/bismark2bedGraph: Please specify bedGraph output path")
params_extra = snakemake.params.get("extra", "")
cmdline_args = ["bismark2bedGraph {params_extra}"]
dir_name = os.path.dirname(bedGraph)
if dir_name:
cmdline_args.append("--dir {dir_name}")
fname = os.path.basename(bedGraph)
cmdline_args.append("--output {fname}")
cmdline_args.append("{snakemake.input}")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
cmdline_args.append("{log}")
# run
shell(" ".join(cmdline_args))
BISMARK2REPORT¶
Generate graphical HTML report from Bismark reports (see https://github.com/FelixKrueger/Bismark/blob/master/bismark2report).
This wrapper can be used in the following way:
# Example: Pair-ended reads
rule bismark2report_pe:
input:
alignment_report="bams/{sample}_{genome}_PE_report.txt",
nucleotide_report="bams/{sample}_{genome}_pe.nucleotide_stats.txt",
dedup_report="bams/{sample}_{genome}_pe.deduplication_report.txt",
mbias_report="meth/{sample}_{genome}_pe.deduplicated.M-bias.txt",
splitting_report="meth/{sample}_{genome}_pe.deduplicated_splitting_report.txt"
output:
html="qc/meth/{sample}_{genome}.bismark2report.html",
log:
"logs/qc/meth/{sample}_{genome}.bismark2report.html.log",
params:
skip_optional_reports=True
wrapper:
"0.73.0/bio/bismark/bismark2report"
# Example: Single-ended reads
rule bismark2report_se:
input:
alignment_report="bams/{sample}_{genome}_SE_report.txt",
nucleotide_report="bams/{sample}_{genome}.nucleotide_stats.txt",
dedup_report="bams/{sample}_{genome}.deduplication_report.txt",
mbias_report="meth/{sample}_{genome}.deduplicated.M-bias.txt",
splitting_report="meth/{sample}_{genome}.deduplicated_splitting_report.txt"
output:
html="qc/meth/{sample}_{genome}.bismark2report.html",
log:
"logs/qc/meth/{sample}_{genome}.bismark2report.html.log",
params:
skip_optional_reports=True
wrapper:
"0.73.0/bio/bismark/bismark2report"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.3.4.3
bismark==0.22.1
samtools==1.9
Input:
alignment_report
: Alignment report (if not specified bismark will try to find it current directory)nucleotide_report
: Optional Bismark nucleotide coverage report (if not specified bismark will try to find it current directory)dedup_report
: Optional deduplication report (if not specified bismark will try to find it current directory)splitting_report
: Optional Bismark methylation extractor report (if not specified bismark will try to find it current directory)mbias_report
: Optional Bismark methylation extractor report (if not specified bismark will try to find it current directory)
Output:
html
: Output HTML file path, if batch mode isn’t used.html_dir
: Output dir path for HTML reports if batch mode is used
- Roman Cherniatchik
"""Snakemake wrapper to generate graphical HTML report from Bismark reports."""
# https://github.com/FelixKrueger/Bismark/blob/master/bismark2report
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
def answer2bool(v):
return str(v).lower() in ("yes", "true", "t", "1")
extra = snakemake.params.get("extra", "")
cmds = ["bismark2report {extra}"]
# output
html_file = snakemake.output.get("html", "")
output_dir = snakemake.output.get("html_dir", None)
if output_dir is None:
if html_file:
output_dir = os.path.dirname(html_file)
else:
if html_file:
raise ValueError(
"bismark/bismark2report: Choose one: 'html=...' for a single dir or 'html_dir=...' for batch processing."
)
if output_dir is None:
raise ValueError(
"bismark/bismark2report: Output file or directory not specified. "
"Use 'html=...' for a single dir or 'html_dir=...' for batch "
"processing."
)
if output_dir:
cmds.append("--dir {output_dir:q}")
if html_file:
html_file_name = os.path.basename(html_file)
cmds.append("--output {html_file_name:q}")
# reports
reports = [
"alignment_report",
"dedup_report",
"splitting_report",
"mbias_report",
"nucleotide_report",
]
skip_optional_reports = answer2bool(
snakemake.params.get("skip_optional_reports", False)
)
for report_name in reports:
path = snakemake.input.get(report_name, "")
if path:
locals()[report_name] = path
cmds.append("--{0} {{{1}:q}}".format(report_name, report_name))
elif skip_optional_reports:
cmds.append("--{0} 'none'".format(report_name))
# log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
cmds.append("{log}")
# run shell command:
shell(" ".join(cmds))
BISMARK2SUMMARY¶
Generate summary graphical HTML report from several Bismark text report files reports (see https://github.com/FelixKrueger/Bismark/blob/master/bismark2summary).
This wrapper can be used in the following way:
import os
rule bismark2summary:
input:
bam=["bams/a_genome_pe.bam", "bams/b_genome.bam"],
# Bismark `bismark2summary` discovers reports automatically based
# on files available in bam file containing folder
#
# If your per BAM file reports aren't in the same folder
# you will need an additional task which symlinks all reports
# (E.g. your splitting report generated by `bismark_methylation_extractor`
# tool is in `meth` folder, and alignment related reports in `bams` folder)
# These dependencies are here just to ensure that corresponding rules
# has already finished at rule execution time, otherwise some reports
# will be missing.
dependencies=[
"bams/a_genome_PE_report.txt",
"bams/a_genome_pe.deduplication_report.txt",
# for example splitting report is missing for 'a' sample
"bams/b_genome_SE_report.txt",
"bams/b_genome.deduplication_report.txt",
"bams/b_genome.deduplicated_splitting_report.txt"
]
output:
html="qc/{experiment}.bismark2summary.html",
txt="qc/{experiment}.bismark2summary.txt"
log:
"logs/qc/{experiment}.bismark2summary.log"
wrapper:
"0.73.0/bio/bismark/bismark2summary"
rule bismark2summary_prepare_symlinks:
input:
"meth/b_genome.deduplicated_splitting_report.txt",
output:
temp("bams/b_genome.deduplicated_splitting_report.txt"),
log:
"qc/bismark2summary_prepare_symlinks.symlinks.log"
run:
wd = os.getcwd()
shell("echo 'Making symlinks' > {log}")
for source, target in zip(input, output):
target_dir = os.path.dirname(target)
target_name = os.path.basename(target)
log_path = os.path.join(wd, log[0])
abs_src_path = os.path.abspath(source)
shell("cd {target_dir} && ln -f -s {abs_src_path} {target_name} >> {log_path} 2>&1")
shell("echo 'Done' >> {log}")
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.3.4.3
bismark==0.22.1
samtools==1.9
Input:
bam
: One or several (space separated) BAM file paths (aligned bam files with bismark reports in same folder). Also, it is recommended to add dependencies for all required reports using rules order or specifing them in input section using any other keys. E.g. deduplicaton report could be missing if rule only depends on aligned bam file. If you add dependency on deduplicated bam file bismark2report will fail because it expects input files to be initial aligned files with aligning report in same directory.
Output:
html
: Output HTML report path (e.g. ‘bismark_summary_report.html’).txt
: Output txt table path (e.g. ‘bismark_summary_report.txt’). Should have same as ‘html’ report but with suffix ‘.txt’.
- Roman Cherniatchik
"""Snakemake wrapper to generate summary graphical HTML report from several Bismark text report files."""
# https://github.com/FelixKrueger/Bismark/blob/master/bismark2summary
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
cmds = ["bismark2summary {extra}"]
# basename
bam = snakemake.input.get("bam", None)
if not bam:
raise ValueError(
"bismark/bismark2summary: Please specify aligned BAM file path"
" (one or several) using 'bam=..'"
)
html = snakemake.output.get("html", None)
txt = snakemake.output.get("txt", None)
if not html or not txt:
raise ValueError(
"bismark/bismark2summary: Please specify both 'html=..' and"
" 'txt=..' paths in output section"
)
basename, ext = os.path.splitext(html)
if ext.lower() != ".html":
raise ValueError(
"bismark/bismark2summary: HTML report file should end"
" with suffix '.html' but was {} ({})".format(ext, html)
)
suggested_txt = basename + ".txt"
if suggested_txt != txt:
raise ValueError(
"bismark/bismark2summary: Expected '{}' TXT report, "
"but was: '{}'".format(suggested_txt, txt)
)
cmds.append("--basename {basename:q}")
# title
title = snakemake.params.get("title", None)
if title:
cmds.append("--title {title:q}")
cmds.append("{bam}")
# log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
cmds.append("{log}")
# run shell command:
shell(" ".join(cmds))
BISMARK_GENOME_PREPARATION¶
Generate indexes for Bismark (see https://github.com/FelixKrueger/Bismark/blob/master/bismark_genome_preparation).
This wrapper can be used in the following way:
# For *.fa file
rule bismark_genome_preparation_fa:
input:
"indexes/{genome}/{genome}.fa"
output:
directory("indexes/{genome}/Bisulfite_Genome")
log:
"logs/indexes/{genome}/Bisulfite_Genome.log"
params:
"" # optional params string
wrapper:
"0.73.0/bio/bismark/bismark_genome_preparation"
# Fo *.fa.gz file:
rule bismark_genome_preparation_fa_gz:
input:
"indexes/{genome}/{genome}.fa.gz"
output:
directory("indexes/{genome}/Bisulfite_Genome")
log:
"logs/indexes/{genome}/Bisulfite_Genome.log"
params:
"" # optional params string
wrapper:
"0.73.0/bio/bismark/bismark_genome_preparation"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.3.4.3
bismark==0.22.1
samtools==1.9
Input:
- path to genome *.fa (or *.fasta, *.fa.gz, *.fasta.gz) file
Output:
- No ouptut, generates bismark indexes in parent directory of input file
- Roman Cherniatchik
"""Snakemake wrapper for Bismark indexes preparing using bismark_genome_preparation."""
# https://github.com/FelixKrueger/Bismark/blob/master/bismark_genome_preparation
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
input_dir = path.dirname(snakemake.input[0])
params_extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell("bismark_genome_preparation --verbose --bowtie2 {params_extra} {input_dir} {log}")
BISMARK_METHYLATION_EXTRACTOR¶
Call methylation counts from Bismark alignment results (see https://github.com/FelixKrueger/Bismark/blob/master/bismark_methylation_extractor).
This wrapper can be used in the following way:
rule bismark_methylation_extractor:
input: "bams/{sample}.bam"
output:
mbias_r1="qc/meth/{sample}.M-bias_R1.png",
# Only for PE BAMS:
# mbias_r2="qc/meth/{sample}.M-bias_R2.png",
mbias_report="meth/{sample}.M-bias.txt",
splitting_report="meth/{sample}_splitting_report.txt",
# 1-based start, 1-based end ('inclusive') methylation info: % and counts
methylome_CpG_cov="meth_cpg/{sample}.bismark.cov.gz",
# BedGraph with methylation percentage: 0-based start, end exclusive
methylome_CpG_mlevel_bedGraph="meth_cpg/{sample}.bedGraph.gz",
# Primary output files: methylation status at each read cytosine position: (extremely large)
read_base_meth_state_cpg="meth/CpG_context_{sample}.txt.gz",
# * You could merge CHG, CHH using: --merge_non_CpG
read_base_meth_state_chg="meth/CHG_context_{sample}.txt.gz",
read_base_meth_state_chh="meth/CHH_context_{sample}.txt.gz"
log:
"logs/meth/{sample}.log"
params:
output_dir="meth", # optional output dir
extra="--gzip --comprehensive --bedGraph" # optional params string
wrapper:
"0.73.0/bio/bismark/bismark_methylation_extractor"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.3.4.3
bismark==0.22.1
samtools==1.9
perl-gdgraph==1.54
Input:
- Input BAM file aligned by Bismark
Output:
- Depends on bismark options passed to params.extra, optional for this wrapper
mbias_report
: M-bias report, *.M-bias.txt (if key is provided, the out file will be renamed to this name)mbias_r1
: M-Bias plot for R1, *.M-bias_R1.png (if key is provided, the out file will be renamed to this name)mbias_r2
: M-Bias plot for R2, *.M-bias_R2.png (if key is provided, the out file will be renamed to this name)splitting_report
: Splitting report, *_splitting_report.txt (if key is provided, the out file will be renamed to this name)methylome_CpG_cov
: Bismark coverage file for CpG context, *.bismark.cov.gz (if key is provided, the out file will be renamed to this name)methylome_CpG_mlevel_bedGraph
: Bismark methylation level track, *.bedGraph.gzread_base_meth_state_cpg
: Per read CpG base methylation info, CpG_context_*.txt.gz (if key is provided, the out file will be renamed to this name)read_base_meth_state_chg
: Per read CpG base methylation info, CHG_context_*.txt.gz (if key is provided, the out file will be renamed to this name)read_base_meth_state_chh
: Per read CpG base methylation info, CHH_context_*.txt.gz (if key is provided, the out file will be renamed to this name)
- Roman Cherniatchik
"""Snakemake wrapper for Bismark methylation extractor tool: bismark_methylation_extractor."""
# https://github.com/FelixKrueger/Bismark/blob/master/bismark_methylation_extractor
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
params_extra = snakemake.params.get("extra", "")
cmdline_args = ["bismark_methylation_extractor {params_extra}"]
# output dir
output_dir = snakemake.params.get("output_dir", "")
if output_dir:
cmdline_args.append("-o {output_dir:q}")
# trimming options
trimming_options = [
"ignore", # meth_bias_r1_5end
"ignore_3prime", # meth_bias_r1_3end
"ignore_r2", # meth_bias_r2_5end
"ignore_3prime_r2", # meth_bias_r2_3end
]
for key in trimming_options:
value = snakemake.params.get(key, None)
if value:
cmdline_args.append("--{} {}".format(key, value))
# Input
cmdline_args.append("{snakemake.input}")
# log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
cmdline_args.append("{log}")
# run
shell(" ".join(cmdline_args))
key2prefix_suffix = [
("mbias_report", ("", ".M-bias.txt")),
("mbias_r1", ("", ".M-bias_R1.png")),
("mbias_r2", ("", ".M-bias_R2.png")),
("splitting_report", ("", "_splitting_report.txt")),
("methylome_CpG_cov", ("", ".bismark.cov.gz")),
("methylome_CpG_mlevel_bedGraph", ("", ".bedGraph.gz")),
("read_base_meth_state_cpg", ("CpG_context_", ".txt.gz")),
("read_base_meth_state_chg", ("CHG_context_", ".txt.gz")),
("read_base_meth_state_chh", ("CHH_context_", ".txt.gz")),
]
log_append = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)
for (key, (prefix, suffix)) in key2prefix_suffix:
exp_path = snakemake.output.get(key, None)
if exp_path:
if len(snakemake.input) != 1:
raise ValueError(
"bismark/bismark_methylation_extractor: Error: only one BAM file is"
" expected in input, but was <{}>".format(snakemake.input)
)
bam_file = snakemake.input[0]
bam_name = os.path.basename(bam_file)
bam_wo_ext = os.path.splitext(bam_name)[0]
actual_path = os.path.join(output_dir, prefix + bam_wo_ext + suffix)
if exp_path != actual_path:
shell("mv {actual_path:q} {exp_path:q} {log_append}")
DEDUPLICATE_BISMARK¶
Deduplicate Bismark Bam Files and saves as *.bam file (see https://github.com/FelixKrueger/Bismark/blob/master/deduplicate_bismark).
This wrapper can be used in the following way:
rule deduplicate_bismark:
input: "bams/a_genome_pe.bam"
output:
bam="bams/{sample}.deduplicated.bam",
report="bams/{sample}.deduplication_report.txt",
log:
"logs/bams/{sample}.deduplicated.log",
params:
extra="" # optional params string
wrapper:
"0.73.0/bio/bismark/deduplicate_bismark"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.3.4.3
bismark==0.22.1
samtools==1.9
Input:
- path to one or multiple *.bam files aligned by Bismark, if multiple passed ‘–multiple’ argument will be added automatically.
Output:
bam
: Result bam file path. The file will be renamed if differs from NAME.deduplicated.bam for given ‘NAME.bam’ input.report
: Result report path. The file will be renamed if differs from NAME.deduplication_report.txt for given ‘NAME.bam’ input.
- Roman Cherniatchik
"""Snakemake wrapper for Bismark aligned reads deduplication using deduplicate_bismark."""
# https://github.com/FelixKrueger/Bismark/blob/master/deduplicate_bismark
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
bam_path = snakemake.output.get("bam", None)
report_path = snakemake.output.get("report", None)
if not bam_path or not report_path:
raise ValueError(
"bismark/deduplicate_bismark: Please specify both 'bam=..' and 'report=..' paths in output section"
)
output_dir = os.path.dirname(bam_path)
if output_dir != os.path.dirname(report_path):
raise ValueError(
"bismark/deduplicate_bismark: BAM and Report files expected to have the same parent directory"
" but was {} and {}".format(bam_path, report_path)
)
arg_output_dir = "--output_dir '{}'".format(output_dir) if output_dir else ""
arg_multiple = "--multiple" if len(snakemake.input) > 1 else ""
params_extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
log_append = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)
shell(
"deduplicate_bismark {params_extra} --bam {arg_multiple}"
" {arg_output_dir} {snakemake.input} {log}"
)
# Move outputs into proper position.
fst_input_filename = os.path.basename(snakemake.input[0])
fst_input_basename = os.path.splitext(fst_input_filename)[0]
prefix = os.path.join(output_dir, fst_input_basename)
deduplicated_bam_actual_name = prefix + ".deduplicated.bam"
if arg_multiple:
# bismark does it exactly like this:
deduplicated_bam_actual_name = deduplicated_bam_actual_name.replace(
"deduplicated", "multiple.deduplicated", 1
)
expected_2_actual_paths = [
(bam_path, deduplicated_bam_actual_name),
(
report_path,
prefix + (".multiple" if arg_multiple else "") + ".deduplication_report.txt",
),
]
for (exp_path, actual_path) in expected_2_actual_paths:
if exp_path and (exp_path != actual_path):
shell("mv {actual_path:q} {exp_path:q} {log_append}")
BOWTIE2¶
For bowtie2, the following wrappers are available:
BOWTIE2¶
Map reads with bowtie2.
This wrapper can be used in the following way:
rule bowtie2:
input:
sample=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"]
output:
"mapped/{sample}.bam"
log:
"logs/bowtie2/{sample}.log"
params:
index="index/genome", # prefix of reference genome index (built with bowtie2-build)
extra="" # optional parameters
threads: 8 # Use at least two threads
wrapper:
"0.73.0/bio/bowtie2/align"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.4.1
samtools==1.10
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
n = len(snakemake.input.sample)
assert (
n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."
if n == 1:
reads = "-U {}".format(*snakemake.input.sample)
else:
reads = "-1 {} -2 {}".format(*snakemake.input.sample)
shell(
"(bowtie2 --threads {snakemake.threads} {extra} "
"-x {snakemake.params.index} {reads} "
"| samtools view -Sbh -o {snakemake.output[0]} -) {log}"
)
BOWTIE2_BUILD¶
Map reads with bowtie2.
This wrapper can be used in the following way:
rule bowtie2_build:
input:
reference="genome.fasta"
output:
multiext(
"genome",
".1.bt2", ".2.bt2", ".3.bt2", ".4.bt2", ".rev.1.bt2", ".rev.2.bt2",
),
log:
"logs/bowtie2_build/build.log"
params:
extra="" # optional parameters
threads: 8
wrapper:
"0.73.0/bio/bowtie2/build"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bowtie2==2.4.1
samtools==1.10
- Daniel Standage
__author__ = "Daniel Standage"
__copyright__ = "Copyright 2020, Daniel Standage"
__email__ = "daniel.standage@nbacc.dhs.gov"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
indexbase = snakemake.output[0].replace(".1.bt2", "")
shell(
"bowtie2-build --threads {snakemake.threads} {snakemake.params.extra} "
"{snakemake.input.reference} {indexbase}"
)
BUSCO¶
Assess assembly and annotation completeness with BUSCO
Example¶
This wrapper can be used in the following way:
rule run_busco:
input:
"sample_data/target.fa"
output:
"txome_busco/full_table_txome_busco.tsv",
log:
"logs/quality/transcriptome_busco.log"
threads: 8
params:
mode="transcriptome",
lineage_path="sample_data/example",
# optional parameters
extra=""
wrapper:
"0.73.0/bio/busco"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
python==3.6
busco==5.0.0
Authors¶
- Tessa Pierce
Code¶
"""Snakemake wrapper for BUSCO assessment"""
__author__ = "Tessa Pierce"
__copyright__ = "Copyright 2018, Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
mode = snakemake.params.get("mode")
assert mode is not None, "please input a run mode: genome, transcriptome or proteins"
lineage = snakemake.params.get("lineage_path")
assert lineage is not None, "please input the path to a lineage for busco assessment"
# busco does not allow you to direct output location: handle this by moving output
outdir = path.dirname(snakemake.output[0])
if "/" in outdir:
out_name = path.basename(outdir)
else:
out_name = outdir
# note: --force allows snakemake to handle rewriting files as necessary
# without needing to specify *all* busco outputs as snakemake outputs
shell(
"busco --in {snakemake.input} --out {out_name} --force "
" --cpu {snakemake.threads} --mode {mode} --lineage {lineage} "
" {extra} {log}"
)
busco_outname = "run_" + out_name
# move to intended location
shell("cp -r {busco_outname}/* {outdir}")
shell("rm -rf {busco_outname}")
BWA¶
For bwa, the following wrappers are available:
BWA ALN¶
Map reads with bwa aln. For more information about BWA see BWA documentation.
This wrapper can be used in the following way:
rule bwa_aln:
input:
"reads/{sample}.{pair}.fastq"
output:
"sai/{sample}.{pair}.sai"
params:
index="genome",
extra=""
log:
"logs/bwa_aln/{sample}.{pair}.log"
threads: 8
wrapper:
"0.73.0/bio/bwa/aln"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa==0.7.17
- Julian de Ruiter
"""Snakemake wrapper for bwa aln."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"bwa aln"
" {extra}"
" -t {snakemake.threads}"
" {snakemake.params.index}"
" {snakemake.input[0]}"
" > {snakemake.output[0]} {log}"
)
BWA INDEX¶
Creates a BWA index. For more information about BWA see BWA documentation.
This wrapper can be used in the following way:
rule bwa_index:
input:
"{genome}.fasta"
output:
"{genome}.amb",
"{genome}.ann",
"{genome}.bwt",
"{genome}.pac",
"{genome}.sa"
log:
"logs/bwa_index/{genome}.log"
params:
prefix="{genome}",
algorithm="bwtsw"
wrapper:
"0.73.0/bio/bwa/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa==0.7.17
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Check inputs/arguments.
if len(snakemake.input) == 0:
raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
raise ValueError("Only one reference genome can be inputed!")
# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")
if len(prefix) > 0:
prefix = "-p " + prefix
# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")
if len(construction_algorithm) != 0:
construction_algorithm = "-a " + construction_algorithm
shell(
"bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)
BWA MEM¶
Map reads using bwa mem, with optional sorting using samtools or picard. For more information about BWA see BWA documentation.
This wrapper can be used in the following way:
rule bwa_mem:
input:
reads=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"]
output:
"mapped/{sample}.bam"
log:
"logs/bwa_mem/{sample}.log"
params:
index="genome",
extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
sort="none", # Can be 'none', 'samtools' or 'picard'.
sort_order="queryname", # Can be 'queryname' or 'coordinate'.
sort_extra="" # Extra args for samtools/picard.
threads: 8
wrapper:
"0.73.0/bio/bwa/mem"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa==0.7.17
samtools==1.9
picard==2.20.1
- Johannes Köster
- Julian de Ruiter
__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
# Extract arguments.
extra = snakemake.params.get("extra", "")
sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
1,
2,
}:
raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements")
if sort_order not in {"coordinate", "queryname"}:
raise ValueError("Unexpected value for sort_order ({})".format(sort_order))
# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
# Simply convert to bam using samtools view.
pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"
elif sort == "samtools":
# Sort alignments using samtools sort.
pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"
# Add name flag if needed.
if sort_order == "queryname":
sort_extra += " -n"
prefix = path.splitext(snakemake.output[0])[0]
sort_extra += " -T " + prefix + ".tmp"
elif sort == "picard":
# Sort alignments using picard SortSam.
pipe_cmd = (
"picard SortSam {sort_extra} INPUT=/dev/stdin"
" OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
)
else:
raise ValueError("Unexpected value for params.sort ({})".format(sort))
shell(
"(bwa mem"
" -t {snakemake.threads}"
" {extra}"
" {snakemake.params.index}"
" {snakemake.input.reads}"
" | " + pipe_cmd + ") {log}"
)
BWA MEM SAMBLASTER¶
Map reads using bwa mem, mark duplicates by samblaster and sort and index by sambamba.
This wrapper can be used in the following way:
rule bwa_mem:
input:
reads=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"]
output:
bam="mapped/{sample}.bam",
index="mapped/{sample}.bam.bai"
log:
"logs/bwa_mem_sambamba/{sample}.log"
params:
index="genome",
extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
sort_extra="" # Extra args for sambamba.
threads: 8
wrapper:
"0.73.0/bio/bwa/mem-samblaster"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa==0.7.17
sambamba==0.7.1
samblaster==0.1.24
- Christopher Schröder
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroeder@tu-dortmund.de"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
# Extract arguments.
extra = snakemake.params.get("extra", "")
sort_extra = snakemake.params.get("sort_extra", "")
samblaster_extra = snakemake.params.get("samblaster_extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
1,
2,
}:
raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements")
shell(
"(bwa mem"
" -t {snakemake.threads}"
" {extra}"
" {snakemake.params.index}"
" {snakemake.input.reads}"
" | samblaster"
" {samblaster_extra}"
" | sambamba view -S -f bam /dev/stdin"
" -t {snakemake.threads}"
" | sambamba sort /dev/stdin"
" -t {snakemake.threads}"
" -o {snakemake.output.bam}"
" {sort_extra}"
") {log}"
)
BWA SAMPE¶
Map paired-end reads with bwa sampe. For more information about BWA see BWA documentation.
This wrapper can be used in the following way:
rule bwa_sampe:
input:
fastq=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"],
sai=["sai/{sample}.1.sai", "sai/{sample}.2.sai"]
output:
"mapped/{sample}.bam"
params:
index="genome",
extra=r"-r '@RG\tID:{sample}\tSM:{sample}'", # optional: Extra parameters for bwa.
sort="none", # optional: Enable sorting. Possible values: 'none', 'samtools' or 'picard'`
sort_order="queryname", # optional: Sort by 'queryname' or 'coordinate'
sort_extra="" # optional: extra arguments for samtools/picard
log:
"logs/bwa_sampe/{sample}.log"
wrapper:
"0.73.0/bio/bwa/sampe"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa==0.7.17
samtools==1.9
picard==2.20.1
- Julian de Ruiter
"""Snakemake wrapper for bwa sampe."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
# Check inputs.
if not len(snakemake.input.sai) == 2:
raise ValueError("input.sai must have 2 elements")
if not len(snakemake.input.fastq) == 2:
raise ValueError("input.fastq must have 2 elements")
# Extract arguments.
extra = snakemake.params.get("extra", "")
sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
# Simply convert to bam using samtools view.
pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"
elif sort == "samtools":
# Sort alignments using samtools sort.
pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"
# Add name flag if needed.
if sort_order == "queryname":
sort_extra += " -n"
# Use prefix for temp.
prefix = path.splitext(snakemake.output[0])[0]
sort_extra += " -T " + prefix + ".tmp"
elif sort == "picard":
# Sort alignments using picard SortSam.
pipe_cmd = (
"picard SortSam {sort_extra} INPUT=/dev/stdin"
" OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
)
else:
raise ValueError("Unexpected value for params.sort ({})".format(sort))
# Run command.
shell(
"(bwa sampe"
" {extra}"
" {snakemake.params.index}"
" {snakemake.input.sai}"
" {snakemake.input.fastq}"
" | " + pipe_cmd + ") {log}"
)
BWA SAMSE¶
Map single-end reads with bwa samse. For more information about BWA see BWA documentation.
This wrapper can be used in the following way:
rule bwa_samse:
input:
fastq="reads/{sample}.1.fastq",
sai="sai/{sample}.1.sai"
output:
"mapped/{sample}.bam"
params:
index="genome",
extra=r"-r '@RG\tID:{sample}\tSM:{sample}'", # optional: Extra parameters for bwa.
sort="none", # optional: Enable sorting. Possible values: 'none', 'samtools' or 'picard'`
sort_order="queryname", # optional: Sort by 'queryname' or 'coordinate'
sort_extra="" # optional: extra arguments for samtools/picard
log:
"logs/bwa_samse/{sample}.log"
wrapper:
"0.73.0/bio/bwa/samse"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa==0.7.17
samtools==1.9
picard==2.20.1
- Julian de Ruiter
"""Snakemake wrapper for bwa sampe."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
# Extract arguments.
extra = snakemake.params.get("extra", "")
sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
# Simply convert to bam using samtools view.
pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"
elif sort == "samtools":
# Sort alignments using samtools sort.
pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"
# Add name flag if needed.
if sort_order == "queryname":
sort_extra += " -n"
# Use prefix for temp.
prefix = path.splitext(snakemake.output[0])[0]
sort_extra += " -T " + prefix + ".tmp"
elif sort == "picard":
# Sort alignments using picard SortSam.
pipe_cmd = (
"picard SortSam {sort_extra} INPUT=/dev/stdin"
" OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
)
else:
raise ValueError("Unexpected value for params.sort ({})".format(sort))
# Run command.
shell(
"(bwa samse"
" {extra}"
" {snakemake.params.index}"
" {snakemake.input.sai}"
" {snakemake.input.fastq}"
" | " + pipe_cmd + ") {log}"
)
BWA SAM(SE/PE)¶
Map paired-end reads with either bwa samse or sampe. For more information about BWA see BWA documentation.
This wrapper can be used in the following way:
rule bwa_sam_pe:
input:
fastq=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"],
sai=["sai/{sample}.1.sai", "sai/{sample}.2.sai"]
output:
"mapped/{sample}.pe.sam"
params:
index="genome",
extra=r"-r '@RG\tID:{sample}\tSM:{sample}'", # optional: Extra parameters for bwa.
sort="none",
log:
"logs/bwa_sam_pe/{sample}.log"
wrapper:
"0.73.0/bio/bwa/samxe"
rule bwa_sam_se:
input:
fastq="reads/{sample}.1.fastq",
sai="sai/{sample}.1.sai"
output:
"mapped/{sample}.se.sam"
params:
index="genome",
extra=r"-r '@RG\tID:{sample}\tSM:{sample}'", # optional: Extra parameters for bwa.
sort="none",
log:
"logs/bwa_sam_se/{sample}.log"
wrapper:
"0.73.0/bio/bwa/samxe"
rule bwa_bam_pe:
input:
fastq=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"],
sai=["sai/{sample}.1.sai", "sai/{sample}.2.sai"]
output:
"mapped/{sample}.pe.bam"
params:
index="genome",
extra=r"-r '@RG\tID:{sample}\tSM:{sample}'", # optional: Extra parameters for bwa.
sort="none",
log:
"logs/bwa_bam_pe/{sample}.log"
wrapper:
"0.73.0/bio/bwa/samxe"
rule bwa_bam_se:
input:
fastq="reads/{sample}.1.fastq",
sai="sai/{sample}.1.sai"
output:
"mapped/{sample}.se.bam"
params:
index="genome",
extra=r"-r '@RG\tID:{sample}\tSM:{sample}'", # optional: Extra parameters for bwa.
sort="none",
log:
"logs/bwa_bam_se/{sample}.log"
wrapper:
"0.73.0/bio/bwa/samxe"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa==0.7.17
samtools==1.9
picard==2.20.1
- Filipe G. Vieira
"""Snakemake wrapper for both bwa samse and sampe."""
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2020, Filipe G. Vieira"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
# Check inputs.
fastq = (
snakemake.input.fastq
if isinstance(snakemake.input.fastq, list)
else [snakemake.input.fastq]
)
sai = (
snakemake.input.sai
if isinstance(snakemake.input.sai, list)
else [snakemake.input.sai]
)
if len(fastq) == 1 and len(sai) == 1:
alg = "samse"
elif len(fastq) == 2 and len(sai) == 2:
alg = "sampe"
else:
raise ValueError("input.fastq and input.sai must have 1 or 2 elements each")
# Extract output format
out_name, out_ext = path.splitext(snakemake.output[0])
out_ext = out_ext[1:].upper()
# Extract arguments.
extra = snakemake.params.get("extra", "")
sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
# Simply convert to output format using samtools view.
pipe_cmd = (
"samtools view -h --output-fmt " + out_ext + " -o {snakemake.output[0]} -"
)
elif sort == "samtools":
# Sort alignments using samtools sort.
pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"
# Add name flag if needed.
if sort_order == "queryname":
sort_extra += " -n"
# Use prefix for temp.
prefix = path.splitext(snakemake.output[0])[0]
sort_extra += " -T " + prefix + ".tmp"
# Define output format
sort_extra += " --output-fmt {}".format(out_ext)
elif sort == "picard":
# Sort alignments using picard SortSam.
pipe_cmd = (
"picard SortSam {sort_extra} INPUT=/dev/stdin"
" OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
)
else:
raise ValueError("Unexpected value for params.sort ({})".format(sort))
# Run command.
shell(
"(bwa {alg}"
" {extra}"
" {snakemake.params.index}"
" {snakemake.input.sai}"
" {snakemake.input.fastq}"
" | " + pipe_cmd + ") {log}"
)
BWA-MEM2¶
For bwa-mem2, the following wrappers are available:
BWA-MEM2 INDEX¶
Creates a bwa-mem2 index.
This wrapper can be used in the following way:
rule bwa_mem2_index:
input:
"{genome}"
output:
"{genome}.0123",
"{genome}.amb",
"{genome}.ann",
"{genome}.bwt.2bit.64",
"{genome}.bwt.8bit.32",
"{genome}.pac",
log:
"logs/bwa-mem2_index/{genome}.log"
params:
prefix="{genome}"
wrapper:
"0.73.0/bio/bwa-mem2/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa-mem2==2.0
- Christopher Schröder
- Patrik Smeds
__author__ = "Christopher Schröder, Patrik Smeds"
__copyright__ = "Copyright 2020, Christopher Schröder, Patrik Smeds"
__email__ = "christopher.schroeder@tu-dortmund.de, patrik.smeds@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Check inputs/arguments.
if len(snakemake.input) == 0:
raise ValueError("A reference genome has to be provided.")
elif len(snakemake.input) > 1:
raise ValueError("Please provide exactly one reference genome as input.")
# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")
if len(prefix) > 0:
prefix = "-p " + prefix
shell("bwa-mem2 index" " {prefix}" " {snakemake.input[0]}" " {log}")
BWA-MEM2¶
Bwa-mem2 is the next version of the bwa-mem algorithm in bwa. It produces alignment identical to bwa and is ~1.3-3.1x faster depending on the use-case, dataset and the running machine. Optional sorting using samtools or picard.
This wrapper can be used in the following way:
rule bwa_mem2_mem:
input:
reads=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"]
output:
"mapped/{sample}.bam"
log:
"logs/bwa_mem2/{sample}.log"
params:
index="genome.fasta",
extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
sort="none", # Can be 'none', 'samtools' or 'picard'.
sort_order="coordinate", # Can be 'coordinate' (default) or 'queryname'.
sort_extra="" # Extra args for samtools/picard.
threads: 8
wrapper:
"0.73.0/bio/bwa-mem2/mem"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa-mem2==2.0
samtools==1.10
picard==2.23
- Christopher Schröder
- Johannes Köster
- Julian de Ruiter
__author__ = "Christopher Schröder, Johannes Köster, Julian de Ruiter"
__copyright__ = (
"Copyright 2020, Christopher Schröder, Johannes Köster and Julian de Ruiter"
)
__email__ = "christopher.schroeder@tu-dortmund.de koester@jimmy.harvard.edu, julianderuiter@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
# Extract arguments.
extra = snakemake.params.get("extra", "")
sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
1,
2,
}:
raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")
if sort_order not in {"coordinate", "queryname"}:
raise ValueError("Unexpected value for sort_order ({})".format(sort_order))
# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
# Simply convert to bam using samtools view.
pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"
elif sort == "samtools":
# Sort alignments using samtools sort.
pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"
# Add name flag if needed.
if sort_order == "queryname":
sort_extra += " -n"
prefix = path.splitext(snakemake.output[0])[0]
sort_extra += " -T " + prefix + ".tmp"
elif sort == "picard":
# Sort alignments using picard SortSam.
pipe_cmd = (
"picard SortSam {sort_extra} INPUT=/dev/stdin"
" OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
)
else:
raise ValueError("Unexpected value for params.sort ({})".format(sort))
shell(
"(bwa-mem2 mem"
" -t {snakemake.threads}"
" {extra}"
" {snakemake.params.index}"
" {snakemake.input.reads}"
" | " + pipe_cmd + ") {log}"
)
BWA MEM SAMBLASTER¶
Map reads using bwa-mem2, mark duplicates by samblaster and sort and index by sambamba.
This wrapper can be used in the following way:
rule bwa_mem:
input:
reads=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"]
output:
bam="mapped/{sample}.bam",
index="mapped/{sample}.bam.bai"
log:
"logs/bwa_mem2_sambamba/{sample}.log"
params:
index="genome.fasta",
extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
sort_extra="-q" # Extra args for sambamba.
threads: 8
wrapper:
"0.73.0/bio/bwa-mem2/mem-samblaster"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bwa-mem2==2.0
sambamba==0.7.1
samblaster==0.1.24
- Christopher Schröder
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroeder@tu-dortmund.de"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
# Extract arguments.
extra = snakemake.params.get("extra", "")
sort_extra = snakemake.params.get("sort_extra", "")
samblaster_extra = snakemake.params.get("samblaster_extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
1,
2,
}:
raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")
shell(
"(bwa-mem2 mem"
" -t {snakemake.threads}"
" {extra}"
" {snakemake.params.index}"
" {snakemake.input.reads}"
" | samblaster"
" {samblaster_extra}"
" | sambamba view -S -f bam /dev/stdin"
" -t {snakemake.threads}"
" | sambamba sort /dev/stdin"
" -t {snakemake.threads}"
" -o {snakemake.output.bam}"
" {sort_extra}"
") {log}"
)
CAIROSVG¶
Convert SVG files with cairosvg.
Example¶
This wrapper can be used in the following way:
rule:
input:
"{prefix}.svg"
output:
"{prefix}.{fmt,(pdf|png)}"
wrapper:
"0.73.0/utils/cairosvg"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
cairosvg=2.4.2
Authors¶
- Johannes Köster
Code¶
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2017, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
_, ext = os.path.splitext(snakemake.output[0])
if ext not in (".png", ".pdf", ".ps", ".svg"):
raise ValueError("invalid file extension: '{}'".format(ext))
fmt = ext[1:]
shell("cairosvg -f {fmt} {snakemake.input[0]} -o {snakemake.output[0]}")
CLUSTALO¶
Multiple alignment of nucleic acid and protein sequences.
Example¶
This wrapper can be used in the following way:
rule clustalo:
input:
"{sample}.fa"
output:
"{sample}.msa.fa"
params:
extra=""
log:
"logs/clustalo/test/{sample}.log"
threads: 8
wrapper:
"0.73.0/bio/clustalo"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
clustalo==1.2.4
Authors¶
- Michael Hall
Code¶
"""Snakemake wrapper for clustal omega."""
__author__ = "Michael Hall"
__copyright__ = "Copyright 2019, Michael Hall"
__email__ = "mbhall88@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Formats the log redrection string
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Executed shell command
shell(
"clustalo {extra}"
" --threads={snakemake.threads}"
" --in {snakemake.input[0]}"
" --out {snakemake.output[0]} "
" {log}"
)
CUTADAPT¶
For cutadapt, the following wrappers are available:
CUTADAPT-PE¶
Trim paired-end reads using cutadapt.
This wrapper can be used in the following way:
rule cutadapt:
input:
["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"]
output:
fastq1="trimmed/{sample}.1.fastq",
fastq2="trimmed/{sample}.2.fastq",
qc="trimmed/{sample}.qc.txt"
params:
# https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
adapters="-a AGAGCACACGTCTGAACTCCAGTCAC -g AGATCGGAAGAGCACACGT -A AGAGCACACGTCTGAACTCCAGTCAC -G AGATCGGAAGAGCACACGT",
# https://cutadapt.readthedocs.io/en/stable/guide.html#
extra="--minimum-length 1 -q 20"
log:
"logs/cutadapt/{sample}.log"
threads: 4 # set desired number of threads here
wrapper:
"0.73.0/bio/cutadapt/pe"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
cutadapt==2.10
Input:
- two (paired-end) fastq files
Output:
- two trimmed (paired-end) fastq files
- text file containing trimming statistics
- Julian de Ruiter
- David Laehnemann
"""Snakemake wrapper for trimming paired-end reads using cutadapt."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."
extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
assert (
extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."
shell(
"cutadapt"
" {snakemake.params.adapters}"
" {snakemake.params.extra}"
" -o {snakemake.output.fastq1}"
" -p {snakemake.output.fastq2}"
" -j {snakemake.threads}"
" {snakemake.input}"
" > {snakemake.output.qc} {log}"
)
CUTADAPT-SE¶
Trim single-end reads using cutadapt.
This wrapper can be used in the following way:
rule cutadapt:
input:
"reads/{sample}.fastq"
output:
fastq="trimmed/{sample}.fastq",
qc="trimmed/{sample}.qc.txt"
params:
adapters="-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
extra="-q 20"
log:
"logs/cutadapt/{sample}.log"
threads: 4 # set desired number of threads here
wrapper:
"0.73.0/bio/cutadapt/se"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
cutadapt==2.10
- Julian de Ruiter
"""Snakemake wrapper for trimming single-end reads using cutadapt."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
n = len(snakemake.input)
assert n == 1, "Input must contain 1 (single-end) element."
extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
assert (
extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."
shell(
"cutadapt"
" {snakemake.params.adapters}"
" {snakemake.params.extra}"
" -j {snakemake.threads}"
" -o {snakemake.output.fastq}"
" {snakemake.input[0]}"
" > {snakemake.output.qc} {log}"
)
DADA2¶
For dada2, the following wrappers are available:
DADA2_ADD_SPECIES¶
DADA2
Adding species-level annotation using dada2 addSpecies
function. Optional parameters are documented in the manual and the function is introduced in the dedicated tutorial section.
This wrapper can be used in the following way:
rule dada2_add_species:
input:
taxtab="results/dada2/taxa.RDS", # Taxonomic assignments
refFasta="resources/example_species_assignment.fa.gz" # Reference FASTA
output:
"results/dada2/taxa-sp.RDS", # Taxonomic + Species assignments
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
#params:
# verbose=True
log:
"logs/dada2/add-species/add-species.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/add-species"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
taxa
: RDS file containing the taxonomic assignmentsrefFasta
: A string with the path to the FASTA reference database
Output:
- The input RDS file augmented by the species-level annotation
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for adding species-level
# annotation using dada2 assignTaxonomy function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list(
taxtab = readRDS(snakemake@input[["taxtab"]]),
refFasta = snakemake@input[["refFasta"]]
)
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using default parameters from dada2::addSpecies()")
}
# Learn errors rates for both read types
taxa.sp<-do.call(addSpecies, args)
# Store the taxonomic assignments as a RDS file
saveRDS(taxa.sp, snakemake@output[[1]],compress = T)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_ASSIGN_SPECIES¶
DADA2
Classifying sequences against a reference database using dada2 assignSpecies
function. Optional parameters are documented in the manual and an example of the function can be found in the dedicated section of the DADA2 website.
This wrapper can be used in the following way:
rule dada2_assign_species:
input:
seqs="results/dada2/seqTab.nochim.RDS", # Chimera-free sequence table
refFasta="resources/species.fasta" # Reference FASTA for Genus-Species taxonomy
output:
"results/dada2/genus-species-taxa.RDS" # Genus-Species taxonomic assignments
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
#params:
# allowMultiple=True
log:
"logs/dada2/assign-species/assign-species.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/assign-species"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
seqs
: RDS file with the chimera-free sequence tablerefFasta
: A string with the path to the genus-species FASTA reference database
Output:
- RDS file containing the genus and species taxonomic assignments
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for exact matching of sequences against
# a genus-species reference database using dada2 assignSpecies function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list(
seqs = readRDS(snakemake@input[["seqs"]]),
refFasta = snakemake@input[["refFasta"]]
)
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using default parameters from dada2::assignSpecies()")
}
# Perform Genus-Species taxonomic assignments
taxa<-do.call(assignSpecies, args)
# Store the taxonomic assignments as a RDS file
saveRDS(taxa, snakemake@output[[1]],compress = T)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_ASSIGN_TAXONOMY¶
DADA2
Classifying sequences against a reference database using dada2 assignTaxonomy
function. Optional parameters are documented in the manual and the function is introduced in the dedicated tutorial section.
This wrapper can be used in the following way:
rule dada2_assign_taxonomy:
input:
seqs="results/dada2/seqTab.nochim.RDS", # Chimera-free sequence table
refFasta="resources/example_train_set.fa.gz" # Reference FASTA for taxonomy
output:
"results/dada2/taxa.RDS" # Taxonomic assignments
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
#params:
# verbose=True
log:
"logs/dada2/assign-taxonomy/assign-taxonomy.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/assign-taxonomy"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
seqs
: RDS file with the chimera-free sequence tablerefFasta
: A string with the path to the FASTA reference database
Output:
- RDS file containing the taxonomic assignments
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for classifying sequences against
# a reference database using dada2 assignTaxonomy function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list(
seqs = readRDS(snakemake@input[["seqs"]]),
refFasta = snakemake@input[["refFasta"]],
multithread=snakemake@threads
)
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using default parameters from dada2::assignTaxonomy()")
}
# Learn errors rates for both read types
taxa<-do.call(assignTaxonomy, args)
# Store the taxonomic assignments as a RDS file
saveRDS(taxa, snakemake@output[[1]],compress = T)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_COLLAPSE_NOMISMATCH¶
DADA2
Combine together sequences that are identical up to shifts and/or indels using dada2 collapseNoMismatch
function. Optional parameters are documented in the manual. While the function is not included in the tutorial, feel free to browse the dada2 issues for showcases.
This wrapper can be used in the following way:
rule dada2_collapse_nomismatch:
input:
"results/dada2/seqTab.nochimeras.RDS" # Chimera-free sequence table
output:
"results/dada2/seqTab.collapsed.RDS"
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
#params:
# verbose=True
log:
"logs/dada2/collapse-nomismatch/collapse-nomismatch.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/collapse-nomismatch"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
- RDS file with the chimera-free sequence table
Output:
- RDS file with the sequence table where the needed sequences were collapsed
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for combining together sequences that are identical
# up to shifts and/or indels using dada2 collapseNoMismatch function
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list(
seqtab = readRDS(snakemake@input[[1]])
)
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using default parameters from dada2::collapseNoMismatch()")
}
# Collapse sequences
taxa<-do.call(collapseNoMismatch, args)
# Store the resulting table as a RDS file
saveRDS(taxa, snakemake@output[[1]],compress = T)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_DEREPLICATE_FASTQ¶
DADA2
Dereplication of FASTQ files using dada2 derepFastq
function. Optional parameters are documented in the manual and though the function is not introduced explicitly in the tutorial it is used in under the hood in the learnErrors section.
This wrapper can be used in the following way:
rule dada2_dereplicate_fastq:
input:
# Quality filtered FASTQ file
"filtered/{fastq}.fastq"
output:
# Dereplicated sequences stored as `derep-class` object in a RDS file
"uniques/{fastq}.RDS"
log:
"logs/dada2/dereplicate-fastq/{fastq}.log"
wrapper:
"0.73.0/bio/dada2/dereplicate-fastq"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for dereplicating FASTQ files using dada2 derepFastq function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list( fls = unlist(snakemake@input))
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using default parameters from dada2::derepFastq()")
}
# Dereplicate
uniques<-do.call(derepFastq, args)
# Store as RDS file
saveRDS(uniques,snakemake@output[[1]])
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_FILTER_TRIM¶
DADA2
Quality filtering of single or paired-end reads using dada2 filterAndTrim
function. Optional parameters are documented in the manual and the function is introduced in the dedicated tutorial section.
This wrapper can be used in the following way:
rule dada2_filter_trim_se:
input:
# Single-end files without primers sequences
fwd="trimmed/{sample}.1.fastq.gz"
output:
filt="filtered-se/{sample}.1.fastq.gz",
stats="reports/dada2/filter-trim-se/{sample}.tsv"
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
params:
# Set the maximum expected errors tolerated in filtered reads
maxEE=1,
# Set the number of kept bases to 7 for the toy example
truncLen=7,
# Set minLen to 1 for the toy example but default is 20
minLen=1
log:
"logs/dada2/filter-trim-se/{sample}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/filter-trim"
rule dada2_filter_trim_pe:
input:
# Paired-end files without primers sequences
fwd="trimmed/{sample}.1.fastq",
rev="trimmed/{sample}.2.fastq"
output:
filt="filtered-pe/{sample}.1.fastq.gz",
filt_rev="filtered-pe/{sample}.2.fastq.gz",
stats="reports/dada2/filter-trim-pe/{sample}.tsv"
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
params:
# Set the maximum expected errors tolerated in filtered reads
maxEE=1,
# Set the number of kept bases in forward and reverse reads
# respectively to 7 for the toy example
truncLen=[7,6],
# Set minLen to 1 for the toy example but default is 20
minLen=1
log:
"logs/dada2/filter-trim-pe/{sample}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/filter-trim"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
fwd
: a forward FASTQ file (potentially compressed) without primer sequencesrev
: an (optional) reverse FASTQ file (potentially compressed) without primer sequences
Output:
filt
: a compressed filtered forward FASTQ filefilt_rev
: an (optional) compressed filtered reverse FASTQ filestats
: a .tsv file with the number of processed and filtered reads per sample
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for filtering single or paired-end reads using dada2 filterAndTrim function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list(
fwd = snakemake@input[["fwd"]],
filt = snakemake@output[["filt"]],
multithread=snakemake@threads
)
# Test if paired end input is passed
if(!is.null(snakemake@input[["rev"]]) & !is.null(snakemake@output[["filt_rev"]])){
args<-c(args,
rev = snakemake@input[["rev"]],
filt.rev = snakemake@output[["filt_rev"]]
)
}
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Check if 'compress=' option is passed
if(!is.null(extra[["compress"]])){
stop("Remove the `compress=` option from `params`.\n",
"The `compress` option is implicitly set here from the file extension.")
} else {
# Check if output files are given as compressed files
# ex: in se version, all(TRUE, NULL) gives TRUE
compressed <- c(
endsWith(args[["filt"]], '.gz'),
if(is.null(args[["filt.rev"]])) NULL else {endsWith(args[["filt.rev"]], 'gz')}
)
if ( all(compressed) ) {
extra[["compress"]] <- TRUE
} else if ( any(compressed) ) {
stop("Either all or no fastq output should be compressed. Please check `output.filt` and `output.filt_rev` for consistency.")
} else {
extra[["compress"]] <- FALSE
}
}
# Add them to the list of arguments
args<-c(args, extra)
} else {
message("No optional parameters. Using default parameters from dada2::filterAndTrim()")
}
# Call the function with arguments
filt.stats<-do.call(filterAndTrim, args)
# Write processed reads report
write.table(filt.stats, snakemake@output[["stats"]], sep="\t", quote=F)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_LEARN_ERRORS¶
DADA2
Learning error rates separately on paired-end data using dada2 learnErrors
function. Optional parameters are documented in the manual and the function is introduced in the dedicated tutorial section.
This wrapper can be used in the following way:
rule learn_pe:
# Run twice dada2_learn_errors: on forward and on reverse reads
input: expand("results/dada2/model_{orientation}.RDS", orientation=[1,2])
rule dada2_learn_errors:
input:
# Quality filtered and trimmed forward FASTQ files (potentially compressed)
expand("filtered/{sample}.{{orientation}}.fastq.gz", sample=["a","b"])
output:
err="results/dada2/model_{orientation}.RDS",# save the error model
plot="reports/dada2/errors_{orientation}.png",# plot observed and estimated rates
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
#params:
# randomize=True
log:
"logs/dada2/learn-errors/learn-errors_{orientation}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/learn-errors"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
- A list of quality filtered and trimmed forward FASTQ files (potentially compressed)
Output:
err
: RDS file with the stored error modelplot
: plot observed vs estimated errors rates
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for learning error rates on sequence data using dada2 learnErrors function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list(
fls = snakemake@input,
multithread=snakemake@threads
)
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using defaults parameters from dada2::learnErrors()")
}
# Learn errors rates for both read types
err<-do.call(learnErrors, args)
# Plot estimated versus observed error rates to validate models
perr<-plotErrors(err, nominalQ = TRUE)
# Save the plots
library(ggplot2)
ggsave(snakemake@output[["plot"]], perr, width = 8, height = 8, dpi = 300)
# Store the estimated errors as RDS files
saveRDS(err, snakemake@output[["err"]],compress = T)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_MAKE_TABLE¶
DADA2
Build a sequence - sample table from denoised samples using dada2 makeSequenceTable
function. Optional parameters are documented in the manual and the function is introduced in the dedicated tutorial section.
This wrapper can be used in the following way:
rule dada2_make_table_se:
input:
# Inferred composition
expand("denoised/{sample}.1.RDS", sample=['a','b'])
output:
"results/dada2/seqTab-se.RDS"
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
params:
names=['a','b'] # Sample names instead of paths
log:
"logs/dada2/make-table/make-table-se.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/make-table"
rule dada2_make_table_pe:
input:
# Merged composition
expand("merged/{sample}.RDS", sample=['a','b'])
output:
"results/dada2/seqTab-pe.RDS"
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
params:
names=['a','b'], # Sample names instead of paths
orderBy="nsamples" # Change the ordering of samples
log:
"logs/dada2/make-table/make-table-pe.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/make-table"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
- A list of RDS files with denoised samples (se), or denoised and merged samples (pe)
Output:
- RDS file with the table
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for building a sequence - sample table from denoised samples using dada2 makeSequenceTable function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# If names are provided use them
nm<-if(is.null(snakemake@params[["names"]])) NULL else snakemake@params[["names"]]
# From a list of n lists to one named list of n elements
smps<-setNames(
object=unlist(snakemake@input),
nm=nm
)
# Read the RDS into the list
smps<-lapply(smps, readRDS)
# Prepare arguments (no matter the order)
args<-list( samples = smps)
# Check if extra params are passed (apart from [["names"]])
if(length(snakemake@params) > 1 ){
# Keeping only the named elements of the list for do.call() (apart from [["names"]])
extra<-snakemake@params[ names(snakemake@params) != "" & names(snakemake@params) != "names" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using default parameters from dada2::makeSequenceTable()")
}
# Make table
seqTab<-do.call(makeSequenceTable, args)
# Store the table as a RDS file
saveRDS(seqTab, snakemake@output[[1]],compress = T)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_MERGE_PAIRS¶
DADA2
Merging denoised forward and reverse reads using dada2 mergePairs
function. Optional parameters are documented in the manual and the function is introduced in the dedicated tutorial section.
This wrapper can be used in the following way:
rule dada2_merge_pairs:
input:
dadaF="denoised/{sample}.1.RDS",# Inferred composition
dadaR="denoised/{sample}.2.RDS",
derepF="uniques/{sample}.1.RDS",# Dereplicated sequences
derepR="uniques/{sample}.2.RDS"
output:
"merged/{sample}.RDS"
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
#params:
# verbose=True
log:
"logs/dada2/merge-pairs/{sample}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/merge-pairs"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
dadaF
: RDS file with the inferred sample composition from forward readsdadaR
: reversederepF
: RDS file with the dereplicated forward readsderepR
: reverse
Output:
- RDS file with the merged pairs
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for merging denoised forward and reverse reads using dada2 mergePairs function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list(
dadaF = snakemake@input[["dadaF"]],
derepF = snakemake@input[["derepF"]],
dadaR = snakemake@input[["dadaR"]],
derepR = snakemake@input[["derepR"]]
)
# Read RDS from the list
args<-sapply(args,readRDS)
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using default parameters from dada2::mergePairs()")
}
# Merge pairs
merger<-do.call(mergePairs, args)
# Store the estimated errors as RDS files
saveRDS(merger, snakemake@output[[1]],compress = T)
# Close the connection for the log file
sink(type="message")
sink()
DADA2_QUALITY_PROFILES¶
DADA2
Plotting the quality profile of reads using dada2 plotQualityProfile
function. The function is introduced in the dedicated tutorial section.
This wrapper can be used in the following way:
rule dada2_quality_profile_se:
input:
# FASTQ file without primers sequences
"trimmed/{sample}.{orientation}.fastq"
output:
"reports/dada2/quality-profile/{sample}.{orientation}-quality-profile.png"
log:
"logs/dada2/quality-profile/{sample}.{orientation}-quality-profile-se.log"
wrapper:
"0.73.0/bio/dada2/quality-profile"
rule dada2_quality_profile_pe:
input:
# FASTQ file without primers sequences
expand("trimmed/{{sample}}.{orientation}.fastq",orientation=[1,2])
output:
"reports/dada2/quality-profile/{sample}-quality-profile.png"
log:
"logs/dada2/quality-profile/{sample}-quality-profile-pe.log"
wrapper:
"0.73.0/bio/dada2/quality-profile"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
- a FASTQ file (potentially compressed) without primers sequences
Output:
- A PNG file of the quality plot
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for plotting the quality profile of reads using dada2 plotQualityProfile function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Plot the quality profile for a given FASTQ file or a list of files
pquality<-plotQualityProfile(unlist(snakemake@input))
# Write the plots to files
library(ggplot2)
ggsave(snakemake@output[[1]], pquality, width = 4, height = 3, dpi = 300)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_REMOVE_CHIMERAS¶
DADA2
Remove chimera sequences from the sequence table data using dada2 removeBimeraDenovo
function. Optional parameters are documented in the manual and the function is introduced in the dedicated tutorial section.
This wrapper can be used in the following way:
rule dada2_remove_chimeras:
input:
"results/dada2/seqTab.RDS" # Sequence table
output:
"results/dada2/seqTab.nochim.RDS" # Chimera-free sequence table
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
#params:
# verbose=True
log:
"logs/dada2/remove-chimeras/remove-chimeras.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/remove-chimeras"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
- RDS file with the sequence table
Output:
- RDS file with the chimera-free sequence table
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for removing chimeras sequences from
# the sequence table data using dada2 removeBimeraDenovo function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list(
unqs = readRDS(snakemake@input[[1]]),
multithread=snakemake@threads
)
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using default parameters from dada2::removeBimeraDenovo()")
}
# Remove chimeras
seqTab_nochimeras<-do.call(removeBimeraDenovo, args)
# Store the estimated errors as RDS files
saveRDS(seqTab_nochimeras, snakemake@output[[1]],compress = T)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DADA2_SAMPLE_INFERENCE¶
DADA2
Inferring sample composition using dada2 dada
function. Optional parameters are documented in the manual and the function is introduced in the dedicated tutorial section.
This wrapper can be used in the following way:
rule dada2_sample_inference:
input:
# Dereplicated (aka unique) sequences of the sample
derep="uniques/{fastq}.RDS",
err="results/dada2/model_1.RDS" # Error model
output:
"denoised/{fastq}.RDS" # Inferred sample composition
# Even though this is an R wrapper, use named arguments in Python syntax
# here, to specify extra parameters. Python booleans (`arg1=True`, `arg2=False`)
# and lists (`list_arg=[]`) are automatically converted to R.
# For a named list as an extra named argument, use a python dict
# (`named_list={name1=arg1}`).
#params:
# verbose=True
log:
"logs/dada2/sample-inference/{fastq}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/sample-inference"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bioconductor-dada2==1.16
Input:
derep
: RDS file with the dereplicated sequenceserr
: RDS file with the error model
Output:
- RDS file with the stored inferred sample composition
- Charlie Pauvert
# __author__ = "Charlie Pauvert"
# __copyright__ = "Copyright 2020, Charlie Pauvert"
# __email__ = "cpauvert@protonmail.com"
# __license__ = "MIT"
# Snakemake wrapper for inferring sample composition using dada2 dada function.
# Sink the stderr and stdout to the snakemake log file
# https://stackoverflow.com/a/48173272
log.file<-file(snakemake@log[[1]],open="wt")
sink(log.file)
sink(log.file,type="message")
library(dada2)
# Prepare arguments (no matter the order)
args<-list(
derep = readRDS(snakemake@input[["derep"]]),
err = readRDS(snakemake@input[["err"]]),
multithread = snakemake@threads
)
# Check if extra params are passed
if(length(snakemake@params) > 0 ){
# Keeping only the named elements of the list for do.call()
extra<-snakemake@params[ names(snakemake@params) != "" ]
# Add them to the list of arguments
args<-c(args, extra)
} else{
message("No optional parameters. Using default parameters from dada2::dada()")
}
# Learn errors rates for both read types
inferred_composition<-do.call(dada, args)
# Store the inferred sample composition as RDS files
saveRDS(inferred_composition, snakemake@output[[1]],compress = T)
# Proper syntax to close the connection for the log file
# but could be optional for Snakemake wrapper
sink(type="message")
sink()
DEEPTOOLS¶
For deeptools, the following wrappers are available:
DEEPTOOLS COMPUTEMATRIX¶
deepTools computeMatrix
calculates scores per genomic region. The matrix file can be used as input for other tools or for the generation of a deepTools plotHeatmap
or deepTools plotProfiles
. For usage information about deepTools computeMatrix
, please see the documentation. For more information about deepTools
, also see the source code.
computeMatrix option Output format Name of output
variable to be used
Recommended
extension
–outFileName, -out, -o gzipped matrix file matrix_gz
(required)
“.gz” –outFileNameMatrix tab-separated table of
matrix file
matrix_tab “.tab” –outFileSortedRegions BED matrix file with sorted
regions after skipping zeros
or min/max threshold values
matrix_bed “.bed”
This wrapper can be used in the following way:
rule compute_matrix:
input:
# Please note that the -R and -S options are defined via input files
bed=expand("{sample}.bed", sample=["a", "b"]),
bigwig=expand("{sample}.bw", sample=["a", "b"])
output:
# Please note that --outFileName, --outFileNameMatrix and --outFileSortedRegions are exclusively defined via output files.
# Usable output variables, their extensions and which option they implicitly call are listed here:
# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/deeptools/computematrix.html.
matrix_gz="matrix_files/matrix.gz", # required
# optional output files
matrix_tab="matrix_files/matrix.tab",
matrix_bed="matrix_files/matrix.bed"
log:
"logs/deeptools/compute_matrix.log"
params:
# required argument, choose "scale-regions" or "reference-point"
command="scale-regions",
# optional parameters
extra="--regionBodyLength 200 --verbose"
wrapper:
"0.73.0/bio/deeptools/computematrix"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
deeptools==3.4.3
Input:
- BED or GTF files (.bed or .gtf) AND
- bigWig files (.bw)
Output:
- gzipped matrix file (.gz) AND/OR
- tab-separated table of matrix file (.tab) AND/OR
- BED matrix file with sorted regions after skiping zeros or min/max threshold values (.bed)
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
out_tab = snakemake.output.get("matrix_tab")
out_bed = snakemake.output.get("matrix_bed")
optional_output = ""
if out_tab:
optional_output += " --outFileNameMatrix {out_tab} ".format(out_tab=out_tab)
if out_bed:
optional_output += " --outFileSortedRegions {out_bed} ".format(out_bed=out_bed)
shell(
"(computeMatrix "
"{snakemake.params.command} "
"{snakemake.params.extra} "
"-R {snakemake.input.bed} "
"-S {snakemake.input.bigwig} "
"-o {snakemake.output.matrix_gz} "
"{optional_output}) {log}"
)
DEEPTOOLS PLOTFINGERPRINT¶
deepTools plotFingerprint
plots a profile of cumulative read coverages from a list of indexed BAM files. For usage information about deepTools plotFingerprint
, please see the documentation. For more information about deepTools
, also see the source code.
In addition to required output, an optional output file of read counts can be generated by setting the output variable “counts” (see example Snakemake rule below). Also an optional output file of quality control metrics can be generated by setting the variable “qc_metrics”. If the jsd_sample is specified in the input, the results of the Jensen-Shannon distance calculation are also written to this file.
plotFingerprint option Output Name of output
variable to be used
Recommended
extension(s)
–plotFile, -plot, -o coverage plot fingerprint
(required)
“.png” or
“.eps” or
“.pdf” or
“.svg”
–outRawCounts tab-separated table of read
counts per bin
counts “.tab” –outQualityMetrics tab-separated table of metrics
for quality control and for
results of Jensen-Shannon
distance calculation (optional)
metrics “.txt”
This wrapper can be used in the following way:
rule plot_fingerprint:
input:
bam_files=expand("samples/{sample}.bam", sample=["a", "b"]),
bam_idx=expand("samples/{sample}.bam.bai", sample=["a", "b"]),
jsd_sample="samples/b.bam" # optional, requires qc_metrics output
output:
# Please note that --plotFile and --outRawCounts are exclusively defined via output files.
# Usable output variables, their extensions and which option they implicitly call are listed here:
# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/deeptools/plotfingerprint.html.
fingerprint="plot_fingerprint/plot_fingerprint.png", # required
# optional output
counts="plot_fingerprint/raw_counts.tab",
qc_metrics="plot_fingerprint/qc_metrics.txt"
log:
"logs/deeptools/plot_fingerprint.log"
params:
# optional parameters
"--numberOfSamples 200 "
threads:
8
wrapper:
"0.73.0/bio/deeptools/plotfingerprint"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
deeptools==3.4.3
Input:
- list of BAM files (.bam) AND
- list of their index files (.bam.bai)
Output:
- plot file in image format (.png, .eps, .pdf or .svg)
- tab-separated table of read counts per bin (.tab) (optional)
- tab-separated table of metrics and JSD calculation (.txt) (optional)
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
import re
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
jsd_sample = snakemake.input.get("jsd_sample")
out_counts = snakemake.output.get("counts")
out_metrics = snakemake.output.get("qc_metrics")
optional_output = ""
jsd = ""
if jsd_sample:
jsd += " --JSDsample {jsd} ".format(jsd=jsd_sample)
if out_counts:
optional_output += " --outRawCounts {out_counts} ".format(out_counts=out_counts)
if out_metrics:
optional_output += " --outQualityMetrics {metrics} ".format(metrics=out_metrics)
shell(
"(plotFingerprint "
"-b {snakemake.input.bam_files} "
"-o {snakemake.output.fingerprint} "
"{optional_output} "
"--numberOfProcessors {snakemake.threads} "
"{jsd} "
"{snakemake.params}) {log}"
)
# ToDo: remove the 'NA' string replacement when fixed in deepTools, see:
# https://github.com/deeptools/deepTools/pull/999
regex_passes = 2
with open(out_metrics, "rt") as f:
metrics = f.read()
for i in range(regex_passes):
metrics = re.sub("\tNA(\t|\n)", "\tnan\\1", metrics)
with open(out_metrics, "wt") as f:
f.write(metrics)
DEEPTOOLS PLOTHEATMAP¶
deepTools plotHeatmap
creates a heatmap for scores associated with genomic regions. As input, it requires a matrix file generated by deepTools computeMatrix
. For usage information about deepTools plotHeatmap
, please see the documentation. For more information about deepTools
, also see the source code.
You can select which optional output files are generated by adding the respective output variable with the recommended extension(s) for them (see example Snakemake rule below).
PlotHeatmap option Output Name of output
variable to be used
Recommended
extension(s)
–outFileName, -out, -o plot image heatmap_img
(required)
“.png” or
“.eps” or
“.pdf” or
“.svg”
–outFileSortedRegions BED file with
sorted regions
regions “.bed” –outFileNameMatrix tab-separated matrix
of values underlying
the heatmap
heatmap_matrix “.tab”
This wrapper can be used in the following way:
rule plot_heatmap:
input:
# matrix file from deepTools computeMatrix tool
"matrix.gz"
output:
# Please note that --outFileSortedRegions and --outFileNameMatrix are exclusively defined via output files.
# Usable output variables, their extensions and which option they implicitly call are listed here:
# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/deeptools/plotheatmap.html.
heatmap_img="plot_heatmap/heatmap.png", # required
# optional output files
regions="plot_heatmap/heatmap_regions.bed",
heatmap_matrix="plot_heatmap/heatmap_matrix.tab"
log:
"logs/deeptools/heatmap.log"
params:
# optional parameters
"--plotType=fill "
wrapper:
"0.73.0/bio/deeptools/plotheatmap"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
deeptools==3.4.3
Input:
- gzipped matrix file from
deepTools computeMatrix
(.gz)
Output:
- plot file in image format (.png, .eps, .pdf or .svg) AND/OR
- file with sorted regions after skipping zeros or min/max threshold values (.bed) AND/OR
- tab-separated table for average profile (.tab)
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
out_region = snakemake.output.get("regions")
out_matrix = snakemake.output.get("heatmap_matrix")
optional_output = ""
if out_region:
optional_output += " --outFileSortedRegions {out_region} ".format(
out_region=out_region
)
if out_matrix:
optional_output += " --outFileNameMatrix {out_matrix} ".format(
out_matrix=out_matrix
)
shell(
"(plotHeatmap "
"-m {snakemake.input[0]} "
"-o {snakemake.output.heatmap_img} "
"{optional_output} "
"{snakemake.params}) {log}"
)
DEEPTOOLS PLOTPROFILE¶
deepTools plotProfile
plots scores over sets of genomic regions. As input, it requires a matrix file generated by deepToolscomputeMatrix
. For usage information about deepTools plotProfile
, please see the documentation. For more information about deepTools
, also see the source code.
You can select which optional output files are generated by adding the respective output variable with the recommended extension for them (see example Snakemake rule below).
PlotProfile option Output Name of output
variable to be used
Recommended
extension(s)
–outFileName, -out, -o profile plot plot_img
(required)
“.png” or
“.eps” or
“.pdf” or
“.svg”
–outFileSortedRegions BED file with
sorted regions
regions “.bed” –outFileNameData tab-separated table
for average profile
data “.tab”
This wrapper can be used in the following way:
rule plot_profile:
input:
# matrix file from deepTools computeMatrix tool
"matrix.gz"
output:
# Please note that --outFileSortedRegions and --outFileNameData are exclusively defined via output files.
# Usable output variables, their extensions and which option they implicitly call are listed here:
# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/deeptools/plotprofile.html.
# Through the output variables image file and more output options for plot profile can be selected.
plot_img="plot_profile/plot.png", # required
# optional output files
regions="plot_profile/regions.bed",
data="plot_profile/data.tab"
log:
"logs/deeptools/plot_profile.log"
params:
# optional parameters
"--plotType=fill "
"--perGroup "
"--colors red yellow blue "
"--dpi 150 "
wrapper:
"0.73.0/bio/deeptools/plotprofile"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
deeptools==3.4.3
Input:
- gzipped matrix file from
deepTools computeMatrix
(.gz)
Output:
- plot file in image format (.png, .eps, .pdf or .svg) AND/OR
- file with sorted regions after skipping zeros or min/max threshold values (.bed) AND/OR
- tab-separated table for average profile (.tab)
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
out_region = snakemake.output.get("regions")
out_data = snakemake.output.get("data")
optional_output = ""
if out_region:
optional_output += " --outFileSortedRegions {out_region} ".format(
out_region=out_region
)
if out_data:
optional_output += " --outFileNameData {out_data} ".format(out_data=out_data)
shell(
"(plotProfile "
"-m {snakemake.input[0]} "
"-o {snakemake.output.plot_img} "
"{optional_output} "
"{snakemake.params}) {log}"
)
DEEPVARIANT¶
Call genetic variants using deep neural network. Copyright 2017 Google LLC. BSD 3-Clause “New” or “Revised” https://github.com/google/deepvariant
Example¶
This wrapper can be used in the following way:
rule deepvariant:
input:
bam="mapped/{sample}.bam",
ref="genome/genome.fasta"
output:
vcf="calls/{sample}.vcf.gz"
params:
model="wgs", # {wgs, wes}
extra=""
threads: 2
log:
"logs/deepvariant/{sample}/stdout.log"
wrapper:
"0.73.0/bio/deepvariant"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
deepvariant=0.10.0
tensorflow-estimator=2.0.0
unzip=6.0
Notes¶
- The extra param alllows for additional program arguments.
- This snakemake wrapper uses bioconda deepvariant package. Copyright 2018 Brad Chapman.
Authors¶
- Tetsuro Hisayoshi
Code¶
__author__ = "Tetsuro Hisayoshi"
__copyright__ = "Copyright 2020, Tetsuro Hisayoshi"
__email__ = "hisayoshi0530@gmail.com"
__license__ = "MIT"
import os
import tempfile
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
log_dir = os.path.dirname(snakemake.log[0])
output_dir = os.path.dirname(snakemake.output[0])
# sample basename
basename = os.path.splitext(os.path.basename(snakemake.input.bam[0]))[0]
with tempfile.TemporaryDirectory() as tmp_dir:
shell(
"(dv_make_examples.py "
"--cores {snakemake.threads} "
"--ref {snakemake.input.ref} "
"--reads {snakemake.input.bam} "
"--sample {basename} "
"--examples {tmp_dir} "
"--logdir {log_dir} "
"{extra} \n"
"dv_call_variants.py "
"--cores {snakemake.threads} "
"--outfile {tmp_dir}/{basename}.tmp "
"--sample {basename} "
"--examples {tmp_dir} "
"--model {snakemake.params.model} \n"
"dv_postprocess_variants.py "
"--ref {snakemake.input.ref} "
"--infile {tmp_dir}/{basename}.tmp "
"--outfile {snakemake.output.vcf} ) {log}"
)
DELLY¶
Call variants with delly.
Example¶
This wrapper can be used in the following way:
rule delly:
input:
ref="genome.fasta",
samples=["mapped/a.bam"],
# optional exclude template (see https://github.com/dellytools/delly)
exclude="human.hg19.excl.tsv"
output:
"sv/calls.bcf"
params:
extra="" # optional parameters for delly (except -g, -x)
log:
"logs/delly.log"
threads: 2 # It is best to use as many threads as samples
wrapper:
"0.73.0/bio/delly"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
delly==0.8.1
Authors¶
- Johannes Köster
Code¶
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
exclude = (
"-x {}".format(snakemake.input.exclude)
if snakemake.input.get("exclude", "")
else ""
)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"OMP_NUM_THREADS={snakemake.threads} delly call {extra} "
"{exclude} -g {snakemake.input.ref} "
"-o {snakemake.output[0]} {snakemake.input.samples} {log}"
)
DIAMOND¶
For diamond, the following wrappers are available:
DIAMOND BLASTX¶
DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data.
This wrapper can be used in the following way:
rule diamond_blastx:
input:
fname_fastq = "{sample}.fastq",
fname_db = "db.dmnd"
output:
fname = "{sample}.tsv.gz"
log:
"logs/diamond_blastx/{sample}.log"
params:
extra="--header --compress 1"
threads: 8
wrapper:
"0.73.0/bio/diamond/blastx"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
diamond==2.0.6
- Kim Philipp Jablonski
__author__ = "Kim Philipp Jablonski"
__copyright__ = "Copyright 2020, Kim Philipp Jablonski"
__email__ = "kim.philipp.jablonski@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"diamond blastx"
" --threads {snakemake.threads}"
" --db {snakemake.input.fname_db}"
" --query {snakemake.input.fname_fastq}"
" --out {snakemake.output.fname}"
" {extra}"
" {log}"
)
DIAMOND MAKEDB¶
DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data.
This wrapper can be used in the following way:
rule diamond_blastx:
input:
fname = "{reference}.fasta",
output:
fname = "{reference}.dmnd"
log:
"logs/diamond_makedb/{reference}.log"
params:
extra=""
threads: 8
wrapper:
"0.73.0/bio/diamond/makedb"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
diamond==2.0.6
- Kim Philipp Jablonski
__author__ = "Kim Philipp Jablonski"
__copyright__ = "Copyright 2020, Kim Philipp Jablonski"
__email__ = "kim.philipp.jablonski@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"diamond makedb"
" --threads {snakemake.threads}"
" --in {snakemake.input.fname}"
" --db {snakemake.output.fname}"
" {extra}"
" {log}"
)
EPIC¶
For epic, the following wrappers are available:
EPIC¶
Find broad enriched domains in ChIP-Seq data with epic
This wrapper can be used in the following way:
rule epic:
input:
treatment = "bed/test.bed",
background = "bed/control.bed"
output:
enriched_regions = "epic/enriched_regions.csv", # required
bed = "epic/enriched_regions.bed", # optional
matrix = "epic/matrix.gz" # optional
log:
"logs/epic/epic.log"
params:
genome = "hg19", # optional, default hg19
extra="-g 3 -w 200" # "--bigwig epic/bigwigs"
threads: 1 # optional, defaults to 1
wrapper:
"0.73.0/bio/epic/peaks"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
epic=0.2.7
pandas=0.22.0
Input:
treatment
: chip .bed(.gz/.bz) filesbackground
: input .bed(.gz/.bz) files
Output:
enriched_regions
: main output file with enriched peaksbed
: (optional) contains much of the same info as enriched_regions but in a bed format, suitable for viewing in the UCSC genome browser or downstream use with bedtoolsmatrix
: (optional) a gzipped matrix of read counts
- All/any of the different bigwig options must be given as extra parameters
- Endre Bakken Stovner
__author__ = "Endre Bakken Stovner"
__copyright__ = "Copyright 2017, Endre Bakken Stovner"
__email__ = "endrebak85@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
threads = snakemake.threads or 1
treatment = snakemake.input.get("treatment")
background = snakemake.input.get("background")
# Executed shell command
enriched_regions = snakemake.output.get("enriched_regions")
bed = snakemake.output.get("bed")
matrix = snakemake.output.get("matrix")
if len(snakemake.log) > 0:
log = snakemake.log[0]
genome = snakemake.params.get("genome")
cmd = "epic -cpu {threads} -t {treatment} -c {background} -o {enriched_regions} -gn {genome}"
if bed:
cmd += " -b {bed}"
if matrix:
cmd += " -sm {matrix}"
if log:
cmd += " -l {log}"
cmd += " {extra}"
shell(cmd)
FASTP¶
trim and QC fastq reads with fastp
Example¶
This wrapper can be used in the following way:
rule fastp_se:
input:
sample=["reads/se/{sample}.fastq"]
output:
trimmed="trimmed/se/{sample}.fastq",
html="report/se/{sample}.html",
json="report/se/{sample}.json"
log:
"logs/fastp/se/{sample}.log"
params:
adapters="--adapter_sequence ACGGCTAGCTA",
extra=""
threads: 1
wrapper:
"0.73.0/bio/fastp"
rule fastp_pe:
input:
sample=["reads/pe/{sample}.1.fastq", "reads/pe/{sample}.2.fastq"]
output:
trimmed=["trimmed/pe/{sample}.1.fastq", "trimmed/pe/{sample}.2.fastq"],
html="report/pe/{sample}.html",
json="report/pe/{sample}.json"
log:
"logs/fastp/pe/{sample}.log"
params:
adapters="--adapter_sequence ACGGCTAGCTA --adapter_sequence_r2 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
extra=""
threads: 2
wrapper:
"0.73.0/bio/fastp"
rule fastp_pe_wo_trimming:
input:
sample=["reads/pe/{sample}.1.fastq", "reads/pe/{sample}.2.fastq"]
output:
html="report/pe_wo_trimming/{sample}.html",
json="report/pe_wo_trimming/{sample}.json"
log:
"logs/fastp/pe_wo_trimming/{sample}.log"
params:
extra=""
threads: 2
wrapper:
"0.73.0/bio/fastp"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
fastp=0.20
Authors¶
- Sebastian Kurscheid (sebastian.kurscheid@unibas.ch)
Code¶
__author__ = "Sebastian Kurscheid"
__copyright__ = "Copyright 2019, Sebastian Kurscheid"
__email__ = "sebastian.kurscheid@anu.edu.au"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
n = len(snakemake.input.sample)
assert (
n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."
if n == 1:
reads = "--in1 {}".format(snakemake.input.sample)
else:
reads = "--in1 {} --in2 {}".format(*snakemake.input.sample)
trimmed_paths = snakemake.output.get("trimmed", None)
if trimmed_paths is not None:
if n == 1:
trimmed = "--out1 {}".format(snakemake.output.trimmed)
else:
trimmed = "--out1 {} --out2 {}".format(*snakemake.output.trimmed)
else:
trimmed = ""
html = "--html {}".format(snakemake.output.html)
json = "--json {}".format(snakemake.output.json)
shell(
"(fastp --thread {snakemake.threads} "
"{extra} "
"{adapters} "
"{reads} "
"{trimmed} "
"{json} "
"{html} ) {log}"
)
FASTQ_SCREEN¶
fastq_screen screens a library of sequences in FASTQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.
This wrapper allows the configuration to be passed as a filename or as a dictionary in the rule’s params.fastq_screen_config of the rule. So the following configuration file:
DATABASE ecoli /data/Escherichia_coli/Bowtie2Index/genome BOWTIE2
DATABASE ecoli /data/Escherichia_coli/Bowtie2Index/genome BOWTIE
DATABASE hg19 /data/hg19/Bowtie2Index/genome BOWTIE2
DATABASE mm10 /data/mm10/Bowtie2Index/genome BOWTIE2
BOWTIE /path/to/bowtie
BOWTIE2 /path/to/bowtie2
becomes:
fastq_screen_config = {
'database': {
'ecoli': {
'bowtie2': '/data/Escherichia_coli/Bowtie2Index/genome',
'bowtie': '/data/Escherichia_coli/BowtieIndex/genome'},
'hg19': {
'bowtie2': '/data/hg19/Bowtie2Index/genome'},
'mm10': {
'bowtie2': '/data/mm10/Bowtie2Index/genome'}
},
'aligner_paths': {'bowtie': 'bowtie', 'bowtie2': 'bowtie2'}
}
By default, the wrapper will use bowtie2 as the aligner and a subset of 100000
reads. These can be overridden using params.aligner
and params.subset
respectively. Furthermore, params.extra can be used to pass additional
arguments verbatim to fastq_screen
, for example extra="--illumina1_3"
or
extra="--bowtie2 '--trim5=8'"
.
Example¶
This wrapper can be used in the following way:
rule fastq_screen:
input:
"samples/{sample}.fastq"
output:
txt="qc/{sample}.fastq_screen.txt",
png="qc/{sample}.fastq_screen.png"
params:
fastq_screen_config="fastq_screen.conf",
subset=100000,
aligner='bowtie2'
threads: 8
wrapper:
"0.73.0/bio/fastq_screen"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
fastq-screen==0.5.2
bowtie2==2.2.6
bowtie==1.1.2
Input/Output¶
Input:
- A FASTQ file, gzipped or not.
Output:
txt
: a text file containing the fraction of reads mapping to each provided indexpng
: a bar plot of the contents oftxt
, saved as a PNG file
Notes¶
fastq_screen
hard-codes the output filenames. This wrapper moves the hard-coded output files to those specified by the rule.- While the dictionary form of
fastq_screen_config
is convenient, the unordered nature of the dictionary may causesnakemake --list-params-changed
to incorrectly report changed parameters even though the contents remain the same. If you plan on using--list-params-changed
then it will be better to write a config file and pass that as fastq_screen_config. This problem will disappear with Python 3.6. - When providing the dictionary form of
fastq_screen_config
, the wrapper will write a temp file using Python’stempfile
module. To control the temp file directory, make sure the $TMPDIR environmental variable is set (see the tempfile docs) for details). One way of doing this is by adding something likeshell.prefix("export TMPDIR=/scratch; ")
to the snakefile calling this wrapper.
Authors¶
- Ryan Dale
Code¶
import os
import re
from snakemake.shell import shell
import tempfile
__author__ = "Ryan Dale"
__copyright__ = "Copyright 2016, Ryan Dale"
__email__ = "dalerr@niddk.nih.gov"
__license__ = "MIT"
_config = snakemake.params["fastq_screen_config"]
subset = snakemake.params.get("subset", 100000)
aligner = snakemake.params.get("aligner", "bowtie2")
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell()
# snakemake.params.fastq_screen_config can be either a dict or a string. If
# string, interpret as a filename pointing to the fastq_screen config file.
# Otherwise, create a new tempfile out of the contents of the dict:
if isinstance(_config, dict):
tmp = tempfile.NamedTemporaryFile(delete=False).name
with open(tmp, "w") as fout:
for label, indexes in _config["database"].items():
for aligner, index in indexes.items():
fout.write(
"\t".join(["DATABASE", label, index, aligner.upper()]) + "\n"
)
for aligner, path in _config["aligner_paths"].items():
fout.write("\t".join([aligner.upper(), path]) + "\n")
config_file = tmp
else:
config_file = _config
# fastq_screen hard-codes filenames according to this prefix. We will send
# hard-coded output to a temp dir, and then move them later.
prefix = re.split(".fastq|.fq|.txt|.seq", os.path.basename(snakemake.input[0]))[0]
tempdir = tempfile.mkdtemp()
shell(
"fastq_screen --outdir {tempdir} "
"--force "
"--aligner {aligner} "
"--conf {config_file} "
"--subset {subset} "
"--threads {snakemake.threads} "
"{extra} "
"{snakemake.input[0]} "
"{log}"
)
# Move output to the filenames specified by the rule
shell("mv {tempdir}/{prefix}_screen.txt {snakemake.output.txt}")
shell("mv {tempdir}/{prefix}_screen.png {snakemake.output.png}")
# Clean up temp
shell("rm -r {tempdir}")
if isinstance(_config, dict):
shell("rm {tmp}")
FASTQC¶
Generate fastq qc statistics using fastqc.
Example¶
This wrapper can be used in the following way:
rule fastqc:
input:
"reads/{sample}.fastq"
output:
html="qc/fastqc/{sample}.html",
zip="qc/fastqc/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: "--quiet"
log:
"logs/fastqc/{sample}.log"
threads: 1
wrapper:
"0.73.0/bio/fastqc"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
fastqc==0.11.9
Input/Output¶
Input:
- fastq file
Output:
- html file containing statistics
- zip file containing statistics
Authors¶
- Julian de Ruiter
Code¶
"""Snakemake wrapper for fastqc."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from os import path
import re
from tempfile import TemporaryDirectory
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
def basename_without_ext(file_path):
"""Returns basename of file path, without the file extension."""
base = path.basename(file_path)
# Remove file extension(s) (similar to the internal fastqc approach)
base = re.sub("\\.gz$", "", base)
base = re.sub("\\.bz2$", "", base)
base = re.sub("\\.txt$", "", base)
base = re.sub("\\.fastq$", "", base)
base = re.sub("\\.fq$", "", base)
base = re.sub("\\.sam$", "", base)
base = re.sub("\\.bam$", "", base)
return base
# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
shell(
"fastqc {snakemake.params} -t {snakemake.threads} "
"--outdir {tempdir:q} {snakemake.input[0]:q}"
" {log}"
)
# Move outputs into proper position.
output_base = basename_without_ext(snakemake.input[0])
html_path = path.join(tempdir, output_base + "_fastqc.html")
zip_path = path.join(tempdir, output_base + "_fastqc.zip")
if snakemake.output.html != html_path:
shell("mv {html_path:q} {snakemake.output.html:q}")
if snakemake.output.zip != zip_path:
shell("mv {zip_path:q} {snakemake.output.zip:q}")
FGBIO¶
For fgbio, the following wrappers are available:
FGBIO ANNOTATEBAMWITHUMIS¶
Annotates existing BAM files with UMIs (Unique Molecular Indices, aka Molecular IDs, Molecular barcodes) from a separate FASTQ file.
This wrapper can be used in the following way:
rule AnnotateBam:
input:
bam="mapped/{sample}.bam",
umi="umi/{sample}.fastq"
output:
"mapped/{sample}.annotated.bam"
params: ""
log:
"logs/fgbio/annotate_bam/{sample}.log"
wrapper:
"0.73.0/bio/fgbio/annotatebamwithumis"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
fgbio==0.6.1
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2018, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra_params = snakemake.params.get("extra", "")
bam_input = snakemake.input.bam
if bam_input is None:
raise ValueError("Missing bam input file!")
elif not isinstance(bam_input, str):
raise ValueError("Input bam should be a string: " + str(bam_input) + "!")
umi_input = snakemake.input.umi
if umi_input is None:
raise ValueError("Missing input file with UMIs")
elif not isinstance(umi_input, str):
raise ValueError("Input UMIs-file should be a string: " + str(umi_input) + "!")
if not len(snakemake.output) == 1:
raise ValueError("Only one output value expected: " + str(snakemake.output) + "!")
output_file = snakemake.output[0]
if output_file is None:
raise ValueError("Missing output file!")
elif not isinstance(output_file, str):
raise ValueError("Output bam-file should be a string: " + str(output_file) + "!")
shell(
"fgbio AnnotateBamWithUmis"
" -i {bam_input}"
" -f {umi_input}"
" -o {output_file}"
" {extra_params}"
" {log}"
)
FGBIO CALLMOLECULARCONSENSUSREADS¶
Calls consensus sequences from reads with the same unique molecular tag.
This wrapper can be used in the following way:
rule ConsensusReads:
input:
"mapped/a.bam"
output:
"mapped/{sample}.m3.bam"
params:
extra="-M 3"
log:
"logs/fgbio/consensus_reads/{sample}.log"
wrapper:
"0.73.0/bio/fgbio/callmolecularconsensusreads"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
fgbio==0.6.1
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2018, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra_params = snakemake.params.get("extra", "")
bam_input = snakemake.input[0]
if not isinstance(bam_input, str) and len(snakemake.input) != 1:
raise ValueError("Input bam should be one bam file: " + str(bam_input) + "!")
output_file = snakemake.output[0]
if not isinstance(output_file, str) and len(snakemake.output) != 1:
raise ValueError("Output should be one bam file: " + str(output_file) + "!")
shell(
"fgbio CallMolecularConsensusReads"
" -i {bam_input}"
" -o {output_file}"
" {extra_params}"
" {log}"
)
FGBIO COLLECTDUPLEXSEQMETRICS¶
Collects a suite of metrics to QC duplex sequencing data.g.
This wrapper can be used in the following way:
rule CollectDuplexSeqMetrics:
input:
"mapped/{sample}.gu.bam"
output:
family_sizes="stats/{sample}.family_sizes.txt",
duplex_family_sizes="stats/{sample}.duplex_family_sizes.txt",
duplex_yield_metrics="stats/{sample}.duplex_yield_metrics.txt",
umi_counts="stats/{sample}.umi_counts.txt",
duplex_qc="stats/{sample}.duplex_qc.pdf",
duplex_umi_counts="stats/{sample}.duplex_umi_counts.txt",
params:
extra=lambda wildcards: "-d " + wildcards.sample
log:
"logs/fgbio/collectduplexseqmetrics/{sample}.log"
wrapper:
"0.73.0/bio/fgbio/collectduplexseqmetrics"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
fgbio==0.6.1
r-ggplot2
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra_params = snakemake.params.get("extra", "")
bam_input = snakemake.input[0]
family_sizes = snakemake.output.family_sizes
duplex_family_sizes = snakemake.output.duplex_family_sizes
duplex_yield_metrics = snakemake.output.duplex_yield_metrics
umi_counts = snakemake.output.umi_counts
duplex_qc = snakemake.output.duplex_qc
duplex_umi_counts = snakemake.output.get("duplex_umi_counts", None)
file_path = str(path.dirname(family_sizes))
name = str(path.basename(family_sizes)).split(".")[0]
path_name_prefix = str(path.join(file_path, name))
if not family_sizes == path_name_prefix + ".family_sizes.txt":
raise Exception(
"Unexpected family_sizes path/name format, expected {}, got {}.".format(
path_name_prefix + ".family_sizes.txt", family_sizes
)
)
if not duplex_family_sizes == path_name_prefix + ".duplex_family_sizes.txt":
raise Exception(
"Unexpected duplex_family_sizes path/name format, expected {}, got {}. Note that dirname will be extracted from family_sizes variable.".format(
path_name_prefix + ".duplex_family_sizes.txt", duplex_family_sizes
)
)
if not duplex_yield_metrics == path_name_prefix + ".duplex_yield_metrics.txt":
raise Exception(
"Unexpected duplex_yield_metrics path/name format, expected {}, got {}. Note that dirname will be extracted from family_sizes variable.".format(
path_name_prefix + ".duplex_yield_metrics.txt", duplex_yield_metrics
)
)
if not umi_counts == path_name_prefix + ".umi_counts.txt":
raise Exception(
"Unexpected umi_counts path/name format, expected {}, got {}. Note that dirname will be extracted from family_sizes variable.".format(
path_name_prefix + ".umi_counts.txt", umi_counts
)
)
if not duplex_qc == path_name_prefix + ".duplex_qc.pdf":
raise Exception(
"Unexpected duplex_qc path/name format, expected {}, got {}. Note that dirname will be extracted from family_sizes variable.".format(
path_name_prefix + ".duplex_qc.pdf", duplex_qc
)
)
if (
duplex_umi_counts is not None
and not duplex_umi_counts == path_name_prefix + ".duplex_umi_counts.txt"
):
raise Exception(
"Unexpected duplex_umi_counts path/name format, expected {}, got {}. Note that dirname will be extracted from family_sizes variable.".format(
path_name_prefix + ".duplex_umi_counts.txt", duplex_umi_counts
)
)
duplex_umi_counts_flag = ""
if duplex_umi_counts is not None:
duplex_umi_counts_flag = "-u "
if not isinstance(bam_input, str) and len(snakemake.input) != 1:
raise ValueError("Input bam should be one bam file: " + str(bam_input) + "!")
shell(
"fgbio CollectDuplexSeqMetrics"
" -i {bam_input}"
" -o {path_name_prefix}"
" {duplex_umi_counts_flag}"
" {extra_params}"
" {log}"
)
FGBIO FILTERCONSENSUSREADS¶
Filters consensus reads generated by CallMolecularConsensusReads or CallDuplexConsensusReads.
This wrapper can be used in the following way:
rule FilterConsensusReads:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.filtered.bam"
params:
extra="",
min_base_quality=2,
min_reads=[2, 2, 2],
ref="genome.fasta"
log:
"logs/fgbio/filterconsensusreads/{sample}.log"
threads: 1
wrapper:
"0.73.0/bio/fgbio/filterconsensusreads"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
fgbio==0.6.1
- min_base_quality: a single value (Int). Mask (make N) consensus bases with quality less than this threshold. (default: 5)
- min_reads: n array of Ints, max length 3, min length 1. Number of reads that need to support a UMI. For filtering bam files processed with CallMolecularConsensusReads one value is required. 3 values can be provided for bam files processed with CallDuplexConsensusReads, if fewer than 3 are provided the last value will be repeated, the first value is for the final consensus sequence and the two last for each strands consensus.
- For more inforamtion see, http://fulcrumgenomics.github.io/fgbio/tools/latest/FilterConsensusReads.html
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra_params = snakemake.params.get("extra", "")
min_base_quality = snakemake.params.get("min_base_quality", None)
if not isinstance(min_base_quality, int):
raise ValueError("min_base_quality needs to be provided as an Int!")
min_reads = snakemake.params.get("min_reads", None)
if not isinstance(min_reads, list) or not (1 <= len(min_reads) <= 3):
raise ValueError(
"min_reads needs to be provided as list of Ints, min length 1, max length 3!"
)
ref = snakemake.params.get("ref", None)
if ref is None:
raise ValueError("A reference needs to be provided!")
bam_input = snakemake.input[0]
if not isinstance(bam_input, str) and len(snakemake.input) != 1:
raise ValueError("Input bam should be one bam file: " + str(bam_input) + "!")
bam_output = snakemake.output[0]
if not isinstance(bam_output, str) and len(snakemake.output) != 1:
raise ValueError("Output should be one bam file: " + str(bam_output) + "!")
shell(
"fgbio FilterConsensusReads"
" -i {bam_input}"
" -o {bam_output}"
" -r {ref}"
" --min-reads {min_reads}"
" --min-base-quality {min_base_quality}"
" {extra_params}"
" {log}"
)
FGBIO GROUPREADSBYUMI¶
Groups reads together that appear to have come from the same original molecule.
This wrapper can be used in the following way:
rule GroupReads:
input:
"mapped/a.bam"
output:
bam="mapped/{sample}.gu.bam",
hist="mapped/{sample}.gu.histo.tsv",
params:
extra="-s adjacency --edits 1"
log:
"logs/fgbio/group_reads/{sample}.log"
wrapper:
"0.73.0/bio/fgbio/groupreadsbyumi"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
fgbio==0.6.1
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2018, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra_params = snakemake.params.get("extra", "")
bam_input = snakemake.input[0]
if not isinstance(bam_input, str) and len(snakemake.input) != 1:
raise ValueError("Input bam should be one bam file: " + str(bam_input) + "!")
output_bam_file = snakemake.output.bam
if not isinstance(output_bam_file, str) and len(output_bam_file) != 1:
raise ValueError("Bam output should be one bam file: " + str(output_bam_file) + "!")
output_histo_file = snakemake.output.hist
if not isinstance(output_histo_file, str) and len(output_histo_file) != 1:
raise ValueError(
"Histo output should be one histogram file path: "
+ str(output_histo_file)
+ "!"
)
shell(
"fgbio GroupReadsByUmi"
" -i {bam_input}"
" -o {output_bam_file}"
" -f {output_histo_file}"
" {extra_params}"
" {log}"
)
FGBIO SETMATEINFORMATION¶
Adds and/or fixes mate information on paired-end reads. Sets the MQ (mate mapping quality), MC (mate cigar string), ensures all mate-related flag fields are set correctly, and that the mate reference and mate start position are correct.
This wrapper can be used in the following way:
rule SetMateInfo:
input:
"mapped/a.bam"
output:
"mapped/{sample}.mi.bam"
params: ""
log:
"logs/fgbio/set_mate_info/{sample}.log"
wrapper:
"0.73.0/bio/fgbio/setmateinformation"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
fgbio==0.6.1
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2018, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra_params = snakemake.params.get("extra", "")
bam_input = snakemake.input[0]
if not isinstance(bam_input, str) and len(snakemake.input) != 1:
raise ValueError("Input bam should be one bam file: " + str(bam_input) + "!")
output_file = snakemake.output[0]
if not isinstance(output_file, str) and len(snakemake.output) != 1:
raise ValueError("Output should be one bam file: " + str(output_file) + "!")
shell(
"fgbio SetMateInformation"
" -i {bam_input}"
" -o {output_file}"
" {extra_params}"
" {log}"
)
FILTLONG¶
Quality filtering tool for long reads.
Example¶
This wrapper can be used in the following way:
rule filtlong:
input:
reads = "{sample}.fastq"
output:
"{sample}.filtered.fastq"
params:
extra=" --mean_q_weight 5.0",
target_bases = 10
log:
"logs/filtlong/test/{sample}.log"
wrapper:
"0.73.0/bio/filtlong"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
filtlong=0.2.0=he941832_2
Authors¶
- Michael Hall
Code¶
"""Snakemake wrapper for filtlong."""
__author__ = "Michael Hall"
__copyright__ = "Copyright 2019, Michael Hall"
__email__ = "michael@mbh.sh"
__license__ = "MIT"
from snakemake.shell import shell
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
target_bases = int(snakemake.params.get("target_bases", 0))
if target_bases > 0:
extra += " --target_bases {}".format(target_bases)
# Formats the log redrection string
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Executed shell command
shell("filtlong {extra}" " {snakemake.input.reads} > {snakemake.output} {log}")
FREEBAYES¶
Call small genomic variants with freebayes.
Example¶
This wrapper can be used in the following way:
rule freebayes:
input:
ref="genome.fasta",
# you can have a list of samples here
samples="mapped/{sample}.bam",
# the matching BAI indexes have to present for freebayes
indexes="mapped/{sample}.bam.bai"
# optional BED file specifying chromosomal regions on which freebayes
# should run, e.g. all regions that show coverage
#regions="/path/to/region-file.bed"
output:
"calls/{sample}.vcf" # either .vcf or .bcf
log:
"logs/freebayes/{sample}.log"
params:
extra="", # optional parameters
chunksize=100000, # reference genome chunk size for parallelization (default: 100000)
normalize=False, # flag to use bcftools norm to normalize indels
threads: 2
wrapper:
"0.73.0/bio/freebayes"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
freebayes=1.3.1
bcftools=1.11
parallel=20190522
bedtools>=2.29
sed=4.7
Authors¶
- Johannes Köster
- Felix Mölder
Code¶
__author__ = "Johannes Köster, Felix Mölder, Christopher Schröder"
__copyright__ = "Copyright 2017, Johannes Köster"
__email__ = "johannes.koester@protonmail.com, felix.moelder@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
params = snakemake.params.get("extra", "")
norm = snakemake.params.get("normalize", False)
assert norm in [True, False]
pipe = ""
if snakemake.output[0].endswith(".bcf"):
if norm:
pipe = "| bcftools norm -Ob -"
else:
pipe = "| bcftools view -Ob -"
elif norm:
pipe = "| bcftools norm -"
if snakemake.threads == 1:
freebayes = "freebayes"
else:
chunksize = snakemake.params.get("chunksize", 100000)
regions = (
"<(fasta_generate_regions.py {snakemake.input.ref}.fai {chunksize})".format(
snakemake=snakemake, chunksize=chunksize
)
)
if snakemake.input.get("regions", ""):
regions = (
"<(bedtools intersect -a "
r"<(sed 's/:\([0-9]*\)-\([0-9]*\)$/\t\1\t\2/' "
"{regions}) -b {snakemake.input.regions} | "
r"sed 's/\t\([0-9]*\)\t\([0-9]*\)$/:\1-\2/')"
).format(regions=regions, snakemake=snakemake)
freebayes = ("freebayes-parallel {regions} {snakemake.threads}").format(
snakemake=snakemake, regions=regions
)
shell(
"({freebayes} {params} -f {snakemake.input.ref}"
" {snakemake.input.samples} {pipe} > {snakemake.output[0]}) {log}"
)
GATK¶
For gatk, the following wrappers are available:
GATK APPLYBQSR¶
Run gatk ApplyBQSR.
This wrapper can be used in the following way:
rule gatk_applybqsr:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta",
dict="genome.dict",
recal_table="recal/{sample}.grp"
output:
bam="recal/{sample}.bam"
log:
"logs/gatk/gatk_applybqsr/{sample}.log"
params:
extra="", # optional
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/applybqsr"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
openjdk=8
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param allows for additional program arguments for ApplyBSQR.
- For more information see, https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR
- Christopher Schröder
- Johannes Köster
- Jake VanCampen
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroeder@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)
shell(
"gatk --java-options '{java_opts}' ApplyBQSR {extra} "
"-R {snakemake.input.ref} -I {snakemake.input.bam} "
"--bqsr-recal-file {snakemake.input.recal_table} "
"-O {snakemake.output.bam} {log}"
)
GATK BASERECALIBRATOR¶
Run gatk BaseRecalibrator.
This wrapper can be used in the following way:
rule gatk_baserecalibrator:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta",
dict="genome.dict",
known="dbsnp.vcf.gz" # optional known sites - single or a list
output:
recal_table="recal/{sample}.grp"
log:
"logs/gatk/baserecalibrator/{sample}.log"
params:
extra="", # optional
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/baserecalibrator"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
openjdk=8
snakemake-wrapper-utils==0.1.3
Input:
- bam file
- fasta reference
- vcf.gz of known variants
Output:
- recalibration table for the bam
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Christopher Schröder
- Johannes Köster
- Jake VanCampen
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroeder@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
known = snakemake.input.get("known", "")
if known:
if isinstance(known, str):
known = [known]
known = list(map("--known-sites {}".format, known))
shell(
"gatk --java-options '{java_opts}' BaseRecalibrator {extra} "
"-R {snakemake.input.ref} -I {snakemake.input.bam} "
"-O {snakemake.output.recal_table} {known} {log}"
)
GATK BASERECALIBRATORSPARK¶
Run gatk BaseRecalibratorSpark.
This wrapper can be used in the following way:
rule gatk_baserecalibratorspark:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta",
dict="genome.dict",
known="dbsnp.vcf.gz" # optional known sites
output:
recal_table="recal/{sample}.grp"
log:
"logs/gatk/baserecalibrator/{sample}.log"
params:
extra="", # optional
java_opts="", # optional
#spark_runner="", # optional, local by default
#spark_0.73.0="", # optional
#spark_extra="", # optional
resources:
mem_mb=1024
threads: 8
wrapper:
"0.73.0/bio/gatk/baserecalibratorspark"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
openjdk=8
snakemake-wrapper-utils==0.1.3
Input:
- bam file
- fasta reference
- vcf.gz of known variants
Output:
- recalibration table for the bam
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param allows for additional program arguments for baserecalibratorspark.
- The spark_runner param = “LOCAL”|”SPARK”|”GCS” allows to set the spark_runner. Set the parameter to “LOCAL” or don’t set it at all to run on local machine.
- The spark_master param allows to set the URL of the Spark Master to submit the job. Set to “local[number_of_cores]” for local execution. Don’t set it at all for local execution with number of cores determined by snakemake.
- The spark_extra param allows for additional spark arguments.
- For more information see, https://gatk.broadinstitute.org/hc/en-us/articles/360036897372-BaseRecalibratorSpark-BETA-
- Christopher Schröder
- Johannes Köster
- Jake VanCampen
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroeder@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
spark_runner = snakemake.params.get("spark_runner", "LOCAL")
spark_master = snakemake.params.get(
"spark_master", "local[{}]".format(snakemake.threads)
)
spark_extra = snakemake.params.get("spark_extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
known = snakemake.input.get("known", "")
if known:
known = "--known-sites {}".format(known)
shell(
"gatk --java-options '{java_opts}' BaseRecalibratorSpark {extra} "
"-R {snakemake.input.ref} -I {snakemake.input.bam} "
"-O {snakemake.output.recal_table} {known} "
"-- --spark-runner {spark_runner} --spark-master {spark_master} {spark_extra} "
"{log}"
)
GATK COMBINEGVCFS¶
Run gatk CombineGVCFs.
This wrapper can be used in the following way:
rule genotype_gvcfs:
input:
gvcfs=["calls/a.g.vcf", "calls/b.g.vcf"],
ref="genome.fasta"
output:
gvcf="calls/all.g.vcf",
log:
"logs/gatk/combinegvcfs.log"
params:
extra="", # optional
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/combinegvcfs"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Johannes Köster
- Jake VanCampen
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
gvcfs = list(map("-V {}".format, snakemake.input.gvcfs))
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' CombineGVCFs {extra} "
"{gvcfs} "
"-R {snakemake.input.ref} "
"-O {snakemake.output.gvcf} {log}"
)
GATK FILTERMUTECTCALLS¶
Run gatk FilterMutectCalls.
This wrapper can be used in the following way:
rule gatk_filtermutectcalls:
input:
vcf="calls/snvs.vcf",
ref="genome.fasta",
output:
vcf="calls/snvs.mutect.filtered.vcf",
log:
"logs/gatk/filter/snvs.log",
params:
extra="--max-alt-allele-count 3", # optional arguments, see GATK docs
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024,
wrapper:
"0.73.0/bio/gatk/filtermutectcalls"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2021, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' FilterMutectCalls "
"-R {snakemake.input.ref} -V {snakemake.input.vcf} "
"{extra} "
"-O {snakemake.output.vcf} "
"{log}"
)
GATK GENOMICSDBIMPORT¶
Run gatk GenomicsDBImport.
This wrapper can be used in the following way:
rule genomics_db_import:
input:
gvcfs=["calls/a.g.vcf.gz", "calls/b.g.vcf.gz"],
output:
db=directory("db"),
log:
"logs/gatk/genomicsdbimport.log"
params:
intervals="ref",
db_action="create", # optional
extra="", # optional
java_opts="", # optional
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/genomicsdbimport"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.2.0.0
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-XX:ParallelGCThreads=10” (not for -XmX or -Djava.io.tmpdir, since they are handled automatically).
- The intervals param is mandatory
- By default, the wrapper will create a new database (output directory must be empty or non-existent). If you want to update an existing DB, set db_action param to update.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://gatk.broadinstitute.org/hc/en-us/articles/360051305591-GenomicsDBImport
- Filipe G. Vieira
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2021, Filipe G. Vieira"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
gvcfs = list(map("--variant {}".format, snakemake.input.gvcfs))
db_action = snakemake.params.get("db_action", "create")
if db_action == "create":
db_action = "--genomicsdb-workspace-path"
elif db_action == "update":
db_action = "--genomicsdb-update-workspace-path"
else:
raise ValueError(
"invalid option provided to 'params.db_action'; please choose either 'create' or 'update'."
)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' GenomicsDBImport {extra} "
"{gvcfs} "
"--intervals {snakemake.params.intervals} "
"{db_action} {snakemake.output.db} {log}"
)
GATK GENOTYPEGVCFS¶
Run gatk GenotypeGVCFs.
This wrapper can be used in the following way:
rule genotype_gvcfs:
input:
gvcf="calls/all.g.vcf", # combined gvcf over multiple samples
# N.B. gvcf or genomicsdb must be specified
# in the latter case, this is a GenomicsDB data store
ref="genome.fasta"
output:
vcf="calls/all.vcf",
log:
"logs/gatk/genotypegvcfs.log"
params:
extra="", # optional
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/genotypegvcfs"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.2.0.0
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Johannes Köster
- Jake VanCampen
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
interval_file = snakemake.input.get("interval_file", "")
if interval_file:
interval_file = "-L {}".format(interval_file)
dbsnp = snakemake.input.get("known", "")
if dbsnp:
dbsnp = "-D {}".format(dbsnp)
# Allow for either an input gvcf or GenomicsDB
gvcf = snakemake.input.get("gvcf", "")
genomicsdb = snakemake.input.get("genomicsdb", "")
if gvcf:
if genomicsdb:
raise Exception("Only input.gvcf or input.genomicsdb expected, got both.")
input_string = gvcf
else:
if genomicsdb:
input_string = "gendb://{}".format(genomicsdb)
else:
raise Exception("Expected input.gvcf or input.genomicsdb.")
tmp_dir = snakemake.params.get("tmp_dir", "")
if tmp_dir:
tmp_dir = "--tmp-dir={}".format(tmp_dir)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' GenotypeGVCFs {extra} "
"-V {input_string} "
"-R {snakemake.input.ref} "
"{dbsnp} "
"{interval_file} "
"{tmp_dir} "
"-O {snakemake.output.vcf} {log}"
)
GATK HAPLOTYPECALLER¶
Run gatk HaplotypeCaller.
This wrapper can be used in the following way:
rule haplotype_caller:
input:
# single or list of bam files
bam="mapped/{sample}.bam",
ref="genome.fasta"
# known="dbsnp.vcf" # optional
output:
gvcf="calls/{sample}.g.vcf",
# bam="{sample}.assemb_haplo.bam",
log:
"logs/gatk/haplotypecaller/{sample}.log"
params:
extra="", # optional
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/haplotypecaller"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Johannes Köster
- Jake VanCampen
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
known = snakemake.input.get("known", "")
if known:
known = "--dbsnp " + str(known)
bam_output = snakemake.output.get("bam", "")
if bam_output:
bam_output = "--bam-output " + str(bam_output)
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
bams = snakemake.input.bam
if isinstance(bams, str):
bams = [bams]
bams = list(map("-I {}".format, bams))
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' HaplotypeCaller {extra} "
"-R {snakemake.input.ref} {bams} "
"-ERC GVCF {bam_output} "
"-O {snakemake.output.gvcf} {known} {log}"
)
GATK MARKDUPLICATESSPARK¶
Spark implementation of Picard MarkDuplicates that allows the tool to be run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the non-Spark Picard version of the tool. Since the tool requires holding all of the readnames in memory while it groups read information, machine configuration and starting sort-order impact tool performance.
This wrapper can be used in the following way:
rule mark_duplicates_spark:
input:
"mapped/{sample}.bam"
output:
bam="dedup/{sample}.bam",
metrics="dedup/{sample}.metrics.txt"
log:
"logs/dedup/{sample}.log"
params:
extra="--remove-sequencing-duplicates", # optional
java_opts="", # optional
#spark_runner="", # optional, local by default
#spark_0.73.0="", # optional
#spark_extra="", # optional
resources:
mem_mb=1024
threads: 8
wrapper:
"0.73.0/bio/gatk/markduplicatesspark"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.2.0.0
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param allows for additional program arguments for markduplicatesspark.
- The spark_runner param = “LOCAL”|”SPARK”|”GCS” allows to set the spark_runner. Set the parameter to “LOCAL” or don’t set it at all to run on local machine.
- The spark_master param allows to set the URL of the Spark Master to submit the job. Set to “local[number_of_cores]” for local execution. Don’t set it at all for local execution with number of cores determined by snakemake.
- The spark_extra param allows for additional spark arguments.
- For more information see, https://gatk.broadinstitute.org/hc/en-us/articles/360050814112-MarkDuplicatesSpark
- Filipe G. Vieira
__author__ = "Fillipe G. Vieira"
__copyright__ = "Copyright 2021, Filipe G. Vieira"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
spark_runner = snakemake.params.get("spark_runner", "LOCAL")
spark_master = snakemake.params.get(
"spark_master", "local[{}]".format(snakemake.threads)
)
spark_extra = snakemake.params.get("spark_extra", "")
java_opts = get_java_opts(snakemake)
metrics = snakemake.output.get("metrics", "")
if metrics:
metrics = f"--metrics-file {metrics}"
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' MarkDuplicatesSpark "
"{extra} "
"--input {snakemake.input} "
"--output {snakemake.output.bam} "
"{metrics} "
"-- --spark-runner {spark_runner} --spark-master {spark_master} {spark_extra} "
"{log}"
)
GATK MUTECT2¶
Call somatic SNVs and indels via local assembly of haplotypes
This wrapper can be used in the following way:
rule mutect2:
input:
fasta = "genome/genome.fasta",
map = "mapped/{sample}.bam"
output:
vcf = "variant/{sample}.vcf"
message:
"Testing Mutect2 with {wildcards.sample}"
threads:
1
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
log:
"logs/mutect_{sample}.log"
wrapper:
"0.73.0/bio/gatk/mutect"
rule mutect2_bam:
input:
fasta = "genome/genome.fasta",
map = "mapped/{sample}.bam"
output:
vcf = "variant_bam/{sample}.vcf",
bam = "variant_bam/{sample}.bam"
message:
"Testing Mutect2 with {wildcards.sample}"
threads:
1
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
log:
"logs/mutect_{sample}.log"
wrapper:
"0.73.0/bio/gatk/mutect"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
snakemake-wrapper-utils==0.1.3
- Thibault Dayris
"""Snakemake wrapper for GATK4 Mutect2"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake.utils import makedirs
from snakemake_wrapper_utils.java import get_java_opts
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
bam_output = "--bam-output"
if snakemake.output.get("bam", None) is not None:
bam_output = bam_output + " " + snakemake.output.bam
else:
bam_output = ""
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
shell(
"gatk --java-options '{java_opts}' Mutect2 " # Tool and its subprocess
"--input {snakemake.input.map} " # Path to input mapping file
"{bam_output} " # Path to output bam file, optional
"--output {snakemake.output.vcf} " # Path to output vcf file
"--reference {snakemake.input.fasta} " # Path to reference fasta file
"{extra} " # Extra parameters
"{log}" # Logging behaviour
)
GATK SELECTVARIANTS¶
Run gatk SelectVariants.
This wrapper can be used in the following way:
rule gatk_select:
input:
vcf="calls/all.vcf",
ref="genome.fasta",
output:
vcf="calls/snvs.vcf"
log:
"logs/gatk/select/snvs.log"
params:
extra="--select-type-to-include SNP", # optional filter arguments, see GATK docs
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/selectvariants"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Johannes Köster
- Jake VanCampen
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' SelectVariants -R {snakemake.input.ref} -V {snakemake.input.vcf} "
"{extra} -O {snakemake.output.vcf} {log}"
)
GATK SPLITNCIGARREADS¶
Run gatk SplitNCigarReads.
This wrapper can be used in the following way:
rule splitncigarreads:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta"
output:
"split/{sample}.bam"
log:
"logs/gatk/splitNCIGARreads/{sample}.log"
params:
extra="", # optional
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/splitncigarreads"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "jan.forster@uk-essen.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' SplitNCigarReads {extra} "
" -R {snakemake.input.ref} -I {snakemake.input.bam} "
"-O {snakemake.output} {log}"
)
GATK VARIANTEVAL¶
Run gatk VariantEval.
This wrapper can be used in the following way:
rule gatk_varianteval:
input:
vcf="calls/snvs.vcf",
ref="genome.fasta",
dict="genome.dict",
# comp="calls/comp.vcf", # optional comparison VCF
output:
vcf="snvs.varianteval.grp"
log:
"logs/gatk/varianteval/snvs.log"
params:
extra="", # optional arguments, see GATK docs
java_opts="", # optional
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/varianteval"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.2.0.0
snakemake-wrapper-utils==0.1.3
Input:
- vcf files
- BAM/CRAM files (optional)
- reference genome (optional)
- reference dictionary (optional)
- vcf.gz of known variants (optional)
- PED (pedigree) file (optional)
Output:
- Evaluation tables detailing the results of the eval modules on VCF file
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-XX:ParallelGCThreads=10” (not for -XmX or -Djava.io.tmpdir, since they are handled automatically).
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://gatk.broadinstitute.org/hc/en-us/articles/360056967892-VariantEval-BETA-
- Filipe G. Vieira
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2021, Filipe G. Vieira"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
vcf = snakemake.input.vcf
if isinstance(vcf, str):
vcf = "--eval {}".format(vcf)
else:
vcf = list(map("--eval {}".format, vcf))
bam = snakemake.input.get("bam", "")
if bam:
if isinstance(bam, str):
bam = "--input {}".format(bam)
else:
bam = list(map("--input {}".format, bam))
ref = snakemake.input.get("ref", "")
if ref:
ref = "--reference " + ref
ref_dict = snakemake.input.get("dict", "")
if ref_dict:
ref_dict = "--sequence-dictionary " + ref_dict
known = snakemake.input.get("known", "")
if known:
known = "--dbsnp " + known
comp = snakemake.input.get("comp", "")
if comp:
if isinstance(comp, str):
comp = "--comparison {}".format(comp)
else:
comp = list(map("--comparison {}".format, comp))
ped = snakemake.input.get("ped", "")
if ped:
ped = "--pedigree " + ped
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' VariantEval "
"{vcf} "
"{bam} "
"{ref} "
"{ref_dict} "
"{known} "
"{ped} "
"{comp} "
"{extra} --output {snakemake.output[0]} {log}"
)
GATK VARIANTFILTRATION¶
Run gatk VariantFiltration.
This wrapper can be used in the following way:
rule gatk_filter:
input:
vcf="calls/snvs.vcf",
ref="genome.fasta",
output:
vcf="calls/snvs.filtered.vcf"
log:
"logs/gatk/filter/snvs.log"
params:
filters={"myfilter": "AB < 0.2 || MQ0 > 50"},
extra="", # optional arguments, see GATK docs
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/variantfiltration"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Johannes Köster
- Jake VanCampen
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
filters = [
"--filter-name {} --filter-expression '{}'".format(name, expr.replace("'", "\\'"))
for name, expr in snakemake.params.filters.items()
]
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' VariantFiltration -R {snakemake.input.ref} -V {snakemake.input.vcf} "
"{extra} {filters} -O {snakemake.output.vcf} {log}"
)
GATK VARIANTRECALIBRATOR¶
Run gatk VariantRecalibrator.
This wrapper can be used in the following way:
from snakemake.remote import GS
# GATK resource bundle files can be either directly obtained from google storage (like here), or
# from FTP. You can also use local files.
GS = GS.RemoteProvider()
def gatk_bundle(f):
return GS.remote("genomics-public-data/resources/broad/hg38/v0/{}".format(f))
rule haplotype_caller:
input:
vcf="calls/all.vcf",
ref="genome.fasta",
# resources have to be given as named input files
hapmap=gatk_bundle("hapmap_3.3.hg38.sites.vcf.gz"),
omni=gatk_bundle("1000G_omni2.5.hg38.sites.vcf.gz"),
g1k=gatk_bundle("1000G_phase1.snps.high_confidence.hg38.vcf.gz"),
dbsnp=gatk_bundle("Homo_sapiens_assembly38.dbsnp138.vcf.gz"),
# use aux to e.g. download other necessary file
aux=[gatk_bundle("hapmap_3.3.hg38.sites.vcf.gz.tbi"),
gatk_bundle("1000G_omni2.5.hg38.sites.vcf.gz.tbi"),
gatk_bundle("1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi"),
gatk_bundle("Homo_sapiens_assembly38.dbsnp138.vcf.gz.tbi")]
output:
vcf="calls/all.recal.vcf",
tranches="calls/all.tranches"
log:
"logs/gatk/variantrecalibrator.log"
params:
mode="SNP", # set mode, must be either SNP, INDEL or BOTH
# resource parameter definition. Key must match named input files from above.
resources={"hapmap": {"known": False, "training": True, "truth": True, "prior": 15.0},
"omni": {"known": False, "training": True, "truth": False, "prior": 12.0},
"g1k": {"known": False, "training": True, "truth": False, "prior": 10.0},
"dbsnp": {"known": True, "training": False, "truth": False, "prior": 2.0}},
annotation=["QD", "FisherStrand"], # which fields to use with -an (see VariantRecalibrator docs)
extra="", # optional
java_opts="", # optional
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/gatk/haplotypecaller"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk4==4.1.4.1
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Johannes Köster
- Jake VanCampen
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
def fmt_res(resname, resparams):
fmt_bool = lambda b: str(b).lower()
try:
f = snakemake.input.get(resname)
except KeyError:
raise RuntimeError(
"There must be a named input file for every resource (missing: {})".format(
resname
)
)
return "{},known={},training={},truth={},prior={} {}".format(
resname,
fmt_bool(resparams["known"]),
fmt_bool(resparams["training"]),
fmt_bool(resparams["truth"]),
resparams["prior"],
f,
)
resources = [
"--resource:{}".format(fmt_res(resname, resparams))
for resname, resparams in snakemake.params["resources"].items()
]
annotation = list(map("-an {}".format, snakemake.params.annotation))
tranches = ""
if snakemake.output.tranches:
tranches = "--tranches-file " + snakemake.output.tranches
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk --java-options '{java_opts}' VariantRecalibrator {extra} {resources} "
"-R {snakemake.input.ref} -V {snakemake.input.vcf} "
"-mode {snakemake.params.mode} "
"--output {snakemake.output.vcf} "
"{tranches} {annotation} {log}"
)
GATK3¶
For gatk3, the following wrappers are available:
GATK3 BASERECALIBRATOR¶
Run gatk3 BaseRecalibrator.
This wrapper can be used in the following way:
rule baserecalibrator:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta",
known="dbsnp.vcf.gz"
output:
"{sample}.recal_data_table"
log:
"logs/gatk3/bqsr/{sample}.log"
params:
extra="" # optional
resources:
mem_mb = 1024
threads: 16
wrapper:
"bio/gatk/baserecalibrator"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk==3.8
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Gatk3.jar is not included in the bioconda package, i.e it need to be added to the conda environment manually.
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
input_bam = snakemake.input.bam
input_known = snakemake.input.known
input_ref = snakemake.input.ref
bed = snakemake.params.get("bed", None)
if bed is not None:
bed = "-L " + bed
else:
bed = ""
input_known_string = ""
for known in input_known:
input_known_string = input_known_string + " --knownSites {}".format(known)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk3 {java_opts} -T BaseRecalibrator"
" -nct {snakemake.threads}"
" {extra}"
" -I {input_bam}"
" -R {input_ref}"
" {input_known_string}"
" {bed}"
" -o {snakemake.output}"
" {log}"
)
GATK3 INDELREALIGNER¶
Run gatk3 IndelRealigner
This wrapper can be used in the following way:
rule indelrealigner:
input:
bam="mapped/{sample}.bam",
bai="mapped/{sample}.bai",
ref="genome.fasta",
known="dbsnp.vcf.gz",
known_idx="dbsnp.vcf.gz.tbi",
target_intervals="{sample}.intervals"
output:
bam="realigned/{sample}.bam",
bai="realigned/{sample}.bai",
java_temp=temp(directory("/tmp/gatk3_indelrealigner/{sample}")),
log:
"logs/gatk3/indelrealigner/{sample}.log"
params:
extra="" # optional
threads: 16
resources:
mem_mb = 1024
wrapper:
"bio/gatk/indelrealigner"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk==3.8
snakemake-wrapper-utils==0.1.3
Input:
- bam file
- vcf files
- reference genome
- target intervals to realign
- bed file (optional)
Output:
- indel realigned bam file
- indel realigned bai file (optional)
- temp dir (optional)
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-XX:ParallelGCThreads=10” (memory is automatically inferred from resources and temp dir from output.java_temp.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Gatk3.jar is not included in the bioconda package, i.e it need to be added to the conda environment manually.
- Patrik Smeds
- Filipe G. Vieira
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
input_known = snakemake.input.known
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
bed = snakemake.input.get("bed", None)
if bed is not None:
bed = "-L " + bed
else:
bed = ""
input_known_string = ""
for known in input_known:
input_known_string = input_known_string + " -known {}".format(known)
output_bai = snakemake.output.get("bai", None)
if output_bai is None:
extra += " --disable_bam_indexing"
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk3 {java_opts} -T IndelRealigner"
" {extra}"
" -I {snakemake.input.bam}"
" -R {snakemake.input.ref}"
" {input_known_string}"
" {bed}"
" --targetIntervals {snakemake.input.target_intervals}"
" -o {snakemake.output.bam}"
" {log}"
)
GATK3 PRINTREADS¶
Run gatk3 PrintReads
This wrapper can be used in the following way:
rule printreads:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta",
recal_data="{sample}.recal_data_table"
output:
"alignment/{sample}.bqsr.bam"
log:
"logs/gatk/bqsr/{sample}..log"
params:
extra="" # optional
resources:
mem_mb = 1024
threads: 16
wrapper:
"bio/gatk3/printreads"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk==3.8
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-Xmx4G” for one, and “-Xmx4G -XX:ParallelGCThreads=10” for two options.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Gatk3.jar is not included in the bioconda package, i.e it need to be added to the conda environment manually.
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
input_bam = snakemake.input.bam
input_recal_data = snakemake.input.recal_data
input_ref = snakemake.input.ref
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk3 {java_opts} -T PrintReads"
" {extra}"
" -I {input_bam}"
" -R {input_ref}"
" -BQSR {input_recal_data}"
" -o {snakemake.output}"
" {log}"
)
GATK3 REALIGNERTARGETCREATOR¶
Run gatk3 RealignerTargetCreator
This wrapper can be used in the following way:
rule realignertargetcreator:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta",
known="dbsnp.vcf.gz",
output:
intervals="{sample}.intervals",
java_temp=temp(directory("gatk3_indelrealigner/{sample}")),
log:
"logs/gatk/realignertargetcreator/{sample}.log",
params:
extra="", # optional
resources:
mem_mb=1024,
threads: 16
wrapper:
"bio/gatk3/realignertargetcreator"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gatk==3.8
snakemake-wrapper-utils==0.1.3
Input:
- bam file
- vcf files
- reference genome
- bed file (optional)
Output:
- target intervals
- temp dir (optional)
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-XX:ParallelGCThreads=10” (memory is automatically inferred from resources and temp dir from output.java_temp.
- The extra param alllows for additional program arguments.
- For more inforamtion see, https://software.broadinstitute.org/gatk/documentation/article?id=11050
- Gatk3.jar is not included in the bioconda package, i.e it need to be added to the conda environment manually.
- Patrik Smeds
- Filipe G. Vieira
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
input_known = snakemake.input.known
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
bed = snakemake.input.get("bed", None)
if bed is not None:
bed = "-L " + bed
else:
bed = ""
input_known_string = ""
for known in input_known:
input_known_string = input_known_string + " -known {}".format(known)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"gatk3 {java_opts} -T RealignerTargetCreator"
" -nt {snakemake.threads}"
" {extra}"
" -I {snakemake.input.bam}"
" -R {snakemake.input.ref}"
" {input_known_string}"
" {bed}"
" -o {snakemake.output.intervals}"
" {log}"
)
GDC-API¶
For gdc-api, the following wrappers are available:
GDC API-BASED DATA DOWNLOAD OF BAM SLICES¶
Download slices of GDC BAM files using curl and the GDC API for BAM Slicing.
This wrapper can be used in the following way:
rule gdc_api_bam_slice_download:
output:
bam="raw/{sample}.bam",
log:
"logs/gdc-api/bam-slicing/{sample}.log"
params:
# to use this rule flexibly, make uuid a function that maps your
# sample names of choice to the UUIDs they correspond to (they are
# the column `id` in the GDC manifest files, which can be used to
# systematically construct sample sheets)
uuid="092c8a6d-aad5-41bf-b186-e68e613c0e89",
# a gdc_token is required for controlled access and all BAM files
# on GDC seem to be controlled access (adjust if this changes)
gdc_token="gdc/gdc-user-token.2020-05-07T10_00_00.555Z.txt",
# provide wanted `region=` or `gencode=` slices joined with `&`
slices="region=chr22®ion=chr5:1000-2000®ion=unmapped&gencode=BRCA2",
# extra command line arguments passed to curl
extra=""
wrapper:
"0.73.0/bio/gdc-api/bam-slicing"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
curl==7.69.1
- BAM file UUIDs can be found via the GDC repository query, either by clicking on individual files or systematically by creating a cart and downloading a manifest file.
- Slicing can be performed using region syntax like ‘region=chr20:3000-4000’, gene name syntax like ‘gencode=BRCA2’ (this uses Gene symbols of GENCODE v22) or ‘region=unmapped’ to get unmapped reads. Multiple such entries can be joined with ampersands (e.g.
region=chr5:200-300®ion=unmapped&gencode=BRCA1
). - All BAM data files in GDC are controlled access according to this GDC repository query, thus a GDC access token file is always required and must be provided via
params: gdc_token: "path/to/access_token.txt"
. Should this change in the future, feel free to adjust this wrapper or contact the original author.
- David Lähnemann
__author__ = "David Lähnemann"
__copyright__ = "Copyright 2020, David Lähnemann"
__email__ = "david.laehnemann@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
import os
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
uuid = snakemake.params.get("uuid", "")
if uuid == "":
raise ValueError("You need to provide a GDC UUID via the 'uuid' in 'params'.")
token_file = snakemake.params.get("gdc_token", "")
if token_file == "":
raise ValueError(
"You need to provide a GDC data access token file via the 'token' in 'params'."
)
token = ""
with open(token_file) as tf:
token = tf.read()
os.environ["CURL_HEADER_TOKEN"] = "'X-Auth-Token: {}'".format(token)
slices = snakemake.params.get("slices", "")
if slices == "":
raise ValueError(
"You need to provide 'region=chr1:1000-2000' or 'gencode=BRCA2' slice(s) via the 'slices' in 'params'."
)
extra = snakemake.params.get("extra", "")
shell(
"curl --silent"
" --header $CURL_HEADER_TOKEN"
" 'https://api.gdc.cancer.gov/slicing/view/{uuid}?{slices}'"
" {extra}"
" --output {snakemake.output.bam} {log}"
)
if os.path.getsize(snakemake.output.bam) < 100000:
with open(snakemake.output.bam) as f:
if "error" in f.read():
shell("cat {snakemake.output.bam} {log}")
raise RuntimeError(
"Your GDC API request returned an error, check your log file for the error message."
)
GDC-CLIENT¶
For gdc-client, the following wrappers are available:
GDC DATA TRANSFER TOOL DATA DOWNLOAD¶
Download GDC data files with the gdc-client.
This wrapper can be used in the following way:
rule gdc_download:
output:
# the file extension (up to two components, here .maf.gz), has
# to uniquely map to one of the files downloaded for that UUID
"raw/{sample}.maf.gz"
log:
"logs/gdc-client/download/{sample}.log"
params:
# to use this rule flexibly, make uuid a function that maps your
# sample names of choice to the UUIDs they correspond to (they are
# the column `id` in the GDC manifest files, which can be used to
# systematically construct sample sheets)
uuid="34b80c89-c41e-47be-84fb-0c0ea493b5bb",
# a gdc_token is only required for controlled access samples,
# leave blank otherwise (`gdc_token=""`) or skip this param entirely
gdc_token="gdc/gdc-user-token.2020-05-07T10_00_00.555Z.txt",
# for valid extra command line arguments, check command line help or:
# https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/
extra = ""
threads: 4
wrapper:
"0.73.0/bio/gdc-client/download"
rule gdc_download_bam:
output:
# specify all the downloaded files you want to keep, as all other
# downloaded files will be removed automatically e.g. for
# BAM data this could be
"raw/{sample}.bam",
"raw/{sample}.bam.bai",
"raw/{sample}.annotations.txt",
directory("raw/{sample}/logs")
log:
"logs/gdc-client/download/{sample}.log"
params:
# to use this rule flexibly, make uuid a function that maps your
# sample names of choice to the UUIDs they correspond to (they are
# the column `id` in the GDC manifest files, which can be used to
# systematically construct sample sheets)
uuid="34b80c89-c41e-47be-84fb-0c0ea493b5bb",
# a gdc_token is only required for controlled access samples,
# leave blank otherwise (`gdc_token=""`) or skip this param entirely
gdc_token="gdc/gdc-user-token.2020-05-07T10_00_00.555Z.txt",
# for valid extra command line arguments, check command line help or:
# https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/
extra = ""
threads: 4
wrapper:
"0.73.0/bio/gdc-client/download"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gdc-client==1.5.0
- David Lähnemann
__author__ = "David Lähnemann"
__copyright__ = "Copyright 2020, David Lähnemann"
__email__ = "david.laehnemann@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
import os.path as path
from tempfile import TemporaryDirectory
import glob
uuid = snakemake.params.get("uuid", "")
if uuid == "":
raise ValueError("You need to provide a GDC UUID via the 'uuid' in 'params'.")
extra = snakemake.params.get("extra", "")
token = snakemake.params.get("gdc_token", "")
if token != "":
token = "--token-file {}".format(token)
with TemporaryDirectory() as tempdir:
shell(
"gdc-client download"
" {token}"
" {extra}"
" -n {snakemake.threads} "
" --log-file {snakemake.log} "
" --dir {tempdir}"
" {uuid}"
)
for out_path in snakemake.output:
tmp_path = path.join(tempdir, uuid, path.basename(out_path))
if not path.exists(tmp_path):
(root, ext1) = path.splitext(out_path)
paths = glob.glob(path.join(tempdir, uuid, "*" + ext1))
if len(paths) > 1:
(root, ext2) = path.splitext(root)
paths = glob.glob(path.join(tempdir, uuid, "*" + ext2 + ext1))
if len(paths) == 0:
raise ValueError(
"{} file extension {} does not match any downloaded file.\n"
"Are you sure that UUID {} provides a file of such format?\n".format(
out_path, ext1, uuid
)
)
if len(paths) > 1:
raise ValueError(
"Found more than one downloaded file with extension '{}':\n"
"{}\n"
"Cannot match requested output file {} unambiguously.\n".format(
ext2 + ext1, paths, out_path
)
)
tmp_path = paths[0]
shell("mv {tmp_path} {out_path}")
GENOMEPY¶
Download genomes the easy way: https://github.com/vanheeringen-lab/genomepy
Example¶
This wrapper can be used in the following way:
rule genomepy:
output:
multiext("{assembly}/{assembly}", ".fa", ".fa.fai", ".fa.sizes", ".gaps.bed",
".annotation.gtf.gz", ".blacklist.bed")
log:
"logs/genomepy_{assembly}.log"
params:
provider="UCSC" # optional, defaults to ucsc. Choose from ucsc, ensembl, and ncbi
cache: True # mark as eligible for between workflow caching
wrapper:
"0.73.0/bio/genomepy"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
bioconda::genomepy==0.8.3
Authors¶
- Maarten van der Sande
Code¶
__author__ = "Maarten van der Sande"
__copyright__ = "Copyright 2020, Maarten van der Sande"
__email__ = "M.vanderSande@science.ru.nl"
__license__ = "MIT"
from snakemake.shell import shell
# Optional parameters
provider = snakemake.params.get("provider", "UCSC")
# set options for plugins
all_plugins = "blacklist,bowtie2,bwa,gmap,hisat2,minimap2,star"
req_plugins = ","
if any(["blacklist" in out for out in snakemake.output]):
req_plugins = "blacklist,"
annotation = ""
if any(["annotation" in out for out in snakemake.output]):
annotation = "--annotation"
# parse the genome dir
genome_dir = "./"
if snakemake.output[0].count("/") > 1:
genome_dir = "/".join(snakemake.output[0].split("/")[:-1])
log = snakemake.log
# Finally execute genomepy
shell(
"""
# set a trap so we can reset to original user's settings
active_plugins=$(genomepy config show | grep -Po '(?<=- ).*' | paste -s -d, -) || echo ""
trap "genomepy plugin disable {{{all_plugins}}} >> {log} 2>&1;\
genomepy plugin enable {{$active_plugins,}} >> {log} 2>&1" EXIT
# disable all, then enable the ones we need
genomepy plugin disable {{{all_plugins}}} > {log} 2>&1
genomepy plugin enable {{{req_plugins}}} >> {log} 2>&1
# install the genome
genomepy install {snakemake.wildcards.assembly} \
{provider} {annotation} -g {genome_dir} >> {log} 2>&1
"""
)
GRIDSS¶
For gridss, the following wrappers are available:
GRIDSS ASSEMBLE¶
GRIDSS is a module software suite containing tools useful for the detection of genomic rearrangements. It includes a genome-wide break-end assembler, as well as a structural variation caller for Illumina sequencing data. assemble
performs GRIDSS breakend assembly. Documentation at: https://github.com/PapenfussLab/gridss
This wrapper can be used in the following way:
WORKING_DIR = "working_dir"
samples = ["A", "B"]
preprocess_endings = (
".cigar_metrics",
".coverage.blacklist.bed",
".idsv_metrics",
".insert_size_histogram.pdf",
".insert_size_metrics",
".mapq_metrics",
".sv.bam",
".sv.bam.bai",
".sv_metrics",
".tag_metrics",
)
assembly_endings = (
".cigar_metrics",
".coverage.blacklist.bed",
".downsampled_0.bed",
".excluded_0.bed",
".idsv_metrics",
".mapq_metrics",
".quality_distribution.pdf",
".quality_distribution_metrics",
".subsetCalled_0.bed",
".sv.bam",
".sv.bam.bai",
".tag_metrics",
)
reference_index_endings = (".amb",".ann", ".bwt", ".pac", ".sa", ".gridsscache", ".img")
rule gridss_assemble:
input:
bams=expand("mapped/{sample}.bam", sample=samples),
bais=expand("mapped/{sample}.bam.bai", sample=samples),
reference="reference/genome.fasta",
dictionary="reference/genome.dict",
indices=multiext("reference/genome.fasta", *reference_index_endings),
preprocess=expand("{working_dir}/{sample}.bam.gridss.working/{sample}.bam{ending}", working_dir=[WORKING_DIR], sample=samples, ending=preprocess_endings)
output:
assembly="assembly/group.bam",
assembly_others=expand("{working_dir}/group.bam.gridss.working/group.bam{ending}", working_dir=[WORKING_DIR], ending=assembly_endings)
params:
extra="--jvmheap 1g",
workingdir=WORKING_DIR
log:
"log/gridss/assemble/group.log"
threads:
100
wrapper:
"0.73.0/bio/gridss/assemble"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gridss==2.9.4
- Christopher Schröder
"""Snakemake wrapper for gridss assemble"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroede@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Check inputs/arguments.
reference = snakemake.input.get("reference")
if not snakemake.params.workingdir:
raise ValueError("Please set params.workingdir to provide a working directory.")
if not snakemake.input.reference:
raise ValueError("Please set input.reference to provide reference genome.")
for ending in (".amb", ".ann", ".bwt", ".pac", ".sa"):
if not path.exists("{}{}".format(reference, ending)):
raise ValueError(
"{reference}{ending} missing. Please make sure the reference was properly indexed by bwa.".format(
reference=reference, ending=ending
)
)
dictionary = path.splitext(reference)[0] + ".dict"
if not path.exists(dictionary):
raise ValueError(
"{dictionary}.dict missing. Please make sure the reference dictionary was properly created. This can be accomplished for example by CreateSequenceDictionary.jar from Picard".format(
dictionary=dictionary
)
)
shell(
"(gridss -s assemble " # Tool
"--reference {reference} " # Reference
"--threads {snakemake.threads} " # Threads
"--workingdir {snakemake.params.workingdir} " # Working directory
"--assembly {snakemake.output.assembly} " # Assembly output
"{snakemake.input.bams} "
"{extra}) {log}"
)
GRIDSS CALL¶
GRIDSS is a module software suite containing tools useful for the detection of genomic rearrangements. It includes a genome-wide break-end assembler, as well as a structural variation caller for Illumina sequencing data. call
performs variant calling. Documentation at: https://github.com/PapenfussLab/gridss
This wrapper can be used in the following way:
WORKING_DIR = "working_dir"
samples = ["A", "B"]
preprocess_endings = (
".cigar_metrics",
".coverage.blacklist.bed",
".idsv_metrics",
".insert_size_histogram.pdf",
".insert_size_metrics",
".mapq_metrics",
".sv.bam",
".sv.bam.bai",
".sv_metrics",
".tag_metrics",
)
assembly_endings = (
".cigar_metrics",
".coverage.blacklist.bed",
".downsampled_0.bed",
".excluded_0.bed",
".idsv_metrics",
".mapq_metrics",
".quality_distribution.pdf",
".quality_distribution_metrics",
".subsetCalled_0.bed",
".sv.bam",
".sv.bam.bai",
".tag_metrics",
)
reference_index_endings = (".amb",".ann", ".bwt", ".pac", ".sa", ".gridsscache", ".img")
rule gridss_call:
input:
bams=expand("mapped/{sample}.bam", sample=samples),
bais=expand("mapped/{sample}.bam.bai", sample=samples),
reference="reference/genome.fasta",
dictionary="reference/genome.dict",
indices=multiext("reference/genome.fasta", *reference_index_endings),
preprocess=expand("{working_dir}/{sample}.bam.gridss.working/{sample}.bam{ending}", working_dir=[WORKING_DIR], sample=samples, ending=preprocess_endings),
assembly="assembly/group.bam",
assembly_others=expand("{working_dir}/group.bam.gridss.working/group.bam{ending}", working_dir=[WORKING_DIR], ending=assembly_endings)
output:
vcf="vcf/group.vcf",
idx="vcf/group.vcf.idx",
tmpidx=temp(WORKING_DIR + "/group.vcf.gridss.working/group.vcf.allocated.vcf.idx") # be aware the group occurs two times here
params:
extra="--jvmheap 1g",
workingdir=WORKING_DIR
log:
"log/gridss/call/group.log"
threads:
100
wrapper:
"0.73.0/bio/gridss/call"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gridss==2.9.4
cpulimit=0.2
- Christopher Schröder
"""Snakemake wrapper for gridss call"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroede@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Check inputs/arguments.
reference = snakemake.input.get("reference")
dictionary = snakemake.input.get("dictionary")
if not snakemake.params.workingdir:
raise ValueError("Please set params.workingdir to provide a working directory.")
if not snakemake.input.reference:
raise ValueError("Please set input.reference to provide reference genome.")
for ending in (".amb", ".ann", ".bwt", ".pac", ".sa"):
if not path.exists("{}{}".format(reference, ending)):
raise ValueError(
"{reference}{ending} missing. Please make sure the reference was properly indexed by bwa.".format(
reference=reference, ending=ending
)
)
dictionary = path.splitext(reference)[0] + ".dict"
if not path.exists(dictionary):
raise ValueError(
"{dictionary}.dict missing. Please make sure the reference dictionary was properly created. This can be accomplished for example by CreateSequenceDictionary.jar from Picard".format(
dictionary=dictionary
)
)
shell(
"(export JAVA_OPTS='-XX:ActiveProcessorCount={snakemake.threads}' & "
"gridss -s call " # Tool
"--reference {reference} " # Reference
"--threads {snakemake.threads} " # Threads
"--workingdir {snakemake.params.workingdir} " # Working directory
"--assembly {snakemake.input.assembly} " # Assembly input from gridss assemble
"--output {snakemake.output.vcf} " # Assembly vcf
"{snakemake.input.bams} "
"{extra}) {log}"
)
GRIDSS PREPROCESS¶
GRIDSS is a module software suite containing tools useful for the detection of genomic rearrangements. It includes a genome-wide break-end assembler, as well as a structural variation caller for Illumina sequencing data. preprocess
pre-processes input BAM files. Can be run per input file. Documentation at: https://github.com/PapenfussLab/gridss
This wrapper can be used in the following way:
WORKING_DIR="working_dir"
rule gridss_preprocess:
input:
bam="mapped/{sample}.bam",
bai="mapped/{sample}.bam.bai",
reference="reference/genome.fasta",
dictionary="reference/genome.dict",
refindex=multiext("reference/genome.fasta", ".amb", ".ann", ".bwt", ".pac", ".sa", ".gridsscache", ".img")
output:
multiext("{WORKING_DIR}/{sample}.bam.gridss.working/{sample}.bam", ".cigar_metrics", ".coverage.blacklist.bed", ".idsv_metrics", ".insert_size_histogram.pdf", ".insert_size_metrics", ".mapq_metrics", ".sv.bam", ".sv.bam.bai", ".sv_metrics", ".tag_metrics")
params:
extra="--jvmheap 1g",
workingdir=WORKING_DIR
log:
"log/gridss/preprocess/{WORKING_DIR}/{sample}.preprocess.log"
threads:
8
wrapper:
"0.73.0/bio/gridss/preprocess"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gridss==2.9.4
- Christopher Schröder
"""Snakemake wrapper for gridss preprocess"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroede@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Check inputs/arguments.
reference = snakemake.input.get("reference")
dictionary = snakemake.input.get("dictionary")
if not snakemake.params.workingdir:
raise ValueError("Please set params.workingdir to provide a working directory.")
if not snakemake.input.reference:
raise ValueError("Please set input.reference to provide reference genome.")
for ending in (".amb", ".ann", ".bwt", ".pac", ".sa"):
if not path.exists("{}{}".format(reference, ending)):
raise ValueError(
"{reference}{ending} missing. Please make sure the reference was properly indexed by bwa.".format(
reference=reference, ending=ending
)
)
dictionary = path.splitext(reference)[0] + ".dict"
if not path.exists(dictionary):
raise ValueError(
"{dictionary}.dict missing. Please make sure the reference dictionary was properly created. This can be accomplished for example by CreateSequenceDictionary.jar from Picard".format(
dictionary=dictionary
)
)
shell(
"(gridss -s preprocess " # Tool
"--reference {reference} " # Reference
"--threads {snakemake.threads} "
"--workingdir {snakemake.params.workingdir} "
"{snakemake.input.bam} "
"{extra}) {log}"
)
GRIDSS SETUPREFERENCE¶
GRIDSS is a module software suite containing tools useful for the detection of genomic rearrangements. It includes a genome-wide break-end assembler, as well as a structural variation caller for Illumina sequencing data. setupreference
is a once-off setup generating additional files in the same directory as the reference. WARNING multiple instances of GRIDSS attempting to perform setupreference at the same time will result in file corruption. Make sure these files are generated before running parallel GRIDSS jobs. Documentation at: https://github.com/PapenfussLab/gridss
This wrapper can be used in the following way:
rule gridss_setupreference:
input:
reference="reference/genome.fasta",
dictionary="reference/genome.dict",
indices=multiext("reference/genome.fasta", ".amb", ".ann", ".bwt", ".pac", ".sa")
output:
multiext("reference/genome.fasta", ".gridsscache", ".img")
params:
extra="--jvmheap 1g"
log:
"log/gridss/setupreference.log"
wrapper:
"0.73.0/bio/gridss/setupreference"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
gridss==2.9.4
- Christopher Schröder
"""Snakemake wrapper for gridss setupreference"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroede@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Check inputs/arguments.
reference = snakemake.input.get("reference", None)
if not snakemake.input.reference:
raise ValueError("A reference genome has to be provided!")
for ending in (".amb", ".ann", ".bwt", ".pac", ".sa"):
if not path.exists("{}{}".format(reference, ending)):
raise ValueError(
"{reference}{ending} missing. Please make sure the reference was properly indexed by bwa.".format(
reference=reference, ending=ending
)
)
dictionary = path.splitext(reference)[0] + ".dict"
if not path.exists(dictionary):
raise ValueError(
"{dictionary}.dict missing. Please make sure the reference dictionary was properly created. This can be accomplished for example by CreateSequenceDictionary.jar from Picard".format(
dictionary=dictionary
)
)
shell(
"(gridss -s setupreference " # Tool
"--reference {reference} " # Reference
"{extra}) {log}"
)
HAP.PY¶
For hap.py, the following wrappers are available:
PRE.PY¶
Preprocessing/normalisation of vcf/bcf files. Part of the hap.py suite by Illumina (see https://github.com/Illumina/hap.py/blob/master/doc/normalisation.md).
This wrapper can be used in the following way:
rule preprocess_variants:
input:
##vcf/bcf
variants="variants.vcf"
output:
"normalized/variants.vcf"
params:
## path to reference genome
genome="genome.fasta",
## parameters such as -L to left-align variants
extra="-L"
threads: 2
wrapper:
"0.73.0/bio/hap.py/pre.py"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
hap.py=0.3.14
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
## Extract arguments
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"(pre.py"
" --threads {snakemake.threads}"
" -r {snakemake.params.genome}"
" {extra}"
" {snakemake.input.variants}"
" {snakemake.output})"
" {log}"
)
HISAT2¶
For hisat2, the following wrappers are available:
HISAT2 ALIGN¶
Map reads with hisat2.
This wrapper can be used in the following way:
rule hisat2_align:
input:
reads=["reads/{sample}_R1.fastq", "reads/{sample}_R2.fastq"]
output:
"mapped/{sample}.bam"
log:
"logs/hisat2_align_{sample}.log"
params:
extra="",
idx="index/",
threads: 2
wrapper:
"0.73.0/bio/hisat2/align"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
hisat2==2.1.0
samtools==1.9
- The -S flag must not be used since output is already directly piped to samtools for compression.
- The –threads/-p flag must not be used since threads is set separately via the snakemake threads directive.
- The wrapper does not yet handle SRA input accessions.
- No reference index files checking is done since the actual number of files may differ depending on the reference sequence size. This is also why the index is supplied in the params directive instead of the input directive.
- Wibowo Arindrarto
__author__ = "Wibowo Arindrarto"
__copyright__ = "Copyright 2016, Wibowo Arindrarto"
__email__ = "bow@bow.web.id"
__license__ = "BSD"
from snakemake.shell import shell
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Run log
log = snakemake.log_fmt_shell()
# Input file wrangling
reads = snakemake.input.get("reads")
if isinstance(reads, str):
input_flags = "-U {0}".format(reads)
elif len(reads) == 1:
input_flags = "-U {0}".format(reads[0])
elif len(reads) == 2:
input_flags = "-1 {0} -2 {1}".format(*reads)
else:
raise RuntimeError(
"Reads parameter must contain at least 1 and at most 2" " input files."
)
# Executed shell command
shell(
"(hisat2 {extra} "
"--threads {snakemake.threads} "
" -x {snakemake.params.idx} {input_flags} "
" | samtools view -Sbh -o {snakemake.output[0]} -) "
" {log}"
)
HISAT2 INDEX¶
Create index with hisat2.
This wrapper can be used in the following way:
rule hisat2_index:
input:
fasta = "{genome}.fasta"
output:
directory("index_{genome}")
params:
prefix = "index_{genome}/"
log:
"logs/hisat2_index_{genome}.log"
threads: 2
wrapper:
"0.73.0/bio/hisat2/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
hisat2==2.1.0
samtools==1.9
Input:
sequence
: list of FASTA files of list of sequences
Output:
- Directory of the hisat2 custom index.
- Joël Simoneau
"""Snakemake wrapper for HISAT2 index"""
__author__ = "Joël Simoneau"
__copyright__ = "Copyright 2019, Joël Simoneau"
__email__ = "simoneaujoel@gmail.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Allowing for multiple FASTA files
fasta = snakemake.input.get("fasta")
assert fasta is not None, "input-> a FASTA-file or a sequence is required"
input_seq = ""
if not "." in fasta:
input_seq += "-c "
input_seq += ",".join(fasta) if isinstance(fasta, list) else fasta
hisat_dir = snakemake.params.get("prefix", "")
if hisat_dir:
os.makedirs(hisat_dir)
shell(
"hisat2-build {extra} "
"-p {snakemake.threads} "
"{input_seq} "
"{snakemake.params.prefix} "
"{log}"
)
HMMER¶
For hmmer, the following wrappers are available:
HMMBUILD¶
hmmbuild: construct profile HMM(s) from multiple sequence alignment(s)
This wrapper can be used in the following way:
rule hmmbuild_profile:
input:
"test-profile.sto"
output:
"test-profile.hmm"
log:
"logs/test-profile-hmmbuild.log"
params:
extra="",
threads: 4
wrapper:
"0.73.0/bio/hmmer/hmmbuild"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
hmmer=3.2.1
- N Tessa Pierce
"""Snakemake wrapper for hmmbuild"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
" hmmbuild {extra} --cpu {snakemake.threads} "
" {snakemake.output} {snakemake.input} {log} "
)
HMMPRESS¶
Format an HMM database into a binary format for hmmscan.
This wrapper can be used in the following way:
rule hmmpress_profile:
input:
"test-profile.hmm"
output:
"test-profile.hmm.h3f",
"test-profile.hmm.h3i",
"test-profile.hmm.h3m",
"test-profile.hmm.h3p"
log:
"logs/hmmpress.log"
params:
extra="",
threads: 4
wrapper:
"0.73.0/bio/hmmer/hmmpress"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
hmmer=3.2.1
- N Tessa Pierce
"""Snakemake wrapper for hmmpress"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# -f Force; overwrites any previous hmmpress-ed datafiles. The default is to bitch about any existing files and ask you to delete them first.
shell("hmmpress -f {snakemake.input} {log}")
HMMSCAN¶
search protein sequence(s) against a protein profile database
This wrapper can be used in the following way:
rule hmmscan_profile:
input:
fasta="test-protein.fa",
profile="test-profile.hmm.h3f",
output:
# only one of these is required
tblout="test-prot-tbl.txt", # save parseable table of per-sequence hits to file <f>
domtblout="test-prot-domtbl.txt", # save parseable table of per-domain hits to file <f>
pfamtblout="test-prot-pfamtbl.txt", # save table of hits and domains to file, in Pfam format <f>
outfile="test-prot-out.txt", # Direct the main human-readable output to a file <f> instead of the default stdout.
log:
"logs/hmmscan.log"
params:
evalue_threshold=0.00001,
# if bitscore threshold provided, hmmscan will use that instead
#score_threshold=50,
extra="",
threads: 4
wrapper:
"0.73.0/bio/hmmer/hmmscan"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
hmmer=3.2.1
- N Tessa Pierce
"""Snakemake wrapper for hmmscan"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
profile = snakemake.input.get("profile")
profile = profile.rsplit(".h3", 1)[0]
assert profile.endswith(".hmm"), 'your profile file should end with ".hmm" '
# Direct the main human-readable output to a file <f> instead of the default stdout.
out_cmd = ""
outfile = snakemake.output.get("outfile", "")
if outfile:
out_cmd += " -o {} ".format(outfile)
# save parseable table of per-sequence hits to file <f>
tblout = snakemake.output.get("tblout", "")
if tblout:
out_cmd += " --tblout {} ".format(tblout)
# save parseable table of per-domain hits to file <f>
domtblout = snakemake.output.get("domtblout", "")
if domtblout:
out_cmd += " --domtblout {} ".format(domtblout)
# save table of hits and domains to file, in Pfam format <f>
pfamtblout = snakemake.output.get("pfamtblout", "")
if pfamtblout:
out_cmd += " --pfamtblout {} ".format(pfamtblout)
## default params: enable evalue threshold. If bitscore thresh is provided, use that instead (both not allowed)
# report models >= this score threshold in output
evalue_threshold = snakemake.params.get("evalue_threshold", 0.00001)
score_threshold = snakemake.params.get("score_threshold", "")
if score_threshold:
thresh_cmd = " -T {} ".format(float(score_threshold))
else:
thresh_cmd = " -E {} ".format(float(evalue_threshold))
# all other params should be entered in "extra" param
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"hmmscan {out_cmd} {thresh_cmd} --cpu {snakemake.threads}"
" {extra} {profile} {snakemake.input.fasta} {log}"
)
HMMSEARCH¶
search profile(s) against a sequence database
This wrapper can be used in the following way:
rule hmmsearch_profile:
input:
fasta="test-protein.fa",
profile="test-profile.hmm.h3f",
output:
# only one of these is required
tblout="test-prot-tbl.txt", # save parseable table of per-sequence hits to file <f>
domtblout="test-prot-domtbl.txt", # save parseable table of per-domain hits to file <f>
alignment_hits="test-prot-alignment-hits.txt", # Save a multiple alignment of all significant hits (those satisfying inclusion thresholds) to the file <f>
outfile="test-prot-out.txt", # Direct the main human-readable output to a file <f> instead of the default stdout.
log:
"logs/hmmsearch.log"
params:
evalue_threshold=0.00001,
# if bitscore threshold provided, hmmsearch will use that instead
#score_threshold=50,
extra="",
threads: 4
wrapper:
"0.73.0/bio/hmmer/hmmsearch"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
hmmer=3.2.1
Input:
- hmm profile(s)
- sequence database
Output:
- matches between sequences and hmm profiles
- N Tessa Pierce
"""Snakemake wrapper for hmmsearch"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
profile = snakemake.input.get("profile")
profile = profile.rsplit(".h3", 1)[0]
assert profile.endswith(".hmm"), 'your profile file should end with ".hmm" '
# Direct the main human-readable output to a file <f> instead of the default stdout.
out_cmd = ""
outfile = snakemake.output.get("outfile", "")
if outfile:
out_cmd += " -o {} ".format(outfile)
# save parseable table of per-sequence hits to file <f>
tblout = snakemake.output.get("tblout", "")
if tblout:
out_cmd += " --tblout {} ".format(tblout)
# save parseable table of per-domain hits to file <f>
domtblout = snakemake.output.get("domtblout", "")
if domtblout:
out_cmd += " --domtblout {} ".format(domtblout)
# Save a multiple alignment of all significant hits (those satisfying inclusion thresholds) to the file <f>
alignment_hits = snakemake.output.get("alignment_hits", "")
if alignment_hits:
out_cmd += " -A {} ".format(alignment_hits)
## default params: enable evalue threshold. If bitscore thresh is provided, use that instead (both not allowed)
# report models >= this score threshold in output
evalue_threshold = snakemake.params.get("evalue_threshold", 0.00001)
score_threshold = snakemake.params.get("score_threshold", "")
if score_threshold:
thresh_cmd = " -T {} ".format(float(score_threshold))
else:
thresh_cmd = " -E {} ".format(float(evalue_threshold))
# all other params should be entered in "extra" param
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
" hmmsearch --cpu {snakemake.threads} "
" {out_cmd} {thresh_cmd} {extra} {profile} "
" {snakemake.input.fasta} {log}"
)
HOMER¶
For homer, the following wrappers are available:
HOMER ANNOTATEPEAKS¶
Performing peak annotation to associate peaks with nearby genes. For more information, please see the documentation.
This wrapper can be used in the following way:
rule homer_annotatepeaks:
input:
peaks="peaks_refs/{sample}.peaks",
genome="peaks_refs/gene.fasta",
# optional input files
# gtf="", # implicitly sets the -gtf flag
# gene="", # implicitly sets the -gene flag for gene data file to add gene expression or other data types
motif_files="peaks_refs/motives.txt", # implicitly sets the -m flag
# filter_motiv="", # implicitly sets the -fm flag
# center="", # implicitly sets the -center flag
nearest_peak="peaks_refs/b.peaks", # implicitly sets the -p flag
# tag="", # implicitly sets the -d flag for tagDirectories
# vcf="", # implicitly sets the -vcf flag
# bed_graph="", # implicitly sets the -bedGraph flag
# wig="", # implicitly sets the -wig flag
# map="", # implicitly sets the -map flag
# cmp_genome="", # implicitly sets the -cmpGenome flag
# cmp_Liftover="", # implicitly sets the -cmpLiftover flag
# advanced_annotation="" # optional, implicitly sets the -ann flag, see http://homer.ucsd.edu/homer/ngs/advancedAnnotation.html
output:
annotations="{sample}_annot.txt",
# optional output, implicitly sets the -matrix flag, requires motif_files as input
matrix=multiext("{sample}",
".count.matrix.txt",
".ratio.matrix.txt",
".logPvalue.matrix.txt",
".stats.txt"
),
# optional output, implicitly sets the -mfasta flag, requires motif_files as input
mfasta="{sample}_motif.fasta",
# # optional output, implicitly sets the -mbed flag, requires motif_files as input
mbed="{sample}_motif.bed",
# # optional output, implicitly sets the -mlogic flag, requires motif_files as input
mlogic="{sample}_motif.logic"
threads:
2
params:
mode="", # add tss, tts or rna mode and options here, i.e. "tss mm8"
extra="-gid" # optional params, see http://homer.ucsd.edu/homer/ngs/annotation.html
log:
"logs/annotatePeaks/{sample}.log"
wrapper:
"0.73.0/bio/homer/annotatePeaks"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
homer==4.11
Input:
- peak or BED file
- various optional input files, i.e. gtf, bedGraph, wiggle
Output:
- annotation file (.txt)
- various optional output files
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
import os
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
genome = snakemake.input.get("genome", "")
extra = snakemake.params.get("extra", "")
motif_files = snakemake.input.get("motif_files", "")
matrix = snakemake.output.get("matrix", "")
if genome == "":
genome = "none"
# optional files
opt_files = {
"gtf": "-gtf",
"gene": "-gene",
"motif_files": "-m",
"filter_motiv": "-fm",
"center": "-center",
"nearest_peak": "-p",
"tag": "-d",
"vcf": "-vcf",
"bed_graph": "-bedGraph",
"wig": "-wig",
"map": "-map",
"cmp_genome": "-cmpGenome",
"cmp_Liftover": "-cmpLiftover",
"advanced_annotation": "-ann",
"mfasta": "-mfasta",
"mbed": "-mbed",
"mlogic": "-mlogic",
}
requires_motives = False
for i in opt_files:
file = None
if i == "mfasta" or i == "mbed" or i == "mlogic":
file = snakemake.output.get(i, "")
if file:
requires_motives = True
else:
file = snakemake.input.get(i, "")
if file:
extra += " {flag} {file}".format(flag=opt_files[i], file=file)
if requires_motives and motif_files == "":
sys.exit(
"The optional output files require motif_file(s) as input. For more information please see http://homer.ucsd.edu/homer/ngs/annotation.html."
)
# optional matrix output files:
if matrix:
if motif_files == "":
sys.exit(
"The matrix output files require motif_file(s) as input. For more information please see http://homer.ucsd.edu/homer/ngs/annotation.html."
)
ext = ".count.matrix.txt"
matrix_out = [i for i in snakemake.output if i.endswith(ext)][0]
matrix_name = os.path.basename(matrix_out[: -len(ext)])
extra += " -matrix {}".format(matrix_name)
shell(
"(annotatePeaks.pl"
" {snakemake.params.mode}"
" {snakemake.input.peaks}"
" {genome}"
" {extra}"
" -cpu {snakemake.threads}"
" > {snakemake.output.annotations})"
" {log}"
)
HOMER FINDPEAKS¶
Find ChIP- or ATAC-Seq peaks with the HOMER suite. For more information, please see the documentation.
This wrapper can be used in the following way:
rule homer_findPeaks:
input:
# tagDirectory of sample
tag="tagDir/{sample}",
# tagDirectory of control background sample - optional
control="tagDir/control"
output:
"{sample}_peaks.txt"
params:
# one of 7 basic modes of operation, see homer manual
style="histone",
extra="" # optional params, see homer manual
log:
"logs/findPeaks/{sample}.log"
wrapper:
"0.73.0/bio/homer/findPeaks"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
homer==4.11
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
from snakemake.shell import shell
import os.path as path
import sys
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
control = snakemake.input.get("control", "")
if control == "":
control_command = ""
else:
control_command = "-i " + control
shell(
"(findPeaks"
" {snakemake.input.tag}"
" -style {snakemake.params.style}"
" {extra}"
" {control_command}"
" -o {snakemake.output})"
" {log}"
)
HOMER GETDIFFERENTIALPEAKS¶
Detect differentially bound ChIP peaks between samples. For more information, please see the documentation.
This wrapper can be used in the following way:
rule homer_getDifferentialPeaks:
input:
# peak/bed file to be tested
peaks="{sample}.peaks.bed",
# tagDirectory of first sample
first="tagDir/{sample}",
# tagDirectory of sample to compare
second="tagDir/second"
output:
"{sample}_diffPeaks.txt"
params:
extra="" # optional params, see homer manual
log:
"logs/diffPeaks/{sample}.log"
wrapper:
"0.73.0/bio/homer/getDifferentialPeaks"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
homer==4.11
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
from snakemake.shell import shell
import os.path as path
import sys
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"(getDifferentialPeaks"
" {snakemake.input.peaks}"
" {snakemake.input.first}"
" {snakemake.input.second}"
" {extra}"
" > {snakemake.output})"
" {log}"
)
HOMER MAKETAGDIRECTORY¶
Create a tag directory with the HOMER suite. For more information, please see the documentation.
This wrapper can be used in the following way:
rule homer_makeTagDir:
input:
# input bam, can be one or a list of files
bam="{sample}.bam",
output:
directory("tagDir/{sample}")
params:
extra="" # optional params, see homer manual
log:
"logs/makeTagDir/{sample}.log"
wrapper:
"0.73.0/bio/homer/makeTagDirectory"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
homer==4.11
samtools==1.10
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
from snakemake.shell import shell
import os.path as path
import sys
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"(makeTagDirectory" " {snakemake.output}" " {extra}" " {snakemake.input})" " {log}"
)
HOMER MERGEPEAKS¶
Merge ChIP-Seq peaks from multiple peak files. For more information, please see the documentation. Please be aware that this wrapper does not yet support use of the -prefix
parameter.
This wrapper can be used in the following way:
rule homer_mergePeaks:
input:
# input peak files
"peaks/{sample1}.peaks",
"peaks/{sample2}.peaks"
output:
"merged/{sample1}_{sample2}.peaks"
params:
extra="-d given" # optional params, see homer manual
log:
"logs/mergePeaks/{sample1}_{sample2}.log"
wrapper:
"0.73.0/bio/homer/mergePeaks"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
homer==4.11
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
from snakemake.shell import shell
import os.path as path
import sys
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
class PrefixNotSupportedError(Exception):
pass
if "-prefix" in extra:
raise PrefixNotSupportedError(
"The use of the -prefix parameter is not yet supported in this wrapper"
)
shell("(mergePeaks" " {snakemake.input}" " {extra}" " > {snakemake.output})" " {log}")
IGV-REPORTS¶
Create self-contained igv.js HTML pages.
Example¶
This wrapper can be used in the following way:
rule igv_report:
input:
fasta="minigenome.fa",
vcf="variants.vcf",
# any number of additional optional tracks, see igv-reports manual
tracks=["alignments.bam"]
output:
"igv-report.html"
params:
extra="" # optional params, see igv-reports manual
log:
"logs/igv-report.log"
wrapper:
"0.73.0/bio/igv-reports"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
igv-reports=1.0
Authors¶
- Johannes Köster
Code¶
"""Snakemake wrapper for igv-reports."""
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
tracks = snakemake.input.get("tracks", [])
if tracks:
if isinstance(tracks, str):
tracks = [tracks]
tracks = "--tracks {}".format(" ".join(tracks))
shell(
"create_report {extra} --standalone --output {snakemake.output[0]} {snakemake.input.vcf} {snakemake.input.fasta} {tracks} {log}"
)
INFERNAL¶
For infernal, the following wrappers are available:
INFERNAL CMPRESS¶
Starting from a CM database <cmfile> in standard Infernal-1.1 format, construct binary compressed datafiles for cmscan. Infernal (‘INFERence of RNA ALignment’) is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs). A CM is like a sequence profile, but it scores a combination of sequence consensus and RNA secondary structure consensus, so in many cases, it is more capable of identifying RNA homologs that conserve their secondary structure more than their primary sequence.
This wrapper can be used in the following way:
rule infernal_cmpress:
input:
"test-covariance-model.cm"
output:
"test-covariance-model.cm.i1i",
"test-covariance-model.cm.i1f",
"test-covariance-model.cm.i1m",
"test-covariance-model.cm.i1p"
log:
"logs/cmpress.log"
params:
extra="",
wrapper:
"0.73.0/bio/infernal/cmpress"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
infernal=1.1.2
- Tessa Pierce
"""Snakemake wrapper for Infernal CMpress"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# -F enables overwrite of old (otherwise cmpress will fail if old versions exist)
shell("cmpress -F {snakemake.input} {log}")
INFERNAL CMSCAN¶
cmscan is used to search sequences against collections of covariance models that have been prepared with cmpress. The output format is designed to be human- readable, but is often so voluminous that reading it is impractical, and parsing it is a pain. The –tblout option saves output in a simple tabular format that is concise and easier to parse. The -o option allows redirecting the main output, including throwing it away in /dev/null. Infernal (‘INFERence of RNA ALignment’) is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs). A CM is like a sequence profile, but it scores a combination of sequence consensus and RNA secondary structure consensus, so in many cases, it is more capable of identifying RNA homologs that conserve their secondary structure more than their primary sequence.
This wrapper can be used in the following way:
rule cmscan_profile:
input:
fasta="test-transcript.fa",
profile="test-covariance-model.cm.i1i"
output:
tblout="tr-infernal-tblout.txt",
log:
"logs/cmscan.log"
params:
evalue_threshold=10, # In the per-target output, report target sequences with an E-value of <= <x>. default=10.0 (on average, ~10 false positives reported per query)
extra= "",
#score_threshold=50, # Instead of thresholding per-CM output on E-value, report target sequences with a bit score of >= <x>.
threads: 4
wrapper:
"0.73.0/bio/infernal/cmscan"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
infernal=1.1.2
- Tessa Pierce
"""Snakemake wrapper for Infernal CMscan"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
profile = snakemake.input.get("profile")
profile = profile.rsplit(".i", 1)[0]
assert profile.endswith(".cm"), 'your profile file should end with ".cm"'
# direct output to file <f>, not stdout
out_cmd = ""
outfile = snakemake.output.get("outfile", "")
if outfile:
out_cmd += " -o {} ".format(outfile)
# save parseable table of hits to file <s>
tblout = snakemake.output.get("tblout", "")
if tblout:
out_cmd += " --tblout {} ".format(tblout)
## default params: enable evalue threshold. If bitscore thresh is provided, use that instead (both not allowed)
# report <= this evalue threshold in output
evalue_threshold = snakemake.params.get("evalue_threshold", 10) # use cmscan default
# report >= this score threshold in output
score_threshold = snakemake.params.get("score_threshold", "")
if score_threshold:
thresh_cmd = f" -T {float(score_threshold)} "
else:
thresh_cmd = f" -E {float(evalue_threshold)} "
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"cmscan {out_cmd} {thresh_cmd} {extra} --cpu {snakemake.threads} {profile} {snakemake.input.fasta} {log}"
)
JANNOVAR¶
Annotate predicted effect of nucleotide changes with `Jannovar<https://doc-openbio.readthedocs.io/projects/jannovar/en/master/>`_
Example¶
This wrapper can be used in the following way:
rule jannovar:
input:
vcf="{sample}.vcf",
pedigree="pedigree_ar.ped" # optional, contains familial relationships
output:
"jannovar/{sample}.vcf.gz"
log:
"logs/jannovar/{sample}.log"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mg}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_gb = 1
params:
database="hg19_small.ser", # path to jannovar reference dataset
extra="--show-all" # optional parameters
wrapper:
"0.73.0/bio/jannovar"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
jannovar-cli==0.31
snakemake-wrapper-utils==0.1.3
Authors¶
- Bradford Powell
Code¶
__author__ = "Bradford Powell"
__copyright__ = "Copyright 2018, Bradford Powell"
__email__ = "bpow@unc.edu"
__license__ = "BSD"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
pedigree = snakemake.input.get("pedigree", "")
if pedigree:
pedigree = '--pedigree-file "%s"' % pedigree
shell(
"jannovar annotate-vcf --database {snakemake.params.database}"
" --input-vcf {snakemake.input.vcf} --output-vcf {snakemake.output}"
" {pedigree} {extra} {java_opts} {log}"
)
KALLISTO¶
For kallisto, the following wrappers are available:
KALLISTO INDEX¶
Index a transcriptome using kallisto.
This wrapper can be used in the following way:
rule kallisto_index:
input:
fasta = "{transcriptome}.fasta"
output:
index = "{transcriptome}.idx"
params:
extra = "--kmer-size=5"
log:
"logs/kallisto_index_{transcriptome}.log"
threads: 1
wrapper:
"0.73.0/bio/kallisto/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
kallisto==0.45.0
- Joël Simoneau
"""Snakemake wrapper for Kallisto index"""
__author__ = "Joël Simoneau"
__copyright__ = "Copyright 2019, Joël Simoneau"
__email__ = "simoneaujoel@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Allowing for multiple FASTA files
fasta = snakemake.input.get("fasta")
assert fasta is not None, "input-> a FASTA-file is required"
fasta = " ".join(fasta) if isinstance(fasta, list) else fasta
shell(
"kallisto index " # Tool
"{extra} " # Optional parameters
"--index={snakemake.output.index} " # Output file
"{fasta} " # Input FASTA files
"{log}" # Logging
)
KALLISTO QUANT¶
Pseudoalign reads and quantify transcripts using kallisto.
This wrapper can be used in the following way:
rule kallisto_quant:
input:
fastq = ["reads/{exp}_R1.fastq", "reads/{exp}_R2.fastq"],
index = "index/transcriptome.idx"
output:
directory('quant_results_{exp}')
params:
extra = ""
log:
"logs/kallisto_quant_{exp}.log"
threads: 1
wrapper:
"0.73.0/bio/kallisto/quant"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
kallisto==0.45.0
- Joël Simoneau
"""Snakemake wrapper for Kallisto quant"""
__author__ = "Joël Simoneau"
__copyright__ = "Copyright 2019, Joël Simoneau"
__email__ = "simoneaujoel@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Allowing for multiple FASTQ files
fastq = snakemake.input.get("fastq")
assert fastq is not None, "input-> a FASTQ-file is required"
fastq = " ".join(fastq) if isinstance(fastq, list) else fastq
shell(
"kallisto quant " # Tool
"{extra} " # Optional parameters
"--threads={snakemake.threads} " # Number of threads
"--index={snakemake.input.index} " # Input file
"--output-dir={snakemake.output} " # Output directory
"{fastq} " # Input FASTQ files
"{log}" # Logging
)
LAST¶
For last, the following wrappers are available:
LASTAL¶
LAST finds similar regions between sequences, and aligns them. It is designed for comparing large datasets to each other (e.g. vertebrate genomes and/or large numbers of DNA reads)
This wrapper can be used in the following way:
rule lastal_nucl_x_nucl:
input:
data="test-transcript.fa",
lastdb="test-transcript.fa.prj"
output:
# only one of these outputs is allowed
maf="test-transcript.maf",
#tab="test-transcript.tab",
#blasttab="test-transcript.blasttab",
#blasttabplus="test-transcript.blasttabplus",
params:
#Report alignments that are expected by chance at most once per LENGTH query letters. By default, LAST reports alignments that are expected by chance at most once per million query letters (for a given database). http://last.cbrc.jp/doc/last-evalues.html
D_length=1000000,
extra=""
log:
"logs/lastal/test.log"
threads: 8
wrapper:
"0.73.0/bio/last/lastal"
rule lastal_nucl_x_prot:
input:
data="test-transcript.fa",
lastdb="test-protein.fa.prj"
output:
# only one of these outputs is allowed
maf="test-tr-x-prot.maf"
#tab="test-tr-x-prot.tab",
#blasttab="test-tr-x-prot.blasttab",
#blasttabplus="test-tr-x-prot.blasttabplus",
params:
frameshift_cost=15, #Align DNA queries to protein reference sequences using specified frameshift cost. 15 is reasonable. Special case, -F0 means DNA-versus-protein alignment without frameshifts, which is faster.)
extra="",
log:
"logs/lastal/test.log"
threads: 8
wrapper:
"0.73.0/bio/last/lastal"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
last=874
- Tessa Pierce
""" Snakemake wrapper for lastal """
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# http://last.cbrc.jp/doc/last-evalues.html
d_len = float(snakemake.params.get("D_length", 1000000)) # last default
# set output file formats
maf_out = snakemake.output.get("maf", "")
tab_out = snakemake.output.get("tab", "")
btab_out = snakemake.output.get("blasttab", "")
btabplus_out = snakemake.output.get("blasttabplus", "")
outfiles = [maf_out, tab_out, btab_out, btabplus_out]
# TAB, MAF, BlastTab, BlastTab+ (default=MAF)
assert (
list(map(bool, outfiles)).count(True) == 1
), "please specify ONE output file using one of: 'maf', 'tab', 'blasttab', or 'blasttabplus' keywords in the output field)"
out_cmd = ""
if maf_out:
out_cmd = "-f {}".format("MAF")
outF = maf_out
elif tab_out:
out_cmd = "-f {}".format("TAB")
outF = tab_out
if btab_out:
out_cmd = "-f {}".format("BlastTab")
outF = btab_out
if btabplus_out:
out_cmd = "-f {}".format("BlastTab+")
outF = btabplus_out
frameshift_cost = snakemake.params.get("frameshift_cost", "")
if frameshift_cost:
f_cmd = f"-F {frameshift_cost}"
lastdb_name = str(snakemake.input["lastdb"]).rsplit(".", 1)[0]
shell(
"lastal -D {d_len} -P {snakemake.threads} {extra} {lastdb_name} {snakemake.input.data} > {outF} {log}"
)
LASTDB¶
LAST finds similar regions between sequences, and aligns them. It is designed for comparing large datasets to each other (e.g. vertebrate genomes and/or large numbers of DNA reads)
This wrapper can be used in the following way:
rule lastdb_transcript:
input:
"test-transcript.fa"
output:
"test-transcript.fa.prj",
params:
protein_input=False,
extra=""
log:
"logs/lastdb/test-transcript.log"
wrapper:
"0.73.0/bio/last/lastdb"
rule lastdb_protein:
input:
"test-protein.fa"
output:
"test-protein.fa.prj",
params:
protein_input=True,
extra=""
log:
"logs/lastdb/test-protein.log"
wrapper:
"0.73.0/bio/last/lastdb"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
last=874
- Tessa Pierce
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
protein_cmd = ""
protein = snakemake.params.get("protein_input", False)
if protein:
protein_cmd = " -p "
shell("lastdb {extra} {protein_cmd} -P {snakemake.threads} {snakemake.input} {log}")
LOFREQ¶
For lofreq, the following wrappers are available:
LOFREQ CALL¶
simply call variants
This wrapper can be used in the following way:
rule lofreq:
input:
bam="data/{sample}.bam",
bai="data/{sample}.bai"
output:
"calls/{sample}.vcf"
log:
"logs/lofreq_call/{sample}.log"
params:
ref="data/genome.fasta",
extra=""
threads: 8
wrapper:
"0.73.0/bio/lofreq/call"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.6
lofreq==2.1.3.1
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2018, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
ref = snakemake.params.get("ref", None)
if ref is None:
raise ValueError("A reference must be provided")
bam_input = snakemake.input.bam
bai_input = snakemake.input.bai
if bam_input is None:
raise ValueError("Missing bam input file!")
if bai_input is None:
raise ValueError("Missing bai input file!")
output_file = snakemake.output[0]
if output_file is None:
raise ValueError("Missing output file")
elif not len(snakemake.output) == 1:
raise ValueError("Only expecting one output file: " + str(output_file) + "!")
shell(
"lofreq call-parallel "
" --pp-threads {snakemake.threads}"
" -f {ref}"
" {bam_input}"
" -o {output_file}"
" {extra}"
" {log}"
)
MACS2¶
For macs2, the following wrappers are available:
MACS2 CALLPEAK¶
MACS2 callpeak
model-based analysis tool for ChIP-sequencing that calls peaks from alignment results. For usage information about MACS2 callpeak
, please see the documentation and the command line help. For more information about MACS2
, also see the source code and published article. Depending on the selected extension(s), the option(s) will be set automatically (please see table below). Please note that there are extensions, that are incompatible with each other, because they require the –broad option either to be enabled or disabled.
Extension for the output files Description Format Option NAME_peaks.xls a table with information about called
peaks
excel NAME_control_lambda.bdg local biases estimated for each genomic
location from the control sample
bedGraph –bdg or -B NAME_treat_pileup.bdg pileup signals from treatment sample bedGraph –bdg or -B NAME_peaks.broadPeak similar to _peaks.narrowPeak file,
except for missing the annotating peak
summits
BED 6+3 –broad NAME_peaks.gappedPeak contains the broad region and narrow
peaks
BED 12+3 –broad NAME_peaks.narrowPeak contains the peak locations, peak
summit, p-value and q-value
BED 6+4 if not set –broad NAME_summits.bed peak summits locations for every peak BED if not set –broad
This wrapper can be used in the following way:
rule callpeak:
input:
treatment="samples/a.bam", # required: treatment sample(s)
control="samples/b.bam" # optional: control sample(s)
output:
# all output-files must share the same basename and only differ by it's extension
# Usable extensions (and which tools they implicitly call) are listed here:
# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/macs2/callpeak.html.
multiext("callpeak/basename",
"_peaks.xls", ### required
### optional output files
"_peaks.narrowPeak",
"_summits.bed"
)
log:
"logs/macs2/callpeak.log"
params:
"-f BAM -g hs --nomodel"
wrapper:
"0.73.0/bio/macs2/callpeak"
rule callpeak_options:
input:
treatment="samples/a.bam", # required: treatment sample(s)
control="samples/b.bam" # optional: control sample(s)
output:
# all output-files must share the same basename and only differ by it's extension
# Usable extensions (and which tools they implicitly call) are listed here:
# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/macs2/callpeak.html.
multiext("callpeak_options/basename",
"_peaks.xls", ### required
### optional output files
# these output extensions internally set the --bdg or -B option:
"_treat_pileup.bdg",
"_control_lambda.bdg",
# these output extensions internally set the --broad option:
"_peaks.broadPeak",
"_peaks.gappedPeak"
)
log:
"logs/macs2/callpeak.log"
params:
"-f BAM -g hs --broad-cutoff 0.1 --nomodel"
wrapper:
"0.73.0/bio/macs2/callpeak"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
macs2>=2.2
Input:
- SAM, BAM, BED, ELAND, ELANDMULTI, ELANDEXPORT, BOWTIE, BAMPE or BEDPE files
Output:
- tabular file in excel format (.xls) AND
- different optional metrics in bedGraph or BED formats
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
import os
import sys
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
in_contr = snakemake.input.get("control")
params = "{}".format(snakemake.params)
opt_input = ""
out_dir = ""
ext = "_peaks.xls"
out_file = [o for o in snakemake.output if o.endswith(ext)][0]
out_name = os.path.basename(out_file[: -len(ext)])
out_dir = os.path.dirname(out_file)
if in_contr:
opt_input = "-c {contr}".format(contr=in_contr)
if out_dir:
out_dir = "--outdir {dir}".format(dir=out_dir)
if any(out.endswith(("_peaks.narrowPeak", "_summits.bed")) for out in snakemake.output):
if any(
out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak"))
for out in snakemake.output
):
sys.exit(
"Output files with _peaks.narrowPeak and/or _summits.bed extensions cannot be created together with _peaks.broadPeak and/or _peaks.gappedPeak extended output files.\n"
"For usable extensions please see https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/macs2/callpeak.html.\n"
)
else:
if " --broad" in params:
sys.exit(
"If --broad option in params is given, the _peaks.narrowPeak and _summits.bed files will not be created. \n"
"Remove --broad option from params if these files are needed.\n"
)
if any(
out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak")) for out in snakemake.output
):
if "--broad " not in params and not params.endswith("--broad"):
params += " --broad "
if any(
out.endswith(("_treat_pileup.bdg", "_control_lambda.bdg"))
for out in snakemake.output
):
if all(p not in params for p in ["--bdg", "-B"]):
params += " --bdg "
else:
if any(p in params for p in ["--bdg", "-B"]):
sys.exit(
"If --bdg or -B option in params is given, the _control_lambda.bdg and _treat_pileup.bdg extended files must be specified in output. \n"
)
shell(
"(macs2 callpeak "
"-t {snakemake.input.treatment} "
"{opt_input} "
"{out_dir} "
"-n {out_name} "
"{params}) {log}"
)
MAPDAMAGE2¶
tracking and quantifying damage patterns in ancient DNA sequences. For more information about MapDamage2 see MapDamage2 documentation.
Example¶
This wrapper can be used in the following way:
rule mapdamage2:
input:
ref="genome.fasta",
bam="mapped/{sample}.bam",
output:
log="results/{sample}/Runtime_log.txt", # output folder is infered from this file, so it needs to be the same folder for all output files
GtoA3p="results/{sample}/3pGtoA_freq.txt",
CtoT5p="results/{sample}/5pCtoT_freq.txt",
dnacomp="results/{sample}/dnacomp.txt",
frag_misincorp="results/{sample}/Fragmisincorporation_plot.pdf",
len="results/{sample}/Length_plot.pdf",
lg_dist="results/{sample}/lgdistribution.txt",
misincorp="results/{sample}/misincorporation.txt",
# rescaled_bam="results/{sample}.rescaled.bam", # uncomment if you want the rescaled BAM file
params:
extra="--no-stats" # optional parameters for mapdamage2 (except -i, -r, -d, --rescale)
log:
"logs/{sample}/mapdamage2.log"
threads: 1 # MapDamage2 is not threaded
wrapper:
"0.73.0/bio/mapdamage2"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
mapdamage2=2.2
Authors¶
- Filipe G. Vieira
Code¶
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2020, Filipe G. Vieira"
__license__ = "MIT"
import os.path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
in_bam = snakemake.input.get("bam", "")
if in_bam:
in_bam = "--input " + in_bam
output_folder = os.path.dirname(snakemake.output.get("log", ""))
if not output_folder:
raise ValueError("mapDamage2 rule needs output 'log'.")
rescaled_bam = snakemake.output.get("rescaled_bam", "")
if rescaled_bam:
rescaled_bam = "--rescale-out " + rescaled_bam
shell(
"mapDamage "
"{in_bam} "
"--reference {snakemake.input.ref} "
"--folder {output_folder} "
"{rescaled_bam} "
"{extra} "
"{log}"
)
MINIMAP2¶
For minimap2, the following wrappers are available:
MINIMAP2¶
A versatile pairwise aligner for genomic and spliced nucleotide sequences https://lh3.github.io/minimap2
This wrapper can be used in the following way:
rule minimap2:
input:
target="target/{input1}.mmi", # can be either genome index or genome fasta
query=["query/reads1.fasta", "query/reads2.fasta"]
output:
"aligned/{input1}_aln.paf"
log:
"logs/minimap2/{input1}.log"
params:
extra="-x map-pb" # optional
threads: 3
wrapper:
"0.73.0/bio/minimap2/aligner"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
minimap2==2.17
- Tom Poorten
- Michael Hall
__author__ = "Tom Poorten"
__copyright__ = "Copyright 2017, Tom Poorten"
__email__ = "tom.poorten@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
inputQuery = " ".join(snakemake.input.query)
shell(
"(minimap2 -t {snakemake.threads} {extra} -o {snakemake.output[0]} "
"{snakemake.input.target} {inputQuery}) {log}"
)
MINIMAP2 INDEX¶
creates a minimap2 index
This wrapper can be used in the following way:
rule minimap2_index:
input:
target="target/{input1}.fasta"
output:
"{input1}.mmi"
log:
"logs/minimap2_index/{input1}.log"
params:
extra="" # optional additional args
threads: 3
wrapper:
"0.73.0/bio/minimap2/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
minimap2==2.17
- Tom Poorten
__author__ = "Tom Poorten"
__copyright__ = "Copyright 2017, Tom Poorten"
__email__ = "tom.poorten@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"(minimap2 -t {snakemake.threads} {extra} "
"-d {snakemake.output[0]} {snakemake.input.target}) {log}"
)
MSISENSOR¶
For msisensor, the following wrappers are available:
MSISENSOR MSI¶
Score your MSI with MSIsensor
This wrapper can be used in the following way:
rule test_msisensor_msi:
input:
normal = "example.normal.bam",
tumor = "example.tumor.bam",
microsat = "example.microsate.sites"
output:
"example.msi",
"example.msi_dis",
"example.msi_germline",
"example.msi_somatic"
message:
"Testing MSIsensor msi"
threads:
1
log:
"example.log"
params:
out_prefix = "example.msi"
wrapper:
"0.73.0/bio/msisensor/msi"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
msisensor==0.5
Input:
- A microsatellite and homopolymer list from MSIsensor Scan
- A pair of normal/tumoral bams
Output:
- A text file containing MSI scores
- A TSV formatted file containing read count distribution
- A TSV formatted file containing somatic sites
- A TSV formatted file containing germline sites
"""Snakemake script for MSISensor msi"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2020, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from os.path import commonprefix
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Extra parameters default value is an empty string
extra = snakemake.params.get("extra", "")
# Detemining common prefix in output files
# to fill the requested parameter '-o'
prefix = commonprefix(snakemake.output)
shell(
"msisensor msi" # Tool and its sub-command
" -d {snakemake.input.microsat}" # Path to homopolymer/microsat file
" -n {snakemake.input.normal}" # Path to normal bam
" -t {snakemake.input.tumor}" # Path to tumor bam
" -o {prefix}" # Path to output distribution file
" -b {snakemake.threads}" # Maximum number of threads used
" {extra}" # Optional extra parameters
" {log}" # Logging behavior
)
MSISENSOR SCAN¶
Scan homopolymers and microsatelites with MSIsensor
This wrapper can be used in the following way:
rule test_msisensor_scan:
input:
"genome.fasta"
output:
"microsat.list"
message:
"Testing MSISensor scan"
threads:
1
params:
extra = ""
log:
"logs/msisensor_scan.log"
wrapper:
"0.73.0/bio/msisensor/scan"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
msisensor==0.5
Input:
- A (multi)fasta formatted file
Output:
- A text file containing homopolymers and microsatelites
- Thibault Dayris
"""Snakemake script for MSISensor Scan"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2020, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Extra parameters default value is an empty string
extra = snakemake.params.get("extra", "")
shell(
"msisensor scan " # Tool and its sub-command
"-d {snakemake.input} " # Path to fasta file
"-o {snakemake.output} " # Path to output file
"{extra} " # Optional extra parameters
"{log}" # Logging behavior
)
MULTIQC¶
Generate qc report using multiqc.
Example¶
This wrapper can be used in the following way:
rule multiqc:
input:
expand("samtools_stats/{sample}.txt", sample=["a", "b"])
output:
"qc/multiqc.html"
params:
"" # Optional: extra parameters for multiqc.
log:
"logs/multiqc.log"
wrapper:
"0.73.0/bio/multiqc"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
multiqc==1.9
Authors¶
- Julian de Ruiter
Code¶
"""Snakemake wrapper for trimming paired-end reads using cutadapt."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"multiqc"
" {snakemake.params}"
" --force"
" -o {output_dir}"
" -n {output_name}"
" {input_dirs}"
" {log}"
)
NANOSIM-H¶
NanoSim-H is a simulator of Oxford Nanopore reads that captures the technology-specific features of ONT data, and allows for adjustments upon improvement of Nanopore sequencing technology.
Example¶
This wrapper can be used in the following way:
rule nanosimh:
input:
"{sample}.fa"
output:
reads = "{sample}.simulated.fa",
log = "{sample}.simulated.log",
errors = "{sample}.simulated.errors.txt"
params:
extra = "",
num_reads = 10,
perfect_reads = True,
min_read_len = 10,
log:
"logs/nanosim-h/test/{sample}.log"
wrapper:
"0.73.0/bio/nanosim-h"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
nanosim-h==1.1.0.4
Authors¶
- Michael Hall
Code¶
"""Snakemake wrapper for NanoSim-H."""
__author__ = "Michael Hall"
__copyright__ = "Copyright 2019, Michael Hall"
__email__ = "mbhall88@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
def is_header(query):
return query.startswith(">")
def get_length_of_longest_sequence(fh):
current_length = 0
all_lengths = []
for line in fh:
if not is_header(line):
current_length += len(line.rstrip())
else:
all_lengths.append(current_length)
current_length = 0
all_lengths.append(current_length)
return max(all_lengths)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
prefix = snakemake.params.get("prefix", snakemake.output.reads.rpartition(".")[0])
num_reads = snakemake.params.get("num_reads", 10000)
profile = snakemake.params.get("profile", "ecoli_R9_2D")
perfect_reads = snakemake.params.get("perfect_reads", False)
min_read_len = snakemake.params.get("min_read_len", 50)
max_read_len = snakemake.params.get("max_read_len", 0)
# need to do this as the default read length of infinity can cause nanosim-h to
# hang if the reference is short
if max_read_len == 0:
with open(snakemake.input[0]) as fh:
max_read_len = get_length_of_longest_sequence(fh)
perfect_reads_flag = "--perfect " if perfect_reads else ""
# Formats the log redrection string
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Executed shell command
shell(
"nanosim-h {extra} "
"{perfect_reads_flag} "
"--max-len {max_read_len} "
"--min-len {min_read_len} "
"--profile {profile} "
"--number {num_reads} "
"--out-pref {prefix} "
"{snakemake.input} {log}"
)
NGS-DISAMBIGUATE¶
Disambiguation algorithm for reads aligned to two species (e.g. human and mouse genomes) from Tophat, Hisat2, STAR or BWA mem.
Example¶
This wrapper can be used in the following way:
rule disambiguate:
input:
a="mapped/{sample}.a.bam",
b="mapped/{sample}.b.bam"
output:
a_ambiguous='disambiguate/{sample}.graft.ambiguous.bam',
b_ambiguous='disambiguate/{sample}.host.ambiguous.bam',
a_disambiguated='disambiguate/{sample}.graft.bam',
b_disambiguated='disambiguate/{sample}.host.bam',
summary='qc/disambiguate/{sample}.txt'
params:
algorithm="bwa",
# optional: Prefix to use for output. If omitted, a
# suitable value is guessed from the output paths. Prefix
# is used for the intermediate output paths, as well as
# sample name in summary file.
prefix="{sample}",
extra=""
wrapper:
"0.73.0/bio/ngs-disambiguate"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
ngs-disambiguate==2016.11.10
bamtools==2.4.0
Input/Output¶
Input:
- species a bam file (name sorted)
- species b bam file (name sorted)
Output:
- bam file with ambiguous alignments for species a
- bam file with ambiguous alignments for species b
- bam file with unambiguous alignments for species a
- bam file with unambiguous alignments for species b
Authors¶
- Julian de Ruiter
Code¶
"""Snakemake wrapper for ngs-disambiguate (from Astrazeneca)."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
# Extract arguments.
prefix = snakemake.params.get("prefix", None)
extra = snakemake.params.get("extra", "")
output_dir = path.dirname(snakemake.output.a_ambiguous)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# If prefix is not given, we use the summary path to derive the most
# probable sample name (as the summary path is least likely to contain)
# additional suffixes. This is better than using a random id as prefix,
# the prefix is also used as the sample name in the summary file.
if prefix is None:
prefix = path.splitext(path.basename(snakemake.output.summary))[0]
# Run command.
shell(
"ngs_disambiguate"
" {extra}"
" -o {output_dir}"
" -s {prefix}"
" -a {snakemake.params.algorithm}"
" {snakemake.input.a}"
" {snakemake.input.b}"
)
# Move outputs into expected positions.
output_base = path.join(output_dir, prefix)
output_map = {
output_base + ".ambiguousSpeciesA.bam": snakemake.output.a_ambiguous,
output_base + ".ambiguousSpeciesB.bam": snakemake.output.b_ambiguous,
output_base + ".disambiguatedSpeciesA.bam": snakemake.output.a_disambiguated,
output_base + ".disambiguatedSpeciesB.bam": snakemake.output.b_disambiguated,
output_base + "_summary.txt": snakemake.output.summary,
}
for src, dest in output_map.items():
if src != dest:
shell("mv {src} {dest}")
OPEN-CRAVAT¶
For open-cravat, the following wrappers are available:
OPENCRAVAT MODULE¶
Install OpenCRAVAT modules. Annotate variant calls with OpenCRAVAT. For more details, see https://github.com/KarchinLab/open-cravat/wiki.
This wrapper can be used in the following way:
rule opencravat_module:
output:
# add any other desired modules as separate directory outputs
directory("modules/annotators/biogrid"),
log:
"logs/open-cravat/module.log"
wrapper:
"0.73.0/bio/open-cravat/module"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
open-cravat=2.1
- Rick Kim
__author__ = "Rick Kim"
__copyright__ = "Copyright 2020, Rick Kim"
__license__ = "GPLv3"
from snakemake.shell import shell
import cravat
import re
import pathlib
import os
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
onames = []
for o in snakemake.output:
onames.append(o)
if type(onames) == str:
onames = [onames]
elif type(onames) == list:
onames = onames
else:
onames = [str(onames)]
for oname in onames:
if os.path.exists(oname):
continue
[o2, module_name] = os.path.split(oname)
[modules_dir, module_type] = os.path.split(o2)
module_type = module_type[:-1]
modules_dir_cur = cravat.admin_util.get_modules_dir()
if modules_dir_cur != modules_dir:
cravat.admin_util.set_modules_dir(modules_dir)
cmd = ["oc", "module", "install", module_name, "-y"]
cmd = " ".join(cmd)
shell("{cmd} {log}")
OPENCRAVAT RUN¶
Runs OpenCRAVAT. Annotate variant calls with OpenCRAVAT. For more details, see https://github.com/KarchinLab/open-cravat/wiki.
This wrapper can be used in the following way:
rule opencravat:
input:
'example_input.tsv',
'modules/commons/hg38wgs',
'modules/converters/cravat-converter',
'modules/mappers/hg38',
'modules/annotators/biogrid',
'modules/annotators/clinvar',
'modules/postaggregators/tagsampler',
'modules/postaggregators/varmeta',
'modules/postaggregators/vcfinfo',
'modules/reporters/excelreporter',
'modules/reporters/tsvreporter',
'modules/reporters/csvreporter',
output:
'example_input.tsv.xlsx',
'example_input.tsv.variant.tsv',
'example_input.tsv.variant.csv'
log:
"logs/open-cravat/run.log"
threads: 1 # set number of threads for parallel processing
wrapper:
"0.73.0/bio/open-cravat/run"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
open-cravat=2.1
- Rick Kim
__author__ = "Rick Kim"
__copyright__ = "Copyright 2020, Rick Kim"
__license__ = "GPLv3"
from snakemake.shell import shell
import os
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
inputfiles = []
annotators = []
reporters = []
modules_dir = set()
for v in snakemake.input:
if os.path.isfile(v):
inputfiles.append(v)
elif os.path.isdir(v):
(module_group_dir, module_name) = os.path.split(v)
(in_modules_dir, module_group) = os.path.split(module_group_dir)
modules_dir.add(in_modules_dir)
if module_group == "annotators":
annotators.append(module_name)
elif module_group == "reporters" and module_name.endswith("reporter"):
reporters.append(module_name[:-8])
if len(modules_dir) > 1:
print(f'Multiple modules directory detected: {",".join(list(modules_dir))}')
exit()
cmd = ["oc", "run"]
cmd.extend(inputfiles)
genome = snakemake.params.get("genome", "hg38")
mp = snakemake.threads
cmd.extend(["-l", genome])
cmd.extend(["--mp", str(mp)])
if len(annotators) > 0:
cmd.append("-a")
cmd.extend(annotators)
if len(reporters) > 0:
cmd.append("-t")
cmd.extend(reporters)
extra = snakemake.params.get("extra", "")
if len(extra) > 0 and type(extra) == str:
cmd.extend(extra.split(" "))
shell("{cmd} {log}")
OPTITYPE¶
Precision 4-digit HLA-I-typing from NGS data based on integer linear programming. Use razers3 beforehand to generate input fastq files only mapping to HLA-regions. Please see https://github.com/FRED-2/OptiType
Example¶
This wrapper can be used in the following way:
rule optitype:
input:
# list of input reads
reads=["reads/{sample}_1.fished.fastq", "reads/{sample}_2.fished.fastq"]
output:
multiext("optitype/{sample}", "_coverage_plot.pdf", "_result.tsv")
log:
"logs/optitype/{sample}.log"
params:
# Type of sequencing data. Can be 'dna' or 'rna'. Default is 'dna'.
sequencing_type="dna",
# optiype config file, optional
config="",
# additional parameters
extra=""
wrapper:
"0.73.0/bio/optitype"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
optitype==1.3.5
Authors¶
- Jan Forster
Code¶
__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
outdir = os.path.dirname(snakemake.output[0])
# get sequencing type
seq_type = snakemake.params.get("sequencing_type", "dna")
seq_type = "--{}".format(seq_type)
# check if non-default config.ini is used
config = snakemake.params.get("config", "")
if any(config):
config = "--config {}".format(config)
shell(
"(OptiTypePipeline.py"
" --input {snakemake.input.reads}"
" --outdir {outdir}"
" --prefix {snakemake.wildcards.sample}"
" {seq_type}"
" {config}"
" {extra})"
" {log}"
)
PALADIN¶
For paladin, the following wrappers are available:
PALADIN ALIGN¶
Align nucleotide reads to a protein fasta file (that has been indexed with paladin index). PALADIN is a protein sequence alignment tool designed for the accurate functional characterization of metagenomes.
This wrapper can be used in the following way:
rule paladin_align:
input:
reads=["reads/reads.left.fq.gz"],
index="index/prot.fasta.bwt",
output:
"paladin_mapped/{sample}.bam" # will output BAM format if output file ends with ".bam", otherwise SAM format
log:
"logs/paladin/{sample}.log"
threads: 4
wrapper:
"0.73.0/bio/paladin/align"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
paladin=1.4.4
samtools=1.5
Input:
- nucleotide reads (fastq)
- indexed protein fasta file (output of paladin index or prepare)
Output:
- mapped reads (SAM or BAM format)
- Tessa Pierce
"""Snakemake wrapper for PALADIN alignment"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
r = snakemake.input.get("reads")
assert (
r is not None
), "reads are required as input. If you have paired end reads, please merge them first (e.g. with PEAR)"
index = snakemake.input.get("index")
assert (
index is not None
), "please index your assembly and provide the basename (with'.bwt' extension) via the 'index' input param"
index_base = str(index).rsplit(".bwt")[0]
outfile = snakemake.output
# if bam output, pipe to bam!
output_cmd = " | samtools view -Sb - > " if str(outfile).endswith(".bam") else " -o "
min_orf_len = snakemake.params.get("f", "250")
shell(
"paladin align -f {min_orf_len} -t {snakemake.threads} {extra} {index_base} {r} {output_cmd} {outfile}"
)
PALADIN INDEX¶
Index a protein fasta file for mapping with paladin. PALADIN is a protein sequence alignment tool designed for the accurate functional characterization of metagenomes.
This wrapper can be used in the following way:
rule paladin_index:
input:
"prot.fasta",
output:
"index/prot.fasta.bwt"
log:
"logs/paladin/prot_index.log"
params:
reference_type=3
wrapper:
"0.73.0/bio/paladin/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
paladin=1.4.4
samtools=1.5
- Tessa Pierce
"""Snakemake wrapper for Paladin Index."""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
# this wrapper temporarily copies your assembly into the output dir
# so that all the paladin output files end up in the desired spot
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
input_assembly = snakemake.input
annotation = snakemake.input.get("gff", "")
paladin_index = str(snakemake.output)
reference_type = snakemake.params.get("reference_type", "3")
assert int(reference_type) in [1, 2, 3, 4]
ref_type_cmd = "-r" + str(reference_type)
output_base = paladin_index.rsplit(".bwt")[0]
shell("cp {input_assembly} {output_base}")
shell("paladin index {ref_type_cmd} {output_base} {annotation} {extra} {log}")
shell("rm -f {output_base}")
PALADIN PREPARE¶
Download and prepare uniprot refs for paladin mapping. PALADIN is a protein sequence alignment tool designed for the accurate functional characterization of metagenomes.
This wrapper can be used in the following way:
rule paladin_prepare:
output:
"uniprot_sprot.fasta.gz",
"uniprot_sprot.fasta.gz.pro"
log:
"logs/paladin/prepare_sprot.log"
params:
reference_type=1, # 1=swiss-prot, 2=uniref90
wrapper:
"0.73.0/bio/paladin/prepare"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
paladin=1.4.4
samtools=1.5
- Tessa Pierce
"""Snakemake wrapper for Paladin Prepare"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
reference_type = snakemake.params.get(
"reference_type", "1"
) # download swissprot as default
assert int(reference_type) in [1, 2]
ref_type_cmd = "-r" + str(reference_type)
shell("paladin prepare {ref_type_cmd} {extra} {log}")
PBMM2¶
For pbmm2, the following wrappers are available:
PBMM2 ALIGN¶
Align reads using pbmm2, a minimap2 SMRT wrapper for PacBio data https://github.com/PacificBiosciences/pbmm2/
This wrapper can be used in the following way:
rule pbmm2_align:
input:
reference="target/{reference}.fasta", # can be either genome index or genome fasta
query="{query}.bam", # can be either unaligned bam, fastq, or fasta
output:
bam="aligned/{query}.{reference}.bam",
index="aligned/{query}.{reference}.bam.bai",
log:
"logs/pbmm2_align/{query}.{reference}.log",
params:
preset="CCS", # SUBREAD, CCS, HIFI, ISOSEQ, UNROLLED
sample="", # sample name for @RG header
extra="--sort", # optional additional args
loglevel="INFO",
threads: 12
wrapper:
"0.73.0/bio/pbmm2/align"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
pbmm2==1.4.0
- William Rowell
__author__ = "William Rowell"
__copyright__ = "Copyright 2020, William Rowell"
__email__ = "wrowell@pacb.com"
__license__ = "MIT"
import tempfile
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
tmp_root = snakemake.params.get("tmp_root", None)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
with tempfile.TemporaryDirectory(dir=tmp_root) as tmp_dir:
shell(
"""
(TMPDIR={tmp_dir}; \
pbmm2 align --num-threads {snakemake.threads} \
--preset {snakemake.params.preset} \
--sample {snakemake.params.sample} \
--log-level {snakemake.params.loglevel} \
{extra} \
{snakemake.input.reference} \
{snakemake.input.query} \
{snakemake.output.bam}) {log}
"""
)
PBMM2 INDEX¶
Indexes a reference using pbmm2, a minimap2 SMRT wrapper for PacBio data https://github.com/PacificBiosciences/pbmm2/
This wrapper can be used in the following way:
rule pbmm2_index:
input:
reference="target/{reference}.fasta",
output:
"target/{reference}.mmi",
log:
"logs/pbmm2_index/{reference}.log",
params:
preset="CCS", # SUBREAD, CCS, HIFI, ISOSEQ, UNROLLED
extra="", # optional additional args
threads: 8
wrapper:
"0.73.0/bio/pbmm2/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
pbmm2==1.3.0
- William Rowell
__author__ = "William Rowell"
__copyright__ = "Copyright 2020, William Rowell"
__email__ = "wrowell@pacb.com"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"""
(pbmm2 index \
--num-threads {snakemake.threads} \
--preset {snakemake.params.preset} \
--log-level DEBUG \
{extra} \
{snakemake.input.reference} {snakemake.output}) {log}
"""
)
PEAR¶
PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger
Example¶
This wrapper can be used in the following way:
rule pear_merge:
input:
read1="reads/reads.left.fq.gz",
read2="reads/reads.right.fq.gz"
output:
assembled="pear/reads_pear_assembled.fq.gz",
discarded="pear/reads_pear_discarded.fq.gz",
unassembled_read1="pear/reads_pear_unassembled_r1.fq.gz",
unassembled_read2="pear/reads_pear_unassembled_r2.fq.gz",
log:
'logs/pear.log'
params:
pval=".01",
extra=""
threads: 4
resources:
mem_mb=4000 # define amount of memory to be used by pear
wrapper:
"0.73.0/bio/pear"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
pear=0.9.6
Authors¶
- Tessa Pierce
Code¶
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
r1 = snakemake.input.get("read1")
r2 = snakemake.input.get("read2")
assert r1 is not None and r2 is not None, "r1 and r2 files are required as input"
assembled = snakemake.output.get("assembled")
assert assembled is not None, "require 'assembled' outfile"
gzip = True if assembled.endswith(".gz") else False
out_base, out_end = assembled.rsplit(".f")
out_end = ".f" + out_end
df_assembled = out_base + ".assembled.fastq"
df_discarded = out_base + ".discarded.fastq"
df_unassembled_r1 = out_base + ".unassembled.forward.fastq"
df_unassembled_r2 = out_base + ".unassembled.reverse.fastq"
df_outputs = [df_assembled, df_discarded, df_unassembled_r1, df_unassembled_r2]
discarded = snakemake.output.get("discarded", out_base + ".discarded" + out_end)
unassembled_r1 = snakemake.output.get(
"unassembled_read1", out_base + ".unassembled_r1" + out_end
)
unassembled_r2 = snakemake.output.get(
"unassembled_read2", out_base + ".unassembled_r2" + out_end
)
final_outputs = [assembled, discarded, unassembled_r1, unassembled_r2]
def move_files(in_list, out_list, gzip):
for f, o in zip(in_list, out_list):
if f != o:
if gzip:
shell("gzip -9 -c {f} > {o}")
shell("rm -f {f}")
else:
shell("cp {f} {o}")
shell("rm -f {f}")
elif gzip:
shell("gzip -9 {f}")
pval = float(snakemake.params.get("pval", ".01"))
max_mem = snakemake.resources.get("mem_mb", "4000")
extra = snakemake.params.get("extra", "")
shell(
"pear -f {r1} -r {r2} -p {pval} -j {snakemake.threads} -y {max_mem} {extra} -o {out_base} {log}"
)
move_files(df_outputs, final_outputs, gzip)
PICARD¶
For picard, the following wrappers are available:
PICARD ADDORREPLACEREADGROUPS¶
Add or replace read groups with picard tools.
This wrapper can be used in the following way:
rule replace_rg:
input:
"mapped/{sample}.bam"
output:
"fixed-rg/{sample}.bam"
log:
"logs/picard/replace_rg/{sample}.log"
params:
"RGLB=lib1 RGPL=illumina RGPU={sample} RGSM={sample}"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/addorreplacereadgroups"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params
java_opts = get_java_opts(snakemake)
shell(
"picard AddOrReplaceReadGroups {java_opts} {extra} "
"I={snakemake.input} O={snakemake.output} &> {snakemake.log}"
)
PICARD BEDTOINTERVALLIST¶
picard BedToIntervalList converts a BED file to Picard Interval List format.
This wrapper can be used in the following way:
rule bed_to_interval_list:
input:
bed="resources/a.bed",
dict="resources/genome.dict"
output:
"a.interval_list"
log:
"logs/picard/bedtointervallist/a.log"
params:
# optional parameters
"SORT=true " # sort output interval list before writing
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/bedtointervallist"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
Input:
bed
: region filedict
: genome dictionary file (from samtools dict or picard CreateSequenceDictionary )
Output:
- interval_list Picard format
- Fabian Kilpert
__author__ = "Fabian Kilpert"
__copyright__ = "Copyright 2020, Fabian Kilpert"
__email__ = "fkilpert@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
log = snakemake.log_fmt_shell()
extra = snakemake.params
java_opts = get_java_opts(snakemake)
shell(
"picard BedToIntervalList "
"{java_opts} {extra} "
"INPUT={snakemake.input.bed} "
"SEQUENCE_DICTIONARY={snakemake.input.dict} "
"OUTPUT={snakemake.output} "
"{log} "
)
PICARD COLLECTALIGNMENTSUMMARYMETRICS¶
Collect metrics on aligned reads with picard tools.
This wrapper can be used in the following way:
rule alignment_summary:
input:
ref="genome.fasta",
bam="mapped/{sample}.bam"
output:
"stats/{sample}.summary.txt"
log:
"logs/picard/alignment-summary/{sample}.log"
params:
# optional parameters (e.g. relax checks as below)
"VALIDATION_STRINGENCY=LENIENT "
"METRIC_ACCUMULATION_LEVEL=null "
"METRIC_ACCUMULATION_LEVEL=SAMPLE"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/collectalignmentsummarymetrics"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
log = snakemake.log_fmt_shell()
extra = snakemake.params
java_opts = get_java_opts(snakemake)
shell(
"picard CollectAlignmentSummaryMetrics {java_opts} {extra} "
"INPUT={snakemake.input.bam} OUTPUT={snakemake.output[0]} "
"REFERENCE_SEQUENCE={snakemake.input.ref} {log}"
)
PICARD COLLECTHSMETRICS¶
Collects hybrid-selection (HS) metrics for a SAM or BAM file using picard.
This wrapper can be used in the following way:
rule picard_collect_hs_metrics:
input:
bam="mapped/{sample}.bam",
reference="genome.fasta",
# Baits and targets should be given as interval lists. These can
# be generated from bed files using picard BedToIntervalList.
bait_intervals="regions.intervals",
target_intervals="regions.intervals"
output:
"stats/hs_metrics/{sample}.txt"
params:
# Optional extra arguments. Here we reduce sample size
# to reduce the runtime in our unit test.
extra="SAMPLE_SIZE=1000"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
log:
"logs/picard_collect_hs_metrics/{sample}.log"
wrapper:
"0.73.0/bio/picard/collecthsmetrics"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Julian de Ruiter
"""Snakemake wrapper for picard CollectHSMetrics."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
inputs = " ".join("INPUT={}".format(in_) for in_ in snakemake.input)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
java_opts = get_java_opts(snakemake)
shell(
"picard CollectHsMetrics"
" {java_opts} {extra}"
" INPUT={snakemake.input.bam}"
" OUTPUT={snakemake.output[0]}"
" REFERENCE_SEQUENCE={snakemake.input.reference}"
" BAIT_INTERVALS={snakemake.input.bait_intervals}"
" TARGET_INTERVALS={snakemake.input.target_intervals}"
" {log}"
)
PICARD COLLECTINSERTSIZEMETRICS¶
Collect metrics on insert size of paired end reads with picard tools.
This wrapper can be used in the following way:
rule insert_size:
input:
"mapped/{sample}.bam"
output:
txt="stats/{sample}.isize.txt",
pdf="stats/{sample}.isize.pdf"
log:
"logs/picard/insert_size/{sample}.log"
params:
# optional parameters (e.g. relax checks as below)
"VALIDATION_STRINGENCY=LENIENT "
"METRIC_ACCUMULATION_LEVEL=null "
"METRIC_ACCUMULATION_LEVEL=SAMPLE"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/collectinsertsizemetrics"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
r-base==3.6.2
snakemake-wrapper-utils==0.1.3
Input:
- bam file
Output:
txt
: textual representation of metricspdf
: insert size histogram
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
log = snakemake.log_fmt_shell()
extra = snakemake.params
java_opts = get_java_opts(snakemake)
shell(
"picard CollectInsertSizeMetrics {java_opts} {extra} "
"INPUT={snakemake.input} OUTPUT={snakemake.output.txt} "
"HISTOGRAM_FILE={snakemake.output.pdf} {log}"
)
PICARD COLLECTMULTIPLEMETRICS¶
A picard
meta-metrics tool that collects multiple classes of metrics. For usage information about CollectMultipleMetrics, please see picard
’s documentation. For more information about picard
, also see the source code.
You can select which tool(s) to run by adding the respective extension(s) (see table below) to the requested output of the wrapper invocation (see example Snakemake rule below).
Tool Extension(s) for the output files CollectAlignmentSummaryMetrics “.alignment_summary_metrics” CollectInsertSizeMetrics “.insert_size_metrics”,
“.insert_size_histogram.pdf”
QualityScoreDistribution “.quality_distribution_metrics”,
“.quality_distribution.pdf”
MeanQualityByCycle “.quality_by_cycle_metrics”,
“.quality_by_cycle.pdf”
CollectBaseDistributionByCycle “.base_distribution_by_cycle_metrics”,
“.base_distribution_by_cycle.pdf”
CollectGcBiasMetrics “.gc_bias.detail_metrics”,
“.gc_bias.summary_metrics”,
“.gc_bias.pdf”
RnaSeqMetrics “.rna_metrics” CollectSequencingArtifactMetrics “.bait_bias_detail_metrics”,
“.bait_bias_summary_metrics”,
“.error_summary_metrics”,
“.pre_adapter_detail_metrics”,
“.pre_adapter_summary_metrics”
CollectQualityYieldMetrics “.quality_yield_metrics”
This wrapper can be used in the following way:
rule collect_multiple_metrics:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta"
output:
# Through the output file extensions the different tools for the metrics can be selected
# so that it is not necessary to specify them under params with the "PROGRAM" option.
# Usable extensions (and which tools they implicitly call) are listed here:
# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/picard/collectmultiplemetrics.html.
multiext("stats/{sample}",
".alignment_summary_metrics",
".insert_size_metrics",
".insert_size_histogram.pdf",
".quality_distribution_metrics",
".quality_distribution.pdf",
".quality_by_cycle_metrics",
".quality_by_cycle.pdf",
".base_distribution_by_cycle_metrics",
".base_distribution_by_cycle.pdf",
".gc_bias.detail_metrics",
".gc_bias.summary_metrics",
".gc_bias.pdf",
".rna_metrics",
".bait_bias_detail_metrics",
".bait_bias_summary_metrics",
".error_summary_metrics",
".pre_adapter_detail_metrics",
".pre_adapter_summary_metrics",
".quality_yield_metrics"
)
resources:
# This parameter (default 3 GB) can be used to limit the total resources a pipeline is allowed to use, see:
# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources
mem_gb=3
log:
"logs/picard/multiple_metrics/{sample}.log"
params:
# optional parameters
"VALIDATION_STRINGENCY=LENIENT "
"METRIC_ACCUMULATION_LEVEL=null "
"METRIC_ACCUMULATION_LEVEL=SAMPLE "
"REF_FLAT=ref_flat.txt " # is required if RnaSeqMetrics are used
wrapper:
"0.73.0/bio/picard/collectmultiplemetrics"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.23.0
snakemake-wrapper-utils==0.1.3
Input:
- BAM file (.bam)
- FASTA reference sequence file (.fasta or .fa)
Output:
- multiple metrics text files (_metrics) AND
- multiple metrics pdf files (.pdf)
- the appropriate extensions for the output files must be used depending on the desired tools
- David Laehnemann
- Antonie Vietor
__author__ = "David Laehnemann, Antonie Vietor"
__copyright__ = "Copyright 2020, David Laehnemann, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
import sys
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params
java_opts = get_java_opts(snakemake)
exts_to_prog = {
".alignment_summary_metrics": "CollectAlignmentSummaryMetrics",
".insert_size_metrics": "CollectInsertSizeMetrics",
".insert_size_histogram.pdf": "CollectInsertSizeMetrics",
".quality_distribution_metrics": "QualityScoreDistribution",
".quality_distribution.pdf": "QualityScoreDistribution",
".quality_by_cycle_metrics": "MeanQualityByCycle",
".quality_by_cycle.pdf": "MeanQualityByCycle",
".base_distribution_by_cycle_metrics": "CollectBaseDistributionByCycle",
".base_distribution_by_cycle.pdf": "CollectBaseDistributionByCycle",
".gc_bias.detail_metrics": "CollectGcBiasMetrics",
".gc_bias.summary_metrics": "CollectGcBiasMetrics",
".gc_bias.pdf": "CollectGcBiasMetrics",
".rna_metrics": "RnaSeqMetrics",
".bait_bias_detail_metrics": "CollectSequencingArtifactMetrics",
".bait_bias_summary_metrics": "CollectSequencingArtifactMetrics",
".error_summary_metrics": "CollectSequencingArtifactMetrics",
".pre_adapter_detail_metrics": "CollectSequencingArtifactMetrics",
".pre_adapter_summary_metrics": "CollectSequencingArtifactMetrics",
".quality_yield_metrics": "CollectQualityYieldMetrics",
}
progs = set()
for file in snakemake.output:
matched = False
for ext in exts_to_prog:
if file.endswith(ext):
progs.add(exts_to_prog[ext])
matched = True
if not matched:
sys.exit(
"Unknown type of metrics file requested, for possible metrics files, see https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/picard/collectmultiplemetrics.html"
)
programs = " PROGRAM=" + " PROGRAM=".join(progs)
out = str(snakemake.wildcards.sample) # as default
output_file = str(snakemake.output[0])
for ext in exts_to_prog:
if output_file.endswith(ext):
out = output_file[: -len(ext)]
break
shell(
"(picard CollectMultipleMetrics "
"I={snakemake.input.bam} "
"O={out} "
"R={snakemake.input.ref} "
"{extra} {programs} {java_opts}) {log}"
)
PICARD COLLECTTARGETEDPCRMETRICS¶
Collect metric information for target pcr metrics runs, with picard tools.
This wrapper can be used in the following way:
rule CollectTargetedPcrMetrics:
input:
bam="mapped/{sample}.bam",
amplicon_intervals="amplicon.list",
target_intervals="target.list"
output:
"stats/{sample}.pcr.txt"
log:
"logs/picard/collecttargetedpcrmetrics/{sample}.log"
params:
extra=""
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/collecttargetedpcrmetrics"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@mail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
log = snakemake.log_fmt_shell()
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
shell(
"picard CollectTargetedPcrMetrics "
"{java_opts} {extra} "
"INPUT={snakemake.input.bam} "
"OUTPUT={snakemake.output[0]} "
"AMPLICON_INTERVALS={snakemake.input.amplicon_intervals} "
"TARGET_INTERVALS={snakemake.input.target_intervals} "
"{log}"
)
PICARD CREATESEQUENCEDICTIONARY¶
Create a .dict file for a given FASTA file
This wrapper can be used in the following way:
rule create_dict:
input:
"genome.fasta"
output:
"genome.dict"
log:
"logs/picard/create_dict.log"
params:
extra="" # optional: extra arguments for picard.
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/createsequencedictionary"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"picard "
"CreateSequenceDictionary "
"{java_opts} {extra} "
"R={snakemake.input[0]} "
"O={snakemake.output[0]} "
"{log}"
)
PICARD MARKDUPLICATES¶
Mark PCR and optical duplicates with picard tools. For more information about MarkDuplicates see picard documentation.
This wrapper can be used in the following way:
rule mark_duplicates:
input:
"mapped/{sample}.bam"
output:
bam="dedup/{sample}.bam",
metrics="dedup/{sample}.metrics.txt"
log:
"logs/picard/dedup/{sample}.log"
params:
"REMOVE_DUPLICATES=true"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/markduplicates"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params
java_opts = get_java_opts(snakemake)
shell(
"picard MarkDuplicates " # Tool and its subcommand
"{java_opts} " # Automatic java option
"{extra} " # User defined parmeters
"INPUT={snakemake.input} " # Input file
"OUTPUT={snakemake.output.bam} " # Output bam
"METRICS_FILE={snakemake.output.metrics} " # Output metrics
"{log}" # Logging
)
PICARD MARKDUPLICATESWITHMATECIGAR¶
Mark PCR and optical duplicates with picard tools, taking into account the CIGAR of the mate.
This wrapper can be used in the following way:
rule mark_duplicates:
input:
"mapped/{sample}.bam"
output:
bam="dedup/{sample}.bam",
metrics="dedup/{sample}.metrics.txt"
log:
"logs/picard/dedup/{sample}.log"
params:
"REMOVE_DUPLICATES=true"
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/markduplicateswithmatecigar"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-XX:ParallelGCThreads=10” (not for -XmX or -Djava.io.tmpdir, since they are handled automatically).
- The extra param alllows for additional program arguments.
- For more information see, https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicatesWithMateCigar
- Johannes Köster
- Filipe G. Vieira
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
shell(
"picard MarkDuplicatesWithMateCigar {java_opts} {extra} INPUT={snakemake.input} "
"OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
"{log}"
)
PICARD MERGESAMFILES¶
Merge sam/bam files using picard tools.
This wrapper can be used in the following way:
rule merge_bams:
input:
expand("mapped/{sample}.bam", sample=["a", "b"])
output:
"merged.bam"
log:
"logs/picard_mergesamfiles.log"
params:
"VALIDATION_STRINGENCY=LENIENT"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/mergesamfiles"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Julian de Ruiter
"""Snakemake wrapper for picard MergeSamFiles."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params
java_opts = get_java_opts(snakemake)
inputs = " ".join("INPUT={}".format(in_) for in_ in snakemake.input)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"picard"
" MergeSamFiles"
" {java_opts} {extra}"
" {inputs}"
" OUTPUT={snakemake.output[0]}"
" {log}"
)
PICARD MERGEVCFS¶
Merge vcf files using picard tools.
This wrapper can be used in the following way:
rule merge_vcfs:
input:
vcfs=["snvs.chr1.vcf", "snvs.chr2.vcf"]
output:
"snvs.vcf"
log:
"logs/picard/mergevcfs.log"
params:
extra=""
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/mergevcfs"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Johannes Köster
"""Snakemake wrapper for picard MergeSamFiles."""
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "johannes.koester@protonmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
inputs = " ".join("INPUT={}".format(f) for f in snakemake.input.vcfs)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
shell(
"picard"
" MergeVcfs"
" {java_opts}"
" {extra}"
" {inputs}"
" OUTPUT={snakemake.output[0]}"
" {log}"
)
PICARD REVERTSAM¶
Reverts SAM or BAM files to a previous state. .
This wrapper can be used in the following way:
rule revert_bam:
input:
"mapped/{sample}.bam"
output:
"revert/{sample}.bam"
log:
"logs/picard/revert_sam/{sample}.log"
params:
extra="SANITIZE=true" # optional: Extra arguments for picard.
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/revertsam"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Patrik Smeds
"""Snakemake wrapper for picard RevertSam."""
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2018, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"picard"
" RevertSam"
" {java_opts}"
" {extra}"
" INPUT={snakemake.input[0]}"
" OUTPUT={snakemake.output[0]}"
" {log}"
)
PICARD SOMTOFASTQ¶
Converts a SAM or BAM file to FASTQ.
This wrapper can be used in the following way:
rule bam_to_fastq:
input:
"mapped/{sample}.bam"
output:
fastq1="reads/{sample}.R1.fastq",
fastq2="reads/{sample}.R2.fastq"
log:
"logs/picard/sam_to_fastq/{sample}.log"
params:
extra="" # optional: Extra arguments for picard.
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/samtofastq"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Patrik Smeds
"""Snakemake wrapper for picard SortSam."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
fastq1 = snakemake.output.fastq1
fastq2 = snakemake.output.get("fastq2", None)
fastq_unpaired = snakemake.output.get("unpaired_fastq", None)
if not isinstance(fastq1, str):
raise ValueError("f1 needs to be provided")
output = " FASTQ=" + fastq1
if isinstance(fastq2, str):
output += " SECOND_END_FASTQ=" + fastq2
if isinstance(fastq_unpaired, str):
if not isinstance(fastq2, str):
raise ValueError("f2 is required if fastq_unpaired is set")
output += " UNPAIRED_FASTQ=" + fastq_unpaired
shell(
"picard"
" SamToFastq"
" {java_opts}"
" {extra}"
" INPUT={snakemake.input[0]}"
" {output}"
" {log}"
)
PICARD SORTSAM¶
Sort sam/bam files using picard tools.
This wrapper can be used in the following way:
rule sort_bam:
input:
"mapped/{sample}.bam"
output:
"sorted/{sample}.bam"
log:
"logs/picard/sort_sam/{sample}.log"
params:
sort_order="coordinate",
extra="VALIDATION_STRINGENCY=LENIENT" # optional: Extra arguments for picard.
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/picard/sortsam"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
picard==2.22.1
snakemake-wrapper-utils==0.1.3
- Julian de Ruiter
"""Snakemake wrapper for picard SortSam."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"picard"
" SortSam"
" {java_opts}"
" {extra}"
" INPUT={snakemake.input[0]}"
" OUTPUT={snakemake.output[0]}"
" SORT_ORDER={snakemake.params.sort_order}"
" {log}"
)
PINDEL¶
For pindel, the following wrappers are available:
PINDEL¶
Call variants with pindel.
This wrapper can be used in the following way:
pindel_types = ["D", "BP", "INV", "TD", "LI", "SI", "RP"]
rule pindel:
input:
ref="genome.fasta",
# samples to call
samples=["mapped/a.bam"],
# bam configuration file, see http://gmt.genome.wustl.edu/packages/pindel/quick-start.html
config="pindel_config.txt"
output:
expand("pindel/all_{type}", type=pindel_types)
params:
# prefix must be consistent with output files
prefix="pindel/all",
extra="" # optional parameters (except -i, -f, -o)
log:
"logs/pindel.log"
threads: 4
wrapper:
"0.73.0/bio/pindel/call"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
pindel==0.2.5b8
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"pindel -T {snakemake.threads} {snakemake.params.extra} -i {snakemake.input.config} "
"-f {snakemake.input.ref} -o {snakemake.params.prefix} {log}"
)
PINDEL2VCF¶
Convert pindel output to vcf.
This wrapper can be used in the following way:
rule pindel2vcf:
input:
ref="genome.fasta",
pindel="pindel/all_{type}"
output:
"pindel/all_{type}.vcf"
params:
refname="hg38", # mandatory, see pindel manual
refdate="20170110", # mandatory, see pindel manual
extra="" # extra params (except -r, -p, -R, -d, -v)
log:
"logs/pindel/pindel2vcf.{type}.log"
wrapper:
"0.73.0/bio/pindel/pindel2vcf"
rule pindel2vcf_multi_input:
input:
ref="genome.fasta",
pindel=["pindel/all_D", "pindel/all_INV"]
output:
"pindel/all.vcf"
params:
refname="hg38", # mandatory, see pindel manual
refdate="20170110", # mandatory, see pindel manual
extra="" # extra params (except -r, -p, -R, -d, -v)
log:
"logs/pindel/pindel2vcf.log"
wrapper:
"0.73.0/bio/pindel/pindel2vcf"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
pindel==0.2.5b8
- Johannes Köster
__author__ = "Johannes Köster, Patrik Smeds"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
import os
import tempfile
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
expected_endings = [
"INT",
"D",
"SI",
"INV",
"INV_final" "TD",
"LI",
"BP",
"CloseEndMapped",
"RP",
]
def split_file_name(file_parts, file_ending_index):
return (
"_".join(file_parts[:file_ending_index]),
"_".join(file_parts[file_ending_index]),
)
def process_input_path(input_file):
"""
:params input_file: Input file from rule, ex /path/to/file/all_D or /path/to/file/all_INV_final
:return: ""/path/to/file", "all"
"""
file_path, file_name = os.path.split(input_file)
file_parts = file_name.split("_")
# seperate ending and name, to name: all ending: D or name: all ending: INV_final
file_name, file_ending = split_file_name(
file_parts, -2 if file_name.endswith("_final") else -1
)
if not file_ending in expected_endings:
raise Exception("Unexpected variant type: " + file_ending)
return file_path, file_name
with tempfile.TemporaryDirectory() as tmpdirname:
input_flag = "-p"
input_file = snakemake.input.get("pindel")
if isinstance(input_file, list) and len(input_file) > 1:
input_flag = "-P"
input_path, input_name = process_input_path(input_file[0])
input_file = os.path.join(input_path, input_name)
for variant_input in snakemake.input.pindel:
if not variant_input.startswith(input_file):
raise Exception(
"Unable to extract common path from multi file input, expect path is: "
+ input_file
)
if not os.path.isfile(variant_input):
raise Exception('Input "' + input_file + '" is not a file!')
os.symlink(
os.path.abspath(variant_input),
os.path.join(tmpdirname, os.path.basename(variant_input)),
)
input_file = os.path.join(tmpdirname, input_name)
shell(
"pindel2vcf {snakemake.params.extra} {input_flag} {input_file} -r {snakemake.input.ref} -R {snakemake.params.refname} -d {snakemake.params.refdate} -v {snakemake.output[0]} {log}"
)
PLASS¶
Plass (Protein-Level ASSembler) is software to assemble short read sequencing data on a protein level. The main purpose of Plass is the assembly of complex metagenomic datasets.
Example¶
This wrapper can be used in the following way:
rule plass_paired:
input:
left=["reads/reads.left.fq.gz", "reads/reads2.left.fq.gz"],
right=["reads/reads.right.fq.gz", "reads/reads2.right.fq.gz"]
output:
"plass/prot.fasta"
log:
"logs/plass.log"
params:
extra=""
threads: 4
wrapper:
"0.73.0/bio/plass"
rule plass_single:
input:
single=["reads/reads.left.fq.gz", "reads/reads2.left.fq.gz"],
output:
"plass/prot_single.fasta"
log:
"logs/plass_single.log"
params:
extra=""
threads: 4
wrapper:
"0.73.0/bio/plass"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
plass=2.c7e35
Authors¶
- Tessa Pierce
Code¶
"""Snakemake wrapper for PLASS Protein-Level Assembler."""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2018, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
# allow multiple input files for single assembly
left = snakemake.input.get("left")
single = snakemake.input.get("single")
assert (
left is not None or single is not None
), "please check read inputs: either left/right or single read file inputs are required"
if left:
left = (
[snakemake.input.left]
if isinstance(snakemake.input.left, str)
else snakemake.input.left
)
right = snakemake.input.get("right")
assert (
right is not None
), "please input 'right' reads or specify that the reads are 'single'"
right = (
[snakemake.input.right]
if isinstance(snakemake.input.right, str)
else snakemake.input.right
)
assert len(left) == len(
right
), "left input needs to contain the same number of files as the right input"
input_str_left = " " + " ".join(left)
input_str_right = " " + " ".join(right)
input_cmd = input_str_left + " " + input_str_right
else:
single = (
[snakemake.input.single]
if isinstance(snakemake.input.single, str)
else snakemake.input.single
)
input_cmd = " " + " ".join(single)
outdir = path.dirname(snakemake.output[0])
tmpdir = path.join(outdir, "tmp")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"plass assemble {input_cmd} {snakemake.output} {tmpdir} --threads {snakemake.threads} {snakemake.params.extra} {log}"
)
PRESEQ¶
For preseq, the following wrappers are available:
PRESEQ LC_EXTRAP¶
preseq
estimates the library complexity of existing sequencing data to then estimate the yield of future experiments based on their design. For usage information, please see preseq
’s command line help (this seems more up to date than the available documentation from 2014 ). For more information about preseq
, also see the source code.
This wrapper can be used in the following way:
rule preseq_lc_extrap_bam:
input:
"samples/{sample}.sorted.bam"
output:
"test_bam/{sample}.lc_extrap"
params:
"-v" #optional parameters
log:
"logs/test_bam/{sample}.log"
wrapper:
"0.73.0/bio/preseq/lc_extrap"
rule preseq_lc_extrap_bed:
input:
"samples/{sample}.sorted.bed"
output:
"test_bed/{sample}.lc_extrap"
params:
"-v" #optional parameters
log:
"logs/test_bed/{sample}.log"
wrapper:
"0.73.0/bio/preseq/lc_extrap"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
preseq==2.0.3
Input:
- bed files containing duplicates and sorted by chromosome, start position, strand position and finally strand OR
- bam files containing duplicates and sorted by using bamtools or samtools sort.
Output:
- lc_extrap (.lc_extrap)
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
params = ""
if (os.path.splitext(snakemake.input[0])[-1]) == ".bam":
if "-bam" not in (snakemake.input[0]):
params = "-bam "
shell(
"(preseq lc_extrap {params} {snakemake.params} {snakemake.input[0]} -output {snakemake.output[0]}) {log}"
)
PRIMERCLIP¶
Primer trimming on sam file, https://github.com/swiftbiosciences/primerclip
Example¶
This wrapper can be used in the following way:
rule primerclip:
input:
0.73.0_file="0.73.0_file",
alignment_file="mapped/{sample}.bam"
output:
alignment_file="mapped/{sample}.trimmed.bam"
log:
"logs/primerclip/{sample}.log"
params:
extra=""
wrapper:
"0.73.0/bio/primerclip"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
samtools==1.9
primerclip==0.3.8
Authors¶
- Patrik Smeds
Code¶
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
master_file = snakemake.input.master_file
in_alignment_file = snakemake.input.alignment_file
out_alignment_file = snakemake.output.alignment_file
# Check inputs/arguments.
if not isinstance(master_file, str):
raise ValueError("master_file, path to the master file")
if not isinstance(in_alignment_file, str):
raise ValueError("in_alignment_file, path to the input alignment file")
if not isinstance(out_alignment_file, str):
raise ValueError("out_alignment_file, path to the output file")
samtools_input_command = "samtools view -h " + in_alignment_file
samtools_output_command = " | head -n -3 | samtools view -Sh"
if out_alignment_file.endswith(".cram"):
samtools_output_command += "C -o " + out_alignment_file
elif out_alignment_file.endswith(".sam"):
samtools_output_command += " -o " + out_alignment_file
else:
samtools_output_command += "b -o " + out_alignment_file
shell(
"{samtools_input_command} |"
" primerclip"
" {master_file}"
" /dev/stdin"
" /dev/stdout"
" {samtools_output_command}"
" {log}"
)
PROSOLO¶
For prosolo, the following wrappers are available:
PROSOLO FDR CONTROL¶
ProSolo can control the false discovery rate of any combination of its defined single cell events (like the presence of an alternative allele or the dropout of an allele).
This wrapper can be used in the following way:
rule prosolo_fdr_control:
input:
"variant_calling/{sc}.{bulk}.prosolo.bcf"
output:
"fdr_control/{sc}.{bulk}.prosolo.fdr.bcf"
threads:
1
params:
# comma-separated set of events for whose (joint)
# false discovery rate you want to control
events = "ADO_TO_REF,HET",
# false discovery rate to control for
fdr = 0.05
log:
"logs/prosolo_{sc}_{bulk}.fdr.log"
wrapper:
"0.73.0/bio/prosolo/control-fdr"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
prosolo==0.6.1
Input:
- Variants called with prosolo in vcf or bcf format, including the fine-grained posterior probabilities for single cell events.
Output:
- bcf file with all variants that satisfy the chosen false discovery rate threshold with regard to the specified events.
- David Lähnemann
"""Snakemake wrapper for ProSolo FDR control"""
__author__ = "David Lähnemann"
__copyright__ = "Copyright 2020, David Lähnemann"
__email__ = "david.laehnemann@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"( prosolo control-fdr"
" {snakemake.input}"
" --events {snakemake.params.events}"
" --var SNV"
" --fdr {snakemake.params.fdr}"
" --output {snakemake.output} )"
"{log} "
)
PROSOLO¶
ProSolo calls variants or other events (like allele dropout) in a single cell sample against a bulk background sample. The single cell should stem from the same population of cells as the bulk background sample. The single cell sample should be amplified using multiple displacement amplification to match ProSolo’s statistical model.
This wrapper can be used in the following way:
rule prosolo_calling:
input:
single_cell = "data/mapped/{sc}.sorted.bam",
single_cell_index = "data/mapped/{sc}.sorted.bam.bai",
bulk = "data/mapped/{bulk}.sorted.bam",
bulk_index = "data/mapped/{bulk}.sorted.bam.bai",
ref = "data/genome.fa",
ref_idx = "data/genome.fa.fai",
candidates = "data/{sc}.{bulk}.prosolo_candidates.bcf",
output:
"variant_calling/{sc}.{bulk}.prosolo.bcf"
params:
extra = ""
threads:
1
log:
"logs/prosolo_{sc}_{bulk}.log"
wrapper:
"0.73.0/bio/prosolo/single-cell-bulk"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
prosolo==0.6.1
Input:
- A position-sorted single cell bam file, with its index.
- A position-sorted bulk bam file, with its index.
- A reference genome sequence in fasta format, with its index.
- A vcf or bcf file specifying candidate sites to perform calling on.
Output:
- Variants called in bcf format, with fine-grained posterior probabilities for single cell events.
- David Lähnemann
"""Snakemake wrapper for ProSolo single-cell-bulk calling"""
__author__ = "David Lähnemann"
__copyright__ = "Copyright 2020, David Lähnemann"
__email__ = "david.laehnemann@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"( prosolo single-cell-bulk "
"--omit-indels "
" {snakemake.params.extra} "
"--candidates {snakemake.input.candidates} "
"--output {snakemake.output} "
"{snakemake.input.single_cell} "
"{snakemake.input.bulk} "
"{snakemake.input.ref} ) "
"{log} "
)
PTRIMMER¶
Tool to trim off the primer sequence from mutiplex amplicon sequencing
Example¶
This wrapper can be used in the following way:
rule ptrimmer_pe:
input:
r1="resources/a.lane1_R1.fastq.gz",
r2="resources/a.lane1_R2.fastq.gz",
primers="resources/primers.txt"
output:
r1="results/a.lane1_R1.fq.gz",
r2="results/a.lane1_R2.fq.gz"
log:
"logs/ptrimmer/a.lane.log"
wrapper:
"0.73.0/bio/ptrimmer"
rule ptrimmer_se:
input:
r1="resources/a.lane1_R1.fastq.gz",
primers="resources/primers.txt"
output:
r1="results/a.lane1_R1.fq",
log:
"logs/ptrimmer/a.lane1.log"
wrapper:
"0.73.0/bio/ptrimmer"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
ptrimmer==1.3.3
Authors¶
- Felix Mölder
Code¶
__author__ = "Felix Mölder"
__copyright__ = "Copyright 2020, Felix Mölder"
__email__ = "felix.moelder@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
from pathlib import Path
import ntpath
input_reads = "-f {r1}".format(r1=snakemake.input.r1)
out_r1 = ntpath.basename(snakemake.output.r1)
output_reads = "-d {o1}".format(o1=out_r1)
if snakemake.input.get("r2", ""):
seqmode = "pair"
input_reads = "{reads} -r {r2}".format(reads=input_reads, r2=snakemake.input.r2)
out_r2 = ntpath.basename(snakemake.output.r2)
output_reads = "{reads} -e {o2}".format(reads=output_reads, o2=out_r2)
else:
seqmode = "single"
primers = snakemake.input.primers
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
ptrimmer_params = "-t {mode} {in_reads} -a {primers} {out_reads}".format(
mode=seqmode, in_reads=input_reads, primers=primers, out_reads=output_reads
)
process_r1 = "mv {out_read} {final_output_path}".format(
out_read=out_r1, final_output_path=snakemake.output.r1
)
process_r2 = ""
if snakemake.input.get("r2", ""):
process_r2 = "&& mv {out_read} {final_output_path}".format(
out_read=out_r2, final_output_path=snakemake.output.r2
)
shell("(ptrimmer {ptrimmer_params} && {process_r1} {process_r2}) {log}")
PYFASTAQ¶
For pyfastaq, the following wrappers are available:
PYFASTAQ REPLACE_BASES¶
Replaces all occurrences of one letter with another.
This wrapper can be used in the following way:
rule replace_bases:
input:
"{sample}.rna.fa"
output:
"{sample}.dna.fa",
params:
old_base = "U",
new_base = "T",
log:
"logs/fastaq/replace_bases/test/{sample}.log"
wrapper:
"0.73.0/bio/pyfastaq/replace_bases"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
pyfastaq==3.17.0
- Michael Hall
__author__ = "Michael Hall"
__copyright__ = "Copyright 2019, Michael Hall"
__email__ = "michael@mbh.sh"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"fastaq replace_bases"
" {snakemake.input[0]}"
" {snakemake.output[0]}"
" {snakemake.params.old_base}"
" {snakemake.params.new_base}"
" {log}"
)
RASUSA¶
Randomly subsample sequencing reads to a specified coverage using rasusa.
Example¶
This wrapper can be used in the following way:
rule subsample:
input:
r1="{sample}.r1.fq",
r2="{sample}.r2.fq",
output:
r1="{sample}.subsampled.r1.fq",
r2="{sample}.subsampled.r2.fq",
params:
options="--seed 15",
genome_size="3mb", # required
coverage=20, # required
log:
"logs/subsample/{sample}.log",
wrapper:
"0.73.0/bio/rasusa"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
rasusa==0.3.0
Authors¶
- Michael Hall
Code¶
__author__ = "Michael Hall"
__copyright__ = "Copyright 2020, Michael Hall"
__email__ = "michael@mbh.sh"
__license__ = "MIT"
from snakemake.shell import shell
options = snakemake.params.get("options", "")
shell(
"rasusa {options} -i {snakemake.input} -o {snakemake.output} "
"-c {snakemake.params.coverage} -g {snakemake.params.genome_size} "
"2> {snakemake.log}"
)
RAZERS3¶
Mapping (short) reads against a reference sequence. Can have multiple output formats, please see https://github.com/seqan/seqan/tree/master/apps/razers3
Example¶
This wrapper can be used in the following way:
rule razers3:
input:
# list of input reads
reads=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"]
output:
# output format is automatically inferred from file extension. Can be bam/sam or other formats.
"mapped/{sample}.bam"
log:
"logs/razers3/{sample}.log"
params:
# the reference genome
genome="genome.fasta",
# additional parameters
extra=""
threads: 8
wrapper:
"0.73.0/bio/razers3"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
razers3==3.5.8
Authors¶
- Jan Forster
Code¶
__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"(razers3"
" -tc {snakemake.threads}"
" {extra}"
" -o {snakemake.output[0]}"
" {snakemake.params.genome}"
" {snakemake.input.reads})"
" {log}"
)
REBALER¶
Reference-based long read assemblies of bacterial genomes
Example¶
This wrapper can be used in the following way:
rule rebaler:
input:
reference="ref.fa",
reads="{sample}.fq",
output:
assembly="{sample}.asm.fa",
log:
"logs/rebaler/{sample}.log",
params:
extra="",
wrapper:
"0.73.0/bio/rebaler"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
rebaler==0.2.0
Authors¶
- Michael Hall
Code¶
"""Snakemake wrapper for Rebaler - https://github.com/rrwick/Rebaler"""
__author__ = "Michael Hall"
__copyright__ = "Copyright 2020, Michael Hall"
__email__ = "michael@mbh.sh"
__license__ = "MIT"
from snakemake.shell import shell
def get_named_input(name):
value = snakemake.input.get(name)
if value is None:
raise NameError("Missing input named '{}'".format(name))
return value
def get_named_output(name):
return snakemake.output.get(name, snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
reference = get_named_input("reference")
reads = get_named_input("reads")
output = get_named_output("assembly")
shell("rebaler {extra} -t {snakemake.threads} {reference} {reads} > {output} {log}")
REFERENCE¶
For reference, the following wrappers are available:
ENSEMBL-ANNOTATION¶
Download annotation of genomic sites (e.g. transcripts) from ENSEMBL FTP servers, and store them in a single .gtf or .gff3 file.
This wrapper can be used in the following way:
rule get_annotation:
output:
"refs/annotation.gtf"
params:
species="homo_sapiens",
release="87",
build="GRCh37",
fmt="gtf",
flavor="" # optional, e.g. chr_patch_hapl_scaff, see Ensembl FTP.
log:
"logs/get_annotation.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.73.0/bio/reference/ensembl-annotation"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
curl
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
import subprocess
import sys
from snakemake.shell import shell
species = snakemake.params.species.lower()
release = int(snakemake.params.release)
fmt = snakemake.params.fmt
build = snakemake.params.build
flavor = snakemake.params.get("flavor", "")
branch = ""
if release >= 81 and build == "GRCh37":
# use the special grch37 branch for new releases
branch = "grch37/"
if flavor:
flavor += "."
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
suffix = ""
if fmt == "gtf":
suffix = "gtf.gz"
elif fmt == "gff3":
suffix = "gff3.gz"
url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/{fmt}/{species}/{species_cap}.{build}.{release}.{flavor}{suffix}".format(
release=release,
build=build,
species=species,
fmt=fmt,
species_cap=species.capitalize(),
suffix=suffix,
flavor=flavor,
branch=branch,
)
try:
shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}")
except subprocess.CalledProcessError as e:
if snakemake.log:
sys.stderr = open(snakemake.log[0], "a")
print(
"Unable to download annotation data from Ensembl. "
"Did you check that this combination of species, build, and release is actually provided?",
file=sys.stderr,
)
exit(1)
ENSEMBL-SEQUENCE¶
Download sequences (e.g. genome) from ENSEMBL FTP servers, and store them in a single .fasta file.
This wrapper can be used in the following way:
rule get_genome:
output:
"refs/genome.fasta"
params:
species="saccharomyces_cerevisiae",
datatype="dna",
build="R64-1-1",
release="98"
log:
"logs/get_genome.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.73.0/bio/reference/ensembl-sequence"
rule get_chromosome:
output:
"refs/chr1.fasta"
params:
species="saccharomyces_cerevisiae",
datatype="dna",
build="R64-1-1",
release="101",
chromosome="I"
log:
"logs/get_genome.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.73.0/bio/reference/ensembl-sequence"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
curl
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
import subprocess as sp
import sys
from itertools import product
from snakemake.shell import shell
species = snakemake.params.species.lower()
release = int(snakemake.params.release)
build = snakemake.params.build
branch = ""
if release >= 81 and build == "GRCh37":
# use the special grch37 branch for new releases
branch = "grch37/"
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
spec = ("{build}" if int(release) > 75 else "{build}.{release}").format(
build=build, release=release
)
suffixes = ""
datatype = snakemake.params.get("datatype", "")
chromosome = snakemake.params.get("chromosome", "")
if datatype == "dna":
if chromosome:
suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)]
else:
suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"]
elif datatype == "cdna":
suffixes = ["cdna.all.fa.gz"]
elif datatype == "cds":
suffixes = ["cds.all.fa.gz"]
elif datatype == "ncrna":
suffixes = ["ncrna.fa.gz"]
elif datatype == "pep":
suffixes = ["pep.all.fa.gz"]
else:
raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep")
if chromosome:
if not datatype == "dna":
raise ValueError(
"invalid datatype, to select a single chromosome the datatype must be dna"
)
success = False
for suffix in suffixes:
url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species_cap}.{spec}.{suffix}".format(
release=release,
species=species,
datatype=datatype,
spec=spec.format(build=build, release=release),
suffix=suffix,
species_cap=species.capitalize(),
branch=branch,
)
try:
shell("curl -sSf {url} > /dev/null 2> /dev/null")
except sp.CalledProcessError:
continue
shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}")
success = True
break
if not success:
print(
"Unable to download requested sequence data from Ensembl. "
"Did you check that this combination of species, build, and release is actually provided?",
file=sys.stderr,
)
exit(1)
ENSEMBL-VARIATION¶
Download known genomic variants from ENSEMBL FTP servers, and store them in a single .vcf.gz file.
This wrapper can be used in the following way:
rule get_variation:
output:
vcf="refs/variation.vcf.gz"
# Optional: add fai to get VCF with annotated contig lengths (as required by GATK)
# and properly sorted VCFs.
# fai="refs/genome.fasta.fai"
params:
species="saccharomyces_cerevisiae",
release="98", # releases <98 are unsupported
build="R64-1-1",
type="all" # one of "all", "somatic", "structural_variation"
log:
"logs/get_variation.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.73.0/bio/reference/ensembl-variation"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
bcftools=1.11
curl
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
import tempfile
import subprocess
import sys
import os
from snakemake.shell import shell
from snakemake.exceptions import WorkflowError
species = snakemake.params.species.lower()
release = int(snakemake.params.release)
build = snakemake.params.build
type = snakemake.params.type
if release < 98:
print("Ensembl releases <98 are unsupported.", file=open(snakemake.log[0], "w"))
exit(1)
branch = ""
if release >= 81 and build == "GRCh37":
# use the special grch37 branch for new releases
branch = "grch37/"
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
if type == "all":
if species == "homo_sapiens" and release >= 93:
suffixes = [
"-chr{}".format(chrom) for chrom in list(range(1, 23)) + ["X", "Y", "MT"]
]
else:
suffixes = [""]
elif type == "somatic":
suffixes = ["_somatic"]
elif type == "structural_variations":
suffixes = ["_structural_variations"]
else:
raise ValueError(
"Unsupported type {} (only all, somatic, structural_variations are allowed)".format(
type
)
)
species_filename = species if release >= 91 else species.capitalize()
urls = [
"ftp://ftp.ensembl.org/pub/{branch}release-{release}/variation/vcf/{species}/{species_filename}{suffix}.{ext}".format(
release=release,
species=species,
suffix=suffix,
species_filename=species_filename,
branch=branch,
ext=ext,
)
for suffix in suffixes
for ext in ["vcf.gz", "vcf.gz.csi"]
]
names = [os.path.basename(url) for url in urls if url.endswith(".gz")]
try:
gather = "curl {urls}".format(urls=" ".join(map("-O {}".format, urls)))
workdir = os.getcwd()
with tempfile.TemporaryDirectory() as tmpdir:
if snakemake.input.get("fai"):
shell(
"(cd {tmpdir}; {gather} && "
"bcftools concat -Oz --naive {names} > concat.vcf.gz && "
"bcftools reheader --fai {workdir}/{snakemake.input.fai} concat.vcf.gz "
"> {workdir}/{snakemake.output}) {log}"
)
else:
shell(
"(cd {tmpdir}; {gather} && "
"bcftools concat -Oz --naive {names} "
"> {workdir}/{snakemake.output}) {log}"
)
except subprocess.CalledProcessError as e:
if snakemake.log:
sys.stderr = open(snakemake.log[0], "a")
print(
"Unable to download variation data from Ensembl. "
"Did you check that this combination of species, build, and release is actually provided? ",
file=sys.stderr,
)
exit(1)
REFGENIE¶
Deploy biomedical reference datasets via refgenie.
Example¶
This wrapper can be used in the following way:
rule obtain_asset:
output:
# the name refers to the refgenie seek key (see attributes on http://refgenomes.databio.org)
fai="refs/genome.fasta"
# Multiple outputs/seek keys are possible here.
params:
genome="human_alu",
asset="fasta",
tag="default"
wrapper:
"0.73.0/bio/refgenie"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
refgenie=0.9.2
refgenconf=0.9.0
Authors¶
- Johannes Köster
Code¶
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
import os
import refgenconf
genome = snakemake.params.genome
asset = snakemake.params.asset
tag = snakemake.params.tag
conf_path = os.environ["REFGENIE"]
rgc = refgenconf.RefGenConf(conf_path, writable=True)
# pull asset if necessary
gat, archive_data, server_url = rgc.pull(genome, asset, tag, force=False)
for seek_key, out in snakemake.output.items():
path = rgc.seek(genome, asset, tag_name=tag, seek_key=seek_key, strict_exists=True)
os.symlink(path, out)
RUBIC¶
RUBIC detects recurrent copy number alterations using copy number breaks.
Example¶
This wrapper can be used in the following way:
rule rubic:
input:
seg="{samples}/segments.txt",
markers="{samples}/markers.txt"
output:
out_gains="{samples}/gains.txt",
out_losses="{samples}/losses.txt",
out_plots=directory("{samples}/plots") #only possible to provide output directory for plots
params:
fdr="",
genefile=""
wrapper:
"0.73.0/bio/rubic"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
r-base=3.4.1
r-rubic=1.0.3
r-data.table=1.10.4
r-pracma=2.0.4
r-ggplot2=2.2.1
r-gtable=0.2.0
r-codetools=0.2_15
r-digest=0.6.12
Authors¶
- Beatrice F. Tan
Code¶
# __author__ = "Beatrice F. Tan"
# __copyright__ = "Copyright 2018, Beatrice F. Tan"
# __email__ = "beatrice.ftan@gmail.com"
# __license__ = "LUMC"
library(RUBIC)
all_genes <- if (snakemake@params[["genefile"]] == "") system.file("extdata", "genes.tsv", package="RUBIC") else snakemake@params[["genefile"]]
fdr <- if (snakemake@params[["fdr"]] == "") 0.25 else snakemake@params[["fdr"]]
rbc <- rubic(fdr, snakemake@input[["seg"]], snakemake@input[["markers"]], genes=all_genes)
rbc$save.focal.gains(snakemake@output[["out_gains"]])
rbc$save.focal.losses(snakemake@output[["out_losses"]])
rbc$save.plots(snakemake@output[["out_plots"]])
SALMON¶
For salmon, the following wrappers are available:
SALMON_INDEX¶
Index a transcriptome assembly with salmon
This wrapper can be used in the following way:
rule salmon_index:
input:
"assembly/transcriptome.fasta"
output:
directory("salmon/transcriptome_index")
log:
"logs/salmon/transcriptome_index.log"
threads: 2
params:
# optional parameters
extra=""
wrapper:
"0.73.0/bio/salmon/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
salmon==0.14.1
- Tessa Pierce
"""Snakemake wrapper for Salmon Index."""
__author__ = "Tessa Pierce"
__copyright__ = "Copyright 2018, Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
shell(
"salmon index -t {snakemake.input} -i {snakemake.output} "
" --threads {snakemake.threads} {extra} {log}"
)
SALMON_QUANT¶
Quantify transcripts with salmon
This wrapper can be used in the following way:
rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates)
# use a list for r1 and r2.
r1 = "reads/{sample}_1.fq.gz",
r2 = "reads/{sample}_2.fq.gz",
index = "salmon/transcriptome_index"
output:
quant = 'salmon/{sample}/quant.sf',
lib = 'salmon/{sample}/lib_format_counts.json'
log:
'logs/salmon/{sample}.log'
params:
# optional parameters
libtype ="A",
#zip_ext = bz2 # req'd for bz2 files ('bz2'); optional for gz files('gz')
extra=""
threads: 2
wrapper:
"0.73.0/bio/salmon/quant"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
salmon==0.14.1
- Tessa Pierce
"""Snakemake wrapper for Salmon Quant"""
__author__ = "Tessa Pierce"
__copyright__ = "Copyright 2018, Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
def manual_decompression(reads, zip_ext):
"""Allow *.bz2 input into salmon. Also provide same
decompression for *gz files, as salmon devs mention
it may be faster in some cases."""
if zip_ext and reads:
if zip_ext == "bz2":
reads = " < (bunzip2 -c " + reads + ")"
elif zip_ext == "gz":
reads = " < (gunzip -c " + reads + ")"
return reads
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
zip_extension = snakemake.params.get("zip_extension", "")
libtype = snakemake.params.get("libtype", "A")
r1 = snakemake.input.get("r1")
r2 = snakemake.input.get("r2")
r = snakemake.input.get("r")
assert (
r1 is not None and r2 is not None
) or r is not None, "either r1 and r2 (paired), or r (unpaired) are required as input"
if r1:
r1 = (
[snakemake.input.r1]
if isinstance(snakemake.input.r1, str)
else snakemake.input.r1
)
r2 = (
[snakemake.input.r2]
if isinstance(snakemake.input.r2, str)
else snakemake.input.r2
)
assert len(r1) == len(r2), "input-> equal number of files required for r1 and r2"
r1_cmd = " -1 " + manual_decompression(" ".join(r1), zip_extension)
r2_cmd = " -2 " + manual_decompression(" ".join(r2), zip_extension)
read_cmd = " ".join([r1_cmd, r2_cmd])
if r:
assert (
r1 is None and r2 is None
), "Salmon cannot quantify mixed paired/unpaired input files. Please input either r1,r2 (paired) or r (unpaired)"
r = [snakemake.input.r] if isinstance(snakemake.input.r, str) else snakemake.input.r
read_cmd = " -r " + manual_decompression(" ".join(r), zip_extension)
outdir = path.dirname(snakemake.output.get("quant"))
shell(
"salmon quant -i {snakemake.input.index} "
" -l {libtype} {read_cmd} -o {outdir} "
" -p {snakemake.threads} {extra} {log} "
)
SAMBAMBA¶
For sambamba, the following wrappers are available:
SAMBAMBA FLAGSTAT¶
Outputs some statistics drawn from read flags. See details `here https://lomereiter.github.io/sambamba/docs/sambamba-flagstat.html`_
This wrapper can be used in the following way:
rule sambamba_flagstat:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.stats.txt"
params:
extra="" # optional parameters
log:
"logs/sambamba-flagstat/{sample}.log"
threads: 1
wrapper:
"0.73.0/bio/sambamba/flagstat"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sambamba==0.8.0
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"sambamba flagstat {snakemake.params.extra} -t {snakemake.threads} "
"{snakemake.input[0]} > {snakemake.output[0]} "
"{log}"
)
SAMBAMBA INDEX¶
Indexing a bam file with `sambamba https://lomereiter.github.io/sambamba/docs/sambamba-index.html`_
This wrapper can be used in the following way:
rule sambamba_index:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.bam.bai"
params:
extra="" # optional parameters
log:
"logs/sambamba-index/{sample}.log"
threads: 8
wrapper:
"0.73.0/bio/sambamba/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sambamba==0.8.0
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"sambamba index {snakemake.params.extra} -t {snakemake.threads} "
"{snakemake.input[0]} {snakemake.output[0]} "
"{log}"
)
SAMBAMBA MARKDUP¶
Marks (default) or removes duplicate reads in BAM file. See details `here https://lomereiter.github.io/sambamba/docs/sambamba-markdup.html`_
This wrapper can be used in the following way:
rule sambamba_markdup:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.rmdup.bam"
params:
extra="-r" # optional parameters
log:
"logs/sambamba-markdup/{sample}.log"
threads: 8
wrapper:
"0.73.0/bio/sambamba/markdup"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sambamba==0.8.0
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"sambamba markdup {snakemake.params.extra} -t {snakemake.threads} "
"{snakemake.input[0]} {snakemake.output[0]} "
"{log}"
)
SAMBAMBA MERGE¶
merge multiple BAM files into one using `sambamba https://lomereiter.github.io/sambamba/docs/sambamba-merge.html`_
This wrapper can be used in the following way:
rule sambamba_merge:
input:
["mapped/{sample}_1.sorted.bam", "mapped/{sample}_2.sorted.bam"]
output:
"mapped/{sample}.merged.bam"
params:
extra="" # optional parameters
log:
"logs/sambamba-merge/{sample}.log"
threads: 1
wrapper:
"0.73.0/bio/sambamba/merge"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sambamba==0.8.0
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"sambamba merge {snakemake.params.extra} -t {snakemake.threads} "
"{snakemake.output[0]} {snakemake.input} "
"{log}"
)
SAMBAMBA SLICE¶
Fast tool for copying a slice of a BAM file. See details `here https://lomereiter.github.io/sambamba/docs/sambamba-slice.html`_
This wrapper can be used in the following way:
rule sambamba_slice:
input:
bam="mapped/{sample}.bam",
bai="mapped/{sample}.bam.bai"
output:
"mapped/{sample}.region.bam"
params:
region="xx:1-10" # region to catch (contig:start-end)
log:
"logs/sambamba-slice/{sample}.log"
wrapper:
"0.73.0/bio/sambamba/slice"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sambamba==0.8.0
Input:
- coordinate-sorted and indexed bam file
Output:
- new bam file with specific region
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"sambamba slice "
"{snakemake.input[0]} {snakemake.params.region} > {snakemake.output[0]} "
"{log}"
)
SAMBAMBA SORT¶
Sort bam file with sambamba
This wrapper can be used in the following way:
rule sambamba_sort:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.sorted.bam"
params:
"" # optional parameters
log:
"logs/sambamba-sort/{sample}.log"
threads: 8
wrapper:
"0.73.0/bio/sambamba/sort"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sambamba==0.8.0
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"sambamba sort {snakemake.params} -t {snakemake.threads} "
"-o {snakemake.output[0]} {snakemake.input[0]} "
"{log}"
)
SAMBAMBA VIEW¶
Filter and/or view BAM files. See details `here https://lomereiter.github.io/sambamba/docs/sambamba-view.html`_
This wrapper can be used in the following way:
rule sambamba_view:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.filtered.bam"
params:
extra="-f bam -F 'mapping_quality >= 50'" # optional parameters
log:
"logs/sambamba-view/{sample}.log"
threads: 8
wrapper:
"0.73.0/bio/sambamba/view"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sambamba==0.8.0
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
in_file = snakemake.input[0]
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
if in_file.endswith(".sam") and ("-S" not in extra or "--sam-input" not in extra):
extra += " --sam-input"
shell(
"sambamba view {extra} -t {snakemake.threads} "
"{snakemake.input[0]} > {snakemake.output[0]} "
"{log}"
)
SAMTOOLS¶
For samtools, the following wrappers are available:
SAMTOOLS BAM2FQ INTERLEAVED¶
Convert a bam file back to unaligned reads in a single fastq file with samtools. For paired end reads, this results in an unsorted interleaved file.
This wrapper can be used in the following way:
rule samtools_bam2fq_interleaved:
input:
"mapped/{sample}.bam"
output:
"reads/{sample}.fq"
params:
" "
threads: 3
wrapper:
"0.73.0/bio/samtools/bam2fq/interleaved"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- David Laehnemann
- Victoria Sack
__author__ = "David Laehnemann, Victoria Sack"
__copyright__ = "Copyright 2018, David Laehnemann, Victoria Sack"
__email__ = "david.laehnemann@hhu.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
prefix = os.path.splitext(snakemake.output[0])[0]
shell(
"samtools bam2fq {snakemake.params} "
" -@ {snakemake.threads} "
" {snakemake.input[0]}"
" > {snakemake.output[0]} "
"{log}"
)
SAMTOOLS BAM2FQ SEPARATE¶
Convert a bam file with paired end reads back to unaligned reads in a two separate fastq files with samtools. Reads that are not properly paired are discarded (READ_OTHER and singleton reads in samtools bam2fq documentation), as are secondary (0x100) and supplementary reads (0x800).
This wrapper can be used in the following way:
rule samtools_bam2fq_separate:
input:
"mapped/{sample}.bam"
output:
"reads/{sample}.1.fq",
"reads/{sample}.2.fq"
params:
sort = "-m 4G",
bam2fq = "-n"
threads: # Remember, this is the number of samtools' additional threads
3 # At least 2 threads have to be requested on cluster sumbission.
# Thus, this value - 2 will be sent to samtools sort -@ argument.
wrapper:
"0.73.0/bio/samtools/bam2fq/separate"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Samtools -@/–threads takes one integer as input. This is the number of additional threads and not raw threads.
- David Laehnemann
- Victoria Sack
__author__ = "David Laehnemann, Victoria Sack"
__copyright__ = "Copyright 2018, David Laehnemann, Victoria Sack"
__email__ = "david.laehnemann@hhu.de"
__license__ = "MIT"
import os
from snakemake.shell import shell
params_sort = snakemake.params.get("sort", "")
params_bam2fq = snakemake.params.get("bam2fq", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
prefix = os.path.splitext(snakemake.output[0])[0]
# Samtools takes additional threads through its option -@
# One thread is used bu Samtools sort
# One thread is used by Samtools bam2fq
# So snakemake.threads has to take them into account
# before allowing additional threads through samtools sort -@
threads = "" if snakemake.threads <= 2 else " -@ {} ".format(snakemake.threads - 2)
shell(
"(samtools sort -n "
" {threads} "
" -T {prefix} "
" {params_sort} "
" {snakemake.input[0]} | "
"samtools bam2fq "
" {params_bam2fq} "
" -1 {snakemake.output[0]} "
" -2 {snakemake.output[1]} "
" -0 /dev/null "
" -s /dev/null "
" -F 0x900 "
" - "
") {log}"
)
SAMTOOLS CALMD¶
Calculates MD and NM tags. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_calmd:
input:
aln = "{sample}.bam", # Can be 'sam', 'bam', or 'cram'
ref = "genome.fasta"
output:
"{sample}.calmd.bam"
params:
"-E" # optional params string
threads: 2
wrapper:
"0.73.0/bio/samtools/calmd"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.11
- Filipe G. Vieira
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2020, Filipe G. Vieira"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
out_name, out_ext = path.splitext(snakemake.output[0])
out_ext = out_ext[1:].upper()
shell(
"samtools calmd --threads {snakemake.threads} {snakemake.params} --output-fmt {out_ext} {snakemake.input.aln} {snakemake.input.ref} > {snakemake.output[0]} {log}"
)
SAMTOOLS DEPTH¶
Compute the read depth at each position or region using samtools. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_depth:
input:
bams=["mapped/A.bam", "mapped/B.bam"],
bed="regionToCalcDepth.bed", # optional
output:
"depth.txt"
params:
# optional bed file passed to -b
extra="" # optional additional parameters as string
wrapper:
"0.73.0/bio/samtools/depth"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Dayne Filer
"""Snakemake wrapper for running samtools depth."""
__author__ = "Dayne L Filer"
__copyright__ = "Copyright 2020, Dayne L Filer"
__email__ = "dayne.filer@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
params = snakemake.params.get("extra", "")
# check for optional bed file
bed = snakemake.input.get("bed", "")
if bed:
bed = "-b " + bed
shell(
"samtools depth {params} {bed} "
"-o {snakemake.output[0]} {snakemake.input.bams} {log}"
)
SAMTOOLS FAIDX¶
index reference sequence in FASTA format from reference sequence. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_index:
input:
"{sample}.fa"
output:
"{sample}.fa.fai"
params:
"" # optional params string
wrapper:
"0.73.0/bio/samtools/faidx"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Michael Chambers
__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "greenkidneybean@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]} {log}"
)
SAMTOOLS FIXMATE¶
Use samtools to correct mate information after BWA mapping. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_fixmate:
input:
"mapped/{input}"
output:
"fixed/{input}"
message:
"Fixing mate information in {wildcards.input}"
threads:
1
params:
extra = ""
wrapper:
"0.73.0/bio/samtools/fixmate/"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Thibault Dayris
"""Snakemake wrapper for samtools fixmate"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
import os.path as op
from snakemake.shell import shell
from snakemake.utils import makedirs
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
# Samtools' threads parameter lists ADDITIONAL threads.
# that is why threads - 1 has to be given to the -@ parameter
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)
makedirs(op.dirname(snakemake.output[0]))
shell(
"samtools fixmate {extra} {threads}"
" {snakemake.input[0]} {snakemake.output[0]} {log}"
)
SAMTOOLS FLAGSTAT¶
Use samtools to create a flagstat file from a bam or sam file. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_flagstat:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.bam.flagstat"
wrapper:
"0.73.0/bio/samtools/flagstat"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Christopher Preusch
__author__ = "Christopher Preusch"
__copyright__ = "Copyright 2017, Christopher Preusch"
__email__ = "cpreusch[at]ust.hk"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell("samtools flagstat {snakemake.input[0]} > {snakemake.output[0]} {log}")
SAMTOOLS IDXSTATS¶
Use samtools to retrieve and print stats form indexed bam, sam or cram files. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_idxstats:
input:
bam="mapped/{sample}.bam",
idx="mapped/{sample}.bam.bai"
output:
"mapped/{sample}.bam.idxstats"
log:
"logs/samtools/idxstats/{sample}.log"
wrapper:
"0.73.0/bio/samtools/idxstats"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
Input:
- indexed sam, bam or cram file (.sam, .bam, .cram)
- corresponding index files
Output:
- idxstat file (.idxstats)
- Antonie Vietor
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell("samtools idxstats {snakemake.input.bam} > {snakemake.output[0]} {log}")
SAMTOOLS INDEX¶
Index bam file with samtools. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_index:
input:
"mapped/{sample}.sorted.bam"
output:
"mapped/{sample}.sorted.bam.bai"
params:
"" # optional params string
wrapper:
"0.73.0/bio/samtools/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]} {log}"
)
SAMTOOLS MERGE¶
Merge two bam files with samtools. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_merge:
input:
["mapped/A.bam", "mapped/B.bam"]
output:
"merged.bam"
params:
"" # optional additional parameters as string
threads: # Samtools takes additional threads through its option -@
8 # This value - 1 will be sent to -@
wrapper:
"0.73.0/bio/samtools/merge"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Samtools -@/–threads takes one integer as input. This is the number of additional threads and not raw threads.
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Samtools takes additional threads through its option -@
# One thread for samtools merge
# Other threads are *additional* threads passed to the '-@' argument
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)
shell(
"samtools merge {threads} {snakemake.params} "
"{snakemake.output[0]} {snakemake.input} "
"{log}"
)
SAMTOOLS MPILEUP¶
Generate pileup using samtools. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule mpilup:
input:
# single or list of bam files
bam="mapped/{sample}.bam",
reference_genome="genome.fasta"
output:
"mpileup/{sample}.mpileup.gz"
log:
"logs/samtools/mpileup/{sample}.log"
params:
extra="-d 10000", # optional
wrapper:
"0.73.0/bio/samtools/mpileup"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
pigz==2.3.4
- Patrik Smeds
"""Snakemake wrapper for running mpileup."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
bam_input = snakemake.input.bam
reference_genome = snakemake.input.reference_genome
extra = snakemake.params.get("extra", "")
if not snakemake.output[0].endswith(".gz"):
raise Exception(
'output file will be compressed and therefore filename should end with ".gz"'
)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"samtools mpileup "
"{extra} "
"-f {reference_genome} "
"{bam_input} "
" | pigz > {snakemake.output} "
"{log}"
)
SAMTOOLS SORT¶
Sort bam file with samtools. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_sort:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.sorted.bam"
params:
extra = "-m 4G",
tmp_dir = "/tmp/"
threads: # Samtools takes additional threads through its option -@
8 # This value - 1 will be sent to -@.
wrapper:
"0.73.0/bio/samtools/sort"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Samtools -@/–threads takes one integer as input. This is the number of additional threads and not raw threads.
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
out_name, out_ext = os.path.splitext(snakemake.output[0])
tmp_dir = snakemake.params.get("tmp_dir", "")
if tmp_dir:
prefix = os.path.join(tmp_dir, os.path.basename(out_name))
else:
prefix = out_name
# Samtools takes additional threads through its option -@
# One thread for samtools
# Other threads are *additional* threads passed to the argument -@
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)
shell(
"samtools sort {extra} {threads} -o {snakemake.output[0]} "
"-T {prefix} {snakemake.input[0]} "
"{log}"
)
SAMTOOLS STATS¶
Generate stats using samtools. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_stats:
input:
"mapped/{sample}.bam"
output:
"samtools_stats/{sample}.txt"
params:
extra="", # Optional: extra arguments.
region="xx:1000000-2000000" # Optional: region string.
log:
"logs/samtools_stats/{sample}.log"
wrapper:
"0.73.0/bio/samtools/stats"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Julian de Ruiter
"""Snakemake wrapper for trimming paired-end reads using cutadapt."""
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell("samtools stats {extra} {snakemake.input} {region} > {snakemake.output} {log}")
SAMTOOLS VIEW¶
Convert or filter SAM/BAM. For more information see SAMtools documentation.
This wrapper can be used in the following way:
rule samtools_view:
input:
"{sample}.sam"
output:
"{sample}.bam"
params:
"-b" # optional params string
wrapper:
"0.73.0/bio/samtools/view"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
samtools==1.10
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"samtools view {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]} {log}"
)
SEQTK¶
For seqtk, the following wrappers are available:
SEQTK-SUBSAMPLE-PE¶
Subsample reads from paired FASTQ files
This wrapper can be used in the following way:
rule seqtk_subsample_pe:
input:
f1="{sample}.1.fastq.gz",
f2="{sample}.2.fastq.gz"
output:
f1="{sample}.1.subsampled.fastq.gz",
f2="{sample}.2.subsampled.fastq.gz"
params:
n=3,
seed=12345
log:
"logs/seqtk_subsample/{sample}.log"
threads:
1
wrapper:
"0.73.0/bio/seqtk/subsample/pe"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
seqtk==1.3
pigz=2.3
Input:
- paired fastq files (can be gzip compressed)
Output:
- subsampled paired fastq files (gzip compressed)
- Fabian Kilpert
"""Snakemake wrapper for subsampling reads from paired FASTQ files using seqtk."""
__author__ = "Fabian Kilpert"
__copyright__ = "Copyright 2020, Fabian Kilpert"
__email__ = "fkilpert@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell()
shell(
"( "
"seqtk sample "
"-s {snakemake.params.seed} "
"{snakemake.input.f1} "
"{snakemake.params.n} "
"| pigz -9 -p {snakemake.threads} "
"> {snakemake.output.f1} "
"&& "
"seqtk sample "
"-s {snakemake.params.seed} "
"{snakemake.input.f2} "
"{snakemake.params.n} "
"| pigz -9 -p {snakemake.threads} "
"> {snakemake.output.f2} "
") {log} "
)
SEQTK-SUBSAMPLE-SE¶
Subsample reads from FASTQ file
This wrapper can be used in the following way:
rule seqtk_subsample_se:
input:
"{sample}.fastq.gz"
output:
"{sample}.subsampled.fastq.gz"
params:
n=3,
seed=12345
log:
"logs/seqtk_subsample/{sample}.log"
threads:
1
wrapper:
"0.73.0/bio/seqtk/subsample/se"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
seqtk==1.3
pigz=2.3
Input:
- fastq file (can be gzip compressed)
Output:
- subsampled fastq file (gzip compressed)
- Fabian Kilpert
"""Snakemake wrapper for subsampling reads from FASTQ file using seqtk."""
__author__ = "Fabian Kilpert"
__copyright__ = "Copyright 2020, Fabian Kilpert"
__email__ = "fkilpert@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell()
shell(
"( "
"seqtk sample "
"-s {snakemake.params.seed} "
"{snakemake.input} "
"{snakemake.params.n} "
"| pigz -9 -p {snakemake.threads} "
"> {snakemake.output} "
") {log} "
)
SHOVILL¶
Assemble bacterial isolate genomes from Illumina paired-end reads.
Example¶
This wrapper can be used in the following way:
rule shovill:
input:
r1="reads/{sample}_R1.fq.gz",
r2="reads/{sample}_R2.fq.gz"
output:
raw_assembly="assembly/{sample}.{assembler}.assembly.fa",
contigs="assembly/{sample}.{assembler}.contigs.fa"
params:
extra=""
log:
"logs/shovill/{sample}.{assembler}.log"
threads: 1
wrapper:
"0.73.0/bio/shovill"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
shovill==1.1.0
Authors¶
- Sangram Keshari Sahu
Code¶
"""Snakemake wrapper for shovill."""
__author__ = "Sangram Keshari Sahu"
__copyright__ = "Copyright 2020, Sangram Keshari Sahu"
__email__ = "sangramsahu15@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from tempfile import TemporaryDirectory
# Placeholder for optional parameters
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
params = snakemake.params.get("extra", "")
with TemporaryDirectory() as tempdir:
shell(
"(shovill"
" --assembler {snakemake.wildcards.assembler}"
" --outdir {tempdir} --force"
" --R1 {snakemake.input.r1}"
" --R2 {snakemake.input.r2}"
" --cpus {snakemake.threads}"
" {params}) {log}"
)
shell(
"mv {tempdir}/{snakemake.wildcards.assembler}.fasta {snakemake.output.raw_assembly}"
" && mv {tempdir}/contigs.fa {snakemake.output.contigs}"
)
SICKLE¶
For sickle, the following wrappers are available:
SICKLE PE¶
Trim paired-end reads with sickle.
This wrapper can be used in the following way:
rule sickle_pe:
input:
r1="input_R1.fq",
r2="input_R2.fq"
output:
r1="output_R1.fq",
r2="output_R2.fq",
rs="output_single.fq",
params:
qual_type="sanger",
# optional extra parameters
extra=""
log:
# optional log file
"logs/sickle/job.log"
wrapper:
"0.73.0/bio/sickle/pe"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sickle-trim==1.33
- Wibowo Arindrarto
__author__ = "Wibowo Arindrarto"
__copyright__ = "Copyright 2016, Wibowo Arindrarto"
__email__ = "bow@bow.web.id"
__license__ = "BSD"
from snakemake.shell import shell
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell()
shell(
"(sickle pe -f {snakemake.input.r1} -r {snakemake.input.r2} "
"-o {snakemake.output.r1} -p {snakemake.output.r2} "
"-s {snakemake.output.rs} -t {snakemake.params.qual_type} "
"{extra}) {log}"
)
SICKLE SE¶
Trim single-end reads with sickle.
This wrapper can be used in the following way:
rule sickle_pe:
input:
"input_R1.fq"
output:
"output_R1.fq"
params:
qual_type="sanger",
# optional extra parameters
extra=""
log:
"logs/sickle/job.log"
wrapper:
"0.73.0/bio/sickle/pe"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sickle-trim==1.33
- Wibowo Arindrarto
__author__ = "Wibowo Arindrarto"
__copyright__ = "Copyright 2016, Wibowo Arindrarto"
__email__ = "bow@bow.web.id"
__license__ = "BSD"
from snakemake.shell import shell
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell()
shell(
"(sickle se -f {snakemake.input[0]} -o {snakemake.output[0]} "
"-t {snakemake.params.qual_type} {extra}) {log}"
)
SNP-MUTATOR¶
Generate mutated sequence files from a reference genome.
Example¶
This wrapper can be used in the following way:
NUM_SIMULATIONS = 2
rule snpmutator:
input:
"{sample}.fa"
output:
vcf = "{sample}.mutated.vcf",
sequences = expand(
"{{sample}}_mutated_{simulation_number}.fasta",
simulation_number=range(1, NUM_SIMULATIONS + 1)
)
params:
num_simulations = NUM_SIMULATIONS,
extra = " ".join([
"--num-substitutions 2",
"--num-insertions 2",
"--num-deletions 0"
]),
log:
"logs/snp-mutator/test/{sample}.log"
wrapper:
"0.73.0/bio/snp-mutator"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
snp-mutator==1.2.0
Authors¶
- Michael Hall
Code¶
"""Snakemake wrapper for SNP Mutator."""
__author__ = "Michael Hall"
__copyright__ = "Copyright 2019, Michael Hall"
__email__ = "mbhall88@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
from pathlib import Path
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
num_simulations = snakemake.params.get("num_simulations", 100)
fasta_outdir = Path(snakemake.output.sequences[0]).absolute().parent
# Formats the log redrection string
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Executed shell command
shell(
"snpmutator {extra} "
"--num-simulations {num_simulations} "
"--vcf {snakemake.output.vcf} "
"-F {fasta_outdir} "
"{snakemake.input} {log} "
)
SNPEFF¶
For snpeff, the following wrappers are available:
SNPEFF¶
Annotate predicted effect of nucleotide changes with SnpEff
This wrapper can be used in the following way:
rule snpeff:
input:
calls="{sample}.vcf", # (vcf, bcf, or vcf.gz)
db="resources/snpeff/ebola_zaire" # path to reference db downloaded with the snpeff download wrapper
output:
calls="snpeff/{sample}.vcf", # annotated calls (vcf, bcf, or vcf.gz)
stats="snpeff/{sample}.html", # summary statistics (in HTML), optional
csvstats="snpeff/{sample}.csv" # summary statistics in CSV, optional
log:
"logs/snpeff/{sample}.log"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=4096
wrapper:
"0.73.0/bio/snpeff/annotate"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
snpeff==4.3.1t
bcftools=1.11
snakemake-wrapper-utils==0.1.3
- Bradford Powell
__author__ = "Bradford Powell"
__copyright__ = "Copyright 2018, Bradford Powell"
__email__ = "bpow@unc.edu"
__license__ = "BSD"
from snakemake.shell import shell
from os import path
import shutil
import tempfile
from pathlib import Path
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
outcalls = snakemake.output.calls
if outcalls.endswith(".vcf.gz"):
outprefix = "| bcftools view -Oz"
elif outcalls.endswith(".bcf"):
outprefix = "| bcftools view -Ob"
else:
outprefix = ""
incalls = snakemake.input[0]
if incalls.endswith(".bcf"):
incalls = "< <(bcftools view {})".format(incalls)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
data_dir = Path(snakemake.input.db).parent.resolve()
stats = snakemake.output.get("stats", "")
csvstats = snakemake.output.get("csvstats", "")
csvstats_opt = "" if not csvstats else "-csvStats {}".format(csvstats)
stats_opt = "-noStats" if not stats else "-stats {}".format(stats)
reference = path.basename(snakemake.input.db)
shell(
"snpEff {java_opts} -dataDir {data_dir} "
"{stats_opt} {csvstats_opt} {extra} "
"{reference} {incalls} "
"{outprefix} > {outcalls} {log}"
)
SNPEFF DOWNLOAD¶
Download snpeff DB for a given species.
This wrapper can be used in the following way:
rule snpeff_download:
output:
# wildcard {reference} may be anything listed in `snpeff databases`
directory("resources/snpeff/{reference}")
log:
"logs/snpeff/download/{reference}.log"
params:
reference="{reference}"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/snpeff/download"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
snpeff==4.3.1t
bcftools=1.11
snakemake-wrapper-utils==0.1.3
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
from pathlib import Path
from snakemake_wrapper_utils.java import get_java_opts
java_opts = get_java_opts(snakemake)
reference = snakemake.params.reference
outdir = Path(snakemake.output[0]).parent.resolve()
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell("snpEff download {java_opts} -dataDir {outdir} {reference} {log}")
SNPSIFT¶
For snpsift, the following wrappers are available:
SNPSIFT ANNOTATE¶
Annotate using fields from another VCF file with SnpSift
This wrapper can be used in the following way:
rule test_snpsift_annotate:
input:
call="in.vcf",
database="annotation.vcf"
output:
call="annotated/out.vcf"
log:
"annotate.log"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/snpsift/annotate"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
snpsift==4.3.1t
bcftools==1.10.2
pbgzip==2016.08.04
snakemake-wrapper-utils==0.1.3
Input:
- A VCF-formatted file that is to be annoated
- A VCF-formatted annotation file
Output:
- A VCF-formatted file
- Thibault Dayris
"""Snakemake wrapper for SnpSift annotate"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2020, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
min_threads = 1
incall = snakemake.input["call"]
if snakemake.input["call"].endswith("bcf"):
min_threads += 1
incall = "< <(bcftools view {})".format(incall)
elif snakemake.input["call"].endswith("gz"):
min_threads += 1
incall = "< <(gunzip -c {})".format(incall)
outcall = snakemake.output["call"]
if snakemake.output["call"].endswith("gz"):
min_threads += 1
outcall = "| gzip -c > {}".format(outcall)
elif snakemake.output["call"].endswith("bcf"):
min_threads += 1
outcall = "| bcftools view > {}".format(outcall)
else:
outcall = "> {}".format(outcall)
if snakemake.threads < min_threads:
raise ValueError(
"At least {} threads required, {} provided".format(
min_threads, snakemake.threads
)
)
shell(
"SnpSift annotate" # Tool and its subcommand
" {java_opts} {extra}" # Extra parameters
" {snakemake.input.database}" # Path to annotation vcf file
" {incall} " # Path to input vcf file
" {outcall} " # Path to output vcf file
" {log}" # Logging behaviour
)
SNPSIFT DBNSFP¶
Annotate using integrated annotation from dbNSFP with SnpSift
This wrapper can be used in the following way:
rule test_snpsift_dbnsfp:
input:
call = "in.vcf",
dbNSFP = "dbNSFP.txt.gz"
output:
call = "out.vcf"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/snpsift/dbnsfp"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
snpsift=4.3.1t
bcftools==1.10.2
snakemake-wrapper-utils==0.1.3
- Thibault Dayris
"""Snakemake wrapper for SnpSift dbNSFP"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2020, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# Using user-defined file if requested
db = snakemake.input.get("dbNSFP", "")
if db != "":
db = "-db {}".format(db)
min_threads = 1
# Uncompression shall be done on user request
incall = snakemake.input["call"]
if incall.endswith("bcf"):
min_threads += 1
incall = "< <(bcftools view {})".format(incall)
elif incall.endswith("gz"):
min_threads += 1
incall = "< <(gunzip -c {})".format(incall)
# Compression shall be done according to user-defined output
outcall = snakemake.output["call"]
if outcall.endswith("gz"):
min_threads += 1
outcall = "| gzip -c > {}".format(outcall)
elif outcall.endswith("bcf"):
min_threads += 1
outcall = "| bcftools view > {}".format(outcall)
else:
outcall = "> {}".format(outcall)
# Each (un)compression raises the thread number
if snakemake.threads < min_threads:
raise ValueError(
"At least {} threads required, {} provided".format(
min_threads, snakemake.threads
)
)
shell(
"SnpSift dbnsfp" # Tool and its subcommand
" {java_opts} {extra}" # Extra parameters
" {db}" # Path to annotation vcf file
" {incall}" # Path to input vcf file
" {outcall}" # Path to output vcf file
" {log}" # Logging behaviour
)
SNPSIFT GENES SETS¶
Annotate using GMT genes sets with SnpSift
This wrapper can be used in the following way:
rule test_snpsift_gmt:
input:
call = "in.vcf",
gmt = "fake_set.gmt"
output:
call = "annotated/out.vcf"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/snpsift/genesets"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
snpsift==4.3.1t
bcftools==1.10.2
snakemake-wrapper-utils==0.1.3
Input:
- Calls that are to be annotated
- A GMT-formatted annotation file
Output:
- Annotated calls
- Thibault Dayris
"""Snakemake wrapper for SnpSift geneSets"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2020, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
min_threads = 1
# Uncompression shall be done according to user-defined input
incall = snakemake.input["call"]
if snakemake.input["call"].endswith("bcf"):
min_threads += 1
incall = "< <(bcftools view {})".format(incall)
elif snakemake.input["call"].endswith("gz"):
min_threads += 1
incall = "< <(gunzip -c {})".format(incall)
# Compression shall be done according to user-defined output
outcall = snakemake.output["call"]
if snakemake.output["call"].endswith("gz"):
min_threads += 1
outcall = "| gzip -c > {}".format(outcall)
elif snakemake.output["call"].endswith("bcf"):
min_threads += 1
outcall = "| bcftools view > {}".format(outcall)
else:
outcall = "> {}".format(outcall)
# Each (un)compression step raises the threads requirements
if snakemake.threads < min_threads:
raise ValueError(
"At least {} threads required, {} provided".format(
min_threads, snakemake.threads
)
)
shell(
"SnpSift geneSets" # Tool and its subcommand
" {java_opts} {extra}" # Extra parameters
" {snakemake.input.gmt}" # Path to annotation vcf file
" {incall}" # Path to input vcf file
" {outcall}" # Path to output vcf file
" {log}" # Logging behaviour
)
SNPSIFT GWAS CATALOG¶
Annotate using GWAS catalog with SnpSift
This wrapper can be used in the following way:
rule test_snpsift_gwascat:
input:
call = "in.vcf",
gwascat = "gwascatalog.txt"
output:
call = "annotated/out.vcf"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/snpsift/gwascat"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
snpsift==4.3.1t
bcftools==1.10.2
snakemake-wrapper-utils==0.1.3
Input:
- Calls that are to be annotated (vcf, bcf, vcf.gz)
- A GWAS Catalog TSV-formatted file
Output:
- Annotated calls (vcf, bcf, vcf.gz)
- Thibault Dayris
"""Snakemake wrapper for SnpSift gwasCat"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2020, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
min_threads = 1
# Uncompression shall be done based on user input
incall = snakemake.input["call"]
if incall.endswith("bcf"):
min_threads += 1
incall = "< <(bcftools view {})".format(incall)
elif incall.endswith("gz"):
min_threads += 1
incall = "< <(gunzip -c {})".format(incall)
# Compression shall be done based on user-defined output
outcall = snakemake.output["call"]
if outcall.endswith("bcf"):
min_threads += 1
outcall = "| bcftools view {}".format(outcall)
elif outcall.endswith("gz"):
min_threads += 1
outcall = "| gzip -c > {}".format(outcall)
else:
outcall = "> {}".format(outcall)
# Each additional (un)compression step requires more threads
if snakemake.threads < min_threads:
raise ValueError(
"At least {} threads required, {} provided".format(
min_threads, snakemake.threads
)
)
shell(
"SnpSift gwasCat " # Tool and its subcommand
" {java_opts} {extra} " # Extra parameters
" -db {snakemake.input.gwascat} " # Path to gwasCat file
" {incall} " # Path to input vcf file
" {outcall} " # Path to output vcf file
" {log} " # Logging behaviour
)
SNPSIFT VARTYPE¶
Add an INFO field denoting variant type with SnpSift
This wrapper can be used in the following way:
rule test_snpsift_vartype:
input:
vcf="in.vcf"
output:
vcf="annotated/out.vcf"
message:
"Testing SnpSift varType"
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
log:
"varType.log"
wrapper:
"0.73.0/bio/snpsift/varType"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
snpsift=4.3.1t
snakemake-wrapper-utils==0.1.3
- Thibault Dayris
"""Snakemake wrapper for SnpSift varType"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2020, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
shell(
"SnpSift varType" # Tool and its subcommand
" {java_opts} {extra}" # Extra parameters
" {snakemake.input.vcf}" # Path to input vcf file
" > {snakemake.output.vcf}" # Path to output vcf file
" {log}" # Logging behaviour
)
SOURMASH¶
For sourmash, the following wrappers are available:
SOURMASH_COMPUTE¶
Build a MinHash signature for a transcriptome, genome, or reads
This wrapper can be used in the following way:
rule sourmash_reads:
input:
"reads/a.fastq"
output:
"reads.sig"
log:
"logs/sourmash/sourmash_compute_reads.log"
threads: 2
params:
# optional parameters
k = "31",
scaled = "1000",
extra = ""
wrapper:
"0.73.0/bio/sourmash/compute"
rule sourmash_transcriptome:
input:
"assembly/transcriptome.fasta"
output:
"transcriptome.sig"
log:
"logs/sourmash/sourmash_compute_transcriptome.log"
threads: 2
params:
# optional parameters
k = "31",
scaled = "1000",
extra = ""
wrapper:
"0.73.0/bio/sourmash/compute"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sourmash==2.0.0a7
- Lisa K. Johnson
"""Snakemake wrapper for sourmash compute."""
__author__ = "Lisa K. Johnson"
__copyright__ = "Copyright 2018, Lisa K. Johnson"
__email__ = "ljcohen@ucdavis.edu"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
scaled = snakemake.params.get("scaled", "1000")
k = snakemake.params.get("k", "31")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"sourmash compute --scaled {scaled} -k {k} {snakemake.input} -o {snakemake.output}"
" {extra} {log}"
)
SRA-TOOLS¶
For sra-tools, the following wrappers are available:
SRA-TOOLS FASTERQ-DUMP¶
Download FASTQ files from SRA.
This wrapper can be used in the following way:
rule get_fastq_pe:
output:
# the wildcard name must be accession, pointing to an SRA number
"data/{accession}_1.fastq",
"data/{accession}_2.fastq"
params:
# optional extra arguments
extra=""
threads: 6 # defaults to 6
wrapper:
"0.73.0/bio/sra-tools/fasterq-dump"
rule get_fastq_se:
output:
"data/{accession}.fastq"
params:
extra=""
threads: 6
wrapper:
"0.73.0/bio/sra-tools/fasterq-dump"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
sra-tools>2.9.1
- Johannes Köster
- Derek Croote
__author__ = "Johannes Köster, Derek Croote"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
import os
import tempfile
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
outdir = os.path.dirname(snakemake.output[0])
if outdir:
outdir = "--outdir {}".format(outdir)
extra = snakemake.params.get("extra", "")
with tempfile.TemporaryDirectory() as tmp:
shell(
"fasterq-dump --temp {tmp} --threads {snakemake.threads} "
"{extra} {outdir} {snakemake.wildcards.accession} {log}"
)
STAR¶
For star, the following wrappers are available:
STAR¶
Map reads with STAR.
This wrapper can be used in the following way:
rule star_pe_multi:
input:
# use a list for multiple fastq files for one sample
# usually technical replicates across lanes/flowcells
fq1 = ["reads/{sample}_R1.1.fastq", "reads/{sample}_R1.2.fastq"],
# paired end reads needs to be ordered so each item in the two lists match
fq2 = ["reads/{sample}_R2.1.fastq", "reads/{sample}_R2.2.fastq"] #optional
output:
# see STAR manual for additional output files
"star/pe/{sample}/Aligned.out.sam"
log:
"logs/star/pe/{sample}.log"
params:
# path to STAR reference genome index
index="index",
# optional parameters
extra=""
threads: 8
wrapper:
"0.73.0/bio/star/align"
rule star_se:
input:
fq1 = "reads/{sample}_R1.1.fastq"
output:
# see STAR manual for additional output files
"star/{sample}/Aligned.out.sam"
log:
"logs/star/{sample}.log"
params:
# path to STAR reference genome index
index="index",
# optional parameters
extra=""
threads: 8
wrapper:
"0.73.0/bio/star/align"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
star==2.7.5c
- Johannes Köster
- Tomás Di Domenico
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
[snakemake.input.fq1]
if isinstance(snakemake.input.fq1, str)
else snakemake.input.fq1
)
fq2 = snakemake.input.get("fq2")
if fq2:
fq2 = (
[snakemake.input.fq2]
if isinstance(snakemake.input.fq2, str)
else snakemake.input.fq2
)
assert len(fq1) == len(
fq2
), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])
if fq1[0].endswith(".gz"):
readcmd = "--readFilesCommand zcat"
else:
readcmd = ""
outprefix = os.path.dirname(snakemake.output[0]) + "/"
shell(
"STAR "
"{extra} "
"--runThreadN {snakemake.threads} "
"--genomeDir {snakemake.params.index} "
"--readFilesIn {input_str} "
"{readcmd} "
"--outFileNamePrefix {outprefix} "
"--outStd Log "
"{log}"
)
STAR INDEX¶
Index fasta sequences with STAR
This wrapper can be used in the following way:
rule star_index:
input:
fasta = "{genome}.fasta"
output:
directory("{genome}")
message:
"Testing STAR index"
threads:
1
params:
extra = ""
log:
"logs/star_index_{genome}.log"
wrapper:
"0.73.0/bio/star/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
star==2.7.5c
Input:
- A (multi)fasta formatted file
Output:
- A directory containing the indexed sequence for downstream STAR mapping
- Thibault Dayris
- Tomás Di Domenico
"""Snakemake wrapper for STAR index"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake.utils import makedirs
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
sjdb_overhang = snakemake.params.get("sjdbOverhang", "100")
gtf = snakemake.input.get("gtf")
if gtf is not None:
gtf = "--sjdbGTFfile " + gtf
sjdb_overhang = "--sjdbOverhang " + sjdb_overhang
else:
gtf = sjdb_overhang = ""
makedirs(snakemake.output)
shell(
"STAR " # Tool
"--runMode genomeGenerate " # Indexation mode
"{extra} " # Optional parameters
"--runThreadN {snakemake.threads} " # Number of threads
"--genomeDir {snakemake.output} " # Path to output
"--genomeFastaFiles {snakemake.input.fasta} " # Path to fasta files
"{sjdb_overhang} " # Read-len - 1
"{gtf} " # Highly recommended GTF
"{log}" # Logging
)
STRELKA¶
For strelka, the following wrappers are available:
STRELKA GERMLINE¶
Call germline variants with Strelka.
This wrapper can be used in the following way:
rule strelka_germline:
input:
# the required bam file
bam="mapped/{sample}.bam",
# path to reference genome fasta and index
fasta="genome.fasta",
fasta_index="genome.fasta.fai"
output:
# Strelka results - either use directory or complete file path
directory("strelka/{sample}")
log:
"logs/strelka/germline/{sample}.log"
params:
# optional parameters
config_extra="",
run_extra=""
threads: 8
wrapper:
"0.73.0/bio/strelka/germline"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
strelka==2.9.10
- Jan Forster
__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "jan.forster@uk-essen.de"
__license__ = "MIT"
import os
from pathlib import Path
from snakemake.shell import shell
config_extra = snakemake.params.get("config_extra", "")
run_extra = snakemake.params.get("run_extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
bam = snakemake.input.get("bam") # input bam file, required
assert bam is not None, "input-> bam is a required input parameter"
if snakemake.output[0].endswith(".vcf.gz"):
run_dir = Path(snakemake.output[0]).parents[2]
else:
run_dir = snakemake.output
shell(
"configureStrelkaGermlineWorkflow.py " # configure the strelka run
"--bam {bam} " # input bam
"--referenceFasta {snakemake.input.fasta} " # reference genome
"--runDir {run_dir} " # output directory
"{config_extra} " # additional parameters for the configuration
"&& {run_dir}/runWorkflow.py " # run the strelka workflow
"-m local " # run in local mode
"-j {snakemake.threads} " # number of threads
"{run_extra} " # additional parameters for the run
"{log}"
) # logging
STRELKA¶
Strelka calls somatic and germline small variants from mapped sequencing reads
This wrapper can be used in the following way:
rule strelka:
input:
# The normal bam and its index
# are optional input
# normal = "data/b.bam",
# normal_index = "data/b.bam.bai"
tumor = "data/{tumor}.bam",
tumor_index = "data/{tumor}.bam.bai",
fasta = "data/genome.fasta",
fasta_index = "data/genome.fasta.fai"
output:
# Strelka output - can be directory or full file path
directory("{tumor}_vcf")
threads:
1
params:
run_extra = "",
config_extra = ""
log:
"logs/strelka_{tumor}.log"
wrapper:
"0.73.0/bio/strelka/somatic"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
strelka==2.9.10
Input:
- A tumor bam file, with its index.
- A reference genome sequence in fasta format, with its index.
- An optional normal bam file for somatic calling, with its index.
Output:
- Statistics about calling results
- Variants called
- Thibault Dayris
"""Snakemake wrapper for Strelka"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from pathlib import Path
from snakemake.shell import shell
from snakemake.utils import makedirs
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
config_extra = snakemake.params.get("config_extra", "")
run_extra = snakemake.params.get("run_extra", "")
# If a normal bam is given in input,
# then it should be provided in the input
# block, so Snakemake will perform additional
# tests on file existance.
normal = (
"--normalBam {}".format(snakemake.input["normal"])
if "normal" in snakemake.input.keys()
else ""
)
if snakemake.output[0].endswith("vcf.gz"):
run_dir = Path(snakemake.output[0]).parents[2]
else:
run_dir = snakemake.output
shell(
"(configureStrelkaSomaticWorkflow.py " # Configuration script
"{normal} " # Path to normal bam (if any)
"--tumorBam {snakemake.input.tumor} " # Path to tumor bam
"--referenceFasta {snakemake.input.fasta} " # Path to fasta file
"--runDir {run_dir} " # Path to output directory
"{config_extra} " # Extra parametersfor configuration
" && "
"{run_dir}/runWorkflow.py " # Run the pipeline
"--mode local " # Stop internal job submission
"--jobs {snakemake.threads} " # Nomber of threads
"{run_extra}) " # Extra parameters for runWorkflow
"{log}" # Logging behaviour
)
STRLING¶
For strling, the following wrappers are available:
STRLING CALL¶
STRling (pronounced like “sterling”) is a method to detect large short tandem repeat (STR) expansions from short-read sequencing data. call
calls genotypes/estimate allele sizes for all loci in each sample. Documentation at: https://strling.readthedocs.io/en/latest/run.html
This wrapper can be used in the following way:
rule strling_call:
input:
bam="mapped/{sample}.bam",
bai="mapped/{sample}.bam.bai",
bin="extract/{sample}.bin",
reference="reference/genome.fasta",
fai="reference/genome.fasta.fai",
bounds="merged/group-bounds.txt" # optional, produced by strling merge
output:
"call/{sample}-bounds.txt", # must end with -bounds.txt
"call/{sample}-genotype.txt", # must end with -genotype.txt
"call/{sample}-unplaced.txt" # must end with -unplaced.txt
params:
extra="" # optional extra command line arguments
log:
"log/strling/call/{sample}.log"
wrapper:
"0.73.0/bio/strling/call"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
strling==0.3
- Christopher Schröder
"""Snakemake wrapper for strling call"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroede@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Check inputs/arguments.
bam = snakemake.input.get("bam", None)
bin = snakemake.input.get("bin", None)
reference = snakemake.input.get("reference", None)
bounds = snakemake.input.get("bounds", None)
if not bam or (isinstance(bam, list) and len(bam) != 1):
raise ValueError("Please provide exactly one 'bam' as input.")
if not path.exists(bam + ".bai"):
raise ValueError(
"Please index the bam file. The index file must have same file name as the bam file, with '.bai' appended."
)
if not reference:
raise ValueError("Please provide a fasta 'reference' input.")
if not bounds: # optional
bounds_string = ""
else:
bounds_string = "-b {}".format(bounds)
if not path.exists(reference + ".fai"):
raise ValueError(
"Please index the reference. The index file must have same file name as the reference file, with '.fai' appended."
)
if not any(o.endswith("-bounds.txt") for o in snakemake.output):
raise ValueError("Please provide a file that ends with -bounds.txt in the output.")
for filename in snakemake.output:
if filename.endswith("-bounds.txt"):
prefix = filename[: -len("-bounds.txt")]
break
if not any(o == "{}-genotype.txt".format(prefix) for o in snakemake.output):
raise ValueError(
"Please provide an output file that ends with -genotype.txt and has the same prefix as -bounds.txt"
)
if not any(o == "{}-unplaced.txt".format(prefix) for o in snakemake.output):
raise ValueError(
"Please provide an output file that ends with -unplaced.txt and has the same prefix as -bounds.txt"
)
shell(
"(strling call "
"{bam} "
"{bin} "
"{bounds_string} "
"-o {prefix} "
"{extra}) {log}"
)
STRLING EXTRACT¶
STRling (pronounced “sterling”) is a method to detect large short tandem repeat (STR) expansions from short-read sequencing data. extract
retrieves informative read pairs to a binary format for a single sample (same as above, you can use the same bin files). Documentation at: https://strling.readthedocs.io/en/latest/run.html
This wrapper can be used in the following way:
rule strling_extract:
input:
bam="mapped/{sample}.bam",
bai="mapped/{sample}.bam.bai",
reference="reference/genome.fasta",
fai="reference/genome.fasta.fai",
index="reference/genome.fasta.str" # optional
output:
"extract/{sample}.bin"
log:
"log/strling/extract/{sample}.log"
params:
extra="" # optionally add further command line arguments
wrapper:
"0.73.0/bio/strling/extract"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
strling==0.3
- Christopher Schröder
"""Snakemake wrapper for strling extract"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroede@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Check inputs/arguments.
bam = snakemake.input.get("bam", None)
reference = snakemake.input.get("reference", None)
index = snakemake.input.get("index", None)
if not bam or (isinstance(bam, list) and len(bam) != 1):
raise ValueError("Please provide exactly one 'bam' input.")
if not path.exists(bam + ".bai"):
raise ValueError(
"Please index the bam file. The index file must have same file name as the bam file, with '.bai' appended."
)
if not reference:
raise ValueError("Please provide a fasta 'reference' input.")
if not path.exists(reference + ".fai"):
raise ValueError(
"Please index the reference. The index file must have same file name as the reference file, with '.fai' appended."
)
if not index: # optional
index_string = ""
else:
index_string = "-g {}".format(index)
if len(snakemake.output) != 1:
raise ValueError("Please provide exactly one output file (.bin).")
shell(
"(strling extract "
"{bam} "
"{snakemake.output[0]} "
"-f {reference} "
"{index_string} "
"{extra}) {log}"
)
STRLING INDEX¶
STRling (pronounced like “sterling”) is a method to detect large short tandem repeat (STR) expansions from short-read sequencing data. index
creates a bed file of large STR regions in the reference genome. This step is performed automatically as part of strling extract
. However, when running multiple samples, it is more efficient to do it once, then pass the file to strling extract using the -g
option. Documentation at: https://strling.readthedocs.io/en/latest/run.html
This wrapper can be used in the following way:
rule strling_index:
input:
"reference/genome.fasta"
output:
index="reference/genome.fasta.str",
fai="reference/genome.fasta.fai"
params:
extra="" # optionally add further command line arguments
log:
"log/strling/index.log"
wrapper:
"0.73.0/bio/strling/index"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
strling==0.3
- Christopher Schröder
"""Snakemake wrapper for strling index"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroede@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Check inputs/arguments.
if len(snakemake.input) != 1:
raise ValueError("Please provide exactly one reference genome.")
shell(
"(strling index {snakemake.input[0]} "
"-g {snakemake.output.index} "
"{extra}) {log}"
)
STRLING MERGE¶
STRling (pronounced “sterling”) is a method to detect large short tandem repeat (STR) expansions from short-read sequencing data. merge
prepares joint calling of STR loci across all given samples. Requires minimum read evidence from at least one sample. Documentation at: https://strling.readthedocs.io/en/latest/run.html
This wrapper can be used in the following way:
rule strling_merge:
input:
bins=["extract/A.bin", "extract/B.bin"],
reference="reference/genome.fasta",
fai="reference/genome.fasta.fai",
output:
"merged/group-bounds.txt" # must end with "-bounds.txt"
params:
extra="" # optionally add further command line arguments
log:
"log/strling/merge/group.log"
wrapper:
"0.73.0/bio/strling/merge"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
strling==0.3
- Christopher Schröder
"""Snakemake wrapper for strling merge"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroede@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
from os import path
# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Check inputs/arguments.
bins = snakemake.input.get("bins", None)
reference = snakemake.input.get("reference", None)
fai = snakemake.input.get("fai", None)
if not bins or len(bins) < 2:
raise ValueError("Please provide at least two 'bins' as input.")
if not reference:
raise ValueError("Please provide a fasta 'reference' input.")
if not path.exists(reference + ".fai"):
raise ValueError(
"Please index the reference. The index file must have same file name as the reference file, with '.fai' appended."
)
if len(snakemake.output) != 1:
raise ValueError("Please provide exactly one output file (.bin).")
if not snakemake.output[0].endswith("-bounds.txt"):
raise ValueError(
"Output file must end with '-bounds.txt'. Please change the output file name."
)
prefix = snakemake.output[0][: -len("-bounds.txt")]
shell("(strling merge " "{bins} " "-o {prefix} " "{extra}) {log}")
SUBREAD¶
For subread, the following wrappers are available:
SUBREAD FEATURECOUNTS¶
FeatureCounts assign mapped reads or fragments (paired-end data) to genomic features such as genes, exons and promoters. For more information please see featureCounts tutorial, documentation of subread and commandline help.
This wrapper can be used in the following way:
rule feature_counts:
input:
sam="{sample}.bam", # list of sam or bam files
annotation="annotation.gtf",
# optional input
# chr_names="", # implicitly sets the -A flag
# fasta="genome.fasta" # implicitly sets the -G flag
output:
multiext("results/{sample}",
".featureCounts",
".featureCounts.summary",
".featureCounts.jcounts")
threads:
2
params:
tmp_dir="", # implicitly sets the --tmpDir flag
r_path="", # implicitly sets the --Rpath flag
extra="-O --fracOverlap 0.2"
log:
"logs/{sample}.log"
wrapper:
"0.73.0/bio/subread/featurecounts"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
subread=2.0
Input:
- a list of .sam or .bam files
- GTF, GFF or SAF annotation file
- optional a tab separating file that determines the sorting order and contains the chromosome names in the first column
- optional a fasta index file
Output:
- .featureCounts file including read counts (tab separated)
- .featureCounts.summary file including summary statistics (tab separated)
- .featureCounts.jcounts file including count number of reads supporting each exon-exon junction (tab separated)
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "antonie.v@gmx.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
# optional input files and directories
fasta = snakemake.input.get("fasta", "")
chr_names = snakemake.input.get("chr_names", "")
tmp_dir = snakemake.params.get("tmp_dir", "")
r_path = snakemake.params.get("r_path", "")
if fasta:
extra += " -G {}".format(fasta)
if chr_names:
extra += " -A {}".format(chr_names)
if tmp_dir:
extra += " --tmpDir {}".format(tmp_dir)
if r_path:
extra += " --Rpath {}".format(r_path)
shell(
"(featureCounts"
" {extra}"
" -T {snakemake.threads}"
" -J"
" -a {snakemake.input.annotation}"
" -o {snakemake.output[0]}"
" {snakemake.input.sam})"
" {log}"
)
TABIX¶
Process given file with tabix (e.g., create index).
Example¶
This wrapper can be used in the following way:
rule tabix:
input:
"{prefix}.vcf.gz"
output:
"{prefix}.vcf.gz.tbi"
params:
# pass arguments to tabix (e.g. index a vcf)
"-p vcf"
log:
"logs/tabix/{prefix}.log"
wrapper:
"0.73.0/bio/tabix"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
htslib==1.10
Authors¶
- Johannes Köster
Code¶
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell("tabix {snakemake.params} {snakemake.input[0]} {log}")
TRANSDECODER¶
For transdecoder, the following wrappers are available:
TRANSDECODER LONGORFS¶
TransDecoder.LongOrfs will identify coding regions within transcript sequences (ORFs) that are at least 100 amino acids long. You can lower this via the ‘-m’ parameter, but know that the rate of false positive ORF predictions increases drastically with shorter minimum length criteria.
This wrapper can be used in the following way:
rule transdecoder_longorfs:
input:
fasta="test.fa.gz", # required
gene_trans_map="test.gtm" # optional gene-to-transcript identifier mapping file (tab-delimited, gene_id<tab>trans_id<return> )
output:
"test.fa.transdecoder_dir/longest_orfs.pep"
log:
"logs/transdecoder/test-longorfs.log"
params:
extra=""
wrapper:
"0.73.0/bio/transdecoder/longorfs"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
transdecoder=5.5.0
- Tessa Pierce
"""Snakemake wrapper for Transdecoder LongOrfs"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
gtm_cmd = ""
gtm = snakemake.input.get("gene_trans_map", "")
if gtm:
gtm_cmd = " --gene_trans_map " + gtm
output_dir = path.dirname(str(snakemake.output))
# transdecoder fails if output already exists. No force option available
shell("rm -rf {output_dir}")
input_fasta = str(snakemake.input.fasta)
if input_fasta.endswith("gz"):
input_fa = input_fasta.rsplit(".gz")[0]
shell("gunzip -c {input_fasta} > {input_fa}")
else:
input_fa = input_fasta
shell("TransDecoder.LongOrfs -t {input_fa} {gtm_cmd} {log}")
TRANSDECODER PREDICT¶
Predict the likely coding regions from the ORFs identified by Transdecoder.LongOrfs. Optionally include results from homology searches (blast/hmmer results) as ORF retention criteria.
This wrapper can be used in the following way:
rule transdecoder_predict:
input:
fasta="test.fa.gz", # required input; optionally gzipped
pfam_hits="pfam_hits.txt", # optionally retain ORFs with hits by inputting pfam results here (run separately)
blastp_hits="blastp_hits.txt", # optionally retain ORFs with hits by inputting blastp results here (run separately)
# you may also want to add your transdecoder longorfs result here - predict will fail if you haven't first run longorfs
#longorfs="test.fa.transdecoder_dir/longest_orfs.pep"
output:
"test.fa.transdecoder.bed",
"test.fa.transdecoder.cds",
"test.fa.transdecoder.pep",
"test.fa.transdecoder.gff3"
log:
"logs/transdecoder/test-predict.log"
params:
extra=""
wrapper:
"0.73.0/bio/transdecoder/predict"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
transdecoder=5.5.0
Input:
- fasta assembly
Output:
- candidate coding regions (pep, cds, gff3, bed output formats)
- Tessa Pierce
"""Snakemake wrapper for Transdecoder Predict"""
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
addl_outputs = ""
pfam = snakemake.input.get("pfam_hits", "")
if pfam:
addl_outputs += " --retain_pfam_hits " + pfam
blast = snakemake.input.get("blastp_hits", "")
if blast:
addl_outputs += " --retain_blastp_hits " + blast
input_fasta = str(snakemake.input.fasta)
if input_fasta.endswith("gz"):
input_fa = input_fasta.rsplit(".gz")[0]
shell("gunzip -c {input_fasta} > {input_fa}")
else:
input_fa = input_fasta
shell("TransDecoder.Predict -t {input_fa} {addl_outputs} {extra} {log}")
TRIM_GALORE¶
For trim_galore, the following wrappers are available:
TRIM_GALORE-PE¶
Trim paired-end reads using trim_galore.
This wrapper can be used in the following way:
rule trim_galore_pe:
input:
["reads/{sample}.1.fastq.gz", "reads/{sample}.2.fastq.gz"]
output:
"trimmed/{sample}.1_val_1.fq.gz",
"trimmed/{sample}.1.fastq.gz_trimming_report.txt",
"trimmed/{sample}.2_val_2.fq.gz",
"trimmed/{sample}.2.fastq.gz_trimming_report.txt"
params:
extra="--illumina -q 20"
log:
"logs/trim_galore/{sample}.log"
wrapper:
"0.73.0/bio/trim_galore/pe"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
trim-galore==0.4.5
Input:
- two (paired-end) fastq files (can be gzip compressed)
Output:
- two trimmed (paired-end) fastq files
- two trimming reports
- It is expected that the fastqc Snakemake wrapper be used in place of the –fastqc option.
- All output files must be placed in the same directory.
- Kerrin Mendler
"""Snakemake wrapper for trimming paired-end reads using trim_galore."""
__author__ = "Kerrin Mendler"
__copyright__ = "Copyright 2018, Kerrin Mendler"
__email__ = "mendlerke@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
import os.path
log = snakemake.log_fmt_shell()
# Check that two input files were supplied
n = len(snakemake.input)
assert n == 2, "Input must contain 2 files. Given: %r." % n
# Don't run with `--fastqc` flag
if "--fastqc" in snakemake.params.get("extra", ""):
raise ValueError(
"The trim_galore Snakemake wrapper cannot "
"be run with the `--fastqc` flag. Please "
"remove the flag from extra params. "
"You can use the fastqc Snakemake wrapper on "
"the input and output files instead."
)
# Check that four output files were supplied
m = len(snakemake.output)
assert m == 4, "Output must contain 4 files. Given: %r." % m
# Check that all output files are in the same directory
out_dir = os.path.dirname(snakemake.output[0])
for file_path in snakemake.output[1:]:
assert out_dir == os.path.dirname(file_path), (
"trim_galore can only output files to a single directory."
" Please indicate only one directory for the output files."
)
shell(
"(trim_galore"
" {snakemake.params.extra}"
" --paired"
" -o {out_dir}"
" {snakemake.input})"
" {log}"
)
TRIM_GALORE-SE¶
Trim unpaired reads using trim_galore.
This wrapper can be used in the following way:
rule trim_galore_se:
input:
"reads/{sample}.fastq.gz"
output:
"trimmed/{sample}_trimmed.fq.gz",
"trimmed/{sample}.fastq.gz_trimming_report.txt"
params:
extra="--illumina -q 20"
log:
"logs/trim_galore/{sample}.log"
wrapper:
"0.73.0/bio/trim_galore/se"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
trim-galore==0.4.3
Input:
- fastq file with untrimmed reads (can be gzip compressed)
Output:
- trimmed fastq file
- trimming report
- It is expected that the fastqc Snakemake wrapper be used in place of the –fastqc option.
- All output files must be placed in the same directory.
- Kerrin Mendler
"""Snakemake wrapper for trimming unpaired reads using trim_galore."""
__author__ = "Kerrin Mendler"
__copyright__ = "Copyright 2018, Kerrin Mendler"
__email__ = "mendlerke@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
import os.path
log = snakemake.log_fmt_shell()
# Don't run with `--fastqc` flag
if "--fastqc" in snakemake.params.get("extra", ""):
raise ValueError(
"The trim_galore Snakemake wrapper cannot "
"be run with the `--fastqc` flag. Please "
"remove the flag from extra params. "
"You can use the fastqc Snakemake wrapper on "
"the input and output files instead."
)
# Check that two output files were supplied
m = len(snakemake.output)
assert m == 2, "Output must contain 2 files. Given: %r." % m
# Check that all output files are in the same directory
out_dir = os.path.dirname(snakemake.output[0])
for file_path in snakemake.output[1:]:
assert out_dir == os.path.dirname(file_path), (
"trim_galore can only output files to a single directory."
" Please indicate only one directory for the output files."
)
shell(
"(trim_galore"
" {snakemake.params.extra}"
" -o {out_dir}"
" {snakemake.input})"
" {log}"
)
TRIMMOMATIC¶
For trimmomatic, the following wrappers are available:
TRIMMOMATIC PE¶
Trim paired-end reads with trimmomatic . (De)compress with pigz.
This wrapper can be used in the following way:
rule trimmomatic_pe:
input:
r1="reads/{sample}.1.fastq.gz",
r2="reads/{sample}.2.fastq.gz"
output:
r1="trimmed/{sample}.1.fastq.gz",
r2="trimmed/{sample}.2.fastq.gz",
# reads where trimming entirely removed the mate
r1_unpaired="trimmed/{sample}.1.unpaired.fastq.gz",
r2_unpaired="trimmed/{sample}.2.unpaired.fastq.gz"
log:
"logs/trimmomatic/{sample}.log"
params:
# list of trimmers (see manual)
trimmer=["TRAILING:3"],
# optional parameters
extra="",
compression_level="-9"
threads:
32
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/trimmomatic/pe"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
trimmomatic==0.36
pigz==2.3.4
snakemake-wrapper-utils==0.1.3
- Johannes Köster
- Jorge Langa
"""
bio/trimmomatic/pe
Snakemake wrapper to trim reads with trimmomatic in PE mode with help of pigz.
pigz is the parallel implementation of gz. Trimmomatic spends most of the time
compressing and decompressing instead of trimming sequences. By using process
substitution (<(command), >(command)), we can accelerate trimmomatic a lot.
Consider providing this wrapper with at least 1 extra thread per each gzipped
input or output file.
"""
__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_files, output_files, available_threads):
gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz"))
gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz"))
potential_threads_per_process = available_threads // (
1 + gzipped_input_files + gzipped_output_files
)
if potential_threads_per_process > 0:
# decompressing pigz creates at most 4 threads
pigz_input_threads = (
min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
)
pigz_output_threads = (
(available_threads - pigz_input_threads * gzipped_input_files)
// (1 + gzipped_output_files)
if gzipped_output_files != 0
else 0
)
trimmomatic_threads = (
available_threads
- pigz_input_threads * gzipped_input_files
- pigz_output_threads * gzipped_output_files
)
else:
# not enough threads for pigz
pigz_input_threads = 0
pigz_output_threads = 0
trimmomatic_threads = available_threads
return trimmomatic_threads, pigz_input_threads, pigz_output_threads
def compose_input_gz(filename, threads):
if filename.endswith(".gz") and threads > 0:
return "<(pigz -p {threads} --decompress --stdout {filename})".format(
threads=threads, filename=filename
)
return filename
def compose_output_gz(filename, threads, compression_level):
if filename.endswith(".gz") and threads > 0:
return ">(pigz -p {threads} {compression_level} > {filename})".format(
threads=threads, compression_level=compression_level, filename=filename
)
return filename
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)
# Distribute threads
input_files = [snakemake.input.r1, snakemake.input.r2]
output_files = [
snakemake.output.r1,
snakemake.output.r1_unpaired,
snakemake.output.r2,
snakemake.output.r2_unpaired,
]
trimmomatic_threads, input_threads, output_threads = distribute_threads(
input_files, output_files, snakemake.threads
)
input_r1, input_r2 = [
compose_input_gz(filename, input_threads) for filename in input_files
]
output_r1, output_r1_unp, output_r2, output_r2_unp = [
compose_output_gz(filename, output_threads, compression_level)
for filename in output_files
]
shell(
"trimmomatic PE -threads {trimmomatic_threads} {java_opts} {extra} "
"{input_r1} {input_r2} "
"{output_r1} {output_r1_unp} "
"{output_r2} {output_r2_unp} "
"{trimmer} "
"{log}"
)
TRIMMOMATIC SE¶
Trim single-end reads with trimmomatic. (De)compress with pigz.
This wrapper can be used in the following way:
rule trimmomatic:
input:
"reads/{sample}.fastq.gz" # input and output can be uncompressed or compressed
output:
"trimmed/{sample}.fastq.gz"
log:
"logs/trimmomatic/{sample}.log"
params:
# list of trimmers (see manual)
trimmer=["TRAILING:3"],
# optional parameters
extra="",
# optional compression levels from -0 to -9 and -11
compression_level="-9"
threads:
32
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
wrapper:
"0.73.0/bio/trimmomatic/se"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
trimmomatic==0.36
pigz==2.3.4
snakemake-wrapper-utils==0.1.3
- Johannes Köster
- Jorge Langa
"""
bio/trimmomatic/se
Snakemake wrapper to trim reads with trimmomatic in SE mode with help of pigz.
pigz is the parallel implementation of gz. Trimmomatic spends most of the time
compressing and decompressing instead of trimming sequences. By using process
substitution (<(command), >(command)), we can accelerate trimmomatic a lot.
Consider providing this wrapper with at least 1 extra thread per each gzipped
input or output file.
"""
__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_file, output_file, available_threads):
gzipped_input_files = 1 if input_file.endswith(".gz") else 0
gzipped_output_files = 1 if output_file.endswith(".gz") else 0
potential_threads_per_process = available_threads // (
1 + gzipped_input_files + gzipped_output_files
)
if potential_threads_per_process > 0:
# decompressing pigz creates at most 4 threads
pigz_input_threads = (
min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
)
pigz_output_threads = (
(available_threads - pigz_input_threads * gzipped_input_files)
// (1 + gzipped_output_files)
if gzipped_output_files != 0
else 0
)
trimmomatic_threads = (
available_threads
- pigz_input_threads * gzipped_input_files
- pigz_output_threads * gzipped_output_files
)
else:
# not enough threads for pigz
pigz_input_threads = 0
pigz_output_threads = 0
trimmomatic_threads = available_threads
return trimmomatic_threads, pigz_input_threads, pigz_output_threads
def compose_input_gz(filename, threads):
if filename.endswith(".gz") and threads > 0:
return "<(pigz -p {threads} --decompress --stdout {filename})".format(
threads=threads, filename=filename
)
return filename
def compose_output_gz(filename, threads, compression_level):
if filename.endswith(".gz") and threads > 0:
return ">(pigz -p {threads} {compression_level} > {filename})".format(
threads=threads, compression_level=compression_level, filename=filename
)
return filename
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)
# Distribute threads
trimmomatic_threads, input_threads, output_threads = distribute_threads(
snakemake.input[0], snakemake.output[0], snakemake.threads
)
# Collect files
input = compose_input_gz(snakemake.input[0], input_threads)
output = compose_output_gz(snakemake.output[0], output_threads, compression_level)
shell(
"trimmomatic SE -threads {trimmomatic_threads} "
"{java_opts} {extra} {input} {output} {trimmer} {log}"
)
TRINITY¶
Generate transcriptome assembly with Trinity
Example¶
This wrapper can be used in the following way:
rule trinity:
input:
left=["reads/reads.left.fq.gz", "reads/reads2.left.fq.gz"],
right=["reads/reads.right.fq.gz", "reads/reads2.right.fq.gz"]
output:
"trinity_out_dir/Trinity.fasta"
log:
'logs/trinity/trinity.log'
params:
extra=""
threads: 4
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_gb=10
wrapper:
"0.73.0/bio/trinity"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
trinity==2.8.4
Authors¶
- Tessa Pierce
Code¶
"""Snakemake wrapper for Trinity."""
__author__ = "Tessa Pierce"
__copyright__ = "Copyright 2018, Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from os import path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
# Previous wrapper reserved 10 Gigabytes by default. This behaviour is
# preserved below:
max_memory = "10G"
# Getting memory in megabytes, if java opts is not filled with -Xmx parameter
# By doing so, backward compatibility is preserved
if "mem_mb" in snakemake.resources.keys():
# max_memory from trinity expects a value in gigabytes.
rounded_mb_to_gb = int(snakemake.resources["mem_mb"] / 1024)
max_memory = "{}G".format(rounded_mb_to_gb)
# Getting memory in gigabytes, for user convenience. Please prefer the use
# of mem_mb over mem_gb as advised in documentation.
elif "mem_gb" in snakemake.resources.keys():
max_memory = "{}G".format(snakemake.resources["mem_gb"])
# allow multiple input files for single assembly
left = snakemake.input.get("left")
assert left is not None, "input-> left is a required input parameter"
left = (
[snakemake.input.left]
if isinstance(snakemake.input.left, str)
else snakemake.input.left
)
right = snakemake.input.get("right")
if right:
right = (
[snakemake.input.right]
if isinstance(snakemake.input.right, str)
else snakemake.input.right
)
assert len(left) >= len(
right
), "left input needs to contain at least the same number of files as the right input (can contain additional, single-end files)"
input_str_left = " --left " + ",".join(left)
input_str_right = " --right " + ",".join(right)
else:
input_str_left = " --single " + ",".join(left)
input_str_right = ""
input_cmd = " ".join([input_str_left, input_str_right])
# infer seqtype from input files:
seqtype = snakemake.params.get("seqtype")
if not seqtype:
if "fq" in left[0] or "fastq" in left[0]:
seqtype = "fq"
elif "fa" in left[0] or "fasta" in left[0]:
seqtype = "fa"
else: # assertion is redundant - warning or error instead?
assert (
seqtype is not None
), "cannot infer 'fq' or 'fa' seqtype from input files. Please specify 'fq' or 'fa' in 'seqtype' parameter"
outdir = path.dirname(snakemake.output[0])
assert "trinity" in outdir, "output directory name must contain 'trinity'"
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
shell(
"Trinity {input_cmd} --CPU {snakemake.threads} "
" --max_memory {max_memory} --seqType {seqtype} "
" --output {outdir} {snakemake.params.extra} "
" {log}"
)
TXIMPORT¶
Import and summarize transcript-level estimates for both transcript-level and gene-level analysis.
Example¶
This wrapper can be used in the following way:
rule tximport:
input:
quant = expand("quant/A/quant.sf")
# Optional transcript/gene links as described in tximport
# tx2gene = /path/to/tx2gene
output:
txi = "txi.RDS"
params:
extra = "type='salmon', txOut=TRUE"
wrapper:
"0.73.0/bio/tximport"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
bioconductor-tximport==1.14.0
r-readr==1.3.1
r-jsonlite==1.6
Notes¶
Add any tximport options in the params, they will be transmitted through the R wrapper. Supplementary options will cause unknown parameters error.
Authors¶
- Thibault Dayris
Code¶
#!/bin/R
# Loading library
base::library("tximport"); # Perform actual count importation in R
base::library("readr"); # Read faster!
base::library("jsonlite"); # Importing inferential replicates
# Cast input paths as character to avoid errors
samples_paths <- sapply( # Sequentially apply
snakemake@input[["quant"]], # ... to all quantification paths
function(quant) as.character(quant) # ... a cast as character
);
# Collapse path into a character vector
samples_paths <- base::paste0(samples_paths, collapse = '", "');
# Building function arguments
extra <- base::paste0('files = c("', samples_paths, '")');
# Check if user provided optional transcript to gene table
if ("tx_to_gene" %in% names(snakemake@input)) {
tx2gene <- readr::read_tsv(snakemake@input[["tx_to_gene"]]);
extra <- base::paste(
extra, # Foreward existing arguments
", tx2gene = ", # Argument name
"tx2gene" # Add tx2gene to parameters
);
}
# Add user defined arguments
if ("extra" %in% names(snakemake@params)) {
if (snakemake@params[["extra"]] != "") {
extra <- base::paste(
extra, # Foreward existing parameters
snakemake@params[["extra"]], # Add user parameters
sep = ", " # Field separator
);
}
}
print(extra);
# Perform tximport work
txi <- base::eval( # Evaluate the following
base::parse( # ... parsed expression
text = base::paste0(
"tximport::tximport(", extra, ");" # ... of tximport and its arguments
)
)
);
# Save results
base::saveRDS( # Save R object
object = txi, # The txi object
file = snakemake@output[["txi"]] # Output path is provided by Snakemake
);
UCSC¶
For ucsc, the following wrappers are available:
BEDGRAPHTOBIGWIG¶
Convert *.bedGraph file to *.bw file (see http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/FOOTER.txt)
This wrapper can be used in the following way:
rule bedGraphToBigWig:
input:
bedGraph="{sample}.bedGraph",
chromsizes="genome.chrom.sizes"
output:
"{sample}.bw"
log:
"logs/{sample}.bed-graph_to_big-wig.log"
params:
"" # optional params string
wrapper:
"0.73.0/bio/ucsc/bedGraphToBigWig"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
ucsc-bedgraphtobigwig==377
Input:
bedGraph
: Path to *.bedGraph filechromsizes
: Chrom sizes file, could be generated by twoBitInfo or downloaded from UCSC
Output:
- Path to output ‘*.bw’ file
- Roman Cherniatchik
"""Snakemake wrapper for *.bedGraph to *.bw conversion using UCSC bedGraphToBigWig tool."""
# http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/FOOTER.txt
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
shell(
"bedGraphToBigWig {extra}"
" {snakemake.input.bedGraph} {snakemake.input.chromsizes}"
" {snakemake.output} {log}"
)
FATOTWOBIT¶
Convert *.fa file to *.2bit file (see http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/FOOTER.txt)
This wrapper can be used in the following way:
# Example: from *.fa file
rule faToTwoBit_fa:
input:
"{sample}.fa"
output:
"{sample}.2bit"
log:
"logs/{sample}.fa_to_2bit.log"
params:
"" # optional params string
wrapper:
"0.73.0/bio/ucsc/faToTwoBit"
# Example: from *.fa.gz file
rule faToTwoBit_fa_gz:
input:
"{sample}.fa.gz"
output:
"{sample}.2bit"
log:
"logs/{sample}.fa-gz_to_2bit.log"
params:
"" # optional params string
wrapper:
"0.73.0/bio/ucsc/faToTwoBit"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
ucsc-fatotwobit==377
- Roman Cherniatchik
"""Snakemake wrapper for *.2bit to *.fa conversion using UCSC faToTwoBit tool."""
# http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/FOOTER.txt
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
shell("faToTwoBit {extra} {snakemake.input} {snakemake.output} {log}")
TWOBITINFO¶
Generate *.chorom.sizes file by *.2bit file (see http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/FOOTER.txt)
This wrapper can be used in the following way:
rule twoBitInfo:
input:
"{sample}.2bit"
output:
"{sample}.chrom.sizes"
log:
"logs/{sample}.chrom.sizes.log"
params:
"" # optional params string
wrapper:
"0.73.0/bio/ucsc/twoBitInfo"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
ucsc-twobitinfo==377
- Roman Cherniatchik
"""Snakemake wrapper for *.2bit to *.fa conversion using UCSC twoBitInfo tool."""
# http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/FOOTER.txt
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
shell("twoBitInfo {extra} {snakemake.input} {snakemake.output} {log}")
TWOBITTOFA¶
Convert *.2bit file to *.fa file (see http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/FOOTER.txt)
This wrapper can be used in the following way:
rule twoBitToFa:
input:
"{sample}.2bit"
output:
"{sample}.fa"
log:
"logs/{sample}.2bit_to_fa.log"
params:
"" # optional params string
wrapper:
"0.73.0/bio/ucsc/twoBitToFa"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
ucsc-twobittofa==377
- Roman Cherniatchik
"""Snakemake wrapper for *.2bit to *.fa conversion using UCSC twoBitToFa tool."""
# http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/FOOTER.txt
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
shell("twoBitToFa {extra} {snakemake.input} {snakemake.output} {log}")
UMIS¶
For umis, the following wrappers are available:
UMIS BAMTAG¶
Convert a BAM/SAM with fastqtransformed read names to have UMI and
This wrapper can be used in the following way:
rule umis_bamtag:
input:
"data/{sample}.bam"
output:
"data/{sample}.annotated.bam"
log:
"logs/umis/bamtag/{sample}.log"
params:
extra=""
threads: 1
wrapper:
"0.73.0/bio/umis/bamtag"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
umis==1.0.3
samtools==1.9
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
bam_input = snakemake.input[0]
if bam_input is None:
raise ValueError("Missing bam input file!")
elif not len(snakemake.input) == 1:
raise ValueError("Only expecting one input file: " + str(snakemake.input) + "!")
output_file = snakemake.output[0]
if output_file is None:
raise ValueError("Missing output file")
elif not len(snakemake.output) == 1:
raise ValueError("Only expecting one output file: " + str(output_file) + "!")
in_pipe = ""
if bam_input.endswith(".sam"):
in_pipe = "cat "
else:
in_pipe = "samtools view -h "
out_pipe = ""
if not output_file.endswith(".sam"):
out_pipe = " | samtools view -S -b - "
shell(
" {in_pipe} {bam_input} | " " umis bamtag -" " {out_pipe} > {output_file}" " {log}"
)
UNICYCLER¶
Assemble bacterial genomes with Unicycler.
You may find additional information on Unicycler’s github page.
Example¶
This wrapper can be used in the following way:
rule test_unicycler:
input:
# R1 and R2 short reads:
paired = expand(
"reads/{sample}.{read}.fq.gz",
read=["R1", "R2"],
allow_missing=True
)
# Long reads:
# long = long_reads/{sample}.fq.gz
# Unpaired reads:
# unpaired = reads/{sample}.fq.gz
output:
"result/{sample}/assembly.fasta"
log:
"logs/{sample}.log"
params:
extra=""
wrapper:
"0.73.0/bio/unicycler"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
bowtie2==2.4.1
bcftools==1.10.2
spades==3.14.1
samtools==1.10
pilon==1.23
racon==1.4.13
blast==2.10.1
unicycler==0.4.8
Authors¶
- Thibault Dayris
Code¶
"""Snakemake wrapper for Unicycler"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2020, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from os.path import dirname
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
input_reads = ""
if "paired" in snakemake.input.keys():
input_reads += " --short1 {} --short2 {}".format(*snakemake.input.paired)
if "unpaired" in snakemake.input.keys():
input_reads += " --unpaired {} ".format(snakemake.input["unpaired"])
if "long" in snakemake.input.keys():
input_reads += " --long {} ".format(snakemake.input["long"])
output_dir = " --out {} ".format(dirname(snakemake.output[0]))
shell(
" unicycler "
" {input_reads} "
" --threads {snakemake.threads} "
" {output_dir} "
" {extra} "
" {log} "
)
VARDICT¶
Run Vardict to call genomic variants
Example¶
This wrapper can be used in the following way:
rule vardict_single_mode:
input:
reference="data/genome.fasta",
regions="regions.bed",
bam="mapped/{sample}.bam",
output:
vcf="vcf/{sample}.s.vcf",
params:
extra="",
bed_columns="-c 1 -S 2 -E 3 -g 4", # Optional, default is -c 1 -S 2 -E 3 -g 4
af_th="0.01", # Optional, default is 0.01
threads: 1
log:
"logs/varscan_{sample}_s_.log",
wrapper:
"0.73.0/bio/vardict"
rule vardict_paired_mode:
input:
reference="data/genome.fasta",
regions="regions.bed",
bam="mapped/{sample}.bam",
normal="mapped/b.bam",
output:
vcf="vcf/{sample}.tn.vcf",
params:
extra="",
threads: 1
log:
"logs/varscan_{sample}_tn.log",
wrapper:
"0.73.0/bio/vardict"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
vardict-java==1.8.2
Input/Output¶
Input:
- reference file
- bam file
- normal file, optional (must be set for tumor/normal mode)
- region file
Output:
- A VCF file
Authors¶
- Patrik Smeds
Code¶
"""Snakemake wrapper for VarDict Single sample mode"""
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2021, Patrik Smeds"
__email__ = "patrik.smeds@scilifelab.uu.se"
__license__ = "MIT"
from pathlib import Path
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
reference = snakemake.input.reference
regions = snakemake.input.regions
bam = snakemake.input.bam
normal = snakemake.input.get("normal", None)
vcf = snakemake.output.vcf
extra = snakemake.params.get("extra", "")
bed_columns = snakemake.params.get("bed_columns", "-c 1 -S 2 -E 3 -g 4")
af_th = snakemake.params.get("allele_frequency_threshold", "0.01")
if normal is None:
input_bams = bam
name = snakemake.params.get("sample_name", Path(bam).stem)
post_scripts = (
"teststrandbias.R | var2vcf_valid.pl -A -N '" + name + "' -E -f " + af_th
)
else:
input_bams = "'" + bam + "|" + normal + "'"
name = snakemake.params.get("sample_name", Path(bam).stem + "|" + Path(normal).stem)
post_scripts = 'testsomatic.R | var2vcf_paired.pl -N "' + name + '" -f ' + af_th
shell(
"vardict-java -G {reference} "
"-f {af_th} "
"-th {snakemake.threads} "
"{bed_columns} "
"-N '{name}' "
"-b {input_bams} "
"{regions} |"
"{post_scripts} "
"> {vcf}"
"{log}"
)
VARSCAN¶
For varscan, the following wrappers are available:
VARSCAN MPILEUP2INDEL¶
Detect indel in NGS data from mpileup files with VarScan
This wrapper can be used in the following way:
rule mpileup_to_vcf:
input:
"mpileup/{sample}.mpileup.gz"
output:
"vcf/{sample}.vcf"
message:
"Calling Indel with Varscan2"
threads: # Varscan does not take any threading information
1 # However, mpileup might have to be unzipped.
# Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
log:
"logs/varscan_{sample}.log"
wrapper:
"0.73.0/bio/varscan/mpileup2indel"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
varscan==2.4.3
snakemake-wrapper-utils==0.1.3
Varscan does not take any threading information by itself. However, mpileup files given as input, might be gzipped.
If so, it’s recommended to use two threads:
- 1 for varscan itself
- 1 for zcat
- Thibault Dayris
"""Snakemake wrapper for Varscan2 mpileup2indel"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
import os.path as op
from snakemake.shell import shell
from snakemake.utils import makedirs
from snakemake_wrapper_utils.java import get_java_opts
# Gathering extra parameters and logging behaviour
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
# In case input files are gzipped mpileup files,
# they are being unzipped and piped
# In that case, it is recommended to use at least 2 threads:
# - One for unzipping with zcat
# - One for running varscan
pileup = (
" cat {} ".format(snakemake.input[0])
if not snakemake.input[0].endswith("gz")
else " zcat {} ".format(snakemake.input[0])
)
# Building output directories
makedirs(op.dirname(snakemake.output[0]))
shell(
"varscan mpileup2indel " # Tool and its subprocess
"<( {pileup} ) "
"{java_opts} {extra} " # Extra parameters
"> {snakemake.output[0]} " # Path to vcf file
"{log}" # Logging behaviour
)
VARSCAN MPILEUP2SNP¶
Detect variants in NGS data from Samtools mpileup with VarScan
This wrapper can be used in the following way:
rule mpileup_to_vcf:
input:
"mpileup/{sample}.mpileup.gz"
output:
"vcf/{sample}.vcf"
message:
"Calling SNP with Varscan2"
threads: # Varscan does not take any threading information
1 # However, mpileup might have to be unzipped.
# Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
log:
"logs/varscan_{sample}.log"
wrapper:
"0.73.0/bio/varscan/mpileup2snp"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
varscan==2.4.3
snakemake-wrapper-utils==0.1.3
Varscan does not take any threading information by itself. However, mpileup files given as input, might be gzipped.
If so, it’s recommended to use two threads:
- 1 for varscan itself
- 1 for zcat
- Thibault Dayris
"""Snakemake wrapper for Varscan2 mpileup2snp"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
import os.path as op
from snakemake.shell import shell
from snakemake.utils import makedirs
from snakemake_wrapper_utils.java import get_java_opts
# Gathering extra parameters and logging behaviour
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
# In case input files are gzipped mpileup files,
# they are being unzipped and piped
# In that case, it is recommended to use at least 2 threads:
# - One for unzipping with zcat
# - One for running varscan
pileup = (
" cat {} ".format(snakemake.input[0])
if not snakemake.input[0].endswith("gz")
else " zcat {} ".format(snakemake.input[0])
)
# Building output directories
makedirs(op.dirname(snakemake.output[0]))
shell(
"varscan mpileup2snp " # Tool and its subprocess
"<( {pileup} ) "
"{java_opts} {extra} " # Extra parameters
"> {snakemake.output[0]} " # Path to vcf file
"{log}" # Logging behaviour
)
VARSCAN SOMATIC¶
Varscan Somatic calls variants and identifies their somatic status (Germline/LOH/Somatic) using pileup files from a matched tumor-normal pair.
This wrapper can be used in the following way:
rule varscan_somatic:
input:
# A pair of pileup files can be used *instead* of the mpileup
# normal_pileup = ""
# tumor_pileup = ""
mpileup = "mpileup/{sample}.mpileup.gz"
output:
snp = "vcf/{sample}.snp.vcf",
indel = "vcf/{sample}.indel.vcf"
message:
"Calling somatic variants {wildcards.sample}"
threads:
1
# optional specification of memory usage of the JVM that snakemake will respect with global
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
resources:
mem_mb=1024
params:
extra = ""
wrapper:
"0.73.0/bio/varscan/somatic"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
varscan==2.4.3
snakemake-wrapper-utils==0.1.3
- Thibault Dayris
"""Snakemake wrapper for varscan somatic"""
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
import os.path as op
from snakemake.shell import shell
from snakemake.utils import makedirs
from snakemake_wrapper_utils.java import get_java_opts
# Defining logging and gathering extra parameters
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
# Building output dirs
makedirs(op.dirname(snakemake.output.snp))
makedirs(op.dirname(snakemake.output.indel))
# Output prefix
prefix = op.splitext(snakemake.output.snp)[0]
# Searching for input files
pileup_pair = ["normal_pileup", "tumor_pileup"]
in_pileup = ""
mpileup = ""
if "mpileup" in snakemake.input.keys():
# Case there is a mpileup with both normal and tumor
in_pileup = snakemake.input.mpileup
mpileup = "--mpileup 1"
elif all(pileup in snakemake.input.keys() for pileup in pileup_pair):
# Case there are two separate pileup files
in_pileup = " {snakemake.input.normal_pileup}" " {snakemakeinput.tumor_pileup} "
else:
raise KeyError("Could not find either a mpileup, or a pair of pileup files")
shell(
"varscan somatic" # Tool and its subcommand
" {in_pileup}" # Path to input file(s)
" {prefix}" # Path to output
" {java_opts} {extra}" # Extra parameters
" {mpileup}"
" --output-snp {snakemake.output.snp}" # Path to snp output file
" --output-indel {snakemake.output.indel}" # Path to indel output file
)
VCFTOOLS¶
For vcftools, the following wrappers are available:
VCFTOOLS FILTER¶
Filter vcf files using vcftools
This wrapper can be used in the following way:
rule filter_vcf:
input:
"{sample}.vcf"
output:
"{sample}.filtered.vcf"
params:
extra="--chr 1 --recode-INFO-all"
wrapper:
"0.73.0/bio/vcftools/filter"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vcftools==0.1.16
- Patrik Smeds
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2018, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
input_flag = "--vcf"
if snakemake.input[0].endswith(".gz"):
input_flag = "--gzvcf"
output = " > " + snakemake.output[0]
if output.endswith(".gz"):
output = " | gzip" + output
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
shell(
"vcftools "
"{input_flag} "
"{snakemake.input} "
"{extra} "
"--recode "
"--stdout "
"{output} "
"{log}"
)
VEMBRANE¶
For vembrane, the following wrappers are available:
VEMBRANE FILTER¶
Vembrane filter allows to simultaneously filter variants based on any INFO field, CHROM, POS, REF, ALT, QUAL, and the annotation field ANN. When filtering based on ANN, annotation entries are filtered first. If no annotation entry remains, the entire variant is deleted. https://github.com/vembrane/vembrane
This wrapper can be used in the following way:
rule vembrane_filter:
input:
vcf="in.vcf",
output:
vcf="filtered/out.vcf"
params:
expression="POS > 4000",
extra=""
log:
"logs/vembrane.log"
wrapper:
"0.73.0/bio/vembrane/filter"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vembrane=0.5.1
- Christopher Schröder
"""Snakemake wrapper for vembrane"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroeder@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
shell(
"vembrane filter" # Tool and its subcommand
" {extra}" # Extra parameters
" {snakemake.params.expression:q}"
" {snakemake.input}" # Path to input file
" > {snakemake.output}" # Path to output file
" {log}" # Logging behaviour
)
VEMBRANE TABLE¶
Vembrane table allows to generate table-like textfiles from vcfs based on any INFO field, CHROM, POS, REF, ALT, QUAL, and the annotation field ANN. When filtering based on ANN, annotation entries are filtered first. https://github.com/vembrane/vembrane
This wrapper can be used in the following way:
rule vembrane_table:
input:
vcf="in.vcf",
output:
vcf="table/out.tsv"
params:
expression="CHROM, POS, ALT, REF",
extra=""
log:
"logs/vembrane.log"
wrapper:
"0.73.0/bio/vembrane/table"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vembrane=0.5.1
- Christopher Schröder
"""Snakemake wrapper for vembrane"""
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "christopher.schroeder@tu-dortmund.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
shell(
"vembrane table" # Tool and its subcommand
" {extra}" # Extra parameters
" {snakemake.params.expression:q}"
" {snakemake.input}" # Path to input file
" > {snakemake.output}" # Path to output file
" {log}" # Logging behaviour
)
VEP¶
For vep, the following wrappers are available:
VEP ANNOTATE¶
Annotate variant calls with VEP.
This wrapper can be used in the following way:
rule annotate_variants:
input:
calls="variants.bcf", # .vcf, .vcf.gz or .bcf
cache="resources/vep/cache", # can be omitted if fasta and gff are specified
plugins="resources/vep/plugins",
# optionally add reference genome fasta
# fasta="genome.fasta",
# fai="genome.fasta.fai", # fasta index
# gff="annotation.gff",
# csi="annotation.gff.csi", # tabix index
output:
calls="variants.annotated.bcf", # .vcf, .vcf.gz or .bcf
stats="variants.html"
params:
# Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
# Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
plugins=["LoFtool"],
extra="--everything" # optional: extra arguments
log:
"logs/vep/annotate.log"
threads: 4
wrapper:
"0.73.0/bio/vep/annotate"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
ensembl-vep=102
bcftools=1.10
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
from pathlib import Path
from snakemake.shell import shell
def get_only_child_dir(path):
children = [child for child in path.iterdir() if child.is_dir()]
assert (
len(children) == 1
), "Invalid VEP cache directory, only a single entry is allowed, make sure that cache was created with the snakemake VEP cache wrapper"
return children[0]
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
fork = "--fork {}".format(snakemake.threads) if snakemake.threads > 1 else ""
stats = snakemake.output.stats
cache = snakemake.input.get("cache", "")
plugins = snakemake.input.plugins
load_plugins = " ".join(map("--plugin {}".format, snakemake.params.plugins))
if snakemake.output.calls.endswith(".vcf.gz"):
fmt = "z"
elif snakemake.output.calls.endswith(".bcf"):
fmt = "b"
else:
fmt = "v"
fasta = snakemake.input.get("fasta", "")
if fasta:
fasta = "--fasta {}".format(fasta)
gff = snakemake.input.get("gff", "")
if gff:
gff = "--gff {}".format(gff)
if cache:
entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
species = entrypath.parent.name
release, build = entrypath.name.split("_")
cache = (
"--offline --cache --dir_cache {cache} --cache_version {release} --species {species} --assembly {build}"
).format(cache=cache, release=release, build=build, species=species)
shell(
"(bcftools view {snakemake.input.calls} | "
"vep {extra} {fork} "
"--format vcf "
"--vcf "
"{cache} "
"{gff} "
"{fasta} "
"--dir_plugins {plugins} "
"{load_plugins} "
"--output_file STDOUT "
"--stats_file {stats} | "
"bcftools view -O{fmt} > {snakemake.output.calls}) {log}"
)
VEP DOWNLOAD CACHE¶
Download VEP cache for given species, build and release.
This wrapper can be used in the following way:
rule get_vep_cache:
output:
directory("resources/vep/cache")
params:
species="saccharomyces_cerevisiae",
build="R64-1-1",
release="98"
log:
"logs/vep/cache.log"
cache: True # save space and time with between workflow caching (see docs)
wrapper:
"0.73.0/bio/vep/cache"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
ensembl-vep=101
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
from pathlib import Path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"vep_install --AUTO cf "
"--SPECIES {snakemake.params.species} "
"--ASSEMBLY {snakemake.params.build} "
"--VERSION {snakemake.params.release} "
"--CACHEDIR {snakemake.output} "
"--CONVERT "
"--NO_UPDATE "
"{extra} {log}"
)
VEP DOWNLOAD PLUGINS¶
Download VEP plugins.
This wrapper can be used in the following way:
rule download_vep_plugins:
output:
directory("resources/vep/plugins")
params:
release=100
wrapper:
"0.73.0/bio/vep/plugins"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
python=3
- Johannes Köster
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
import sys
from pathlib import Path
from urllib.request import urlretrieve
from zipfile import ZipFile
from tempfile import NamedTemporaryFile
if snakemake.log:
sys.stderr = open(snakemake.log[0], "w")
outdir = Path(snakemake.output[0])
outdir.mkdir()
with NamedTemporaryFile() as tmp:
urlretrieve(
"https://github.com/Ensembl/VEP_plugins/archive/release/{release}.zip".format(
release=snakemake.params.release
),
tmp.name,
)
with ZipFile(tmp.name) as f:
for member in f.infolist():
memberpath = Path(member.filename)
if len(memberpath.parts) == 1:
# skip root dir
continue
targetpath = outdir / memberpath.relative_to(memberpath.parts[0])
if member.is_dir():
targetpath.mkdir()
else:
with open(targetpath, "wb") as out:
out.write(f.read(member.filename))
VG¶
For vg, the following wrappers are available:
VG CONSTRUCT¶
Construct variation graphs from a reference and variant calls.
This wrapper can be used in the following way:
rule construct:
input:
ref="c.fa",
vcfgz="c.vcf.gz"
output:
vg="graph/c.vg"
params:
"--node-max 10"
log:
"logs/vg/construct/c.log"
threads:
4
wrapper:
"0.73.0/bio/vg/construct"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vg==1.27.0
- Ali Ghaffaari
__author__ = "Ali Ghaffaari"
__copyright__ = "Copyright 2017, Ali Ghaffaari"
__email__ = "ghaffari@mpi-inf.mpg.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False)
shell(
"(vg construct {snakemake.params} --reference {snakemake.input.ref}"
" --vcf {snakemake.input.vcfgz} --threads {snakemake.threads}"
" > {snakemake.output.vg}) {log}"
)
VG IDS¶
Manipulate id space of input graphs. NOTE Use bio/vg/merge for making a joint id space for graphs.
This wrapper can be used in the following way:
rule ids:
input:
vgs="c.vg"
output:
mod="graph/c_mod.vg"
log:
"logs/vg/ids/c.log"
wrapper:
"0.73.0/bio/vg/ids"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vg==1.27.0
- Ali Ghaffaari
__author__ = "Ali Ghaffaari"
__copyright__ = "Copyright 2017, Ali Ghaffaari"
__email__ = "ghaffari@mpi-inf.mpg.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False)
shell(
"(vg ids {snakemake.params} {snakemake.input.vgs}"
" > {snakemake.output.mod}) {log}"
)
VG INDEX GCSA¶
Build GCSA index for variation graphs.
This wrapper can be used in the following way:
rule gcsa:
input:
vgs=["x.vg", "c.vg"]
output:
gcsa="index/wg.gcsa"
params:
"-Z 3000 -X 3"
log:
"logs/vg/index/gcsa/wg.log"
threads:
4
wrapper:
"0.73.0/bio/vg/index/gcsa"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vg==1.27.0
- Ali Ghaffaari
__author__ = "Ali Ghaffaari"
__copyright__ = "Copyright 2017, Ali Ghaffaari"
__email__ = "ghaffari@mpi-inf.mpg.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell()
shell(
"(vg index -g {snakemake.output.gcsa} --threads {snakemake.threads}"
" {snakemake.params} {snakemake.input.vgs}) {log}"
)
VG INDEX XG¶
Create an xg index on variation graphs.
This wrapper can be used in the following way:
rule xg:
input:
vgs="x.vg"
output:
xg="index/x.xg"
log:
"logs/vg/index/xg/x.log"
threads:
4
wrapper:
"0.73.0/bio/vg/index/xg"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vg==1.27.0
- Ali Ghaffaari
__author__ = "Ali Ghaffaari"
__copyright__ = "Copyright 2017, Ali Ghaffaari"
__email__ = "ghaffari@mpi-inf.mpg.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell()
shell(
"(vg index --xg-name {snakemake.output.xg} --threads {snakemake.threads}"
" {snakemake.params} {snakemake.input.vgs}) {log}"
)
VG KMERS¶
Generates kmers from both strands of variation graphs.
This wrapper can be used in the following way:
rule kmers:
input:
vgs="c.vg"
output:
kmers="kmers/c.kmers"
params:
"-gBk 16 -H 1000000000 -T 1000000001"
log:
"logs/vg/kmers/c.log"
threads:
4
wrapper:
"0.73.0/bio/vg/kmers"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vg==1.27.0
- Ali Ghaffaari
__author__ = "Ali Ghaffaari"
__copyright__ = "Copyright 2017, Ali Ghaffaari"
__email__ = "ghaffari@mpi-inf.mpg.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False)
shell(
"(vg kmers {snakemake.params} --threads {snakemake.threads}"
" {snakemake.input.vgs} > {snakemake.output.kmers}) {log}"
)
VG MERGE¶
Generate a joint id space across each graph and merge them all.
This wrapper can be used in the following way:
rule merge:
input:
vgs=["c.vg", "x.vg"]
output:
merged="graph/wg.vg"
log:
"logs/vg/merge/wg.log"
wrapper:
"0.73.0/bio/vg/merge"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vg==1.27.0
- Ali Ghaffaari
__author__ = "Ali Ghaffaari"
__copyright__ = "Copyright 2017, Ali Ghaffaari"
__email__ = "ghaffari@mpi-inf.mpg.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False)
shell(
"(vg ids --join {snakemake.input.vgs} &&"
" for VGFILE in {snakemake.input.vgs};"
" do cat $VGFILE >> {snakemake.output.merged};"
" done) {log}"
)
VG PRUNE¶
Prunes the complex regions of the graph for GCSA2 indexing.
This wrapper can be used in the following way:
rule prune:
input:
vg="c.vg"
output:
pruned="graph/c.pruned.vg"
params:
"-r"
log:
"logs/vg/prune/c.log"
threads:
4
wrapper:
"0.73.0/bio/vg/prune"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vg==1.27.0
- Ali Ghaffaari
__author__ = "Ali Ghaffaari"
__copyright__ = "Copyright 2017, Ali Ghaffaari"
__email__ = "ghaffari@mpi-inf.mpg.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False)
shell(
"(vg prune --threads {snakemake.threads} {snakemake.params}"
" {snakemake.input.vg} > {snakemake.output.pruned}) {log}"
)
VG SIM¶
Samples sequences from the xg-indexed graph.
This wrapper can be used in the following way:
rule sim:
input:
xg="x.xg"
output:
reads="reads/x.seq"
params:
"--read-length 100 --num-reads 100 -f"
log:
"logs/vg/sim/x.log"
threads:
4
wrapper:
"0.73.0/bio/vg/sim"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
vg==1.27.0
- Ali Ghaffaari
__author__ = "Ali Ghaffaari"
__copyright__ = "Copyright 2018, Ali Ghaffaari"
__email__ = "ghaffari@mpi-inf.mpg.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=False)
shell(
"(vg sim {snakemake.params} --xg-name {snakemake.input.xg}"
" --threads {snakemake.threads} > {snakemake.output.reads}) {log}"
)
WGSIM¶
Short read simulator.
Example¶
This wrapper can be used in the following way:
rule wgsim:
input:
ref="genome.fa"
output:
read1="reads/1.fq",
read2="reads/2.fq"
log:
"logs/wgsim/sim.log"
params:
"-X 0 -R 0 -r 0.1 -h"
wrapper:
"0.73.0/bio/wgsim"
Note that input, output and log file paths can be chosen freely. When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies¶
wgsim==1.0.0
Authors¶
- Ali Ghaffaari
Code¶
__author__ = "Ali Ghaffaari"
__copyright__ = "Copyright 2018, Ali Ghaffaari"
__email__ = "ali.ghaffaari@mpi-inf.mpg.de"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell()
shell(
"(wgsim {snakemake.params} {snakemake.input.ref}"
" {snakemake.output.read1} {snakemake.output.read2}) {log}"
)
Meta-Wrappers¶
Meta-wrappers offer curated and tested combinations of Wrappers that fulfil common tasks with popular tools, in a best-practice way. For using them, simply copy-paste the offered snippets into your Snakemake workflow.
The menu on the left (expand by clicking (+) if necessary), lists all available meta-wrappers.
BWA_MAPPING¶
Map reads with bwa-mem and index with samtools index - this is just a test for subworkflows
Example¶
This meta-wrapper can be used by integrating the following into your workflow:
rule bwa_mem:
input:
reads=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"]
output:
"mapped/{sample}.bam"
log:
"logs/bwa_mem/{sample}.log"
params:
index="genome",
extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
sort="samtools", # Can be 'none', 'samtools' or 'picard'.
sort_order="coordinate", # Can be 'queryname' or 'coordinate'.
sort_extra="" # Extra args for samtools/picard.
threads: 8
wrapper:
"0.73.0/bio/bwa/mem"
rule samtools_index:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.bam.bai"
params:
"" # optional params string
wrapper:
"0.73.0/bio/samtools/index"
Note that input, output and log file paths can be chosen freely, as long as the dependencies between the rules remain as listed here. For additional parameters in each individual wrapper, please refer to their corresponding documentation (see links below).
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Used wrappers¶
The following individual wrappers are used in this meta-wrapper:
Please refer to each wrapper in above list for additional configuration parameters and information about the executed code.
Authors¶
- Jan Forster
DADA2-PE¶
A subworkflow for processing paired-end sequences from metabarcoding projects in order to construct ASV tables using DADA2
. The example is based on the data provided by the R
package. For more details, see the official website and the tutorial.
Example¶
This meta-wrapper can be used by integrating the following into your workflow:
# Make sure that you set the `truncLen=` option in the rule `dada2_filter_and_trim_pe` according
# to the results of the quality profile checks (after rule `dada2_quality_profile_pe` has finished on all samples).
# If in doubt, check https://benjjneb.github.io/dada2/tutorial.html#inspect-read-quality-profiles
rule all:
input:
# In a first run of this meta-wrapper, comment out all other inputs and only keep this one.
# Looking at the resulting plot, adjust the `truncLen` in rule `dada2_filter_trim_pe` and then
# rerun with all inputs uncommented.
expand(
"reports/dada2/quality-profile/{sample}-quality-profile.png",
sample=["a","b"]
),
"results/dada2/taxa.RDS"
rule dada2_quality_profile_pe:
input:
# FASTQ file without primer sequences
expand("trimmed/{{sample}}.{orientation}.fastq.gz",orientation=[1,2])
output:
"reports/dada2/quality-profile/{sample}-quality-profile.png"
log:
"logs/dada2/quality-profile/{sample}-quality-profile-pe.log"
wrapper:
"0.73.0/bio/dada2/quality-profile"
rule dada2_filter_trim_pe:
input:
# Paired-end files without primer sequences
fwd="trimmed/{sample}.1.fastq.gz",
rev="trimmed/{sample}.2.fastq.gz"
output:
filt="filtered-pe/{sample}.1.fastq.gz",
filt_rev="filtered-pe/{sample}.2.fastq.gz",
stats="reports/dada2/filter-trim-pe/{sample}.tsv"
params:
# Set the maximum expected errors tolerated in filtered reads
maxEE=1,
# Set the number of kept bases in forward and reverse reads
truncLen=[240,200]
log:
"logs/dada2/filter-trim-pe/{sample}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/filter-trim"
rule dada2_learn_errors:
input:
# Quality filtered and trimmed forward FASTQ files (potentially compressed)
expand("filtered-pe/{sample}.{{orientation}}.fastq.gz", sample=["a","b"])
output:
err="results/dada2/model_{orientation}.RDS",# save the error model
plot="reports/dada2/errors_{orientation}.png",# plot observed and estimated rates
params:
randomize=True
log:
"logs/dada2/learn-errors/learn-errors_{orientation}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/learn-errors"
rule dada2_dereplicate_fastq:
input:
# Quality filtered FASTQ file
"filtered-pe/{fastq}.fastq.gz"
output:
# Dereplicated sequences stored as `derep-class` object in a RDS file
"uniques/{fastq}.RDS"
log:
"logs/dada2/dereplicate-fastq/{fastq}.log"
wrapper:
"0.73.0/bio/dada2/dereplicate-fastq"
rule dada2_sample_inference:
input:
# Dereplicated (aka unique) sequences of the sample
derep="uniques/{sample}.{orientation}.RDS",
err="results/dada2/model_{orientation}.RDS" # Error model
output:
"denoised/{sample}.{orientation}.RDS" # Inferred sample composition
log:
"logs/dada2/sample-inference/{sample}.{orientation}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/sample-inference"
rule dada2_merge_pairs:
input:
dadaF="denoised/{sample}.1.RDS",# Inferred composition
dadaR="denoised/{sample}.2.RDS",
derepF="uniques/{sample}.1.RDS",# Dereplicated sequences
derepR="uniques/{sample}.2.RDS"
output:
"merged/{sample}.RDS"
log:
"logs/dada2/merge-pairs/{sample}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/merge-pairs"
rule dada2_make_table_pe:
input:
# Merged composition
expand("merged/{sample}.RDS", sample=['a','b'])
output:
"results/dada2/seqTab-pe.RDS"
params:
names=['a','b'], # Sample names instead of paths
orderBy="nsamples" # Change the ordering of samples
log:
"logs/dada2/make-table/make-table-pe.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/make-table"
rule dada2_remove_chimeras:
input:
"results/dada2/seqTab-pe.RDS" # Sequence table
output:
"results/dada2/seqTab.nochimeras.RDS" # Chimera-free sequence table
log:
"logs/dada2/remove-chimeras/remove-chimeras.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/remove-chimeras"
rule dada2_collapse_nomismatch:
input:
"results/dada2/seqTab.nochimeras.RDS" # Chimera-free sequence table
output:
"results/dada2/seqTab.collapsed.RDS"
log:
"logs/dada2/collapse-nomismatch/collapse-nomismatch.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/collapse-nomismatch"
rule dada2_assign_taxonomy:
input:
seqs="results/dada2/seqTab.collapsed.RDS", # Chimera-free sequence table
refFasta="resources/example_train_set.fa.gz" # Reference FASTA for taxonomy
output:
"results/dada2/taxa.RDS" # Taxonomic assignments
log:
"logs/dada2/assign-taxonomy/assign-taxonomy.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/assign-taxonomy"
Note that input, output and log file paths can be chosen freely, as long as the dependencies between the rules remain as listed here. For additional parameters in each individual wrapper, please refer to their corresponding documentation (see links below).
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Used wrappers¶
The following individual wrappers are used in this meta-wrapper:
- DADA2_QUALITY_PROFILES
- DADA2_FILTER_TRIM
- DADA2_LEARN_ERRORS
- DADA2_DEREPLICATE_FASTQ
- DADA2_SAMPLE_INFERENCE
- DADA2_MERGE_PAIRS
- DADA2_MAKE_TABLE
- DADA2_REMOVE_CHIMERAS
- DADA2_COLLAPSE_NOMISMATCH
- DADA2_ASSIGN_TAXONOMY
Please refer to each wrapper in above list for additional configuration parameters and information about the executed code.
Authors¶
- Charlie Pauvert
DADA2-SE¶
A subworkflow for processing single-end sequences from metabarcoding projects in order to construct ASV tables using DADA2
. The example is based on the data provided in the R
package. For more details, see the official website. While the tutorial is tailored for paired-end sequences, useful information can be found regarding common functions to singled-end sequences processing.
Example¶
This meta-wrapper can be used by integrating the following into your workflow:
# Make sure that you set the `truncLen=` option in the rule `dada2_filter_and_trim_se` according
# to the results of the quality profile checks (after rule `dada2_quality_profile_se` has finished on all samples).
# If in doubt, check https://benjjneb.github.io/dada2/tutorial.html#inspect-read-quality-profiles
rule all:
input:
# In a first run of this meta-wrapper, comment out all other inputs and only keep this one.
# Looking at the resulting plot, adjust the `truncLen` in rule `dada2_filter_trim_se` and then
# rerun with all inputs uncommented.
expand(
"reports/dada2/quality-profile/{sample}.{orientation}-quality-profile.png",
sample=["a","b"], orientation=1
),
"results/dada2/taxa.RDS"
rule dada2_quality_profile_se:
input:
# FASTQ file without primer sequences
"trimmed/{sample}.{orientation}.fastq.gz"
output:
"reports/dada2/quality-profile/{sample}.{orientation}-quality-profile.png"
log:
"logs/dada2/quality-profile/{sample}.{orientation}-quality-profile-se.log"
wrapper:
"0.73.0/bio/dada2/quality-profile"
rule dada2_filter_trim_se:
input:
# Single-end files without primer sequences
fwd="trimmed/{sample}.1.fastq.gz"
output:
filt="filtered-se/{sample}.1.fastq.gz",
stats="reports/dada2/filter-trim-se/{sample}.tsv"
params:
# Set the maximum expected errors tolerated in filtered reads
maxEE=1,
# Set the number of kept bases
truncLen=240
log:
"logs/dada2/filter-trim-se/{sample}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/filter-trim"
rule dada2_learn_errors:
input:
# Quality filtered and trimmed forward FASTQ files (potentially compressed)
expand("filtered-se/{sample}.{{orientation}}.fastq.gz", sample=["a","b"])
output:
err="results/dada2/model_{orientation}.RDS",# save the error model
plot="reports/dada2/errors_{orientation}.png",# plot observed and estimated rates
params:
randomize=True
log:
"logs/dada2/learn-errors/learn-errors_{orientation}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/learn-errors"
rule dada2_dereplicate_fastq:
input:
# Quality filtered FASTQ file
"filtered-se/{fastq}.fastq.gz"
output:
# Dereplicated sequences stored as `derep-class` object in a RDS file
"uniques/{fastq}.RDS"
log:
"logs/dada2/dereplicate-fastq/{fastq}.log"
wrapper:
"0.73.0/bio/dada2/dereplicate-fastq"
rule dada2_sample_inference:
input:
# Dereplicated (aka unique) sequences of the sample
derep="uniques/{sample}.{orientation}.RDS",
err="results/dada2/model_{orientation}.RDS" # Error model
output:
"denoised/{sample}.{orientation}.RDS" # Inferred sample composition
log:
"logs/dada2/sample-inference/{sample}.{orientation}.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/sample-inference"
rule dada2_make_table_se:
input:
# Inferred composition
expand("denoised/{sample}.1.RDS", sample=['a','b'])
output:
"results/dada2/seqTab-se.RDS"
params:
names=['a','b'] # Sample names instead of paths
log:
"logs/dada2/make-table/make-table-se.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/make-table"
rule dada2_remove_chimeras:
input:
"results/dada2/seqTab-se.RDS" # Sequence table
output:
"results/dada2/seqTab.nochimeras.RDS" # Chimera-free sequence table
log:
"logs/dada2/remove-chimeras/remove-chimeras.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/remove-chimeras"
rule dada2_collapse_nomismatch:
input:
"results/dada2/seqTab.nochimeras.RDS" # Chimera-free sequence table
output:
"results/dada2/seqTab.collapsed.RDS"
log:
"logs/dada2/collapse-nomismatch/collapse-nomismatch.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/collapse-nomismatch"
rule dada2_assign_taxonomy:
input:
seqs="results/dada2/seqTab.collapsed.RDS", # Chimera-free sequence table
refFasta="resources/example_train_set.fa.gz" # Reference FASTA for taxonomy
output:
"results/dada2/taxa.RDS" # Taxonomic assignments
log:
"logs/dada2/assign-taxonomy/assign-taxonomy.log"
threads: 1 # set desired number of threads here
wrapper:
"0.73.0/bio/dada2/assign-taxonomy"
Note that input, output and log file paths can be chosen freely, as long as the dependencies between the rules remain as listed here. For additional parameters in each individual wrapper, please refer to their corresponding documentation (see links below).
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Used wrappers¶
The following individual wrappers are used in this meta-wrapper:
- DADA2_QUALITY_PROFILES
- DADA2_FILTER_TRIM
- DADA2_LEARN_ERRORS
- DADA2_DEREPLICATE_FASTQ
- DADA2_SAMPLE_INFERENCE
- DADA2_MAKE_TABLE
- DADA2_REMOVE_CHIMERAS
- DADA2_COLLAPSE_NOMISMATCH
- DADA2_ASSIGN_TAXONOMY
Please refer to each wrapper in above list for additional configuration parameters and information about the executed code.
Authors¶
- Charlie Pauvert
STAR-ARRIBA¶
A subworkflow for fusion detection from RNA-seq data with arriba
. The fusion calling is based on splice-aware, chimeric alignments done with STAR
. STAR
is used with specific parameters to ensure optimal functionality of the arriba
fusion detection, for details, see the documentation.
Example¶
This meta-wrapper can be used by integrating the following into your workflow:
rule star_index:
input:
fasta="resources/genome.fasta",
annotation="resources/genome.gtf"
output:
directory("resources/star_genome")
threads: 4
params:
extra="--sjdbGTFfile resources/genome.gtf --sjdbOverhang 100"
log:
"logs/star_index_genome.log"
cache: True
wrapper:
"0.73.0/bio/star/index"
rule star_align:
input:
# use a list for multiple fastq files for one sample
# usually technical replicates across lanes/flowcells
fq1="reads/{sample}_R1.1.fastq",
fq2="reads/{sample}_R2.1.fastq", #optional
index="resources/star_genome"
output:
# see STAR manual for additional output files
"star/{sample}/Aligned.out.bam",
"star/{sample}/ReadsPerGene.out.tab"
log:
"logs/star/{sample}.log"
params:
# path to STAR reference genome index
index="resources/star_genome",
# specific parameters to work well with arriba
extra="--quantMode GeneCounts --sjdbGTFfile resources/genome.gtf"
" --outSAMtype BAM Unsorted --chimSegmentMin 10 --chimOutType WithinBAM SoftClip"
" --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0"
" --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3"
threads: 12
wrapper:
"0.73.0/bio/star/align"
rule arriba:
input:
bam="star/{sample}/Aligned.out.bam",
genome="resources/genome.fasta",
annotation="resources/genome.gtf"
output:
fusions="results/arriba/{sample}.fusions.tsv",
discarded="results/arriba/{sample}.fusions.discarded.tsv"
params:
# A tsv containing identified artifacts, such as read-through fusions of neighbouring genes, see https://arriba.readthedocs.io/en/latest/input-files/#blacklist
blacklist="arriba_blacklist.tsv",
extra="-T -P -i 1,2" # -i describes the wanted contigs, remove if you want to use all hg38 chromosomes
log:
"logs/arriba/{sample}.log"
threads: 1
wrapper:
"0.73.0/bio/arriba"
Note that input, output and log file paths can be chosen freely, as long as the dependencies between the rules remain as listed here. For additional parameters in each individual wrapper, please refer to their corresponding documentation (see links below).
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Used wrappers¶
The following individual wrappers are used in this meta-wrapper:
Please refer to each wrapper in above list for additional configuration parameters and information about the executed code.
Authors¶
- Jan Forster