PICARD MARKDUPLICATES

Mark PCR and optical duplicates with picard tools.

URL: https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates

Example

This wrapper can be used in the following way:

rule markduplicates_bam:
    input:
        bams="mapped/{sample}.bam",
    # optional to specify a list of BAMs; this has the same effect
    # of marking duplicates on separate read groups for a sample
    # and then merging
    output:
        bam="dedup_bam/{sample}.bam",
        metrics="dedup_bam/{sample}.metrics.txt",
    log:
        "logs/dedup_bam/{sample}.log",
    params:
        extra="--REMOVE_DUPLICATES true",
    # optional specification of memory usage of the JVM that snakemake will respect with global
    # resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
    # and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
    # https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
    resources:
        mem_mb=1024,
    wrapper:
        "v3.8.0-49-g6f33607/bio/picard/markduplicates"


use rule markduplicates_bam as markduplicateswithmatecigar_bam with:
    output:
        bam="dedup_bam/{sample}.matecigar.bam",
        idx="dedup_bam/{sample}.mcigar.bai",
        metrics="dedup_bam/{sample}.matecigar.metrics.txt",
    log:
        "logs/dedup_bam/{sample}.matecigar.log",
    params:
        withmatecigar=True,
        extra="--REMOVE_DUPLICATES true",


use rule markduplicates_bam as markduplicates_sam with:
    output:
        bam="dedup_sam/{sample}.sam",
        metrics="dedup_sam/{sample}.metrics.txt",
    log:
        "logs/dedup_sam/{sample}.log",
    params:
        extra="--REMOVE_DUPLICATES true",


use rule markduplicates_bam as markduplicates_cram with:
    input:
        bams="mapped/{sample}.bam",
        ref="ref/genome.fasta",
    output:
        bam="dedup_cram/{sample}.cram",
        idx="dedup_cram/{sample}.cram.crai",
        metrics="dedup_cram/{sample}.metrics.txt",
    log:
        "logs/dedup_cram/{sample}.log",
    params:
        extra="--REMOVE_DUPLICATES true",
        embed_ref=True,  # set true if the fasta reference should be embedded into the cram
        withmatecigar=False,

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Notes

–TMP_DIR is automatically set by resources.tmpdir

Software dependencies

picard=3.1.1
samtools=1.19.2
snakemake-wrapper-utils=0.6.2

Input/Output

Input:

bam/cram file(s)

Output:

bam/cram file with marked or removed duplicates

Params

java_opts: allows for additional arguments to be passed to the java compiler, e.g. “-XX:ParallelGCThreads=10” (not for -XmX or -Djava.io.tmpdir, since they are handled automatically).
extra: allows for additional program arguments.
embed_ref: allows to embed the fasta reference into the cram
withmatecigar: allows to run MarkDuplicatesWithMateCigar instead.

Authors

Johannes Köster
Christopher Schröder
Filipe G. Vieira

Code

__author__ = "Johannes Köster, Christopher Schröder"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


import tempfile
from pathlib import Path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
from snakemake_wrapper_utils.samtools import get_samtools_opts, infer_out_format


log = snakemake.log_fmt_shell()
extra = snakemake.params.get("extra", "")
# the --SORTING_COLLECTION_SIZE_RATIO default of 0.25 might
# indicate a JVM memory overhead of that proportion
java_opts = get_java_opts(snakemake, java_mem_overhead_factor=0.3)
samtools_opts = get_samtools_opts(snakemake)


tool = "MarkDuplicates"
if snakemake.params.get("withmatecigar", False):
    tool = "MarkDuplicatesWithMateCigar"


bams = snakemake.input.bams
if isinstance(bams, str):
    bams = [bams]
bams = list(map("--INPUT {}".format, bams))


output = snakemake.output.bam
output_fmt = infer_out_format(output)
convert = ""
if output_fmt == "CRAM":
    output = "/dev/stdout"

    # NOTE: output format inference should be done by snakemake-wrapper-utils. Keeping it here for backwards compatibility.
    if snakemake.params.get("embed_ref", False):
        samtools_opts += ",embed_ref"

    convert = f" | samtools view {samtools_opts}"
elif output_fmt == "BAM" and snakemake.output.get("idx"):
    extra += " --CREATE_INDEX"


with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(picard {tool}"  # Tool and its subcommand
        " {java_opts}"  # Automatic java option
        " {extra}"  # User defined parmeters
        " {bams}"  # Input bam(s)
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {output}"  # Output bam
        " --METRICS_FILE {snakemake.output.metrics}"  # Output metrics
        " {convert}) {log}"  # Logging
    )


output_prefix = Path(snakemake.output.bam).with_suffix("")
if snakemake.output.get("idx"):
    if output_fmt == "BAM" and snakemake.output.idx != str(output_prefix) + ".bai":
        shell("mv {output_prefix}.bai {snakemake.output.idx}")