.. _`bio/picard/markduplicates`:

PICARD MARKDUPLICATES
=====================


.. image:: https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/picard/markduplicates?label=version%20update%20pull%20requests
   :target: https://github.com/snakemake/snakemake-wrappers/pulls?q=is%3Apr+is%3Aopen+label%3Abio/picard/markduplicates

Mark PCR and optical duplicates with picard tools.


**URL**: https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates

Example
-------

This wrapper can be used in the following way:

.. code-block:: python

    rule markduplicates_bam:
        input:
            bams="mapped/{sample}.bam",
        # optional to specify a list of BAMs; this has the same effect
        # of marking duplicates on separate read groups for a sample
        # and then merging
        output:
            bam="dedup_bam/{sample}.bam",
            metrics="dedup_bam/{sample}.metrics.txt",
        log:
            "logs/dedup_bam/{sample}.log",
        params:
            extra="--REMOVE_DUPLICATES true",
        # optional specification of memory usage of the JVM that snakemake will respect with global
        # resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
        # and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
        # https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
        resources:
            mem_mb=1024,
        wrapper:
            "v3.0.4/bio/picard/markduplicates"


    use rule markduplicates_bam as markduplicateswithmatecigar_bam with:
        output:
            bam="dedup_bam/{sample}.matecigar.bam",
            idx="dedup_bam/{sample}.mcigar.bai",
            metrics="dedup_bam/{sample}.matecigar.metrics.txt",
        log:
            "logs/dedup_bam/{sample}.matecigar.log",
        params:
            withmatecigar=True,
            extra="--REMOVE_DUPLICATES true",


    use rule markduplicates_bam as markduplicates_sam with:
        output:
            bam="dedup_sam/{sample}.sam",
            metrics="dedup_sam/{sample}.metrics.txt",
        log:
            "logs/dedup_sam/{sample}.log",
        params:
            extra="--REMOVE_DUPLICATES true",


    use rule markduplicates_bam as markduplicates_cram with:
        input:
            bams="mapped/{sample}.bam",
            ref="ref/genome.fasta",
        output:
            bam="dedup_cram/{sample}.cram",
            idx="dedup_cram/{sample}.cram.crai",
            metrics="dedup_cram/{sample}.metrics.txt",
        log:
            "logs/dedup_cram/{sample}.log",
        params:
            extra="--REMOVE_DUPLICATES true",
            embed_ref=True,  # set true if the fasta reference should be embedded into the cram
            withmatecigar=False,

Note that input, output and log file paths can be chosen freely.

When running with

.. code-block:: bash

    snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.


Notes
-----

* `--TMP_DIR` is automatically set by `resources.tmpdir`


Software dependencies
---------------------

* ``picard=3.1.1``
* ``samtools=1.18``
* ``snakemake-wrapper-utils=0.6.2``

Input/Output
------------
**Input:**

* bam/cram file(s)

**Output:**

* bam/cram file with marked or removed duplicates


Params
------

* ``java_opts``: allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically).

* ``extra``: allows for additional program arguments.

* ``embed_ref``: allows to embed the fasta reference into the cram

* ``withmatecigar``: allows to run `MarkDuplicatesWithMateCigar <https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicatesWithMateCigar>`_ instead.


Authors
-------

* Johannes Köster
* Christopher Schröder
* Filipe G. Vieira


Code
----

.. code-block:: python

    __author__ = "Johannes Köster, Christopher Schröder"
    __copyright__ = "Copyright 2016, Johannes Köster"
    __email__ = "koester@jimmy.harvard.edu"
    __license__ = "MIT"


    import tempfile
    from pathlib import Path
    from snakemake.shell import shell
    from snakemake_wrapper_utils.java import get_java_opts
    from snakemake_wrapper_utils.samtools import get_samtools_opts, infer_out_format


    log = snakemake.log_fmt_shell()
    extra = snakemake.params.get("extra", "")
    # the --SORTING_COLLECTION_SIZE_RATIO default of 0.25 might
    # indicate a JVM memory overhead of that proportion
    java_opts = get_java_opts(snakemake, java_mem_overhead_factor=0.3)
    samtools_opts = get_samtools_opts(snakemake)


    tool = "MarkDuplicates"
    if snakemake.params.get("withmatecigar", False):
        tool = "MarkDuplicatesWithMateCigar"


    bams = snakemake.input.bams
    if isinstance(bams, str):
        bams = [bams]
    bams = list(map("--INPUT {}".format, bams))


    output = snakemake.output.bam
    output_fmt = infer_out_format(output)
    convert = ""
    if output_fmt == "CRAM":
        output = "/dev/stdout"

        # NOTE: output format inference should be done by snakemake-wrapper-utils. Keeping it here for backwards compatibility.
        if snakemake.params.get("embed_ref", False):
            samtools_opts += ",embed_ref"

        convert = f" | samtools view {samtools_opts}"
    elif output_fmt == "BAM" and snakemake.output.get("idx"):
        extra += " --CREATE_INDEX"


    with tempfile.TemporaryDirectory() as tmpdir:
        shell(
            "(picard {tool}"  # Tool and its subcommand
            " {java_opts}"  # Automatic java option
            " {extra}"  # User defined parmeters
            " {bams}"  # Input bam(s)
            " --TMP_DIR {tmpdir}"
            " --OUTPUT {output}"  # Output bam
            " --METRICS_FILE {snakemake.output.metrics}"  # Output metrics
            " {convert}) {log}"  # Logging
        )


    output_prefix = Path(snakemake.output.bam).with_suffix("")
    if snakemake.output.get("idx"):
        if output_fmt == "BAM" and snakemake.output.idx != str(output_prefix) + ".bai":
            shell("mv {output_prefix}.bai {snakemake.output.idx}")


.. |nl| raw:: html

   <br>