.. _`bio/gatk/markduplicatesspark`:

GATK MARKDUPLICATESSPARK
========================


.. image:: https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/gatk/markduplicatesspark?label=version%20update%20pull%20requests
   :target: https://github.com/snakemake/snakemake-wrappers/pulls?q=is%3Apr+is%3Aopen+label%3Abio/gatk/markduplicatesspark

Spark implementation of Picard MarkDuplicates that allows the tool to be run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the non-Spark Picard version of the tool. Since the tool requires holding all of the readnames in memory while it groups read information, machine configuration and starting sort-order impact tool performance.


**URL**: https://gatk.broadinstitute.org/hc/en-us/articles/9570319741083-MarkDuplicatesSpark

Example
-------

This wrapper can be used in the following way:

.. code-block:: python

    rule mark_duplicates_spark:
        input:
            "mapped/{sample}.bam",
        output:
            bam="dedup/{sample}.bam",
            metrics="dedup/{sample}.metrics.txt",
        log:
            "logs/dedup/{sample}.log",
        params:
            extra="--remove-sequencing-duplicates",  # optional
            java_opts="",  # optional
            #spark_runner="",  # optional, local by default
            #spark_v3.0.1="",  # optional
            #spark_extra="", # optional
        resources:
            # Memory needs to be at least 471859200 for Spark, so 589824000 when
            # accounting for default JVM overhead of 20%. We round round to 650M.
            mem_mb=lambda wildcards, input: max([input.size_mb * 0.25, 650]),
        threads: 8
        wrapper:
            "v3.0.1/bio/gatk/markduplicatesspark"

Note that input, output and log file paths can be chosen freely.

When running with

.. code-block:: bash

    snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.


Notes
-----

* The `java_opts` param allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically).
* The `extra` param allows for additional program arguments.
* The `spark_runner` param = "LOCAL"|"SPARK"|"GCS" allows to set the spark_runner. Set the parameter to "LOCAL" or don't set it at all to run on local machine.
* The `spark_master` param allows to set the URL of the Spark Master to submit the job. Set to "local[number_of_cores]" for local execution. Don't set it at all for local execution with number of cores determined by snakemake.
* The `spark_extra` param allows for additional spark arguments.


Software dependencies
---------------------

* ``gatk4=4.4.0.0``
* ``snakemake-wrapper-utils=0.6.2``

Input/Output
------------
**Input:**

* bam file
* reference file

**Output:**

* bam file with marked or removed duplicates


Authors
-------

* Filipe G. Vieira


Code
----

.. code-block:: python

    __author__ = "Fillipe G. Vieira"
    __copyright__ = "Copyright 2021, Filipe G. Vieira"
    __license__ = "MIT"

    import tempfile
    from snakemake.shell import shell
    from snakemake_wrapper_utils.java import get_java_opts

    extra = snakemake.params.get("extra", "")
    spark_runner = snakemake.params.get("spark_runner", "LOCAL")
    spark_master = snakemake.params.get(
        "spark_master", "local[{}]".format(snakemake.threads)
    )
    spark_extra = snakemake.params.get("spark_extra", "")
    java_opts = get_java_opts(snakemake)

    metrics = snakemake.output.get("metrics", "")
    if metrics:
        metrics = f"--metrics-file {metrics}"

    log = snakemake.log_fmt_shell(stdout=True, stderr=True)

    with tempfile.TemporaryDirectory() as tmpdir:
        shell(
            "gatk --java-options '{java_opts}' MarkDuplicatesSpark"
            " --input {snakemake.input}"
            " {extra}"
            " --tmp-dir {tmpdir}"
            " --output {snakemake.output.bam}"
            " {metrics}"
            " -- --spark-runner {spark_runner} --spark-master {spark_master} {spark_extra}"
            " {log}"
        )


.. |nl| raw:: html

   <br>