.. _`bio/picard/markduplicates`: PICARD MARKDUPLICATES ===================== .. image:: https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/picard/markduplicates?label=version%20update%20pull%20requests :target: https://github.com/snakemake/snakemake-wrappers/pulls?q=is%3Apr+is%3Aopen+label%3Abio/picard/markduplicates Mark PCR and optical duplicates with picard tools. **URL**: https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates Example ------- This wrapper can be used in the following way: .. code-block:: python rule markduplicates_bam: input: bams="mapped/{sample}.bam", # optional to specify a list of BAMs; this has the same effect # of marking duplicates on separate read groups for a sample # and then merging output: bam="dedup_bam/{sample}.bam", metrics="dedup_bam/{sample}.metrics.txt", log: "logs/dedup_bam/{sample}.log", params: extra="--REMOVE_DUPLICATES true", # optional specification of memory usage of the JVM that snakemake will respect with global # resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources) # and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`: # https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties resources: mem_mb=1024, wrapper: "v3.0.4/bio/picard/markduplicates" use rule markduplicates_bam as markduplicateswithmatecigar_bam with: output: bam="dedup_bam/{sample}.matecigar.bam", idx="dedup_bam/{sample}.mcigar.bai", metrics="dedup_bam/{sample}.matecigar.metrics.txt", log: "logs/dedup_bam/{sample}.matecigar.log", params: withmatecigar=True, extra="--REMOVE_DUPLICATES true", use rule markduplicates_bam as markduplicates_sam with: output: bam="dedup_sam/{sample}.sam", metrics="dedup_sam/{sample}.metrics.txt", log: "logs/dedup_sam/{sample}.log", params: extra="--REMOVE_DUPLICATES true", use rule markduplicates_bam as markduplicates_cram with: input: bams="mapped/{sample}.bam", ref="ref/genome.fasta", output: bam="dedup_cram/{sample}.cram", idx="dedup_cram/{sample}.cram.crai", metrics="dedup_cram/{sample}.metrics.txt", log: "logs/dedup_cram/{sample}.log", params: extra="--REMOVE_DUPLICATES true", embed_ref=True, # set true if the fasta reference should be embedded into the cram withmatecigar=False, Note that input, output and log file paths can be chosen freely. When running with .. code-block:: bash snakemake --use-conda the software dependencies will be automatically deployed into an isolated environment before execution. Notes ----- * `--TMP_DIR` is automatically set by `resources.tmpdir` Software dependencies --------------------- * ``picard=3.1.1`` * ``samtools=1.18`` * ``snakemake-wrapper-utils=0.6.2`` Input/Output ------------ **Input:** * bam/cram file(s) **Output:** * bam/cram file with marked or removed duplicates Params ------ * ``java_opts``: allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically). * ``extra``: allows for additional program arguments. * ``embed_ref``: allows to embed the fasta reference into the cram * ``withmatecigar``: allows to run `MarkDuplicatesWithMateCigar `_ instead. Authors ------- * Johannes Köster * Christopher Schröder * Filipe G. Vieira Code ---- .. code-block:: python __author__ = "Johannes Köster, Christopher Schröder" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import tempfile from pathlib import Path from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts from snakemake_wrapper_utils.samtools import get_samtools_opts, infer_out_format log = snakemake.log_fmt_shell() extra = snakemake.params.get("extra", "") # the --SORTING_COLLECTION_SIZE_RATIO default of 0.25 might # indicate a JVM memory overhead of that proportion java_opts = get_java_opts(snakemake, java_mem_overhead_factor=0.3) samtools_opts = get_samtools_opts(snakemake) tool = "MarkDuplicates" if snakemake.params.get("withmatecigar", False): tool = "MarkDuplicatesWithMateCigar" bams = snakemake.input.bams if isinstance(bams, str): bams = [bams] bams = list(map("--INPUT {}".format, bams)) output = snakemake.output.bam output_fmt = infer_out_format(output) convert = "" if output_fmt == "CRAM": output = "/dev/stdout" # NOTE: output format inference should be done by snakemake-wrapper-utils. Keeping it here for backwards compatibility. if snakemake.params.get("embed_ref", False): samtools_opts += ",embed_ref" convert = f" | samtools view {samtools_opts}" elif output_fmt == "BAM" and snakemake.output.get("idx"): extra += " --CREATE_INDEX" with tempfile.TemporaryDirectory() as tmpdir: shell( "(picard {tool}" # Tool and its subcommand " {java_opts}" # Automatic java option " {extra}" # User defined parmeters " {bams}" # Input bam(s) " --TMP_DIR {tmpdir}" " --OUTPUT {output}" # Output bam " --METRICS_FILE {snakemake.output.metrics}" # Output metrics " {convert}) {log}" # Logging ) output_prefix = Path(snakemake.output.bam).with_suffix("") if snakemake.output.get("idx"): if output_fmt == "BAM" and snakemake.output.idx != str(output_prefix) + ".bai": shell("mv {output_prefix}.bai {snakemake.output.idx}") .. |nl| raw:: html