MAPDAMAGE2

https://img.shields.io/badge/wrapper_version-v7.3.0-10785b https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/mapdamage2?label=version%20update%20pull%20requests&color=1cb481

mapDamage2 is a computational framework written in Python and R, which tracks and quantifies DNA damage patterns among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms.

URL: https://ginolhac.github.io/mapDamage/

Example

This wrapper can be used in the following way:

rule mapdamage2_plot:
    input:
        bam="mapped/{sample}.bam",
        ref="genome.fasta",
    output:
        GtoA3p="results/plot/{sample}.3pGtoA_freq.txt",
        CtoT5p="results/plot/{sample}.5pCtoT_freq.txt",
        dnacomp="results/plot/{sample}.dnacomp.txt",
        lg_dist="results/plot/{sample}.lgdistribution.txt",
        misinc="results/plot/{sample}.misincorporation.txt",
        plot_misinc="results/plot/{sample}.Fragmisincorporation_plot.pdf",
        plot_len="results/plot/{sample}.Length_plot.pdf",
        log="results/plot/{sample}.log",
    params:
        extra="-m 20",
    log:
        "logs/plot/{sample}.log",
    threads: 1
    wrapper:
        "v7.3.0/bio/mapdamage2"


rule mapdamage2_stats:
    input:
        GtoA3p=rules.mapdamage2_plot.output.GtoA3p,
        CtoT5p=rules.mapdamage2_plot.output.CtoT5p,
        misinc=rules.mapdamage2_plot.output.misinc,
        ref="genome.fasta",
    output:
        stats_ref="results/stats/{sample}.dnacomp_genome.csv",
        stats_prob="results/stats/{sample}.Stats_out_MCMC_correct_prob.csv",
        stats_hist="results/stats/{sample}.Stats_out_MCMC_hist.pdf",
        stats_iter="results/stats/{sample}.Stats_out_MCMC_iter.csv",
        stats_summ="results/stats/{sample}.Stats_out_MCMC_iter_summ_stat.csv",
        stats_plot_freq="results/stats/{sample}.Stats_out_MCMC_post_pred.pdf",
        stats_plot_trace="results/stats/{sample}.Stats_out_MCMC_trace.pdf",
        log="results/stats/{sample}.log",
    params:
        extra="--burn 10 --iter 10",
    log:
        "logs/stats/{sample}.log",
    threads: 1
    wrapper:
        "v7.3.0/bio/mapdamage2"


rule mapdamage2_rescale:
    input:
        bam="mapped/{sample}.bam",
        stats_prob=rules.mapdamage2_stats.output.stats_prob,
        ref="genome.fasta",
    output:
        bam="results/rescale/{sample}.bam",
        log="results/rescale/{sample}.log",
    params:
        extra="",
    log:
        "logs/rescale/{sample}.log",
    threads: 1
    wrapper:
        "v7.3.0/bio/mapdamage2"


rule mapdamage2_all:
    input:
        bam="mapped/{sample}.bam",
        ref="genome.fasta",
    output:
        GtoA3p="results/all/{sample}.3pGtoA_freq.txt",
        CtoT5p="results/all/{sample}.5pCtoT_freq.txt",
        dnacomp="results/all/{sample}.dnacomp.txt",
        lg_dist="results/all/{sample}.lgdistribution.txt",
        misinc="results/all/{sample}.misincorporation.txt",
        plot_misinc="results/all/{sample}.Fragmisincorporation_plot.pdf",
        plot_len="results/all/{sample}.Length_plot.pdf",
        stats_ref="results/all/{sample}.dnacomp_genome.csv",
        stats_prob="results/all/{sample}.Stats_out_MCMC_correct_prob.csv",
        stats_hist="results/all/{sample}.Stats_out_MCMC_hist.pdf",
        stats_iter="results/all/{sample}.Stats_out_MCMC_iter.csv",
        stats_summ="results/all/{sample}.Stats_out_MCMC_iter_summ_stat.csv",
        stats_plot_freq="results/all/{sample}.Stats_out_MCMC_post_pred.pdf",
        stats_plot_trace="results/all/{sample}.Stats_out_MCMC_trace.pdf",
        bam="results/all/{sample}.bam",
        log="results/all/{sample}.log",
    params:
        extra="--burn 10 --iter 10",
    log:
        "logs/all/{sample}.log",
    threads: 1
    wrapper:
        "v7.3.0/bio/mapdamage2"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Notes

  • All output files can also be used as input, when using existing results.

  • All output files are optional.

  • Optional output flags (–plots-only, –stats-only, –rescale-only, and –no-stats) are automatically inferred.

Software dependencies

  • mapdamage2=2.2.3

Input/Output

Input:

  • reference genome

  • SAM/BAM/CRAM alignemnt

Output:

  • log: log file with a summary of command lines used and timestamps.

  • misinc: table with occurrences for each type of mutations and relative positions from the reads ends.

  • CtoT5p: frequencies of Cytosine to Thymine mutations per position from the 5’-ends.

  • GtoA3p: frequencies of Guanine to Adenine mutations per position from the 3’-ends.

  • dnacomp: table of the reference genome base composition per position, inside reads and adjacent regions.

  • lg_dist: table with read length distributions per strand.

  • plot_misinc: pdf file that displays both fragmentation and misincorporation patterns.

  • plot_len: pdf file that displays length distribution of singleton reads per strand and cumulative frequencies of C->T at 5’-end and G->A at 3’-end are also displayed per strand.

  • stats_hist: MCMC histogram for the damage parameters and log likelihood.

  • stats_iter: damage parameters and log likelihood in each MCMC iteration.

  • stats_summ: summary statistics for the damage parameters estimated posterior distributions.

  • stats_prob: position specific probability of a C->T and G->A misincorporation is due to damage.

  • stats_ref: global reference genome base composition (computed by seqtk).

  • stats_plot_trace: MCMC trace plot for the damage parameters and log likelihood.

  • stats_plot_freq: empirical misincorporation frequency and posterior predictive intervals from the fitted model.

  • bam: rescaled BAM file, where likely post-mortem damaged bases have downscaled quality scores.

Params

  • extra: additional program arguments.

Authors

  • Filipe G. Vieira

Code

__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2020, Filipe G. Vieira"
__license__ = "MIT"

import tempfile
from pathlib import Path
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)

# Input BAM fle
in_bam = snakemake.input.get("bam", "")
if in_bam:
    in_bam = f"--input {in_bam}"

# Output
out_plot = any([in_tag.startswith("plot_") for in_tag, _ in snakemake.output.items()])
out_stats = any([in_tag.startswith("stats_") for in_tag, _ in snakemake.output.items()])
out_bam = snakemake.output.get("bam", "")
if out_bam:
    out_bam = f"--rescale --rescale-out {out_bam}"

# Check if only tpye of output is specified
# https://stackoverflow.com/questions/16801322/how-can-i-check-that-a-list-has-one-and-only-one-truthy-value
i = iter([out_plot, out_stats, out_bam])
if any(i) and not any(i):
    if out_plot:
        extra += " --no-stats"
    elif out_stats:
        extra += " --stats-only"
    elif out_bam:
        extra += " --rescale-only"
    else:
        raise ValueError("unexpected error.")


with tempfile.TemporaryDirectory() as tmpdir:
    fnames_tmp = {
        "GtoA3p": "3pGtoA_freq.txt",
        "CtoT5p": "5pCtoT_freq.txt",
        "dnacomp": "dnacomp.txt",
        "lg_dist": "lgdistribution.txt",
        "misinc": "misincorporation.txt",
        "plot_misinc": "Fragmisincorporation_plot.pdf",
        "plot_len": "Length_plot.pdf",
        "log": "Runtime_log.txt",
        "stats_ref": "dnacomp_genome.csv",
        "stats_prob": "Stats_out_MCMC_correct_prob.csv",
        "stats_hist": "Stats_out_MCMC_hist.pdf",
        "stats_iter": "Stats_out_MCMC_iter.csv",
        "stats_summ": "Stats_out_MCMC_iter_summ_stat.csv",
        "stats_plot_freq": "Stats_out_MCMC_post_pred.pdf",
        "stats_plot_trace": "Stats_out_MCMC_trace.pdf",
    }

    # Symlink input files to temp (to use as results) folder
    for in_tag, fname_tmp in snakemake.input.items():
        if in_tag not in ["bam", "ref"]:
            (Path(tmpdir) / fnames_tmp[in_tag]).symlink_to(Path(fname_tmp).resolve())

    shell(
        "mapDamage"
        " {in_bam}"
        " --reference {snakemake.input.ref}"
        " --folder {tmpdir}"
        " {out_bam}"
        " {extra}"
        " {log}"
    )

    for in_tag, fname_tmp in fnames_tmp.items():
        fname_dest = snakemake.output.get(in_tag)
        if fname_dest:
            shell("cp --verbose {tmpdir}/{fname_tmp} {fname_dest} {log}")