MAPDAMAGE2
mapDamage2 is a computational framework written in Python and R, which tracks and quantifies DNA damage patterns among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms.
URL: https://ginolhac.github.io/mapDamage/
Example
This wrapper can be used in the following way:
rule mapdamage2_plot:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta",
output:
GtoA3p="results/plot/{sample}.3pGtoA_freq.txt",
CtoT5p="results/plot/{sample}.5pCtoT_freq.txt",
dnacomp="results/plot/{sample}.dnacomp.txt",
lg_dist="results/plot/{sample}.lgdistribution.txt",
misinc="results/plot/{sample}.misincorporation.txt",
plot_misinc="results/plot/{sample}.Fragmisincorporation_plot.pdf",
plot_len="results/plot/{sample}.Length_plot.pdf",
log="results/plot/{sample}.log",
params:
extra="-m 20",
log:
"logs/plot/{sample}.log",
threads: 1
wrapper:
"v7.3.0/bio/mapdamage2"
rule mapdamage2_stats:
input:
GtoA3p=rules.mapdamage2_plot.output.GtoA3p,
CtoT5p=rules.mapdamage2_plot.output.CtoT5p,
misinc=rules.mapdamage2_plot.output.misinc,
ref="genome.fasta",
output:
stats_ref="results/stats/{sample}.dnacomp_genome.csv",
stats_prob="results/stats/{sample}.Stats_out_MCMC_correct_prob.csv",
stats_hist="results/stats/{sample}.Stats_out_MCMC_hist.pdf",
stats_iter="results/stats/{sample}.Stats_out_MCMC_iter.csv",
stats_summ="results/stats/{sample}.Stats_out_MCMC_iter_summ_stat.csv",
stats_plot_freq="results/stats/{sample}.Stats_out_MCMC_post_pred.pdf",
stats_plot_trace="results/stats/{sample}.Stats_out_MCMC_trace.pdf",
log="results/stats/{sample}.log",
params:
extra="--burn 10 --iter 10",
log:
"logs/stats/{sample}.log",
threads: 1
wrapper:
"v7.3.0/bio/mapdamage2"
rule mapdamage2_rescale:
input:
bam="mapped/{sample}.bam",
stats_prob=rules.mapdamage2_stats.output.stats_prob,
ref="genome.fasta",
output:
bam="results/rescale/{sample}.bam",
log="results/rescale/{sample}.log",
params:
extra="",
log:
"logs/rescale/{sample}.log",
threads: 1
wrapper:
"v7.3.0/bio/mapdamage2"
rule mapdamage2_all:
input:
bam="mapped/{sample}.bam",
ref="genome.fasta",
output:
GtoA3p="results/all/{sample}.3pGtoA_freq.txt",
CtoT5p="results/all/{sample}.5pCtoT_freq.txt",
dnacomp="results/all/{sample}.dnacomp.txt",
lg_dist="results/all/{sample}.lgdistribution.txt",
misinc="results/all/{sample}.misincorporation.txt",
plot_misinc="results/all/{sample}.Fragmisincorporation_plot.pdf",
plot_len="results/all/{sample}.Length_plot.pdf",
stats_ref="results/all/{sample}.dnacomp_genome.csv",
stats_prob="results/all/{sample}.Stats_out_MCMC_correct_prob.csv",
stats_hist="results/all/{sample}.Stats_out_MCMC_hist.pdf",
stats_iter="results/all/{sample}.Stats_out_MCMC_iter.csv",
stats_summ="results/all/{sample}.Stats_out_MCMC_iter_summ_stat.csv",
stats_plot_freq="results/all/{sample}.Stats_out_MCMC_post_pred.pdf",
stats_plot_trace="results/all/{sample}.Stats_out_MCMC_trace.pdf",
bam="results/all/{sample}.bam",
log="results/all/{sample}.log",
params:
extra="--burn 10 --iter 10",
log:
"logs/all/{sample}.log",
threads: 1
wrapper:
"v7.3.0/bio/mapdamage2"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes
All output files can also be used as input, when using existing results.
All output files are optional.
Optional output flags (–plots-only, –stats-only, –rescale-only, and –no-stats) are automatically inferred.
Software dependencies
mapdamage2=2.2.3
Input/Output
Input:
reference genome
SAM/BAM/CRAM alignemnt
Output:
log: log file with a summary of command lines used and timestamps.misinc: table with occurrences for each type of mutations and relative positions from the reads ends.CtoT5p: frequencies of Cytosine to Thymine mutations per position from the 5’-ends.GtoA3p: frequencies of Guanine to Adenine mutations per position from the 3’-ends.dnacomp: table of the reference genome base composition per position, inside reads and adjacent regions.lg_dist: table with read length distributions per strand.plot_misinc: pdf file that displays both fragmentation and misincorporation patterns.plot_len: pdf file that displays length distribution of singleton reads per strand and cumulative frequencies of C->T at 5’-end and G->A at 3’-end are also displayed per strand.stats_hist: MCMC histogram for the damage parameters and log likelihood.stats_iter: damage parameters and log likelihood in each MCMC iteration.stats_summ: summary statistics for the damage parameters estimated posterior distributions.stats_prob: position specific probability of a C->T and G->A misincorporation is due to damage.stats_ref: global reference genome base composition (computed by seqtk).stats_plot_trace: MCMC trace plot for the damage parameters and log likelihood.stats_plot_freq: empirical misincorporation frequency and posterior predictive intervals from the fitted model.bam: rescaled BAM file, where likely post-mortem damaged bases have downscaled quality scores.
Params
extra: additional program arguments.
Code
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2020, Filipe G. Vieira"
__license__ = "MIT"
import tempfile
from pathlib import Path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)
# Input BAM fle
in_bam = snakemake.input.get("bam", "")
if in_bam:
in_bam = f"--input {in_bam}"
# Output
out_plot = any([in_tag.startswith("plot_") for in_tag, _ in snakemake.output.items()])
out_stats = any([in_tag.startswith("stats_") for in_tag, _ in snakemake.output.items()])
out_bam = snakemake.output.get("bam", "")
if out_bam:
out_bam = f"--rescale --rescale-out {out_bam}"
# Check if only tpye of output is specified
# https://stackoverflow.com/questions/16801322/how-can-i-check-that-a-list-has-one-and-only-one-truthy-value
i = iter([out_plot, out_stats, out_bam])
if any(i) and not any(i):
if out_plot:
extra += " --no-stats"
elif out_stats:
extra += " --stats-only"
elif out_bam:
extra += " --rescale-only"
else:
raise ValueError("unexpected error.")
with tempfile.TemporaryDirectory() as tmpdir:
fnames_tmp = {
"GtoA3p": "3pGtoA_freq.txt",
"CtoT5p": "5pCtoT_freq.txt",
"dnacomp": "dnacomp.txt",
"lg_dist": "lgdistribution.txt",
"misinc": "misincorporation.txt",
"plot_misinc": "Fragmisincorporation_plot.pdf",
"plot_len": "Length_plot.pdf",
"log": "Runtime_log.txt",
"stats_ref": "dnacomp_genome.csv",
"stats_prob": "Stats_out_MCMC_correct_prob.csv",
"stats_hist": "Stats_out_MCMC_hist.pdf",
"stats_iter": "Stats_out_MCMC_iter.csv",
"stats_summ": "Stats_out_MCMC_iter_summ_stat.csv",
"stats_plot_freq": "Stats_out_MCMC_post_pred.pdf",
"stats_plot_trace": "Stats_out_MCMC_trace.pdf",
}
# Symlink input files to temp (to use as results) folder
for in_tag, fname_tmp in snakemake.input.items():
if in_tag not in ["bam", "ref"]:
(Path(tmpdir) / fnames_tmp[in_tag]).symlink_to(Path(fname_tmp).resolve())
shell(
"mapDamage"
" {in_bam}"
" --reference {snakemake.input.ref}"
" --folder {tmpdir}"
" {out_bam}"
" {extra}"
" {log}"
)
for in_tag, fname_tmp in fnames_tmp.items():
fname_dest = snakemake.output.get(in_tag)
if fname_dest:
shell("cp --verbose {tmpdir}/{fname_tmp} {fname_dest} {log}")