MAPDAMAGE2
mapDamage2 is a computational framework written in Python and R, which tracks and quantifies DNA damage patterns among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms.
Example
This wrapper can be used in the following way:
rule mapdamage2:
input:
ref="genome.fasta",
bam="mapped/{sample}.bam",
output:
log="results/{sample}/Runtime_log.txt", # output folder is infered from this file, so it needs to be the same folder for all output files
GtoA3p="results/{sample}/3pGtoA_freq.txt",
CtoT5p="results/{sample}/5pCtoT_freq.txt",
dnacomp="results/{sample}/dnacomp.txt",
frag_misincorp="results/{sample}/Fragmisincorporation_plot.pdf",
len="results/{sample}/Length_plot.pdf",
lg_dist="results/{sample}/lgdistribution.txt",
misincorp="results/{sample}/misincorporation.txt",
# rescaled_bam="results/{sample}.rescaled.bam", # uncomment if you want the rescaled BAM file
params:
extra="--no-stats", # optional parameters for mapdamage2 (except -i, -r, -d, --rescale)
log:
"logs/{sample}/mapdamage2.log",
threads: 1 # MapDamage2 is not threaded
wrapper:
"v5.5.2/bio/mapdamage2"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes
The extra param allows for additional program arguments.
For more information see, https://ginolhac.github.io/mapDamage/
Software dependencies
mapdamage2=2.2.2
python=3.12.8
pysam=0.22.1
Input/Output
Input:
reference genome
SAM/BAM/CRAM alignemnt
Output:
Runtime_log.txt
: log file with a summary of command lines used and timestamps.Fragmisincorporation_plot.pdf, a pdf file that displays both fragmentation and misincorporation patterns.
Length_plot.pdf, a pdf file that displays length distribution of singleton reads per strand and cumulative frequencies of C->T at 5’-end and G->A at 3’-end are also displayed per strand.
misincorporation.txt, contains a table with occurrences for each type of mutations and relative positions from the reads ends.
5pCtoT_freq.txt, contains frequencies of Cytosine to Thymine mutations per position from the 5’-ends.
3pGtoA_freq.txt, contains frequencies of Guanine to Adenine mutations per position from the 3’-ends.
dnacomp.txt, contains a table of the reference genome base composition per position, inside reads and adjacent regions.
lgdistribution.txt, contains a table with read length distributions per strand.
Stats_out_MCMC_hist.pdf, MCMC histogram for the damage parameters and log likelihood.
Stats_out_MCMC_iter.csv, values for the damage parameters and log likelihood in each MCMC iteration.
Stats_out_MCMC_trace.pdf, a MCMC trace plot for the damage parameters and log likelihood.
Stats_out_MCMC_iter_summ_stat.csv, summary statistics for the damage parameters estimated posterior distributions.
Stats_out_post_pred.pdf, empirical misincorporation frequency and posterior predictive intervals from the fitted model.
Stats_out_MCMC_correct_prob.csv, position specific probability of a C->T and G->A misincorporation is due to damage.
dnacomp_genome.txt, contains the global reference genome base composition (computed by seqtk).
Rescaled BAM file, where likely post-mortem damaged bases have downscaled quality scores.
Code
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2020, Filipe G. Vieira"
__license__ = "MIT"
import os.path
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
in_bam = snakemake.input.get("bam", "")
if in_bam:
in_bam = "--input " + in_bam
output_folder = os.path.dirname(snakemake.output.get("log", ""))
if not output_folder:
raise ValueError("mapDamage2 rule needs output 'log'.")
rescaled_bam = snakemake.output.get("rescaled_bam", "")
if rescaled_bam:
rescaled_bam = "--rescale-out " + rescaled_bam
shell(
"mapDamage "
"{in_bam} "
"--reference {snakemake.input.ref} "
"--folder {output_folder} "
"{rescaled_bam} "
"{extra} "
"{log}"
)