BAM2NUC#
Calculate mono- and di-nucleotide coverage of the reads and compares them with average genomic sequence composition (see https://github.com/FelixKrueger/Bismark/blob/master/bam2nuc).
Example#
This wrapper can be used in the following way:
# Nucleotide stats for genome is required for further stats for BAM file
rule bam2nuc_for_genome:
input:
genome_fa="indexes/{genome}/{genome}.fa.gz"
output:
"indexes/{genome}/genomic_nucleotide_frequencies.txt"
log:
"logs/indexes/{genome}/genomic_nucleotide_frequencies.txt.log"
wrapper:
"v3.0.2/bio/bismark/bam2nuc"
# Nucleotide stats for BAM file
rule bam2nuc_for_bam:
input:
genome_fa="indexes/{genome}/{genome}.fa.gz",
bam="bams/{sample}_{genome}.bam"
output:
report="bams/{sample}_{genome}.nucleotide_stats.txt"
log:
"logs/{sample}_{genome}.nucleotide_stats.txt.log"
wrapper:
"v3.0.2/bio/bismark/bam2nuc"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies#
bowtie2=2.5.2
bismark=0.24.2
samtools=1.18
Input/Output#
Input:
genome_fa
: Path to genome in FastA format (e.g. *.fa, *.fasta, *.fa.gz, *.fasta.gz). All genomes FastA from it’s parent folder will be takenbam
: Optional BAM or CRAM file (or multiple space separated files). If bam arg isn’t provided, option –genomic_composition_only will be used to generate genomic composition table genomic_nucleotide_frequencies.txt.
Output:
Genome nucleotide frequencies genomic_nucleotide_frequencies.txt will be generated in ‘genome_fa’ directory, optional output.
report
: Report file (or space separated files), pattern ‘{bam_file_name}.nucleotide_stats.txt’.
Params#
extra
: Any additional args
Code#
"""Snakemake wrapper for bam2nuc tool that calculates mono- and di-nucleotide coverage of the reads and compares them with average genomic sequence
composition."""
# https://github.com/FelixKrueger/Bismark/blob/master/bam2nuc
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "roman.chernyatchik@jetbrains.com"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
cmdline_args = ["bam2nuc {extra}"]
genome_fa = snakemake.input.get("genome_fa", None)
if not genome_fa:
raise ValueError("bismark/bam2nuc: Error 'genome_fa' input not specified.")
genome_folder = os.path.dirname(genome_fa)
cmdline_args.append("--genome_folder {genome_folder:q}")
bam = snakemake.input.get("bam", None)
if bam:
cmdline_args.append("{bam}")
bams = bam if isinstance(bam, list) else [bam]
report = snakemake.output.get("report", None)
if not report:
raise ValueError("bismark/bam2nuc: Error 'report' output isn't specified.")
reports = report if isinstance(report, list) else [report]
if len(reports) != len(bams):
raise ValueError(
"bismark/bam2nuc: Error number of paths in output:report ({} files)"
" should be same as in input:bam ({} files).".format(
len(reports), len(bams)
)
)
output_dir = os.path.dirname(reports[0])
if any(output_dir != os.path.dirname(p) for p in reports):
raise ValueError(
"bismark/bam2nuc: Error all reports should be in same directory:"
" {}".format(output_dir)
)
if output_dir:
cmdline_args.append("--dir {output_dir:q}")
else:
cmdline_args.append("--genomic_composition_only")
# log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
cmdline_args.append("{log}")
# run
shell(" ".join(cmdline_args))
# Move outputs into proper position.
if bam:
log_append = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)
expected_2_actual_paths = []
for bam_path, report_path in zip(bams, reports):
bam_name = os.path.basename(bam_path)
bam_basename = os.path.splitext(bam_name)[0]
expected_2_actual_paths.append(
(
report_path,
os.path.join(
output_dir, "{}.nucleotide_stats.txt".format(bam_basename)
),
)
)
for exp_path, actual_path in expected_2_actual_paths:
if exp_path and (exp_path != actual_path):
shell("mv {actual_path:q} {exp_path:q} {log_append}")