NONPAREIL INFER
Nonpareil uses the redundancy of the reads in metagenomic datasets to estimate the average coverage and predict the amount of sequences that will be required to achieve “nearly complete coverage”.
URL: https://nonpareil.readthedocs.io/en/latest/
Example
This wrapper can be used in the following way:
rule nonpareil:
input:
"reads/{sample}",
output:
redund_sum="results/{sample}.npo",
redund_val="results/{sample}.npa",
mate_distr="results/{sample}.npc",
log="results/{sample}.log",
log:
"logs/{sample}.log",
params:
alg="kmer",
infer_X=True,
extra="-k 3 -F",
threads: 2
resources:
mem_mb=50,
wrapper:
"v3.9.0/bio/nonpareil/infer"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes
For a PDF version of the manual, see https://nonpareil.readthedocs.io/_/downloads/en/latest/pdf/
Software dependencies
nonpareil=3.4.1
pigz
pbzip2
snakemake-wrapper-utils=0.6.2
Input/Output
Input:
reads in FASTA/Q format (can be gziped or bziped)
Output:
redund_sum
: redundancy summary TSV file with six columns, representing sequencing effort, summary of the distribution of redundancy (average redundancy, standard deviation, quartile 1, median, and quartile 3).redund_val
: redundancy values TSV file with three columns (similar to redundancy summary, but provides ALL results), representing sequencing effort, ID of the replicate and estimated redundancy value.mate_distr
: mate distribution file, with the number of reads in the dataset matching a query read.log
: log of internal Nonpareil processing.
Params
alg
: nonpareil algorithm, either kmer or alignment (mandatory).infer_X
: automatically infer value of -X (couple of minutes slower to count number of reads)extra
: additional program arguments (not -X if infer_X == True)
Code
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2023, Filipe G. Vieira"
__license__ = "MIT"
from os import path
import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.snakemake import get_mem
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
mem_mb = get_mem(snakemake, out_unit="MiB")
uncomp = ""
in_name, in_ext = path.splitext(snakemake.input[0])
if in_ext in [".gz", ".bz2"]:
uncomp = (
f"pigz --processes {snakemake.threads} --decompress --stdout"
if in_ext == ".gz"
else f"pbzip2 -p{snakemake.threads} --decompress --stdout"
)
in_name, in_ext = path.splitext(in_name)
# Infer output format
if in_ext in [".fa", ".fas", ".fasta"]:
in_format = "fasta"
elif in_ext in [".fq", ".fastq"]:
in_format = "fastq"
else:
raise ValueError("invalid input format")
# Redundancy summary
redund_sum = snakemake.output.get("redund_sum", "")
if redund_sum:
redund_sum = f"-o {redund_sum}"
# Redundancy values
redund_val = snakemake.output.get("redund_val", "")
if redund_val:
redund_val = f"-a {redund_val}"
# Mate distribution
mate_distr = snakemake.output.get("mate_distr", "")
if mate_distr:
mate_distr = f"-C {mate_distr}"
# Log
out_log = snakemake.output.get("log", "")
if out_log:
out_log = f"-l {out_log}"
with tempfile.NamedTemporaryFile() as tmp:
if uncomp:
in_uncomp = tmp.name
shell("{uncomp} {snakemake.input[0]} > {tmp.name}")
else:
in_uncomp = snakemake.input[0]
# Auto infer -X value
if snakemake.params.get("infer_X", True):
# Get total number of lines
total_n_lines = sum(1 for line in open(in_uncomp, "rb"))
# Get total number of reads (depends on format)
total_n_reads = total_n_lines / 4 if in_format == "fastq" else total_n_lines / 2
# Get total number of reads to sample
sample_n_reads = max(1, int(total_n_reads * 0.1) - 1)
# Get total number of reads to sample, depending on defaults
sample_n_reads = (
min(1000, sample_n_reads)
if snakemake.params.alg == "alignment"
else min(10000, sample_n_reads)
)
extra += f" -X {sample_n_reads}"
shell(
"nonpareil"
" -t {snakemake.threads}"
" -R {mem_mb}"
" -T {snakemake.params.alg}"
" -s {in_uncomp}"
" -f {in_format}"
" {extra}"
" {redund_sum}"
" {redund_val}"
" {mate_distr}"
" {out_log}"
" {log}"
)