.. _`bio/nonpareil/infer`: NONPAREIL INFER =============== .. image:: https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/nonpareil/infer?label=version%20update%20pull%20requests :target: https://github.com/snakemake/snakemake-wrappers/pulls?q=is%3Apr+is%3Aopen+label%3Abio/nonpareil/infer Nonpareil uses the redundancy of the reads in metagenomic datasets to estimate the average coverage and predict the amount of sequences that will be required to achieve “nearly complete coverage”. **URL**: https://nonpareil.readthedocs.io/en/latest/ Example ------- This wrapper can be used in the following way: .. code-block:: python rule nonpareil: input: "reads/{sample}", output: redund_sum="results/{sample}.npo", redund_val="results/{sample}.npa", mate_distr="results/{sample}.npc", log="results/{sample}.log", log: "logs/{sample}.log", params: alg="kmer", infer_X=True, extra="-k 3 -F", threads: 2 resources: mem_mb=50, wrapper: "v3.0.1/bio/nonpareil/infer" Note that input, output and log file paths can be chosen freely. When running with .. code-block:: bash snakemake --use-conda the software dependencies will be automatically deployed into an isolated environment before execution. Notes ----- * For a PDF version of the manual, see https://nonpareil.readthedocs.io/_/downloads/en/latest/pdf/ Software dependencies --------------------- * ``nonpareil=3.4.1`` * ``pigz`` * ``pbzip2`` * ``snakemake-wrapper-utils=0.6.2`` Input/Output ------------ **Input:** * reads in FASTA/Q format (can be gziped or bziped) **Output:** * ``redund_sum``: redundancy summary TSV file with six columns, representing sequencing effort, summary of the distribution of redundancy (average redundancy, standard deviation, quartile 1, median, and quartile 3). * ``redund_val``: redundancy values TSV file with three columns (similar to redundancy summary, but provides ALL results), representing sequencing effort, ID of the replicate and estimated redundancy value. * ``mate_distr``: mate distribution file, with the number of reads in the dataset matching a query read. * ``log``: log of internal Nonpareil processing. Params ------ * ``alg``: nonpareil algorithm, either `kmer` or `alignment` (mandatory). * ``infer_X``: automatically infer value of `-X` (couple of minutes slower to count number of reads) * ``extra``: additional program arguments (not `-X` if infer_X == True) Authors ------- * Filipe G. Vieira Code ---- .. code-block:: python __author__ = "Filipe G. Vieira" __copyright__ = "Copyright 2023, Filipe G. Vieira" __license__ = "MIT" from os import path import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.snakemake import get_mem log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") mem_mb = get_mem(snakemake, out_unit="MiB") uncomp = "" in_name, in_ext = path.splitext(snakemake.input[0]) if in_ext in [".gz", ".bz2"]: uncomp = ( f"pigz --processes {snakemake.threads} --decompress --stdout" if in_ext == ".gz" else f"pbzip2 -p{snakemake.threads} --decompress --stdout" ) in_name, in_ext = path.splitext(in_name) # Infer output format if in_ext in [".fa", ".fas", ".fasta"]: in_format = "fasta" elif in_ext in [".fq", ".fastq"]: in_format = "fastq" else: raise ValueError("invalid input format") # Redundancy summary redund_sum = snakemake.output.get("redund_sum", "") if redund_sum: redund_sum = f"-o {redund_sum}" # Redundancy values redund_val = snakemake.output.get("redund_val", "") if redund_val: redund_val = f"-a {redund_val}" # Mate distribution mate_distr = snakemake.output.get("mate_distr", "") if mate_distr: mate_distr = f"-C {mate_distr}" # Log out_log = snakemake.output.get("log", "") if out_log: out_log = f"-l {out_log}" with tempfile.NamedTemporaryFile() as tmp: if uncomp: in_uncomp = tmp.name shell("{uncomp} {snakemake.input[0]} > {tmp.name}") else: in_uncomp = snakemake.input[0] # Auto infer -X value if snakemake.params.get("infer_X", True): # Get total number of lines total_n_lines = sum(1 for line in open(in_uncomp, "rb")) # Get total number of reads (depends on format) total_n_reads = total_n_lines / 4 if in_format == "fastq" else total_n_lines / 2 # Get total number of reads to sample sample_n_reads = max(1, int(total_n_reads * 0.1) - 1) # Get total number of reads to sample, depending on defaults sample_n_reads = ( min(1000, sample_n_reads) if snakemake.params.alg == "alignment" else min(10000, sample_n_reads) ) extra += f" -X {sample_n_reads}" shell( "nonpareil" " -t {snakemake.threads}" " -R {mem_mb}" " -T {snakemake.params.alg}" " -s {in_uncomp}" " -f {in_format}" " {extra}" " {redund_sum}" " {redund_val}" " {mate_distr}" " {out_log}" " {log}" ) .. |nl| raw:: html