.. _`bio/nonpareil/infer`:

NONPAREIL INFER
===============


.. image:: https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/nonpareil/infer?label=version%20update%20pull%20requests
   :target: https://github.com/snakemake/snakemake-wrappers/pulls?q=is%3Apr+is%3Aopen+label%3Abio/nonpareil/infer

Nonpareil uses the redundancy of the reads in metagenomic datasets to estimate the average coverage and predict the amount of sequences that will be required to achieve “nearly complete coverage”.

**URL**: https://nonpareil.readthedocs.io/en/latest/

Example
-------

This wrapper can be used in the following way:

.. code-block:: python

    rule nonpareil:
        input:
            "reads/{sample}",
        output:
            redund_sum="results/{sample}.npo",
            redund_val="results/{sample}.npa",
            mate_distr="results/{sample}.npc",
            log="results/{sample}.log",
        log:
            "logs/{sample}.log",
        params:
            alg="kmer",
            infer_X=True,
            extra="-k 3 -F",
        threads: 2
        resources:
            mem_mb=50,
        wrapper:
            "v3.0.1/bio/nonpareil/infer"

Note that input, output and log file paths can be chosen freely.

When running with

.. code-block:: bash

    snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.


Notes
-----

* For a PDF version of the manual, see https://nonpareil.readthedocs.io/_/downloads/en/latest/pdf/


Software dependencies
---------------------

* ``nonpareil=3.4.1``
* ``pigz``
* ``pbzip2``
* ``snakemake-wrapper-utils=0.6.2``

Input/Output
------------
**Input:**

* reads in FASTA/Q format (can be gziped or bziped)

**Output:**

* ``redund_sum``: redundancy summary TSV file with six columns, representing sequencing effort, summary of the distribution of redundancy (average redundancy, standard deviation, quartile 1, median, and quartile 3).
* ``redund_val``: redundancy values TSV file with three columns (similar to redundancy summary, but provides ALL results), representing sequencing effort, ID of the replicate and estimated redundancy value.
* ``mate_distr``: mate distribution file, with the number of reads in the dataset matching a query read.
* ``log``: log of internal Nonpareil processing.


Params
------

* ``alg``: nonpareil algorithm, either `kmer` or `alignment` (mandatory).

* ``infer_X``: automatically infer value of `-X` (couple of minutes slower to count number of reads)

* ``extra``: additional program arguments (not `-X` if infer_X == True)


Authors
-------

* Filipe G. Vieira


Code
----

.. code-block:: python

    __author__ = "Filipe G. Vieira"
    __copyright__ = "Copyright 2023, Filipe G. Vieira"
    __license__ = "MIT"

    from os import path
    import tempfile
    from snakemake.shell import shell
    from snakemake_wrapper_utils.snakemake import get_mem


    log = snakemake.log_fmt_shell(stdout=True, stderr=True)
    extra = snakemake.params.get("extra", "")
    mem_mb = get_mem(snakemake, out_unit="MiB")

    uncomp = ""
    in_name, in_ext = path.splitext(snakemake.input[0])
    if in_ext in [".gz", ".bz2"]:
        uncomp = (
            f"pigz --processes {snakemake.threads} --decompress --stdout"
            if in_ext == ".gz"
            else f"pbzip2 -p{snakemake.threads} --decompress --stdout"
        )
        in_name, in_ext = path.splitext(in_name)

    # Infer output format
    if in_ext in [".fa", ".fas", ".fasta"]:
        in_format = "fasta"
    elif in_ext in [".fq", ".fastq"]:
        in_format = "fastq"
    else:
        raise ValueError("invalid input format")

    # Redundancy summary
    redund_sum = snakemake.output.get("redund_sum", "")
    if redund_sum:
        redund_sum = f"-o {redund_sum}"

    # Redundancy values
    redund_val = snakemake.output.get("redund_val", "")
    if redund_val:
        redund_val = f"-a {redund_val}"

    # Mate distribution
    mate_distr = snakemake.output.get("mate_distr", "")
    if mate_distr:
        mate_distr = f"-C {mate_distr}"

    # Log
    out_log = snakemake.output.get("log", "")
    if out_log:
        out_log = f"-l {out_log}"


    with tempfile.NamedTemporaryFile() as tmp:
        if uncomp:
            in_uncomp = tmp.name
            shell("{uncomp} {snakemake.input[0]} > {tmp.name}")
        else:
            in_uncomp = snakemake.input[0]

        # Auto infer -X value
        if snakemake.params.get("infer_X", True):
            # Get total number of lines
            total_n_lines = sum(1 for line in open(in_uncomp, "rb"))
            # Get total number of reads (depends on format)
            total_n_reads = total_n_lines / 4 if in_format == "fastq" else total_n_lines / 2
            # Get total number of reads to sample
            sample_n_reads = max(1, int(total_n_reads * 0.1) - 1)
            # Get total number of reads to sample, depending on defaults
            sample_n_reads = (
                min(1000, sample_n_reads)
                if snakemake.params.alg == "alignment"
                else min(10000, sample_n_reads)
            )
            extra += f" -X {sample_n_reads}"

        shell(
            "nonpareil"
            " -t {snakemake.threads}"
            " -R {mem_mb}"
            " -T {snakemake.params.alg}"
            " -s {in_uncomp}"
            " -f {in_format}"
            " {extra}"
            " {redund_sum}"
            " {redund_val}"
            " {mate_distr}"
            " {out_log}"
            " {log}"
        )


.. |nl| raw:: html

   <br>