.. _`bio/salmon/decoys`:

DECOYS
======


.. image:: https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/salmon/decoys?label=version%20update%20pull%20requests
   :target: https://github.com/snakemake/snakemake-wrappers/pulls?q=is%3Apr+is%3Aopen+label%3Abio/salmon/decoys

Generate gentrome sequences and gather decoy sequences name

**URL**: https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/

Example
-------

This wrapper can be used in the following way:

.. code-block:: python

    rule test_salmon_decoy:
        input:
            transcriptome="transcriptome.fasta.gz",
            genome="genome.fasta.gz",
        output:
            gentrome="gentrome.fasta.gz",
            decoys="decoys.txt",
        threads: 2
        log:
            "decoys.log"
        wrapper:
            "v3.0.1/bio/salmon/decoys"

Note that input, output and log file paths can be chosen freely.

When running with

.. code-block:: bash

    snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.


Notes
-----

Provide transcriptome and genome under the same format (raw fasta, gzipped
or bgzipped). In case of compressed input, this wrapper requires 2 threads:
one for on-the-fly decompression and one for actual decoy sequences
acquisition.


Software dependencies
---------------------

* ``bzip2=1.0.8``
* ``gzip=1.13``

Input/Output
------------
**Input:**

* ``transcriptome``: Path to transcriptome sequences, fasta (gz/bz2) formatted.
* ``genome``: Path to genome sequences, fasta (gz/bz2) formatted.

**Output:**

* ``gentrome``: Path to gentrome, fasta (gz/bz2) formatted.
* ``decoys``: Path to text file contianing decoy sequence names.


Authors
-------

* Thibault Dayris


Code
----

.. code-block:: python

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-

    """Snakemake wrapper for gentrome and decoy sequences acquisition"""

    __author__ = "Thibault Dayris"
    __copyright__ = "Copyright 2022, Thibault Dayris"
    __email__ = "thibault.dayris@gustaveroussy.fr"
    __license__ = "MIT"


    from snakemake.shell import shell

    log = snakemake.log_fmt_shell(stdout=False, stderr=True, append=True)
    required_thread_nb = 1

    genome = snakemake.input["genome"]
    if genome.endswith(".gz"):
        genome = f"<( gzip --stdout --decompress {genome} )"
        required_thread_nb += 1  # Add a thread for gzip uncompression
    elif genome.endswith(".bz2"):
        genome = f"<( bzip2 --stdout --decompress {genome} )"
        required_thread_nb += 1  # Add a thread for bzip2 uncompression

    if snakemake.threads < required_thread_nb:
        raise ValueError(
            f"Salmon decoy wrapper requires exactly {required_thread_nb} threads, "
            f"but only {snakemake.threads} were provided"
        )

    sequences = [
        snakemake.input["transcriptome"],
        snakemake.input["genome"],
        snakemake.output["gentrome"],
    ]
    if all(fasta.endswith(".gz") for fasta in sequences):
        # Then all input sequences are gzipped. The output will also be gzipped.
        pass
    elif all(fasta.endswith(".bz2") for fasta in sequences):
        # Then all input sequences are bgzipped. The output will also be bgzipped.
        pass
    elif all(fasta.endswith((".fa", ".fna", ".fasta")) for fasta in sequences):
        # Then all input sequences are raw fasta. The output will also be raw fasta.
        pass
    else:
        raise ValueError(
            "Mixed compression status: Either all fasta sequences are compressed "
            "with the *same* compression algorithm, or none of them are compressed."
        )

    # Gathering decoy sequences names
    # Sed command works as follow:
    # -n       = do not print all lines
    # s/ .*//g = Remove anything after spaces. (remove comments)
    # s/>//p  = Remove '>' character at the begining of sequence names. Print names.
    shell("( sed -n 's/ .*//g;s/>//p' {genome} ) > {snakemake.output.decoys} {log}")

    # Building big gentrome file
    shell(
        "cat {snakemake.input.transcriptome} {snakemake.input.genome} "
        "> {snakemake.output.gentrome} {log}"
    )


.. |nl| raw:: html

   <br>