.. _`bio/salmon/decoys`: DECOYS ====== .. image:: https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/salmon/decoys?label=version%20update%20pull%20requests :target: https://github.com/snakemake/snakemake-wrappers/pulls?q=is%3Apr+is%3Aopen+label%3Abio/salmon/decoys Generate gentrome sequences and gather decoy sequences name **URL**: https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/ Example ------- This wrapper can be used in the following way: .. code-block:: python rule test_salmon_decoy: input: transcriptome="transcriptome.fasta.gz", genome="genome.fasta.gz", output: gentrome="gentrome.fasta.gz", decoys="decoys.txt", threads: 2 log: "decoys.log" wrapper: "v3.0.1/bio/salmon/decoys" Note that input, output and log file paths can be chosen freely. When running with .. code-block:: bash snakemake --use-conda the software dependencies will be automatically deployed into an isolated environment before execution. Notes ----- Provide transcriptome and genome under the same format (raw fasta, gzipped or bgzipped). In case of compressed input, this wrapper requires 2 threads: one for on-the-fly decompression and one for actual decoy sequences acquisition. Software dependencies --------------------- * ``bzip2=1.0.8`` * ``gzip=1.13`` Input/Output ------------ **Input:** * ``transcriptome``: Path to transcriptome sequences, fasta (gz/bz2) formatted. * ``genome``: Path to genome sequences, fasta (gz/bz2) formatted. **Output:** * ``gentrome``: Path to gentrome, fasta (gz/bz2) formatted. * ``decoys``: Path to text file contianing decoy sequence names. Authors ------- * Thibault Dayris Code ---- .. code-block:: python #!/usr/bin/env python3 # -*- coding: utf-8 -*- """Snakemake wrapper for gentrome and decoy sequences acquisition""" __author__ = "Thibault Dayris" __copyright__ = "Copyright 2022, Thibault Dayris" __email__ = "thibault.dayris@gustaveroussy.fr" __license__ = "MIT" from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True, append=True) required_thread_nb = 1 genome = snakemake.input["genome"] if genome.endswith(".gz"): genome = f"<( gzip --stdout --decompress {genome} )" required_thread_nb += 1 # Add a thread for gzip uncompression elif genome.endswith(".bz2"): genome = f"<( bzip2 --stdout --decompress {genome} )" required_thread_nb += 1 # Add a thread for bzip2 uncompression if snakemake.threads < required_thread_nb: raise ValueError( f"Salmon decoy wrapper requires exactly {required_thread_nb} threads, " f"but only {snakemake.threads} were provided" ) sequences = [ snakemake.input["transcriptome"], snakemake.input["genome"], snakemake.output["gentrome"], ] if all(fasta.endswith(".gz") for fasta in sequences): # Then all input sequences are gzipped. The output will also be gzipped. pass elif all(fasta.endswith(".bz2") for fasta in sequences): # Then all input sequences are bgzipped. The output will also be bgzipped. pass elif all(fasta.endswith((".fa", ".fna", ".fasta")) for fasta in sequences): # Then all input sequences are raw fasta. The output will also be raw fasta. pass else: raise ValueError( "Mixed compression status: Either all fasta sequences are compressed " "with the *same* compression algorithm, or none of them are compressed." ) # Gathering decoy sequences names # Sed command works as follow: # -n = do not print all lines # s/ .*//g = Remove anything after spaces. (remove comments) # s/>//p = Remove '>' character at the begining of sequence names. Print names. shell("( sed -n 's/ .*//g;s/>//p' {genome} ) > {snakemake.output.decoys} {log}") # Building big gentrome file shell( "cat {snakemake.input.transcriptome} {snakemake.input.genome} " "> {snakemake.output.gentrome} {log}" ) .. |nl| raw:: html