PRINSEQ++

https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/prinseq-plus-plus?label=version%20update%20pull%20requests

C++ implementation of the prinseq-lite.pl program. It can be used to filter, reformat or trim genomic and metagenomic sequence data.

URL: https://github.com/Adrian-Cantu/PRINSEQ-plus-plus

Example

This wrapper can be used in the following way:

rule prinseq_plus_plus_fas2fq:
    input:
        "reads/{prefix}.fas",
    output:
        good="results/{prefix}.fq",
        bad="results/{prefix}.bad.fq",
    log:
        "logs/fas2fq/{prefix}.log",
    params:
        extra="-min_len 2",
    threads: 2
    wrapper:
        "v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"


rule prinseq_plus_plus_fas2fqgz:
    input:
        "reads/{prefix}.fas",
    output:
        good="results/{prefix}.fq.gz",
        bad="results/{prefix}.bad.fq.gz",
    log:
        "logs/fas2fqgz/{prefix}.log",
    params:
        extra="-min_len 2",
    threads: 2
    wrapper:
        "v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"


rule prinseq_plus_plus_fqgz2fas:
    input:
        "reads/{prefix}.fastq.gz",
    output:
        good="results/{prefix}.fasta",
        bad="results/{prefix}.bad.fasta",
    log:
        "logs/fqgz2fas/{prefix}.log",
    params:
        extra="-min_len 2",
    threads: 2
    wrapper:
        "v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"


rule prinseq_plus_plus_fq2fasgz:
    input:
        "reads/{prefix}.fastq",
    output:
        good="results/{prefix}.fas.gz",
        bad="results/{prefix}.bad.fas.gz",
    log:
        "logs/fq2fasgz/{prefix}.log",
    params:
        extra="-min_len 2",
    threads: 2
    wrapper:
        "v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"


rule prinseq_plus_plus_fqpe:
    input:
        "reads/{prefix}.1.fastq.gz",
        "reads/{prefix}.2.fastq.gz",
    output:
        good="results/{prefix}.R1.fq.gz",
        good2="results/{prefix}.R2.fq.gz",
        single="results/{prefix}.single.R1.fq.gz",
        single2="results/{prefix}.single.R2.fq.gz",
        bad="results/{prefix}.bad.R1.fq.gz",
        bad2="results/{prefix}.bad.R2.fq.gz",
    log:
        "logs/fqpe/{prefix}.log",
    params:
        extra="-min_len 2",
    threads: 2
    wrapper:
        "v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Notes

  • Multiple threads can be used during compression of the output file.

  • Both input and output files can be gzipped

  • Output files are optional but must have the same format.

Software dependencies

  • prinseq-plus-plus=1.2.4

  • htslib=1.21

Input/Output

Input:

  • fastx file(s)

Output:

  • r1: fastx file

  • r2: fastx file (if PE)

  • r1_single: fastx file (if PE)

  • r2_single: fastx file (if PE)

  • r1_bad: fastx file

  • r2_bad: fastx file (if PE)

Params

  • extra: additional program options.

Authors

  • Filipe G. Vieira

Code

"""Snakemake wrapper for Prinseq++"""

__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2023, Filipe G. Vieira"
__license__ = "MIT"

import tempfile
from snakemake.shell import shell
from pathlib import Path


log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=False)
extra = snakemake.params.get("extra", "")


ext_fq = [".fq", ".fastq"]
ext_fas = [".fas", ".fna", ".fasta"]


# Input files
input_pe = len(snakemake.input) == 2
file2 = f"-fastq2 {snakemake.input[1]}" if input_pe else ""

# Check input format
in_fmt = Path(snakemake.input[0].removesuffix(".gz")).suffix
if in_fmt in ext_fq:
    pass
elif in_fmt in ext_fas:
    extra += " -FASTA"
else:
    raise ValueError("Invalid input file format")


# Output files
out_fmt = Path(snakemake.output[0]).suffix
for key, value in snakemake.output.items():
    if out_fmt == ".gz":
        extra += f" -out_{key} >(bgzip --threads {snakemake.threads} > {value})"
    else:
        extra += f" -out_{key} {value}"

# Check output format
if out_fmt == ".gz":
    out_fmt = Path(snakemake.output[0].removesuffix(".gz")).suffix
if out_fmt in ext_fq:
    extra += " -out_format 0"
elif out_fmt in ext_fas:
    extra += " -out_format 1"
else:
    raise ValueError("Invalid output file format")


# Run Prinseq++
with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "prinseq++ -threads {snakemake.threads} -fastq {snakemake.input[0]} {file2} {extra} -out_name {tmpdir}/tmp {log}"
    )