PRINSEQ++
C++ implementation of the prinseq-lite.pl program. It can be used to filter, reformat or trim genomic and metagenomic sequence data.
URL: https://github.com/Adrian-Cantu/PRINSEQ-plus-plus
Example
This wrapper can be used in the following way:
rule prinseq_plus_plus_fas2fq:
input:
"reads/{prefix}.fas",
output:
good="results/{prefix}.fq",
bad="results/{prefix}.bad.fq",
log:
"logs/fas2fq/{prefix}.log",
params:
extra="-min_len 2",
threads: 2
wrapper:
"v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"
rule prinseq_plus_plus_fas2fqgz:
input:
"reads/{prefix}.fas",
output:
good="results/{prefix}.fq.gz",
bad="results/{prefix}.bad.fq.gz",
log:
"logs/fas2fqgz/{prefix}.log",
params:
extra="-min_len 2",
threads: 2
wrapper:
"v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"
rule prinseq_plus_plus_fqgz2fas:
input:
"reads/{prefix}.fastq.gz",
output:
good="results/{prefix}.fasta",
bad="results/{prefix}.bad.fasta",
log:
"logs/fqgz2fas/{prefix}.log",
params:
extra="-min_len 2",
threads: 2
wrapper:
"v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"
rule prinseq_plus_plus_fq2fasgz:
input:
"reads/{prefix}.fastq",
output:
good="results/{prefix}.fas.gz",
bad="results/{prefix}.bad.fas.gz",
log:
"logs/fq2fasgz/{prefix}.log",
params:
extra="-min_len 2",
threads: 2
wrapper:
"v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"
rule prinseq_plus_plus_fqpe:
input:
"reads/{prefix}.1.fastq.gz",
"reads/{prefix}.2.fastq.gz",
output:
good="results/{prefix}.R1.fq.gz",
good2="results/{prefix}.R2.fq.gz",
single="results/{prefix}.single.R1.fq.gz",
single2="results/{prefix}.single.R2.fq.gz",
bad="results/{prefix}.bad.R1.fq.gz",
bad2="results/{prefix}.bad.R2.fq.gz",
log:
"logs/fqpe/{prefix}.log",
params:
extra="-min_len 2",
threads: 2
wrapper:
"v5.5.2-17-g33d5b76/bio/prinseq-plus-plus"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes
Multiple threads can be used during compression of the output file.
Both input and output files can be gzipped
Output files are optional but must have the same format.
Software dependencies
prinseq-plus-plus=1.2.4
htslib=1.21
Input/Output
Input:
fastx file(s)
Output:
r1
: fastx filer2
: fastx file (if PE)r1_single
: fastx file (if PE)r2_single
: fastx file (if PE)r1_bad
: fastx filer2_bad
: fastx file (if PE)
Params
extra
: additional program options.
Code
"""Snakemake wrapper for Prinseq++"""
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2023, Filipe G. Vieira"
__license__ = "MIT"
import tempfile
from snakemake.shell import shell
from pathlib import Path
log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=False)
extra = snakemake.params.get("extra", "")
ext_fq = [".fq", ".fastq"]
ext_fas = [".fas", ".fna", ".fasta"]
# Input files
input_pe = len(snakemake.input) == 2
file2 = f"-fastq2 {snakemake.input[1]}" if input_pe else ""
# Check input format
in_fmt = Path(snakemake.input[0].removesuffix(".gz")).suffix
if in_fmt in ext_fq:
pass
elif in_fmt in ext_fas:
extra += " -FASTA"
else:
raise ValueError("Invalid input file format")
# Output files
out_fmt = Path(snakemake.output[0]).suffix
for key, value in snakemake.output.items():
if out_fmt == ".gz":
extra += f" -out_{key} >(bgzip --threads {snakemake.threads} > {value})"
else:
extra += f" -out_{key} {value}"
# Check output format
if out_fmt == ".gz":
out_fmt = Path(snakemake.output[0].removesuffix(".gz")).suffix
if out_fmt in ext_fq:
extra += " -out_format 0"
elif out_fmt in ext_fas:
extra += " -out_format 1"
else:
raise ValueError("Invalid output file format")
# Run Prinseq++
with tempfile.TemporaryDirectory() as tmpdir:
shell(
"prinseq++ -threads {snakemake.threads} -fastq {snakemake.input[0]} {file2} {extra} -out_name {tmpdir}/tmp {log}"
)