SORTMERNA

SortMeRNA is a local sequence alignment tool for filtering, mapping and OTU clustering.

URL: https://sortmerna.readthedocs.io/

Example

This wrapper can be used in the following way:

rule sortmerna_pe:
    input:
        ref=["database1.fa", "database2.fa"],
        reads=["reads_1.fq.gz", "reads_2.fq.gz"],
    output:
        aligned=["aligned_1.fastq.gz", "aligned_2.fastq.gz"],
        other=["unpaired_1.fastq.gz", "unpaired_2.fastq.gz"],
        stats="sortmerna_pe_stats.log",
    log:
        "logs/sortmerna/reads_pe.log",
    threads: 16
    resources:
        mem_mb=3072,  # amount of memory for building the index
    params:
        # --out2 is injected automatically for list (split) outputs
        extra="--paired_in",
    wrapper:
        "v9.9.0/bio/sortmerna"


rule sortmerna_pe_interleaved:
    input:
        ref=["database1.fa", "database2.fa"],
        reads=["reads_1.fq.gz", "reads_2.fq.gz"],
    output:
        # single-file outputs with paired input -> interleaved (no --out2)
        aligned="aligned_interleaved.fastq.gz",
        other="unpaired_interleaved.fastq.gz",
        stats="sortmerna_pe_interleaved_stats.log",
    log:
        "logs/sortmerna/reads_pe_interleaved.log",
    threads: 16
    resources:
        mem_mb=3072,  # amount of memory for building the index
    params:
        extra="--paired_in",
    wrapper:
        "v9.9.0/bio/sortmerna"


rule sortmerna_se:
    input:
        ref=["database1.fa", "database2.fa"],
        reads="reads.fq",
        # optional: reuse a pre-built index dir instead of rebuilding it (adds --idx-dir)
        idx_dir="idx",
    output:
        aligned="aligned.fastq",
        other="unpaired.fastq",
        stats="sortmerna_se_stats.log",
    log:
        "logs/sortmerna/reads_se.log",
    threads: 16
    resources:
        mem_mb=3072,  # amount of memory for building the index
    params:
        extra="",
    wrapper:
        "v9.9.0/bio/sortmerna"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Notes

–fastx is added automatically whenever aligned or other is requested.
–out2 is added automatically when an output is declared as a list of two files (split _fwd/_rev output); declaring single output files with paired input instead produces interleaved single files. aligned and other, when both requested, must use the same shape.
The kvdb (key value database) is created in a temporary directory, so it does not need pruning before each run.
idx_dir is an optional input pointing at a reference index directory. When supplied, the wrapper adds –idx-dir <idx_dir>; sortmerna reuses an index already present there (matching ref) and otherwise builds it into that directory, so the index persists and is reused across runs instead of being rebuilt in the (discarded) workdir each time. Because idx_dir is a Snakemake input, the directory path must exist before the job runs (provide it as an upstream rule’s directory() output, or pre-create it) — an empty directory is fine, sortmerna will populate it. When idx_dir is omitted, the index is built in the temporary workdir and discarded.
Caution: if several jobs run in parallel against the same empty idx_dir, they will each try to build the index into it concurrently, which can corrupt it. To reuse a shared index safely, build it once in a separate index-only step (sortmerna -task 5) and let the aligning jobs depend on that step’s directory() output, so they only ever read a complete index.

Software dependencies

sortmerna=6.0.2

Input/Output

Input:

ref: Reference FASTA file(s) (one or more)
reads: Query FASTA/Q file (single file, or a list of two files for paired-end)
idx_dir: Pre-built reference index directory (optional); reused across runs instead of rebuilt each time

Output:

aligned: Aligned reads (single file, or a list of two files for split paired output)
other: Unaligned reads (single file, or a list of two files for split paired output)
stats: Alignment statistics log (single file, optional)

Params

extra: additional program arguments

Authors

Curro Campuzano Jiménez

Code

__author__ = "Curro Campuzano Jiménez"
__copyright__ = "Copyright 2023, Curro Campuzano Jiménez"
__email__ = "campuzanocurro@gmail.com"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
mem_mb = snakemake.resources.get("mem_mb", 3072)  # Default value


ref = snakemake.input.ref
if isinstance(ref, list):
    ref = " --ref ".join(ref)


reads = snakemake.input.reads
aligned = snakemake.output.get("aligned")
if aligned:
    if isinstance(aligned, list):
        assert (
            len(aligned) == 2
        ), "if paired input, aligned must be a list of two files, if any"
        assert isinstance(
            reads, list
        ), "if paired input, reads must be a list of two files"


other = snakemake.output.get("other")
if other:
    if isinstance(other, list):
        assert (
            len(other) == 2
        ), "if paired input, other must be a list of two files, if any"
        assert isinstance(
            reads, list
        ), "if paired input, reads must be a list of two files"


if isinstance(reads, list):
    assert len(reads) == 2, "if paired input, reads must be a list of two files"
    reads = " --reads ".join(reads)


# A list output means split paired output (_fwd/_rev files), which requires
# --out2; a single output file with paired input produces one interleaved file.
split_output = isinstance(aligned, list) or isinstance(other, list)


# fastx output is required whenever aligned or other reads are requested
if aligned or other:
    extra = f"--fastx {extra}"


if split_output:
    extra = f"--out2 {extra}"


# A pre-built index can be supplied as the optional `idx_dir` input so it is
# reused across runs instead of rebuilt in the (ephemeral) workdir each time.
# The per-run kvdb still lives in the temporary workdir, so it stays fresh.
idx_dir = snakemake.input.get("idx_dir")
if idx_dir:
    extra = f"--idx-dir {idx_dir} {extra}"


with tempfile.TemporaryDirectory() as temp_workdir:
    shell(
        " sortmerna --ref {ref}"
        " --reads {reads}"
        " --workdir {temp_workdir}"
        " --threads {snakemake.threads}"
        " -m {mem_mb}"
        " --aligned {temp_workdir}/aligned_reads"
        " --other {temp_workdir}/other_reads"
        " {extra}"
        " {log}"
    )

    if split_output:
        if aligned:
            # Handle the case were no alignment
            shell("mv {temp_workdir}/aligned_reads_fwd.* {aligned[0]}")
            shell("mv {temp_workdir}/aligned_reads_rev.* {aligned[1]}")
        if other:
            shell("mv {temp_workdir}/other_reads_fwd.* {other[0]}")
            shell("mv {temp_workdir}/other_reads_rev.* {other[1]}")
    else:
        if aligned:
            shell("mv {temp_workdir}/aligned_reads.f* {aligned}")
        if other:
            shell("mv {temp_workdir}/other_reads.f* {other}")

    stats = snakemake.output.get("stats")
    if stats:
        shell("mv {temp_workdir}/aligned_reads.log {stats}")