SWARM

A robust and fast clustering method for amplicon-based studies

URL: https://github.com/torognes/swarm

Example

This wrapper can be used in the following way:

rule swarm:
    input:
        "{sample}.fas",
    output:
        structure="out/{sample}.struct.tsv",
        network="out/{sample}.network.tsv",
        output="out/{sample}.clusters",
        statistics="stats/{sample}.stats.tsv",
        uclust="out/{sample}.uclust.tsv",
        seeds="out/{sample}.seeds.fas",
        log="out/{sample}.log",
    log:
        "logs/{sample}.log",
    threads: 4
    params:
        extra="--fastidious --differences 1 --usearch-abundance",
    wrapper:
        "v9.6.0/bio/swarm"


rule swarm_gz:
    input:
        "{sample}.fas.gz",
    output:
        structure="out/{sample}.gz.struct.tsv",
        network="out/{sample}.gz.network.tsv",
        output="out/{sample}.gz.clusters",
        statistics="stats/{sample}.gz.stats.tsv",
        uclust="out/{sample}.gz.uclust.tsv",
        seeds="out/{sample}.gz.seeds.fas",
        log="out/{sample}.gz.log",
    log:
        "logs/{sample}.gz.log",
    threads: 4
    params:
        extra="--fastidious --differences 1 --usearch-abundance",
    wrapper:
        "v9.6.0/bio/swarm"


rule swarm_bz2:
    input:
        "{sample}.fas.bz2",
    output:
        structure="out/{sample}.bz2.struct.tsv",
        network="out/{sample}.bz2.network.tsv",
        output="out/{sample}.bz2.clusters",
        statistics="stats/{sample}.bz2.stats.tsv",
        uclust="out/{sample}.bz2.uclust.tsv",
        seeds="out/{sample}.bz2.seeds.fas",
        log="out/{sample}.bz2.log",
    log:
        "logs/{sample}.bz2.log",
    threads: 4
    params:
        extra="--fastidious --differences 1 --usearch-abundance",
    wrapper:
        "v9.6.0/bio/swarm"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Notes

Input files can be compressed with gzip or bzip2.

Software dependencies

swarm=3.1.6
gzip=1.14
bzip2=1.0.8

Input/Output

Input:

input file in FASTA format

Output:

structure: pairs of nearly-identical amplicons in TSV format with five columns.
network: raw amplicon network in TSV format with two columns.
output: list of clusters.
statistics: statistics in TSV format with one cluster per row and seven columns.
uclust: clustering results in TSV format with 10 columns and 3 different type of entries (S, H or C).
seeds: cluster representative sequences in FASTA format.
log: log messages instead of stderr (with the exception of error messages).

Params

extra: additional program arguments

Authors

Filipe G. Vieira

Code

__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2026, Filipe G. Vieira"
__license__ = "MIT"

from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Check input files
in_cmd = "cat"
if snakemake.input[0].endswith(".gz"):
    in_cmd = "gzip --decompress --stdout"
if snakemake.input[0].endswith(".bz2"):
    in_cmd = "bzip2 --decompress --stdout"

# Parse output files
output = list()
for key, value in snakemake.output.items():
    if key in ["structure"]:
        output.append(f"--internal-{key} {value}")
    elif key in ["network", "output", "statistics", "uclust"]:
        output.append(f"--{key}-file {value}")
    elif key in ["seeds", "log"]:
        output.append(f"--{key} {value}")
    else:
        raise ValueError(f"Unknown named output '{key}' with file name '{value}'.")

shell(
    "{in_cmd} {snakemake.input[0]} | swarm --threads {snakemake.threads} {extra} {output} {log}"
)