SWARM
A robust and fast clustering method for amplicon-based studies
URL: https://github.com/torognes/swarm
Example
This wrapper can be used in the following way:
rule swarm:
input:
"{sample}.fas",
output:
structure="out/{sample}.struct.tsv",
network="out/{sample}.network.tsv",
output="out/{sample}.clusters",
statistics="stats/{sample}.stats.tsv",
uclust="out/{sample}.uclust.tsv",
seeds="out/{sample}.seeds.fas",
log="out/{sample}.log",
log:
"logs/{sample}.log",
threads: 4
params:
extra="--fastidious --differences 1 --usearch-abundance",
wrapper:
"v9.6.0/bio/swarm"
rule swarm_gz:
input:
"{sample}.fas.gz",
output:
structure="out/{sample}.gz.struct.tsv",
network="out/{sample}.gz.network.tsv",
output="out/{sample}.gz.clusters",
statistics="stats/{sample}.gz.stats.tsv",
uclust="out/{sample}.gz.uclust.tsv",
seeds="out/{sample}.gz.seeds.fas",
log="out/{sample}.gz.log",
log:
"logs/{sample}.gz.log",
threads: 4
params:
extra="--fastidious --differences 1 --usearch-abundance",
wrapper:
"v9.6.0/bio/swarm"
rule swarm_bz2:
input:
"{sample}.fas.bz2",
output:
structure="out/{sample}.bz2.struct.tsv",
network="out/{sample}.bz2.network.tsv",
output="out/{sample}.bz2.clusters",
statistics="stats/{sample}.bz2.stats.tsv",
uclust="out/{sample}.bz2.uclust.tsv",
seeds="out/{sample}.bz2.seeds.fas",
log="out/{sample}.bz2.log",
log:
"logs/{sample}.bz2.log",
threads: 4
params:
extra="--fastidious --differences 1 --usearch-abundance",
wrapper:
"v9.6.0/bio/swarm"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes
Input files can be compressed with gzip or bzip2.
Software dependencies
swarm=3.1.6gzip=1.14bzip2=1.0.8
Input/Output
Input:
input file in FASTA format
Output:
structure: pairs of nearly-identical amplicons in TSV format with five columns.network: raw amplicon network in TSV format with two columns.output: list of clusters.statistics: statistics in TSV format with one cluster per row and seven columns.uclust: clustering results in TSV format with 10 columns and 3 different type of entries (S, H or C).seeds: cluster representative sequences in FASTA format.log: log messages instead of stderr (with the exception of error messages).
Params
extra: additional program arguments
Code
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2026, Filipe G. Vieira"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Check input files
in_cmd = "cat"
if snakemake.input[0].endswith(".gz"):
in_cmd = "gzip --decompress --stdout"
if snakemake.input[0].endswith(".bz2"):
in_cmd = "bzip2 --decompress --stdout"
# Parse output files
output = list()
for key, value in snakemake.output.items():
if key in ["structure"]:
output.append(f"--internal-{key} {value}")
elif key in ["network", "output", "statistics", "uclust"]:
output.append(f"--{key}-file {value}")
elif key in ["seeds", "log"]:
output.append(f"--{key} {value}")
else:
raise ValueError(f"Unknown named output '{key}' with file name '{value}'.")
shell(
"{in_cmd} {snakemake.input[0]} | swarm --threads {snakemake.threads} {extra} {output} {log}"
)