GALAH
More scalable dereplication for metagenome assembled genomes
URL: https://github.com/wwood/galah
Example
This wrapper can be used in the following way:
rule galah_fas:
input:
fas=["fas/a.fa", "fas/b.fas.bz2", "fas/c.fasta.gz"],
output:
clusters="results.fas.tsv",
repres="results.fas.list",
log:
"logs/out.fas.log",
params:
extra="--precluster-ani 0.9 --ani 0.95",
threads: 2
resources:
mem_mb=50,
wrapper:
"v5.3.0-16-g710597c/bio/galah"
use rule galah_fas as galah_fas_list with:
input:
fas_list="fas/a.fas_list",
output:
clusters="results.fas_list.tsv",
repres="results.fas_list.list",
log:
"logs/out.fas_list.log",
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies
galah=0.4.2
Input/Output
Input:
FASTA files
Output:
clusters
: representative FASTA<TAB>member linesrepres
: paths to representative FAS files
Params
extra
: additional program arguments
Code
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2023, Filipe G. Vieira"
__license__ = "MIT"
from snakemake.shell import shell
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")
fas = snakemake.input.get("fas", "")
if fas:
fas = f"--genome-fasta-files {' '.join(fas)}"
fas_list = snakemake.input.get("fas_list", "")
if fas_list:
fas_list = f"--genome-fasta-list {fas_list}"
clusters = snakemake.output.get("clusters", "")
if clusters:
clusters = f"--output-cluster-definition {clusters}"
repres = snakemake.output.get("repres", "")
if repres:
repres = f"--output-representative-list {repres}"
shell(
"galah cluster"
" --threads {snakemake.threads}"
" {fas}"
" {fas_list}"
" {extra}"
" {clusters}"
" {repres}"
" {log}"
)