LASTAL
LAST finds similar regions between sequences, and aligns them. It is designed for comparing large datasets to each other (e.g. vertebrate genomes and/or large numbers of DNA reads)
Example
This wrapper can be used in the following way:
rule lastal_nucl_x_nucl:
input:
data="test-transcript.fa",
lastdb="test-transcript.fa.prj"
output:
# only one of these outputs is allowed
maf="test-transcript.maf",
#tab="test-transcript.tab",
#blasttab="test-transcript.blasttab",
#blasttabplus="test-transcript.blasttabplus",
params:
#Report alignments that are expected by chance at most once per LENGTH query letters. By default, LAST reports alignments that are expected by chance at most once per million query letters (for a given database). http://last.cbrc.jp/doc/last-evalues.html
D_length=1000000,
extra=""
log:
"logs/lastal/test.log"
threads: 8
wrapper:
"v5.8.0/bio/last/lastal"
rule lastal_nucl_x_prot:
input:
data="test-transcript.fa",
lastdb="test-protein.fa.prj"
output:
# only one of these outputs is allowed
maf="test-tr-x-prot.maf"
#tab="test-tr-x-prot.tab",
#blasttab="test-tr-x-prot.blasttab",
#blasttabplus="test-tr-x-prot.blasttabplus",
params:
frameshift_cost=15, #Align DNA queries to protein reference sequences using specified frameshift cost. 15 is reasonable. Special case, -F0 means DNA-versus-protein alignment without frameshifts, which is faster.)
extra="",
log:
"logs/lastal/test.log"
threads: 8
wrapper:
"v5.8.0/bio/last/lastal"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies
last=1608
Code
""" Snakemake wrapper for lastal """
__author__ = "N. Tessa Pierce"
__copyright__ = "Copyright 2019, N. Tessa Pierce"
__email__ = "ntpierce@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
# http://last.cbrc.jp/doc/last-evalues.html
d_len = float(snakemake.params.get("D_length", 1000000)) # last default
# set output file formats
maf_out = snakemake.output.get("maf", "")
tab_out = snakemake.output.get("tab", "")
btab_out = snakemake.output.get("blasttab", "")
btabplus_out = snakemake.output.get("blasttabplus", "")
outfiles = [maf_out, tab_out, btab_out, btabplus_out]
# TAB, MAF, BlastTab, BlastTab+ (default=MAF)
assert (
list(map(bool, outfiles)).count(True) == 1
), "please specify ONE output file using one of: 'maf', 'tab', 'blasttab', or 'blasttabplus' keywords in the output field)"
out_cmd = ""
if maf_out:
out_cmd = "-f {}".format("MAF")
outF = maf_out
elif tab_out:
out_cmd = "-f {}".format("TAB")
outF = tab_out
if btab_out:
out_cmd = "-f {}".format("BlastTab")
outF = btab_out
if btabplus_out:
out_cmd = "-f {}".format("BlastTab+")
outF = btabplus_out
frameshift_cost = snakemake.params.get("frameshift_cost", "")
if frameshift_cost:
f_cmd = f"-F {frameshift_cost}"
lastdb_name = str(snakemake.input["lastdb"]).rsplit(".", 1)[0]
shell(
"lastal -D {d_len} -P {snakemake.threads} {extra} {lastdb_name} {snakemake.input.data} > {outF} {log}"
)