VEP ANNOTATE
Annotate variant calls with Ensemble’s Variant Effect Predictor (VEP).
URL: https://www.ensembl.org/info/docs/tools/vep/index.html
Example
This wrapper can be used in the following way:
rule annotate_variants:
input:
calls="variants.bcf", # .vcf, .vcf.gz or .bcf
cache="resources/vep/cache", # can be omitted if fasta and gff are specified
plugins="resources/vep/plugins",
# optionally add reference genome fasta
# fasta="genome.fasta",
# fai="genome.fasta.fai", # fasta index
# gff="annotation.gff", # note: GFF files must be sorted, gzipped and indexed
# csi="annotation.gff.csi", # tabix index
# add mandatory aux-files required by some plugins if not present in the VEP plugin directory specified above.
# aux files must be defined as following: "<plugin> = /path/to/file" where plugin must be in lowercase
# revel = path/to/revel_scores.tsv.gz
output:
calls="variants.annotated.bcf", # .vcf, .vcf.gz or .bcf
stats="variants.html",
params:
# Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
# Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
plugins=["LoFtool"],
extra="--everything", # optional: extra arguments
log:
"logs/vep/annotate.log",
threads: 4
wrapper:
"v8.0.0/bio/vep/annotate"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes
In order ro predict variant effects, VEP makes use of variant data bases supplied with the cache argument.
Alternatively, genome annotation can be supplied as GFF and Fasta files.
In that case, VEP does not accept plain GFF files as e.g. obtained from NCBI. GFF files need to be sorted, gzipped, and indexed as documented on the [VEP homepage](https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#gff). It is recommended to add a rule preparing the GFF similar to this: ``` rule vep_prepare:
- input:
gff=”genome.gff”,
- output:
gff=”genome.gff.gz”, index=”genome.gff.gz.tbi”,
- conda:
“../envs/vep.yml” # env with htslib
- log:
“tabix.log”,
threads: 1 shell:
“grep -v ‘#’ {input.gff} | sort -k1,1 -k4,4n -k5,5n -t$’t’ | bgzip -c > {output.gff};” “tabix -p gff {output.gff}”
Software dependencies
ensembl-vep=115.2bcftools=1.21
Input/Output
Input:
calls: Variant calls in .vcf, .vcf.gz or .bcf format.cache: Path to cache dir. Can be output from the wrapper VEP / cache.plugins: Path to plugins dir. Can be output from the wrapper VEP / plugins.gff: Genome features in GFF format (optional). See notes for details on how to prepare.csi: Tabix index file for the GFF (optional). See notes for details on how to prepare.fasta: Genome sequence in fasta format (optional).fai: Fasta file index (optional).<plugin>: Aux-files required by some plugins (optional).
Output:
calls: Annotated variant calls in .vcf, .vcf.gz or .bcf format.stats: HTML file with stats that can be interpreted by e.g. MultiQC.
Params
plugins: Optional list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.htmlextra: additional program arguments.
Code
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"
import os
from pathlib import Path
from snakemake.shell import shell
def get_only_child_dir(path):
children = [child for child in path.iterdir() if child.is_dir()]
assert (
len(children) == 1
), "Invalid VEP cache directory, only a single entry is allowed, make sure that cache was created with the snakemake VEP cache wrapper"
return children[0]
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
fork = "--fork {}".format(snakemake.threads) if snakemake.threads > 1 else ""
stats = snakemake.output.stats
cache = snakemake.input.get("cache", "")
plugins = snakemake.input.plugins
plugin_aux_files = {"LoFtool": "LoFtool_scores.txt", "ExACpLI": "ExACpLI_values.txt"}
load_plugins = []
for plugin in snakemake.params.plugins:
if plugin in plugin_aux_files.keys():
aux_path = os.path.join(plugins, plugin_aux_files[plugin])
load_plugins.append(",".join([plugin, aux_path]))
else:
load_plugins.append(",".join([plugin, snakemake.input.get(plugin.lower(), "")]))
load_plugins = " ".join(map("--plugin {}".format, load_plugins))
if snakemake.output.calls.endswith(".vcf.gz"):
fmt = "z"
elif snakemake.output.calls.endswith(".bcf"):
fmt = "b"
else:
fmt = "v"
fasta = snakemake.input.get("fasta", "")
if fasta:
fasta = "--fasta {}".format(fasta)
gff = snakemake.input.get("gff", "")
if gff:
gff = "--gff {}".format(gff)
if cache:
entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
species = (
entrypath.parent.name[:-7]
if entrypath.parent.name.endswith("_refseq")
else entrypath.parent.name
)
release, build = entrypath.name.split("_")
cache = (
"--offline --cache --dir_cache {cache} --cache_version {release} --species {species} --assembly {build}"
).format(cache=cache, release=release, build=build, species=species)
shell(
"(bcftools view '{snakemake.input.calls}' | "
"vep {extra} {fork} "
"--format vcf "
"--vcf "
"{cache} "
"{gff} "
"{fasta} "
"--dir_plugins {plugins} "
"{load_plugins} "
"--output_file STDOUT "
"--stats_file {stats} | "
"bcftools view -O{fmt} > {snakemake.output.calls}) {log}"
)