ENSEMBL-ANNOTATION

https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/reference/ensembl-annotation?label=version%20update%20pull%20requests

Download annotation of genomic sites (e.g. transcripts) from ENSEMBL FTP servers, and store them in a single .gtf or .gff3 file.

Example

This wrapper can be used in the following way:

rule get_annotation:
    output:
        "refs/annotation.gtf",
    params:
        species="homo_sapiens",
        release="105",
        build="GRCh37",
        flavor="",  # optional, e.g. chr_patch_hapl_scaff, see Ensembl FTP.
        # branch="plants",  # optional: specify branch
    log:
        "logs/get_annotation.log",
    cache: "omit-software"  # save space and time with between workflow caching (see docs)
    wrapper:
        "v7.6.1/bio/reference/ensembl-annotation"


rule get_annotation_gz:
    output:
        "refs/annotation.gtf.gz",
    params:
        species="homo_sapiens",
        release="105",
        build="GRCh37",
        flavor="",  # optional, e.g. chr_patch_hapl_scaff, see Ensembl FTP.
        # branch="plants",  # optional: specify branch
    log:
        "logs/get_annotation.log",
    params:
        url="http://ftp.ensembl.org/pub",
    cache: "omit-software"  # save space and time with between workflow caching (see docs)
    wrapper:
        "v7.6.1/bio/reference/ensembl-annotation"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Software dependencies

  • curl

Params

  • url: URL from where to download cache data (optional; by default is https://ftp.ensembl.org/pub)

Authors

  • Johannes Köster

Code

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"

import subprocess
import sys
from pathlib import Path
from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=False, stderr=True)


species = snakemake.params.species.lower()
build = snakemake.params.build
release = int(snakemake.params.release)
gtf_release = release
out_fmt = Path(snakemake.output[0]).suffixes
out_gz = (out_fmt.pop() and True) if out_fmt[-1] == ".gz" else False
out_fmt = out_fmt.pop().lstrip(".")


branch = ""
if build == "GRCh37":
    if release >= 81:
        # use the special grch37 branch for new releases
        branch = "grch37/"
    if release > 87:
        gtf_release = 87
elif snakemake.params.get("branch"):
    branch = snakemake.params.branch + "/"


flavor = snakemake.params.get("flavor", "")
if flavor:
    flavor += "."


suffix = ""
if out_fmt == "gtf":
    suffix = "gtf.gz"
elif out_fmt == "gff3":
    suffix = "gff3.gz"
else:
    raise ValueError(
        "invalid format specified. Only 'gtf[.gz]' and 'gff3[.gz]' are currently supported."
    )

url = snakemake.params.get("url", "https://ftp.ensembl.org/pub")
url = f"{url}/{branch}release-{release}/{out_fmt}/{species}/{species.capitalize()}.{build}.{gtf_release}.{flavor}{suffix}"
ftp_url = url.replace("https://", "ftp://")

try:
    if out_gz:
        shell("curl --fail -L {url} > {snakemake.output[0]} {log}")
    else:
        shell("(curl --fail -L {url} | gzip -d > {snakemake.output[0]}) {log}")
except subprocess.CalledProcessError:
    try:
        if out_gz:
            shell("curl --fail -L {ftp_url} > {snakemake.output[0]} {log}")
        else:
            shell("(curl --fail -L {ftp_url} | gzip -d > {snakemake.output[0]}) {log}")
    except subprocess.CalledProcessError:
        if snakemake.log:
            sys.stderr = open(snakemake.log[0], "a")
        print(
            "Unable to download annotation data from Ensembl. "
            "Did you check that this combination of species, build, and release is actually provided?",
            file=sys.stderr,
        )
        exit(1)