ENSEMBL-REGULATION

https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/reference/ensembl-regulation?label=version%20update%20pull%20requests

Download annotation of regulatory features (e.g. promotors) for genomes from ENSEMBL FTP servers, and store them in a single .gff or .gff3 file. The output file can be gzipped, which will save space and avoid unzipping during the download. From release 112 onwards, gff3 files are available and the wrapper will require this file extension. For older releases (>=87), only gff files with a different file path are available and the wrapper will require this extension. For the available species (human and mouse as of writing), see the “Regulation (GFF)” column on the FTP download site: https://www.ensembl.org/info/data/ftp/index.html

Example

This wrapper can be used in the following way:

rule get_regulatory_features_gff3_gz:
    output:
        "resources/regulatory_features.gff3.gz", # presence of .gz determines if downloaded is kept compressed
    params:
        species="homo_sapiens", # for available species, release and build, search via "Regulation (GFF)" column at: https://www.ensembl.org/info/data/ftp/index.html
        release="112",
        build="GRCh38",
    log:
        "logs/get_regulatory_features.log",
    cache: "omit-software"  # save space and time with between workflow caching (see docs); for data downloads, software does not affect the resulting data
    wrapper:
        "v5.0.1/bio/reference/ensembl-regulation"


rule get_regulatory_features_grch37_gff:
    output:
        "resources/regulatory_features.gff",
    params:
        species="homo_sapiens",
        release="112",
        build="GRCh37",
    log:
        "logs/get_regulatory_features.log",
    cache: "omit-software"  # save space and time with between workflow caching (see docs); for data downloads, software does not affect the resulting data
    wrapper:
        "v5.0.1/bio/reference/ensembl-regulation"


rule get_regulatory_features_mouse_gff_gz:
    output:
        "resources/regulatory_features.mouse.gff.gz",
    params:
        species="mus_musculus",
        release="98",
        build="GRCm39",
    log:
        "logs/get_regulatory_features.log",
    cache: "omit-software"  # save space and time with between workflow caching (see docs); for data downloads, software does not affect the resulting data
    wrapper:
        "v5.0.1/bio/reference/ensembl-regulation"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Software dependencies

  • wget=1.21.4

Params

  • url: Base URL from where to download cache data (optional; by default is ftp://ftp.ensembl.org/pub).

Authors

  • Johannes Köster

  • David Lähnemann

Code

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2024, Johannes Köster"
__email__ = "johannes.koester@uni-due.de"
__license__ = "MIT"

import subprocess
import sys
from pathlib import Path
from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=False, stderr=True)


species = snakemake.params.species.lower()
build = snakemake.params.build
release = int(snakemake.params.release)
gtf_release = release
out_fmt = Path(snakemake.output[0]).suffixes
out_gz = (out_fmt.pop() and True) if out_fmt[-1] == ".gz" else False
out_fmt = out_fmt.pop().lstrip(".")

if release < 87:
    raise ValueError(
        "Comprehensive GFF files are only available for release 87 or newer."
    )

if build == "GRCh37":
    grch37 = "grch37/"
else:
    grch37 = ""


suffix = ""
if out_fmt == "gff":
    suffix = "gff.gz"
elif out_fmt == "gff3":
    suffix = "gff3.gz"
else:
    raise ValueError(
        "Invalid format specified."
        "Only 'gff[.gz]' (for releases before 112, and for build GRCh37) and"
        "'gff3[.gz]' (for any release from 112 onwards) are currently supported."
    )


url = snakemake.params.get("url", "ftp://ftp.ensembl.org/pub")
if release < 112 or build == "GRCh37":
    if out_fmt != "gff":
        raise ValueError(
            f"Invalid suffix for output file '{snakemake.output[0]}'."
            "For releases older than 112 and for human build GRCh37, only .gff or .gff.gz are valid."
        )
    url = f"{url}/{grch37}release-{release}/regulation/{species}/{species}.{build}.Regulatory_Build.regulatory_features.*.{suffix}"
else:
    if out_fmt != "gff3":
        raise ValueError(
            f"Invalid suffix for output file '{snakemake.output[0]}'."
            "For (non-GRCh37) releases from 112 onwards, only .gff3 or .gff3.gz are valid."
        )
    url = f"{url}/release-{release}/regulation/{species}/{build}/annotation/{species.capitalize()}.{build}.regulatory_features.v{release}.{suffix}"

try:
    if out_gz:
        shell('wget "{url}" -O {snakemake.output[0]} {log}')
    else:
        shell('(wget "{url}" -O - | gzip -d > {snakemake.output[0]}) {log}')
except subprocess.CalledProcessError as e:
    if snakemake.log:
        sys.stderr = open(snakemake.log[0], "a")
    print(
        "Unable to download regulatory feature data from Ensembl. "
        "Did you check that this combination of species, build, and release is actually provided?"
        "A good entry point for a search is: https://www.ensembl.org/info/data/ftp/index.html",
        file=sys.stderr,
    )
    exit(1)