.. _`bio/reference/ensembl-sequence`: ENSEMBL-SEQUENCE ================ .. image:: https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/reference/ensembl-sequence?label=version%20update%20pull%20requests :target: https://github.com/snakemake/snakemake-wrappers/pulls?q=is%3Apr+is%3Aopen+label%3Abio/reference/ensembl-sequence Download sequences (e.g. genome) from ENSEMBL FTP servers, and store them in a single .fasta file. Example ------- This wrapper can be used in the following way: .. code-block:: python rule get_genome: output: "refs/genome.fasta", params: species="saccharomyces_cerevisiae", datatype="dna", build="R64-1-1", release="98", log: "logs/get_genome.log", cache: "omit-software" # save space and time with between workflow caching (see docs) wrapper: "v3.0.1/bio/reference/ensembl-sequence" rule get_single_chromosome: output: "refs/chr2.fasta", params: species="saccharomyces_cerevisiae", datatype="dna", build="R64-1-1", release="101", chromosome="II", # optional: restrict to one or multiple chromosomes, for multiple see below # branch="plants", # optional: specify branch log: "logs/get_genome.log", cache: "omit-software" # save space and time with between workflow caching (see docs) wrapper: "v3.0.1/bio/reference/ensembl-sequence" rule get_multiple_chromosome: output: "refs/chr1_and_chr2.fasta", params: species="saccharomyces_cerevisiae", datatype="dna", build="R64-1-1", release="101", chromosome=["I", "II"], # optional: restrict to one or multiple chromosomes # branch="plants", # optional: specify branch log: "logs/get_genome.log", cache: "omit-software" # save space and time with between workflow caching (see docs) wrapper: "v3.0.1/bio/reference/ensembl-sequence" Note that input, output and log file paths can be chosen freely. When running with .. code-block:: bash snakemake --use-conda the software dependencies will be automatically deployed into an isolated environment before execution. Software dependencies --------------------- * ``curl`` Authors ------- * Johannes Köster Code ---- .. code-block:: python __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import subprocess as sp import sys from itertools import product from snakemake.shell import shell species = snakemake.params.species.lower() release = int(snakemake.params.release) build = snakemake.params.build branch = "" if release >= 81 and build == "GRCh37": # use the special grch37 branch for new releases branch = "grch37/" elif snakemake.params.get("branch"): branch = snakemake.params.branch + "/" log = snakemake.log_fmt_shell(stdout=False, stderr=True) spec = ("{build}" if int(release) > 75 else "{build}.{release}").format( build=build, release=release ) suffixes = "" datatype = snakemake.params.get("datatype", "") chromosome = snakemake.params.get("chromosome", "") if datatype == "dna": if chromosome: suffixes = [f"dna.chromosome.{chrom}.fa.gz" for chrom in chromosome] else: suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"] elif datatype == "cdna": suffixes = ["cdna.all.fa.gz"] elif datatype == "cds": suffixes = ["cds.all.fa.gz"] elif datatype == "ncrna": suffixes = ["ncrna.fa.gz"] elif datatype == "pep": suffixes = ["pep.all.fa.gz"] else: raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep") if chromosome: if not datatype == "dna": raise ValueError( "invalid datatype, to select a single chromosome the datatype must be dna" ) spec = spec.format(build=build, release=release) url_prefix = f"ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}" success = False for suffix in suffixes: url = f"{url_prefix}.{suffix}" try: shell("curl -sSf {url} > /dev/null 2> /dev/null") except sp.CalledProcessError: continue shell("(curl -L {url} | gzip -d >> {snakemake.output[0]}) {log}") success = True break if not success: if len(suffixes) > 1: url = f"{url_prefix}.[{'|'.join(suffixes)}]" else: url = f"{url_prefix}.{suffixes[0]}" print( f"Unable to download requested sequence data from Ensembl ({url}). " "Please check whether above URL is currently available (might be a temporal server issue). " "Apart from that, did you check that this combination of species, build, and release is actually provided?", file=sys.stderr, ) exit(1) .. |nl| raw:: html