ENSEMBL-BIOMART-TABLE

https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/reference/ensembl-biomart-table?label=version%20update%20pull%20requests

Create a table of annotations available via the bioconductor-biomart, with one column per specified annotation (for example ensembl_gene_id, ensembl_transcript_id, ext_gene, … for the human reference). For reference, have a look at the Ensembl biomart online or at the biomaRt package documentation linked in the URL field.

URL: https://bioconductor.org/packages/deveol/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html

Example

This wrapper can be used in the following way:

rule create_transcripts_to_genes_mapping:
    output:
        table="resources/ensembl_transcripts_to_genes_mapping.tsv.gz",  # .gz extension is optional, but recommended
    params:
        biomart="genes",
        species="homo_sapiens",
        build="GRCh38",
        release="112",
        attributes=[
            "ensembl_transcript_id",
            "ensembl_gene_id",
            "external_gene_name",
            "genecards",
            "chromosome_name",
        ],
        filters={ "chromosome_name": ["22", "X"] }, # optional: restrict output by using filters
    log:
        "logs/create_transcripts_to_genes_mapping.log",
    cache: "omit-software"  # save space and time with between workflow caching (see docs)
    wrapper:
        "v5.5.2-17-g33d5b76/bio/reference/ensembl-biomart-table"


rule create_transcripts_to_genes_mapping_parquet:
    output:
        table="resources/ensembl_transcripts_to_genes_mapping.parquet.gz",  # .gz extension is optional, but recommended
    params:
        biomart="genes",
        species="mus_musculus",
        build="GRCm39",
        release="112",
        attributes=["ensembl_transcript_id", "ensembl_gene_id"],
        # filters={ "chromosome_name": "19"}, # optional: restrict output by using filters
    log:
        "logs/create_transcripts_to_genes_mapping_parquet.log",
    cache: "omit-software"  # save space and time with between workflow caching (see docs)
    wrapper:
        "v5.5.2-17-g33d5b76/bio/reference/ensembl-biomart-table"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Software dependencies

  • bioconductor-biomart=2.58

  • r-nanoparquet=0.3

  • r-tidyverse=2.0

Params

  • biomart: for example, ‘genes’; for options, see the documentation on identifying databases

  • species: species that has a ‘genes’ database / dataset available via the Ensembl BioMart (for example, ‘homo_sapiens’), for example check the Ensembl species list

  • build: build available for the selected species, for example ‘GRCh38’

  • release: release from which the species and build are available, for example ‘112’

  • attributes: A list of wanted annotation columns (“database attributes”). For finding available attributes, see the instructions in the biomaRt documentation. Note that these need to be available for the combination of species, build and release from the specified biomart database.

  • filters: (optional) This will restrict the download and output to the filters you specify. The format is a dictionary, for example {"chromosome_name": ["X", "Y"]}. Note that non-existing filter values (for example a chromosomes_name of "Z") will simply be ignored without error or warning. For finding available filters, see the instructions in the biomaRt documentation.

Authors

  • David Lähnemann

Code

# __author__ = "David Lähnemann"
# __copyright__ = "Copyright 2024, David Lähnemann"
# __email__ = "david.laehnemann@hhu.de"
# __license__ = "MIT"

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")
library("nanoparquet")
rlang::global_entrace()
library("fs")
library("cli")

library("biomaRt")

wanted_biomart <- snakemake@params[["biomart"]]
# bioconductor-biomart needs the species as something like `hsapiens` instead
# of `homo_sapiens`, and `chyarkandensis` instead of `cervus_hanglu_yarkandensis`
species_name_components <- str_split(snakemake@params[["species"]], "_")[[1]]
if (length(species_name_components) == 2) {
  wanted_species <- str_c(
    str_sub(species_name_components[1], 1, 1),
    species_name_components[2]
  )
} else if (length(species_name_components) == 3) {
  wanted_species <- str_c(
    str_sub(species_name_components[1], 1, 1),
    str_sub(species_name_components[2], 1, 1),
    species_name_components[3]
  )
} else {
  cli_abort(c(
          "Unsupported species name '{snakemake@params[['species']]}'.",
    "x" = "Splitting on underscores led to unexpected number of name components: {length(species_name_components)}.",
    "i" = "Expected species name with 2 (e.g. `homo_sapiens`) or 3 (e.g. `cervus_hanglu_yarkandensis`) components.",
          "Anything else either does not exist in Ensembl, or we don't yet handle it properly.",
          "In case you are sure the species you specified is correct and exists in Ensembl, please",
          "file a bug report as an issue on GitHub, referencing this file: ",
          "https://github.com/snakemake/snakemake-wrappers/blob/master/bio/reference/ensembl-biomart-table/wrapper.R"
  ))
}

wanted_release <- snakemake@params[["release"]]
wanted_build <- snakemake@params[["build"]]

wanted_filters <- snakemake@params[["filters"]]

wanted_columns <- snakemake@params[["attributes"]]

output_filename <- snakemake@output[["table"]]

if (wanted_build == "GRCh37") {
  grch <- "37"
  version <- NULL
  cli_warn(c(
    "As you specified build 'GRCH37' in your configuration yaml, biomart forces",
    "us to ignore the release you specified ('{release}')."
  ))
} else {
  grch <- NULL
  version <- wanted_release
}

get_mart <- function(biomart, species, build, version, grch, dataset) {
  mart <- useEnsembl(
    biomart = biomart,
    dataset = str_c(species, "_", dataset),
    version = version,
    GRCh = grch
  )

  if (build == "GRCh37") {
    retrieved_build <- str_remove(listDatasets(mart)$version, "\\..*")
  } else {
    retrieved_build <- str_remove(searchDatasets(mart, species)$version, "\\..*")
  }

  if (retrieved_build != build) {
    cli_abort(c(
            "The Ensembl release and genome build number you specified are not compatible.",
      "x" = "Genome build '{build}' not available via biomart for Ensembl release '{release}'.",
      "i" = "Ensembl release '{release}' only provides build '{retrieved_build}'.",
      " " = "Please fix your configuration yaml file's reference entry, you have two options:",
      "*" = "Change the build entry to '{retrieved_build}'.",
      "*" = "Change the release entry to one that provides build '{build}'. You have to determine this from biomart by yourself."
    ))
  }
  mart
}

gene_ensembl <- get_mart(wanted_biomart, wanted_species, wanted_build, version, grch, "gene_ensembl")

if ( !is.null(wanted_filters) ) {
  table <- getBM(
    attributes = wanted_columns,
    filters = names(wanted_filters),
    values = unname(wanted_filters),
    mart = gene_ensembl
  ) |> as_tibble()
} else {
  table <- getBM(
    attributes = wanted_columns,
    mart = gene_ensembl
  ) |> as_tibble()
}



if ( str_detect(output_filename, "tsv(\\.(gz|bz2|xz))?$") ) {
  write_tsv(
    x = table,
    file = output_filename
  )
} else if ( str_detect(output_filename, "\\.parquet") ) {
  last_ext <- path_ext(output_filename)
  compression <- case_match(
    last_ext,
    "parquet" ~ "uncompressed",
    "gz" ~ "gzip",
    "zst" ~ "zstd",
    "sz" ~ "snappy"
  )
  if ( is.na(compression) ) {
    cli_abort(
            "File extension '{last_ext}' not supported for writing with the used nanoparquet version.",
      "x" = "Cannot write to a file '{output_filename}', because the version of the package",
            "nanoparquet used does not support writing files of type '{last_ext}'.",
      "i" = "For supported file types, see: https://r-lib.github.io/nanoparquet/reference/write_parquet.html"
    )
  }
  write_parquet(
    x = table,
    file = output_filename,
    compression = compression
  )
} else {
  cli_abort(c(
    "Unsupported file format in output file '{output_filename}'.",
    "x" = "Only '.tsv' and '.parquet' files are supported, with certain compression variants each.",
    "i" = "For supported compression extensions, see:",
    "*" = "tsv: https://readr.tidyverse.org/reference/write_delim.html#output",
    "*" = "parquet: https://r-lib.github.io/nanoparquet/reference/write_parquet.html#arguments"
  ))
}