ENSEMBL-MYSQL-TABLE
Create a table of annotations available via the versioned ensembl mysql databases. Rows of the resulting table are generated from the mysql database tables specified under ‘main_tables’. If you specify multiple main_tables, they should share some or most of the columns, as they will be stacked on top of each other with bind_rows(). Additional annotation columns beyond those tables can then be given as mysql tables under ‘join_tables’. They have to share a column with all of the main_tables, to allow for `left_join’ing their columns onto them. The main documentation for the Ensembl mysql databases, including ‘Database schemas’, is here: https://www.ensembl.org/info/docs/api/index.html
URL: https://www.ensembl.org/info/docs/api/index.html
Example
This wrapper can be used in the following way:
rule create_repeat_annotations:
output:
table="resources/ensembl_repeat_annotations.tsv.gz", # .gz extension is optional, but recommended
params:
species="homo_sapiens",
build="GRCh37",
release="105",
main_tables={
"repeat_feature": {
"database": "core",
},
}, # choose the main table to retrieve, specifying { table : database }
join_tables={
"seq_region": {
"database": "core",
"join_column": "seq_region_id",
},
"repeat_consensus": {
"database": "core",
"join_column": "repeat_consensus_id",
},
},
# optional: add tables to join in for further annotations, specifying { table : { "database": database, "join_column": join-column } }
log:
"logs/create_repeat_annotations.log",
cache: "omit-software" # save space and time with between workflow caching (see docs)
wrapper:
"v5.0.1/bio/reference/ensembl-mysql-table"
rule create_regulatory_annotations_parquet:
output:
table="resources/ensembl_regulatory_annotations.parquet.gz", # .gz extension is optional, but recommended
params:
species="mus_musculus",
build="GRCm39",
release="112",
main_tables={
"regulatory_feature": {
"database": "funcgen",
},
"mirna_target_feature": {
"database": "funcgen",
},
},
# choose the main table to retrieve, specifying { table : database }
join_tables={
"seq_region": {
"database": "core",
"join_column": "seq_region_id"
},
"feature_type": {
"database": "funcgen",
"join_column": "feature_type_id"
},
},
# optional: add tables to join in for further annotations, specifying { table : { "database": database, "join_column": join-column } }
log:
"logs/create_regulatory_annotations_parquet.log",
cache: "omit-software" # save space and time with between workflow caching (see docs)
wrapper:
"v5.0.1/bio/reference/ensembl-mysql-table"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies
r-nanoparquet=0.3.1
r-tidyverse=2.0.0
r-dbplyr=2.5.0
r-rmariadb=1.3.2
Params
species
: Species that is available via the Ensembl mysql databases. For a quick check, see the Ensembl species list. For full valid species names, consult the respective table for the release you specify, for example for ‘112’ this is at: https://ftp.ensembl.org/pub/release-112/species_EnsemblVertebrates.txtbuild
: build available for the selected species, for example ‘GRCh38’release
: release from which the species and build are available, for example ‘112’main_tables
: A list of wantedmain_tables
. You have to specify them as a dictionary in the format{ table_name : database }
. You can find available tables in theDatabase schema``s in the Ensembl documentation: https://mart.ensembl.org/info/docs/api/index.html Available ``database
names at the time of writing arecore
,cdna
,funcgen
,compara
,rnaseq
,variation
,otherfeatures
. You can also interactively explore the available databases and tables, for example following this walkthrough: https://tavareshugo.github.io/data_carpentry_extras/dbplyr_ensembl/dbplyr_ensembl.htmljoin_tables
: (optional) A list ofjoin_tables
to add further annotation columns. You have to specify them as a nested dictionary in the format:{ table_name : { "database": database, "join_column": join_column_name } }
As formain_tables
, most info can be found in the ``Database schema``s: https://mart.ensembl.org/info/docs/api/index.html And for more detailed infos, you’ll have to interactively explore the mysql databases: https://tavareshugo.github.io/data_carpentry_extras/dbplyr_ensembl/dbplyr_ensembl.html
Code
# __author__ = "David Lähnemann"
# __copyright__ = "Copyright 2024, David Lähnemann"
# __email__ = "david.laehnemann@hhu.de"
# __license__ = "MIT"
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")
library("tidyverse")
library("nanoparquet")
rlang::global_entrace()
library("fs")
library("cli")
library("dbplyr")
library("RMariaDB")
wanted_species <- snakemake@params[["species"]]
wanted_release <- snakemake@params[["release"]]
wanted_build <- snakemake@params[["build"]]
main_tables <- snakemake@params[["main_tables"]]
join_tables <- snakemake@params[["join_tables"]]
output_filename <- snakemake@output[["table"]]
if (wanted_build == "GRCh37") {
port <- 3337
} else if (wanted_build == "GRCm38") {
if (wanted_release > 99) {
cli_abort(c(
"From Ensembl release 100 and upwards, genome build GRCm38 is not available any more.",
"i" = "Please choose a release of 99 or lower, or choose a newer mouse genome build.",
))
}
port <- 3337
} else {
port <- 3306
}
# general connection to ensembl databases
ensembl_connection <- dbConnect(
MariaDB(),
host = "ensembldb.ensembl.org",
user = "anonymous",
port = port,
password = "",
timeout=45
)
get_and_check_db_name <- function(connection, species, database, release, wanted_build) {
species_dbs <- dbGetQuery(connection, "SHOW DATABASES") |> filter(str_starts(Database, species))
dbname_prefix <- str_c(
species,
database,
release,
sep ="_"
)
dbname <- species_dbs |> filter(str_starts(Database, dbname_prefix)) |> pull(Database)
core_dbname_prefix <- str_c(
species,
"core",
release,
sep ="_"
)
core_dbname <- species_dbs |> filter(str_starts(Database, core_dbname_prefix)) |> pull(Database)
if ( length(dbname) == 1 ) {
# CHECK THAT WE ARE GETTING THE CORRECT BUILD
# 1st option: the meta table of the core database contains build / assembly
# identifiers that match what is requested
core_connection <- dbConnect(
MariaDB(),
dbname = core_dbname,
host = "ensembldb.ensembl.org",
user = "anonymous",
port = port,
password = "",
timeout=45
)
assembly_default <- tbl(core_connection, "meta") |> filter(meta_key == "assembly.default") |> pull("meta_value")
assembly_name <- tbl(core_connection, "meta") |> filter(meta_key == "assembly.name") |> pull("meta_value")
if (!wanted_build %in% c(assembly_default, assembly_name)) {
# 2nd option: the build number from the retrieved database name matches
# the systematically parsed version number at the end of
# the build requested via params:
retrieved_build_num <- dbname |> first() |> str_split("_") |> first() |> last()
# Canonicalize the build number to a single number as used in the mysql
# database names. At the time of writing, this code fails for 20 species,
# but build/assembly name numbers in those have no systematic relationship
# with the build numbers used in the respective mysql databases. Thus, we
# throw an error in those cases and ask users to check up the correct
# matchup manually.
wanted_build_num <- wanted_build |>
# remove patch suffixes like ".p14" in "GRCh38.p14", although usually
# we expect users of the wrapper to only specify "GRCh38" as the build
str_replace("\\.p\\d+$", "") |>
# look for trailing combinations of digits and dots
str_extract("[\\d\\.]+$") |>
# remove all dots, as mysql database numbers contain no dots
str_replace_all( "\\.", "") |>
# remove leading zeros, for example from "PKINGS_0.1"
str_replace( "^0+", "") |>
# remove trailing zeros, as this is needed for the vast majority of such
# cases (but in some cases this causes a mismatch, but there is no
# systematic way of resolving when)
str_replace("0$", "")
if (retrieved_build_num != wanted_build_num) {
# 3rd option: try matching up by downloading the releases species
# summary file from the FTP servers... ¯\_(ツ)_/¯
species_file_address <- str_c(
"https://ftp.ensembl.org/pub/release-",
release,
"/species_EnsemblVertebrates.txt"
)
species_summary <- read_tsv(species_file_address) |>
filter( str_starts(core_db, species) ) |>
select(assembly, core_db)
assembly <- species_summary |> pull(assembly) |> first()
summary_build_num <- species_summary |> pull(core_db) |> first() |> str_split("_") |> first() |> last()
if ( (! str_starts(wanted_build, assembly)) & (summary_build_num != wanted_build_num) ) {
cli_abort(c(
"The build we could retrieve for the specified combination of species, database, and release does",
"not match the build you specified.",
"x" = "You specified:",
"*" = "genome build: {wanted_build}",
"*" = "species: {species}",
"*" = "Ensembl release {release}",
"x" = "But the 'meta' table of the respective 'core' database ('{core_dbname}') gave:",
"*" = "assembly.default: {assembly_default}",
"*" = "assembly.name: {assembly_name}",
" " = "Neither matches the requested genome build.",
" " = "",
"x" = "Also, build number '{retrieved_build_num}' at the end of the determined database name",
" " = "does not match the build number we systematically extracted from '{wanted_build}',",
" " = "which is '{wanted_build_num}'.",
" " = "",
"x" = "Finally, we also checked against the Ensembl species summary file at:",
" " = "{species_file_address}",
" " = "It listed the following info, which doesn't match the wanted build specification:",
"*" = "build / assembly : '{assembly}'",
"*" = "build number for core_db: '{summary_build_num}'",
" " = "We give up.",
" " = "",
"i" = "Please ensure that you specify an existing combination of species, build and release.",
" " = "The Ensembl species summary file for the release you want, is a good starting point.",
" " = " It includes correct 'species' names and valid 'assembly' (build) names:",
" " = "{species_file_address}",
" " = ""
))
}
}
}
dbname
} else {
cli_abort(c(
"Could not retrieve a (unique?) database name for the specified combination of species, database, release, and build.",
"x" = "You requested",
"*" = "species: {species}",
"*" = "database: {database}",
"*" = "Ensembl release: {release}",
"*" = "genome build: {wanted_build}",
" " = "",
" " = "As builds are simple numbers in the mysql database names, we looked for a database with prefix:",
" " = "{dbname_prefix}",
" " = "",
"x" = "Here's what we found (can be empty):",
" " = "{dbname}",
" " = "",
"i" = "Please ensure that you specify an existing and unique combination of species, database, release, and build.",
" " = "The following species summary file can be a good starting point:",
" " = "https://ftp.ensembl.org/pub/release-{release}/species_EnsemblVertebrates.txt"
))
}
}
get_table <- function(dbname, port, table_name) {
table_connection <- dbConnect(
MariaDB(),
dbname = dbname,
host = "ensembldb.ensembl.org",
user = "anonymous",
port = port,
password = "",
timeout=45
)
# TODO: This is where we could do filtering, to reduce the amount of actual
# downloading at the earliest possible point. But this would need extra
# params-based syntax, but I'm not sure whether that's really necessary, or
# if most use cases will be about dumping (and joining) full tables.
table <- tbl(table_connection, table_name) |> collect()
table
}
main_table <- tibble()
for (table in names(main_tables)) {
main_table_db_name <- get_and_check_db_name(ensembl_connection, wanted_species, main_tables[[table]][["database"]], wanted_release, wanted_build)
main_table <- main_table |>
bind_rows(
get_table(main_table_db_name, port, table)
)
}
if ( !is.null(join_tables) ) {
for (table in names(join_tables)) {
table_db_name <- get_and_check_db_name(ensembl_connection, wanted_species, join_tables[[table]][["database"]], wanted_release, wanted_build)
tbl <- get_table(table_db_name, port, table)
main_table <- main_table |>
left_join(
tbl,
by = join_tables[[table]][["join_column"]]
)
}
}
if ( str_detect(output_filename, "tsv(\\.(gz|bz2|xz))?$") ) {
write_tsv(
x = main_table,
file = output_filename
)
} else if ( str_detect(output_filename, "\\.parquet") ) {
last_ext <- path_ext(output_filename)
compression <- case_match(
last_ext,
"parquet" ~ "uncompressed",
"gz" ~ "gzip",
"zst" ~ "zstd",
"sz" ~ "snappy"
)
if ( is.na(compression) ) {
cli_abort(
"File extension '{last_ext}' not supported for writing with the used nanoparquet version.",
"x" = "Cannot write to a file '{output_filename}', because the version of the package",
"nanoparquet used does not support writing files of type '{last_ext}'.",
"i" = "For supported file types, see: https://r-lib.github.io/nanoparquet/reference/write_parquet.html"
)
}
write_parquet(
x = main_table,
file = output_filename,
compression = compression
)
} else {
cli_abort(c(
"Unsupported file format in output file '{output_filename}'.",
"x" = "Only '.tsv' and '.parquet' files are supported, with certain compression variants each.",
"i" = "For supported compression extensions, see:",
"*" = "tsv: https://readr.tidyverse.org/reference/write_delim.html#output",
"*" = "parquet: https://r-lib.github.io/nanoparquet/reference/write_parquet.html#arguments"
))
}