MEHARI BUILD TRANSCRIPT DB

Build a transcript database for mehari.

URL: https://github.com/varfish-org/mehari

Example

This wrapper can be used in the following way:

rule mehari_build_transcript_database:
    input:
        annotation="resources/{prefix}.gff3.gz",
        sequences="resources/{prefix}.cdna.fasta",
    output:
        db="{prefix}.bin.zst",
    log:
        "logs/mehari/build_transcript_db/{prefix}.log",
    threads: 4
    params:
        assembly="GRCh38",
        assembly_version="GRCh38.p14",
        transcript_source="Ensembl",
        transcript_source_version="115",
        annotation_version="115",
    wrapper:
        "v9.8.0/bio/mehari/build-transcript-db"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Software dependencies

mehari=0.43.2
snakemake-wrapper-utils=0.8.0

Input/Output

Input:

annotation
sequences

Output:

Params

assembly: Assembly name, e.g., “GRCh38”.
assembly_version: Assembly version, e.g., “GRCh38.p14”.
annotation_version: Version of the annotation.
transcript_source: Source of the transcript sequences, e.g., “Ensembl” or “RefSeq”.
transcript_source_version: Version of the transcript sequences, e.g., “115”.
extra: Extra arguments for the mehari db create invocation.

Authors

Till Hartmann

Code

__author__ = "Till Hartmann"
__copyright__ = "Copyright 2025, Till Hartmann"
__email__ = "till.hartmann@bih-charite.de"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake_wrapper_utils.snakemake import get_format

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# required inputs and outputs
if not snakemake.input.get("annotation"):
    raise ValueError("Input 'annotation' is required but not specified")

if not snakemake.output.get("db"):
    raise ValueError("Output 'db' is required but not specified")

sequences = snakemake.input.get("sequences")
if not sequences:
    raise ValueError("Input 'sequences' is required but not specified")

if get_format(sequences) == "fasta":
    sequences = f"--transcript-sequences {sequences}"
else:
    sequences = f"--seqrepo {sequences}"

# required params
if not snakemake.params.get("assembly"):
    raise ValueError("Parameter 'assembly' is required but not specified")

if not snakemake.params.get("transcript_source"):
    raise ValueError("Parameter 'transcript_source' is required but not specified")


# optional params
assembly_version = snakemake.params.get("assembly_version", "")
if assembly_version:
    assembly_version = f"--assembly-version {assembly_version}"

annotation_version = snakemake.params.get("annotation_version", "")
if annotation_version:
    annotation_version = f"--annotation-version {annotation_version}"

transcript_source_version = snakemake.params.get("transcript_source_version", "")
if transcript_source_version:
    transcript_source_version = (
        f"--transcript-source-version {transcript_source_version}"
    )

shell(
    "mehari db create"
    " --threads {snakemake.threads}"
    " --annotation {snakemake.input.annotation:q}"
    " --assembly {snakemake.params.assembly:q}"
    " --transcript-source {snakemake.params.transcript_source:q}"
    " {sequences}"
    " {assembly_version}"
    " {annotation_version}"
    " {transcript_source_version}"
    " {extra}"
    " --output {snakemake.output.db:q}"
    " {log}"
)