GFFREAD

Validate, filter, convert and perform various other operations on GFF/GTF files with Gffread

URL: http://ccb.jhu.edu/software/stringtie/gff.shtml

Example

This wrapper can be used in the following way:

rule test_gffread:
    input:
        fasta="genome.fasta",
        annotation="annotation.gtf",
        # ids="",  # Optional path to records to keep
        # nids="",  # Optional path to records to drop
        # seq_info="",  # Optional path to sequence information
        # sort_by="",  # Optional path to the ordered list of reference sequences
        # attr="",  # Optional annotation attributes to keep.
        # chr_replace="",  # Optional path to <original_ref_ID> <new_ref_ID>
    output:
        records="transcripts.fa",
        # dupinfo="",  # Optional path to clustering/merging information
    threads: 1
    log:
        "logs/gffread.log",
    params:
        extra="",
    wrapper:
        "v3.8.0/bio/gffread"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Notes

Input/output formats are automatically detected from their file extension.

Software dependencies

gffread=0.12.7

Input/Output

Input:

fasta: Path to genome file (FASTA formatted).
annotation: Path to genome annotation (GTF/GTF/BED formatted).
ids: Optional path to records/transcript to keep.
nids: Optional path to records/transcripts to discard.
seq_info: Optional path to sequence information, a TSV formatted text file containing <seq-name> <seq-length> <seq-description>
sort_by: Optional path to a text file containing the ordered list of reference sequences.
attr: Optional text file containing comma-separated list of annotation attributes to keep.
chr_replace: Optional path to a TSV-formatted text file containing <original_ref_ID> <new_ref_ID>.

Output:

records: Path to genome sequence/annotation in the requested format, containing the requested information.
dupinfo: Optional path to clustering/merging information

Authors

Code

__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2023, Thibault Dayris"
__mail__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

annotation = snakemake.input.annotation
records = snakemake.output.records

# Input format control
if annotation.endswith(".bed"):
    extra += " --in-bed "
elif annotation.endswith(".tlf"):
    extra += " --in-tlf "
elif annotation.endswith(".gtf"):
    pass
else:
    raise ValueError("Unknown annotation format")

# In most cases, output can be specified with -o
out_flag = " -o "

# Output format control
if records.endswith((".gtf", ".gff", ".gff3")):
    extra += " -T "
elif records.endswith(".bed"):
    extra += " --bed "
elif records.endswith(".tlf"):
    extra += " --tlf "
elif records.endswith((".fasta", ".fa", ".fna")):
    # Fasta output must be specified with -w
    out_flag = " -w "
else:
    raise ValueError("Unknown records format")


# Optional input files
ids = snakemake.input.get("ids", "")
if ids:
    extra += f" --ids {ids} "

nids = snakemake.input.get("nids", "")
if nids:
    if ids:
        raise ValueError(
            "Provide either sequences ids to keep, or to drop."
            " Or else, an empty file is produced."
        )
    extra += f" --nids {nids} "

seq_info = snakemake.input.get("seq_info", "")
if seq_info:
    extra += f" -s {seq_info} "

sort_by = snakemake.input.get("sort_by", "")
if sort_by:
    extra += f" --sort-by {sort_by} "

attr = snakemake.input.get("attr", "")
if attr:
    if not records.endswith((".gtf", ".gff", ".gff3")):
        raise ValueError(
            "GTF attributes specified in input, "
            "but records are not in GTF/GFF format."
        )
    extra += f" --attrs {attr} "

chr_replace = snakemake.input.get("chr_replace", "")
if chr_replace:
    extra += f" -m {chr_replace} "


# Optional output files
dupinfo = snakemake.output.get("dupinfo", "")
if dupinfo:
    extra += f" -d {dupinfo} "


shell(
    "gffread {extra} "
    "{out_flag} {records} "
    "-g {snakemake.input.fasta} "
    "{annotation} "
    "{log} "
)