GFFREAD

https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/gffread?label=version%20update%20pull%20requests

Validate, filter, convert and perform various other operations on GFF/GTF files with Gffread

URL: http://ccb.jhu.edu/software/stringtie/gff.shtml

Example

This wrapper can be used in the following way:

rule test_gffread_gtf:
    input:
        fasta="genome.fasta",
        annotation="annotation.gtf",
        # ids="",  # Optional path to records to keep
        # nids="",  # Optional path to records to drop
        # seq_info="",  # Optional path to sequence information
        # sort_by="",  # Optional path to the ordered list of reference sequences
        # attr="",  # Optional annotation attributes to keep.
        # chr_replace="",  # Optional path to <original_ref_ID> <new_ref_ID>
    output:
        # All optional, select based on desired output
        records="annotation.gff",
        transcript_fasta="transcripts.fa",  # Path containing gffread -w output
        cds_fasta="cds.fa",  # Path containing gffread -x output
        # dupinfo="",  # Path to clustering/merging information
    threads: 1
    log:
        "logs/gffread_gtf.log",
    params:
        fasta_flag="-w",
        extra="",
    wrapper:
        "v5.8.0-3-g915ba34/bio/gffread"


rule test_gffread_gff:
    input:
        fasta="genome.fasta",
        annotation="annotation.gff3",
    output:
        protein_fasta="proteins.fa",  # Path containing gffread -y output
    threads: 1
    log:
        "logs/gffread_gff.log",
    params:
        extra="-S",
    wrapper:
        "v5.8.0-3-g915ba34/bio/gffread"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Notes

Input/output formats are detected from their file extension.

Software dependencies

  • gffread=0.12.7

Input/Output

Input:

  • fasta: FASTA genome file.

  • annotation: GTF/GTF/BED genome annotation.

  • ids: Records/transcript to keep (optional).

  • nids: Records/transcripts to discard (optional).

  • seq_info: Sequence information, a TSV formatted text file containing <seq-name> <seq-length> <seq-description> (optional).

  • sort_by: Text file containing the ordered list of reference sequences (optional).

  • attr: Text file containing comma-separated list of annotation attributes to keep (optional).

  • chr_replace: TSV-formatted text file containing <original_ref_ID> <new_ref_ID> (optional).

Output:

  • records: Genome annotation (optional).

  • transcript_fasta: FASTA containing -w output (optional).

  • cds_fasta: FASTA containing -x output (optional).

  • protein_fasta: FASTA containing -y output (optional).

  • dupinfo: Clustering/merging information (optional).

Params

  • extra: additional program arguments (except for –in-bed, –in-tlf, -T, –bed, –tlf, -o, -w, -x, -y, –ids, –nids, -s, –sort-by, –attrs, -m, -d, or -g).

Authors

Code

__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2023, Thibault Dayris"
__mail__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Input files
annotation = snakemake.input.annotation

# Output files
records = snakemake.output.get("records", "")
if records:
    extra += f" -o {records}"

dupinfo = snakemake.output.get("dupinfo", "")
if dupinfo:
    extra += f" -d {dupinfo}"

transcript_fasta = snakemake.output.get("transcript_fasta", "")
if transcript_fasta:
    extra += f" -w {transcript_fasta}"

cds_fasta = snakemake.output.get("cds_fasta", "")
if cds_fasta:
    extra += f" -x {cds_fasta}"

protein_fasta = snakemake.output.get("protein_fasta", "")
if protein_fasta:
    extra += f" -y {protein_fasta}"


# Input format control
if annotation.endswith(".bed"):
    extra += " --in-bed"
elif annotation.endswith(".tlf"):
    extra += " --in-tlf"
elif annotation.endswith((".gtf", ".gff", ".gff3")):
    pass
else:
    raise ValueError("Unknown annotation format")

# Output format control
if not records:
    pass
elif records.endswith((".gtf", ".gff", ".gff3")):
    extra += " -T"
elif records.endswith(".bed"):
    extra += " --bed"
elif records.endswith(".tlf"):
    extra += " --tlf"
else:
    raise ValueError("Unknown records format")


# Optional input files
ids = snakemake.input.get("ids", "")
if ids:
    extra += f" --ids {ids}"

nids = snakemake.input.get("nids", "")
if nids:
    if ids:
        raise ValueError(
            "Provide either sequences ids to keep, or to drop."
            " Or else, an empty file is produced."
        )
    extra += f" --nids {nids}"

seq_info = snakemake.input.get("seq_info", "")
if seq_info:
    extra += f" -s {seq_info}"

sort_by = snakemake.input.get("sort_by", "")
if sort_by:
    extra += f" --sort-by {sort_by}"

attr = snakemake.input.get("attr", "")
if attr:
    if not records.endswith((".gtf", ".gff", ".gff3")):
        raise ValueError(
            "GTF attributes specified in input, "
            "but records are not in GTF/GFF format."
        )
    extra += f" --attrs {attr}"

chr_replace = snakemake.input.get("chr_replace", "")
if chr_replace:
    extra += f" -m {chr_replace}"


shell("gffread {extra} -g {snakemake.input.fasta} {annotation} {log}")