GFFREAD¶
Validate, filter, convert and perform various other operations on GFF/GTF files with Gffread
URL: http://ccb.jhu.edu/software/stringtie/gff.shtml
Example¶
This wrapper can be used in the following way:
rule test_gffread:
input:
fasta="genome.fasta",
annotation="annotation.gtf",
# ids="", # Optional path to records to keep
# nids="", # Optional path to records to drop
# seq_info="", # Optional path to sequence information
# sort_by="", # Optional path to the ordered list of reference sequences
# attr="", # Optional annotation attributes to keep.
# chr_replace="", # Optional path to <original_ref_ID> <new_ref_ID>
output:
records="transcripts.fa",
# dupinfo="", # Optional path to clustering/merging information
threads: 1
log:
"logs/gffread.log",
params:
extra="",
wrapper:
"v2.6.0/bio/gffread"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes¶
Input/output formats are automatically detected from their file extension.
Software dependencies¶
gffread=0.12.7
Input/Output¶
Input:
fasta
: Path to genome file (FASTA formatted).annotation
: Path to genome annotation (GTF/GTF/BED formatted).ids
: Optional path to records/transcript to keep.nids
: Optional path to records/transcripts to discard.seq_info
: Optional path to sequence information, a TSV formatted text file containing <seq-name> <seq-length> <seq-description>sort_by
: Optional path to a text file containing the ordered list of reference sequences.attr
: Optional text file containing comma-separated list of annotation attributes to keep.chr_replace
: Optional path to a TSV-formatted text file containing <original_ref_ID> <new_ref_ID>.
Output:
records
: Path to genome sequence/annotation in the requested format, containing the requested information.dupinfo
: Optional path to clustering/merging information
Authors¶
Code¶
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2023, Thibault Dayris"
__mail__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
annotation = snakemake.input.annotation
records = snakemake.output.records
# Input format control
if annotation.endswith(".bed"):
extra += " --in-bed "
elif annotation.endswith(".tlf"):
extra += " --in-tlf "
elif annotation.endswith(".gtf"):
pass
else:
raise ValueError("Unknown annotation format")
# Output format control
if records.endswith((".gtf", ".gff", ".gff3")):
extra += " -T "
elif records.endswith(".bed"):
extra += " --bed "
elif records.endswith(".tlf"):
extra += " --tlf "
elif records.endswith((".fasta", ".fa", ".fna")):
pass
else:
raise ValueError("Unknown records format")
# Optional input files
ids = snakemake.input.get("ids", "")
if ids:
extra += f" --ids {ids} "
nids = snakemake.input.get("nids", "")
if nids:
if ids:
raise ValueError(
"Provide either sequences ids to keep, or to drop."
" Or else, an empty file is produced."
)
extra += f" --nids {nids} "
seq_info = snakemake.input.get("seq_info", "")
if seq_info:
extra += f" -s {seq_info} "
sort_by = snakemake.input.get("sort_by", "")
if sort_by:
extra += f" --sort-by {sort_by} "
attr = snakemake.input.get("attr", "")
if attr:
if not records.endswith((".gtf", ".gff", ".gff3")):
raise ValueError(
"GTF attributes specified in input, "
"but records are not in GTF/GFF format."
)
extra += f" --attrs {attr} "
chr_replace = snakemake.input.get("chr_replace", "")
if chr_replace:
extra += f" -m {chr_replace} "
# Optional output files
dupinfo = snakemake.output.get("dupinfo", "")
if dupinfo:
extra += f" -d {dupinfo} "
shell(
"gffread {extra} "
"-o {records} "
"{snakemake.input.fasta} "
"{annotation} "
"{log} "
)