GFFREAD
Validate, filter, convert and perform various other operations on GFF/GTF files with Gffread
URL: http://ccb.jhu.edu/software/stringtie/gff.shtml
Example
This wrapper can be used in the following way:
rule test_gffread_gtf:
input:
fasta="genome.fasta",
annotation="annotation.gtf",
# ids="", # Optional path to records to keep
# nids="", # Optional path to records to drop
# seq_info="", # Optional path to sequence information
# sort_by="", # Optional path to the ordered list of reference sequences
# attr="", # Optional annotation attributes to keep.
# chr_replace="", # Optional path to <original_ref_ID> <new_ref_ID>
output:
# All optional, select based on desired output
records="annotation.gff",
transcript_fasta="transcripts.fa", # Path containing gffread -w output
cds_fasta="cds.fa", # Path containing gffread -x output
# dupinfo="", # Path to clustering/merging information
threads: 1
log:
"logs/gffread_gtf.log",
params:
fasta_flag="-w",
extra="",
wrapper:
"v5.8.0-3-g915ba34/bio/gffread"
rule test_gffread_gff:
input:
fasta="genome.fasta",
annotation="annotation.gff3",
output:
protein_fasta="proteins.fa", # Path containing gffread -y output
threads: 1
log:
"logs/gffread_gff.log",
params:
extra="-S",
wrapper:
"v5.8.0-3-g915ba34/bio/gffread"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes
Input/output formats are detected from their file extension.
Software dependencies
gffread=0.12.7
Input/Output
Input:
fasta
: FASTA genome file.annotation
: GTF/GTF/BED genome annotation.ids
: Records/transcript to keep (optional).nids
: Records/transcripts to discard (optional).seq_info
: Sequence information, a TSV formatted text file containing <seq-name> <seq-length> <seq-description> (optional).sort_by
: Text file containing the ordered list of reference sequences (optional).attr
: Text file containing comma-separated list of annotation attributes to keep (optional).chr_replace
: TSV-formatted text file containing <original_ref_ID> <new_ref_ID> (optional).
Output:
records
: Genome annotation (optional).transcript_fasta
: FASTA containing -w output (optional).cds_fasta
: FASTA containing -x output (optional).protein_fasta
: FASTA containing -y output (optional).dupinfo
: Clustering/merging information (optional).
Params
extra
: additional program arguments (except for –in-bed, –in-tlf, -T, –bed, –tlf, -o, -w, -x, -y, –ids, –nids, -s, –sort-by, –attrs, -m, -d, or -g).
Code
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2023, Thibault Dayris"
__mail__ = "thibault.dayris@gustaveroussy.fr"
__license__ = "MIT"
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Input files
annotation = snakemake.input.annotation
# Output files
records = snakemake.output.get("records", "")
if records:
extra += f" -o {records}"
dupinfo = snakemake.output.get("dupinfo", "")
if dupinfo:
extra += f" -d {dupinfo}"
transcript_fasta = snakemake.output.get("transcript_fasta", "")
if transcript_fasta:
extra += f" -w {transcript_fasta}"
cds_fasta = snakemake.output.get("cds_fasta", "")
if cds_fasta:
extra += f" -x {cds_fasta}"
protein_fasta = snakemake.output.get("protein_fasta", "")
if protein_fasta:
extra += f" -y {protein_fasta}"
# Input format control
if annotation.endswith(".bed"):
extra += " --in-bed"
elif annotation.endswith(".tlf"):
extra += " --in-tlf"
elif annotation.endswith((".gtf", ".gff", ".gff3")):
pass
else:
raise ValueError("Unknown annotation format")
# Output format control
if not records:
pass
elif records.endswith((".gtf", ".gff", ".gff3")):
extra += " -T"
elif records.endswith(".bed"):
extra += " --bed"
elif records.endswith(".tlf"):
extra += " --tlf"
else:
raise ValueError("Unknown records format")
# Optional input files
ids = snakemake.input.get("ids", "")
if ids:
extra += f" --ids {ids}"
nids = snakemake.input.get("nids", "")
if nids:
if ids:
raise ValueError(
"Provide either sequences ids to keep, or to drop."
" Or else, an empty file is produced."
)
extra += f" --nids {nids}"
seq_info = snakemake.input.get("seq_info", "")
if seq_info:
extra += f" -s {seq_info}"
sort_by = snakemake.input.get("sort_by", "")
if sort_by:
extra += f" --sort-by {sort_by}"
attr = snakemake.input.get("attr", "")
if attr:
if not records.endswith((".gtf", ".gff", ".gff3")):
raise ValueError(
"GTF attributes specified in input, "
"but records are not in GTF/GFF format."
)
extra += f" --attrs {attr}"
chr_replace = snakemake.input.get("chr_replace", "")
if chr_replace:
extra += f" -m {chr_replace}"
shell("gffread {extra} -g {snakemake.input.fasta} {annotation} {log}")