GDC API-BASED DATA DOWNLOAD OF BAM SLICES#
Download slices of GDC BAM files using curl and the GDC API for BAM Slicing.
Example#
This wrapper can be used in the following way:
rule gdc_api_bam_slice_download:
output:
bam="raw/{sample}.bam",
log:
"logs/gdc-api/bam-slicing/{sample}.log"
params:
# to use this rule flexibly, make uuid a function that maps your
# sample names of choice to the UUIDs they correspond to (they are
# the column `id` in the GDC manifest files, which can be used to
# systematically construct sample sheets)
uuid="092c8a6d-aad5-41bf-b186-e68e613c0e89",
# a gdc_token is required for controlled access and all BAM files
# on GDC seem to be controlled access (adjust if this changes)
gdc_token="gdc/gdc-user-token.2020-05-07T10_00_00.555Z.txt",
# provide wanted `region=` or `gencode=` slices joined with `&`
slices="region=chr22®ion=chr5:1000-2000®ion=unmapped&gencode=BRCA2",
# extra command line arguments passed to curl
extra=""
wrapper:
"v3.0.2/bio/gdc-api/bam-slicing"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes#
BAM file UUIDs can be found via the GDC repository query, either by clicking on individual files or systematically by creating a cart and downloading a manifest file.
Slicing can be performed using region syntax like ‘region=chr20:3000-4000’, gene name syntax like ‘gencode=BRCA2’ (this uses Gene symbols of GENCODE v22) or ‘region=unmapped’ to get unmapped reads. Multiple such entries can be joined with ampersands (e.g.
region=chr5:200-300®ion=unmapped&gencode=BRCA1
).All BAM data files in GDC are controlled access according to this GDC repository query, thus a GDC access token file is always required and must be provided via
params: gdc_token: "path/to/access_token.txt"
. Should this change in the future, feel free to adjust this wrapper or contact the original author.
Software dependencies#
curl=8.4.0
Code#
__author__ = "David Lähnemann"
__copyright__ = "Copyright 2020, David Lähnemann"
__email__ = "david.laehnemann@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
import os
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
uuid = snakemake.params.get("uuid", "")
if uuid == "":
raise ValueError("You need to provide a GDC UUID via the 'uuid' in 'params'.")
token_file = snakemake.params.get("gdc_token", "")
if token_file == "":
raise ValueError(
"You need to provide a GDC data access token file via the 'token' in 'params'."
)
token = ""
with open(token_file) as tf:
token = tf.read()
os.environ["CURL_HEADER_TOKEN"] = "'X-Auth-Token: {}'".format(token)
slices = snakemake.params.get("slices", "")
if slices == "":
raise ValueError(
"You need to provide 'region=chr1:1000-2000' or 'gencode=BRCA2' slice(s) via the 'slices' in 'params'."
)
extra = snakemake.params.get("extra", "")
shell(
"curl --silent"
" --header $CURL_HEADER_TOKEN"
" 'https://api.gdc.cancer.gov/slicing/view/{uuid}?{slices}'"
" {extra}"
" --output {snakemake.output.bam} {log}"
)
if os.path.getsize(snakemake.output.bam) < 100000:
with open(snakemake.output.bam) as f:
if "error" in f.read():
shell("cat {snakemake.output.bam} {log}")
raise RuntimeError(
"Your GDC API request returned an error, check your log file for the error message."
)