GDC API-BASED DATA DOWNLOAD OF BAM SLICES
Download slices of GDC BAM files using curl and the GDC API for BAM Slicing.
Example
This wrapper can be used in the following way:
rule gdc_api_bam_slice_download:
output:
bam="raw/{sample}.bam",
log:
"logs/gdc-api/bam-slicing/{sample}.log"
params:
# to use this rule flexibly, make uuid a function that maps your
# sample names of choice to the UUIDs they correspond to (they are
# the column `id` in the GDC manifest files, which can be used to
# systematically construct sample sheets)
uuid="092c8a6d-aad5-41bf-b186-e68e613c0e89",
# a gdc_token is required for controlled access and all BAM files
# on GDC seem to be controlled access (adjust if this changes)
gdc_token="gdc/gdc-user-token.2020-05-07T10_00_00.555Z.txt",
# provide wanted `region=` or `gencode=` slices joined with `&`
slices="region=chr22®ion=chr5:1000-2000®ion=unmapped&gencode=BRCA2",
# extra command line arguments passed to curl
extra=""
wrapper:
"v4.6.0/bio/gdc-api/bam-slicing"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Notes
BAM file UUIDs can be found via the GDC repository query, either by clicking on individual files or systematically by creating a cart and downloading a manifest file.
Slicing can be performed using region syntax like ‘region=chr20:3000-4000’, gene name syntax like ‘gencode=BRCA2’ (this uses Gene symbols of GENCODE v22) or ‘region=unmapped’ to get unmapped reads. Multiple such entries can be joined with ampersands (e.g.
region=chr5:200-300®ion=unmapped&gencode=BRCA1
).All BAM data files in GDC are controlled access according to this GDC repository query, thus a GDC access token file is always required and must be provided via
params: gdc_token: "path/to/access_token.txt"
. Should this change in the future, feel free to adjust this wrapper or contact the original author.
Software dependencies
curl=8.10.1
Code
__author__ = "David Lähnemann"
__copyright__ = "Copyright 2020, David Lähnemann"
__email__ = "david.laehnemann@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
import os
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
uuid = snakemake.params.get("uuid", "")
if uuid == "":
raise ValueError("You need to provide a GDC UUID via the 'uuid' in 'params'.")
token_file = snakemake.params.get("gdc_token", "")
if token_file == "":
raise ValueError(
"You need to provide a GDC data access token file via the 'token' in 'params'."
)
token = ""
with open(token_file) as tf:
token = tf.read()
os.environ["CURL_HEADER_TOKEN"] = "'X-Auth-Token: {}'".format(token)
slices = snakemake.params.get("slices", "")
if slices == "":
raise ValueError(
"You need to provide 'region=chr1:1000-2000' or 'gencode=BRCA2' slice(s) via the 'slices' in 'params'."
)
extra = snakemake.params.get("extra", "")
shell(
"curl --silent"
" --header $CURL_HEADER_TOKEN"
" 'https://api.gdc.cancer.gov/slicing/view/{uuid}?{slices}'"
" {extra}"
" --output {snakemake.output.bam} {log}"
)
if os.path.getsize(snakemake.output.bam) < 100000:
with open(snakemake.output.bam) as f:
if "error" in f.read():
shell("cat {snakemake.output.bam} {log}")
raise RuntimeError(
"Your GDC API request returned an error, check your log file for the error message."
)