GDC DATA TRANSFER TOOL DATA DOWNLOAD#
Download GDC data files with the gdc-client.
Example#
This wrapper can be used in the following way:
rule gdc_download:
output:
# the file extension (up to two components, here .maf.gz), has
# to uniquely map to one of the files downloaded for that UUID
"raw/{sample}.maf.gz"
log:
"logs/gdc-client/download/{sample}.log"
params:
# to use this rule flexibly, make uuid a function that maps your
# sample names of choice to the UUIDs they correspond to (they are
# the column `id` in the GDC manifest files, which can be used to
# systematically construct sample sheets)
uuid="34b80c89-c41e-47be-84fb-0c0ea493b5bb",
# a gdc_token is only required for controlled access samples,
# leave blank otherwise (`gdc_token=""`) or skip this param entirely
gdc_token="gdc/gdc-user-token.2020-05-07T10_00_00.555Z.txt",
# for valid extra command line arguments, check command line help or:
# https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/
extra = ""
threads: 4
wrapper:
"v3.0.2/bio/gdc-client/download"
rule gdc_download_bam:
output:
# specify all the downloaded files you want to keep, as all other
# downloaded files will be removed automatically e.g. for
# BAM data this could be
"raw/{sample}.bam",
"raw/{sample}.bam.bai",
"raw/{sample}.annotations.txt",
directory("raw/{sample}/logs")
log:
"logs/gdc-client/download/{sample}.log"
params:
# to use this rule flexibly, make uuid a function that maps your
# sample names of choice to the UUIDs they correspond to (they are
# the column `id` in the GDC manifest files, which can be used to
# systematically construct sample sheets)
uuid="34b80c89-c41e-47be-84fb-0c0ea493b5bb",
# a gdc_token is only required for controlled access samples,
# leave blank otherwise (`gdc_token=""`) or skip this param entirely
gdc_token="gdc/gdc-user-token.2020-05-07T10_00_00.555Z.txt",
# for valid extra command line arguments, check command line help or:
# https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/
extra = ""
threads: 4
wrapper:
"v3.0.2/bio/gdc-client/download"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies#
gdc-client=1.6.1
Code#
__author__ = "David Lähnemann"
__copyright__ = "Copyright 2020, David Lähnemann"
__email__ = "david.laehnemann@uni-due.de"
__license__ = "MIT"
from snakemake.shell import shell
import os.path as path
from tempfile import TemporaryDirectory
import glob
uuid = snakemake.params.get("uuid", "")
if uuid == "":
raise ValueError("You need to provide a GDC UUID via the 'uuid' in 'params'.")
extra = snakemake.params.get("extra", "")
token = snakemake.params.get("gdc_token", "")
if token != "":
token = "--token-file {}".format(token)
with TemporaryDirectory() as tempdir:
shell(
"gdc-client download"
" {token}"
" {extra}"
" -n {snakemake.threads} "
" --log-file {snakemake.log} "
" --dir {tempdir}"
" {uuid}"
)
for out_path in snakemake.output:
tmp_path = path.join(tempdir, uuid, path.basename(out_path))
if not path.exists(tmp_path):
(root, ext1) = path.splitext(out_path)
paths = glob.glob(path.join(tempdir, uuid, "*" + ext1))
if len(paths) > 1:
(root, ext2) = path.splitext(root)
paths = glob.glob(path.join(tempdir, uuid, "*" + ext2 + ext1))
if len(paths) == 0:
raise ValueError(
"{} file extension {} does not match any downloaded file.\n"
"Are you sure that UUID {} provides a file of such format?\n".format(
out_path, ext1, uuid
)
)
if len(paths) > 1:
raise ValueError(
"Found more than one downloaded file with extension '{}':\n"
"{}\n"
"Cannot match requested output file {} unambiguously.\n".format(
ext2 + ext1, paths, out_path
)
)
tmp_path = paths[0]
shell("mv {tmp_path} {out_path}")