GDC DATA TRANSFER TOOL DATA DOWNLOAD

Download GDC data files with the gdc-client.

Example

This wrapper can be used in the following way:

rule gdc_download:
    output:
        # the file extension (up to two components, here .maf.gz), has
        # to uniquely map to one of the files downloaded for that UUID
        "raw/{sample}.maf.gz"
    log:
        "logs/gdc-client/download/{sample}.log"
    params:
        # to use this rule flexibly, make uuid a function that maps your
        # sample names of choice to the UUIDs they correspond to (they are
        # the column `id` in the GDC manifest files, which can be used to
        # systematically construct sample sheets)
        uuid="34b80c89-c41e-47be-84fb-0c0ea493b5bb",
        # a gdc_token is only required for controlled access samples,
        # leave blank otherwise (`gdc_token=""`) or skip this param entirely
        gdc_token="gdc/gdc-user-token.2020-05-07T10_00_00.555Z.txt",
        # for valid extra command line arguments, check command line help or:
        # https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/
        extra = ""
    threads: 4
    wrapper:
        "v1.9.0/bio/gdc-client/download"

rule gdc_download_bam:
    output:
        # specify all the downloaded files you want to keep, as all other
        # downloaded files will be removed automatically e.g. for
        # BAM data this could be
        "raw/{sample}.bam",
        "raw/{sample}.bam.bai",
        "raw/{sample}.annotations.txt",
        directory("raw/{sample}/logs")
    log:
        "logs/gdc-client/download/{sample}.log"
    params:
        # to use this rule flexibly, make uuid a function that maps your
        # sample names of choice to the UUIDs they correspond to (they are
        # the column `id` in the GDC manifest files, which can be used to
        # systematically construct sample sheets)
        uuid="34b80c89-c41e-47be-84fb-0c0ea493b5bb",
        # a gdc_token is only required for controlled access samples,
        # leave blank otherwise (`gdc_token=""`) or skip this param entirely
        gdc_token="gdc/gdc-user-token.2020-05-07T10_00_00.555Z.txt",
        # for valid extra command line arguments, check command line help or:
        # https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/
        extra = ""
    threads: 4
    wrapper:
        "v1.9.0/bio/gdc-client/download"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Software dependencies

  • gdc-client==1.5.0

Authors

  • David Lähnemann

Code

__author__ = "David Lähnemann"
__copyright__ = "Copyright 2020, David Lähnemann"
__email__ = "david.laehnemann@uni-due.de"
__license__ = "MIT"

from snakemake.shell import shell
import os.path as path
from tempfile import TemporaryDirectory
import glob

uuid = snakemake.params.get("uuid", "")
if uuid == "":
    raise ValueError("You need to provide a GDC UUID via the 'uuid' in 'params'.")

extra = snakemake.params.get("extra", "")
token = snakemake.params.get("gdc_token", "")
if token != "":
    token = "--token-file {}".format(token)

with TemporaryDirectory() as tempdir:
    shell(
        "gdc-client download"
        " {token}"
        " {extra}"
        " -n {snakemake.threads} "
        " --log-file {snakemake.log} "
        " --dir {tempdir}"
        " {uuid}"
    )

    for out_path in snakemake.output:
        tmp_path = path.join(tempdir, uuid, path.basename(out_path))
        if not path.exists(tmp_path):
            (root, ext1) = path.splitext(out_path)
            paths = glob.glob(path.join(tempdir, uuid, "*" + ext1))
            if len(paths) > 1:
                (root, ext2) = path.splitext(root)
                paths = glob.glob(path.join(tempdir, uuid, "*" + ext2 + ext1))
            if len(paths) == 0:
                raise ValueError(
                    "{} file extension {} does not match any downloaded file.\n"
                    "Are you sure that UUID {} provides a file of such format?\n".format(
                        out_path, ext1, uuid
                    )
                )
            if len(paths) > 1:
                raise ValueError(
                    "Found more than one downloaded file with extension '{}':\n"
                    "{}\n"
                    "Cannot match requested output file {} unambiguously.\n".format(
                        ext2 + ext1, paths, out_path
                    )
                )
            tmp_path = paths[0]
        shell("mv {tmp_path} {out_path}")