GATK SPLITINTERVALS

https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/gatk/splitintervals?label=version%20update%20pull%20requests

This tool takes in intervals via the standard arguments of IntervalArgumentCollection and splits them into interval files for scattering. The resulting files contain equal number of bases. Standard GATK engine arguments include -L and -XL, interval padding, and interval set rule etc. For example, for the -L argument, the tool accepts GATK-style intervals (.list or .intervals), BED files and VCF files. See –subdivision-mode parameter for more options.

URL: https://gatk.broadinstitute.org/hc/en-us/articles/9570513631387-SplitIntervals

Example

This wrapper can be used in the following way:

rule gatk_split_interval_list:
    input:
        intervals="genome.interval_list",
        ref="genome.fasta",
    output:
        bed=multiext("out/genome", ".00.bed", ".01.bed", ".02.bed"),
    log:
        "logs/genome.log",
    params:
        extra="--subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW",
        java_opts="",  # optional
    resources:
        mem_mb=1024,
    wrapper:
        "v3.9.0-1-gc294552/bio/gatk/splitintervals"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Notes

  • The java_opts param allows for additional arguments to be passed to the java compiler, e.g. “-XX:ParallelGCThreads=10” (not for -XmX or -Djava.io.tmpdir, since they are handled automatically).

  • The extra param allows for additional program arguments, but not –scatter-count, –output, –interval-file-prefix, –interval-file-num-digits, or –extension (automatically inferred from output files).

Software dependencies

  • gatk4=4.5.0.0

  • snakemake-wrapper-utils=0.6.2

Input/Output

Input:

  • Intervals/BED file

Output:

  • Several Intervals/BED files

Authors

  • Filipe G. Vieira

Code

__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2022, Filipe G. Vieira"
__license__ = "MIT"

import os
import tempfile
from pathlib import Path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

n_out_files = len(snakemake.output)
assert n_out_files > 1, "you need to specify more than 2 output files!"

prefix = Path(os.path.commonprefix(snakemake.output))
suffix = os.path.commonprefix([file[::-1] for file in snakemake.output])[::-1]
chunk_labels = [
    out.removeprefix(str(prefix)).removesuffix(suffix) for out in snakemake.output
]
assert all(
    [chunk_label.isnumeric() for chunk_label in chunk_labels]
), "all chunk labels have to be numeric!"
len_chunk_labels = set([len(chunk_label) for chunk_label in chunk_labels])
assert len(len_chunk_labels) == 1, "all chunk labels must have the same length!"

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "gatk --java-options '{java_opts}' SplitIntervals"
        " --intervals {snakemake.input.intervals}"
        " --reference {snakemake.input.ref}"
        " --scatter-count {n_out_files}"
        " {extra}"
        " --tmp-dir {tmpdir}"
        " --output {prefix.parent}"
        " --interval-file-prefix {prefix.name:q}"
        " --interval-file-num-digits {len_chunk_labels}"
        " --extension {suffix:q}"
        " {log}"
    )