RASUSA

https://img.shields.io/github/issues-pr/snakemake/snakemake-wrappers/bio/rasusa?label=version%20update%20pull%20requests

Randomly subsample sequencing reads to a specified coverage.

URL: https://github.com/mbhall88/rasusa

Example

This wrapper can be used in the following way:

rule subsample:
    input:
        r1="{sample}.r1.fq",
        r2="{sample}.r2.fq",
    output:
        r1="{sample}.subsampled.r1.fq",
        r2="{sample}.subsampled.r2.fq",
    params:
        options="--seed 15",
        genome_size="3mb",  # required, unless `bases` is given
        coverage=20,  # required, unless `bases is given
        #bases="2gb"
    log:
        "logs/subsample/{sample}.log",
    wrapper:
        "v4.6.0/bio/rasusa"

Note that input, output and log file paths can be chosen freely.

When running with

snakemake --use-conda

the software dependencies will be automatically deployed into an isolated environment before execution.

Software dependencies

  • rasusa=2.1.0

Input/Output

Input:

  • Reads to subsample in FASTA/Q format. Input files can be named or unnamed.

Output:

  • File paths to write subsampled reads to. If using paired-end data, make sure there are two output files in the same order as the input.

Params

  • bases: Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB
    If this option is given, coverage and genome_size are ignored

  • coverage: The desired coverage to sub-sample the reads to.
    If bases is not provided, this option and genome_size are required

  • genome_size: Genome size to calculate coverage with respect to. e.g., 4.3kb, 7Tb, 9000, 4.1MB
    Alternatively, a FASTA/Q index file can be provided and the genome size will be set to the sum of all reference sequences.
    If bases is not provided, this option and coverage are required

  • extra: Additional program arguments.

Authors

  • Michael Hall

Code

__author__ = "Michael Hall"
__copyright__ = "Copyright 2020, Michael Hall"
__email__ = "michael@mbh.sh"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")

bases = snakemake.params.get("bases")
if bases:
    extra += f" --bases {bases}"
else:
    covg = snakemake.params.get("coverage")
    gsize = snakemake.params.get("genome_size")
    if covg is None or gsize is None:
        raise ValueError(
            "If `bases` is not given, then `coverage` and `genome_size` must be"
        )
    extra += f" --genome-size {gsize} --coverage {covg}"

shell(
    "rasusa reads {extra} --output {snakemake.output[0]} --output {snakemake.output[1]} {snakemake.input} {log}"
)