MILLER
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
URL: https://miller.readthedocs.io/en/6.13.0/
Example
This wrapper can be used in the following way:
### Cat ###
rule test_miller_cat:
input:
"table.csv",
"table2.csv",
output:
"miller/cat.tsv",
log:
"logs/cat.log",
params:
extra="cat",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Summary ###
rule test_miller_summary_csv:
input:
table="table.csv",
output:
"miller/summary.tsv",
log:
"logs/summary_csv.log",
params:
extra="summary",
threads: 2
wrapper:
"v9.8.0/utils/miller"
rule test_miller_summary_tsv:
input:
table="table.tsv",
output:
"miller/summary.csv",
log:
"logs/summary_tsv.log",
params:
extra="summary",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Histogram ###
rule test_miller_histogram:
input:
table="table.csv",
output:
"miller/histogram.tsv",
log:
"logs/histogram.log",
params:
extra="histogram -f s1,s2 --auto --nbins 3",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Join ###
rule test_miller_join:
input:
csv1="right.csv",
csv2="table.csv",
output:
"miller/join.csv",
log:
"logs/join.log",
params:
extra=lambda w, input: f"join -f {input.csv2} -u -j gene_id",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Sample ###
rule test_miller_sample:
input:
table="table.csv",
output:
"miller/sample.csv",
log:
"logs/sample.log",
params:
extra="sample -k 3",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Grep ###
rule test_miller_grep:
input:
table="table.csv",
output:
"miller/grep.csv",
log:
"logs/grep.log",
params:
extra="grep -i gene_id=ENSG01",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Cut ###
rule test_miller_cut:
input:
table="table.csv",
output:
"miller/cut.csv",
log:
"logs/cut.log",
params:
extra="cut -f gene_id",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Sort ###
rule test_miller_sort:
input:
table="table.csv",
output:
"miller/sort.csv",
log:
"logs/sort.log",
params:
extra="sort -r gene_id",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Split ###
rule test_miller_split:
input:
table="table.csv",
output:
"miller/split_1.csv",
"miller/split_2.csv",
log:
"logs/split.log",
params:
extra=lambda w, output: f"split -m {len(output)}",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Uniq ###
rule test_miller_uniq:
input:
table="table.csv",
output:
"miller/uniq.tsv",
log:
"logs/uniq.log",
params:
extra="uniq -g gene_id",
threads: 2
wrapper:
"v9.8.0/utils/miller"
### Pipe ###
rule test_miller_pipe:
input:
table="table.csv",
output:
"miller/pipe.tsv",
log:
"logs/pipe.log",
params:
extra="summary then sort -nr max",
threads: 2
wrapper:
"v9.8.0/utils/miller"
Note that input, output and log file paths can be chosen freely.
When running with
snakemake --use-conda
the software dependencies will be automatically deployed into an isolated environment before execution.
Software dependencies
miller=6.18.1snakemake-wrapper-utils=0.8.0
Input/Output
Input:
Path to input file(s).
Output:
Path to output file.
Params
extra: Optional arguments for miller.
Code
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2024, Filipe G. Vieira"
__license__ = "MIT"
import os
from pathlib import Path
from snakemake.shell import shell
from snakemake_wrapper_utils.snakemake import get_format, is_arg
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")
compress_formats = {
".gz": "gzip",
".bgz": "gzip",
".bz2": "bzip2",
".xz": "xz",
}
io_formats = [
"asv",
"asvlite",
"csv",
"csvlite",
"tsv",
"tsvlite",
"json",
"jsonl",
"md",
"markdown",
"nidx",
"pprint",
"usv",
"usvlite",
"xtab",
"dkvp",
]
io_opts = ""
### INPUT
inputs = Path(snakemake.input[0])
# Compressed
for ext, prog in compress_formats.items():
if inputs.suffix == ext:
io_opts += f" --prepipe {prog}"
break
# Delimiter
for in_format in io_formats:
if get_format(inputs) == in_format:
io_opts += f" --i{in_format}"
break
### OUTPUT
output = f"> {snakemake.output[0]}"
# Compressed
for ext, prog in compress_formats.items():
if Path(snakemake.output[0]).suffix == ext:
output = f" | {prog} {output}"
break
# Delimiter
for out_format in io_formats:
if get_format(snakemake.output[0]) == out_format:
io_opts += f" --o{out_format}"
break
if is_arg("cat", extra):
# For cat operations, use all input files
inputs = snakemake.input
elif is_arg("split", extra):
# For split operations, add prefix based on common output prefix and clear output redirection
extra += f" --prefix {os.path.commonprefix(snakemake.output).rstrip('_')}"
output = ""
shell("GOMAXPROCS={snakemake.threads} mlr {io_opts} {extra} {inputs} {output} {log}")