datalad foreach-dataset
Synopsis
datalad foreach-dataset [-h] [--cmd-type {auto|external|exec|eval}] [-d DATASET] [--state
{present|absent|any}] [-r] [-R LEVELS] [--contains PATH]
[--bottomup] [-s] [--output-streams
{capture|pass-through|relpath}] [--chpwd {ds|pwd}]
[--safe-to-consume {auto|all-subds-done|superds-done|always}]
[-J NJOBS] [--version] ...
Description
Run a command or Python code on the dataset and/or each of its sub-datasets.
This command provides a convenience for the cases were no dedicated DataLad command is provided to operate across the hierarchy of datasets. It is very similar to git submodule foreach command with the following major differences
by default (unless –subdatasets-only) it would include operation on the original dataset as well,
subdatasets could be traversed in bottom-up order,
can execute commands in parallel (see JOBS option), but would account for the order, e.g. in bottom-up order command is executed in super-dataset only after it is executed in all subdatasets.
Additional notes:
for execution of “external” commands we use the environment used to execute external git and git-annex commands.
Command format
–cmd-type external: A few placeholders are supported in the command via Python format specification:
“{pwd}” will be replaced with the full path of the current working directory.
“{ds}” and “{refds}” will provide instances of the dataset currently operated on and the reference “context” dataset which was provided via
dataset
argument.“{tmpdir}” will be replaced with the full path of a temporary directory.
Examples
Aggressively git clean all datasets, running 5 parallel jobs:
% datalad foreach-dataset -r -J 5 git clean -dfx
Options
COMMAND
command for execution. A leading ‘–’ can be used to disambiguate this command from the preceding options to DataLad. For –cmd-type exec or eval only a single command argument (Python code) is supported.
-h, --help, --help-np
show this help message. –help-np forcefully disables the use of a pager for displaying the help message
--cmd-type {auto|external|exec|eval}
type of the command. EXTERNAL: to be run in a child process using dataset’s runner; ‘exec’: Python source code to execute using ‘exec(), no value returned; ‘eval’: Python source code to evaluate using ‘eval()’, return value is placed into ‘result’ field. ‘auto’: If used via Python API, and cmd is a Python function, it will use ‘eval’, and otherwise would assume ‘external’. Constraints: value must be one of (‘auto’, ‘external’, ‘exec’, ‘eval’) [Default: ‘auto’]
-d DATASET, --dataset DATASET
specify the dataset to operate on. If no dataset is given, an attempt is made to identify the dataset based on the input and/or the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
--state {present|absent|any}
indicate which (sub)datasets to consider: either only locally present, absent, or any of those two kinds. Constraints: value must be one of (‘present’, ‘absent’, ‘any’) [Default: ‘present’]
-r, --recursive
if set, recurse into potential subdatasets.
-R LEVELS, --recursion-limit LEVELS
limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE
--contains PATH
limit to the subdatasets containing the given path. If a root path of a subdataset is given, the last considered dataset will be the subdataset itself. This option can be given multiple times, in which case datasets that contain any of the given paths will be considered. Constraints: value must be a string or value must be NONE
--bottomup
whether to report subdatasets in bottom-up order along each branch in the dataset tree, and not top-down.
-s, --subdatasets-only
whether to exclude top level dataset. It is implied if a non-empty CONTAINS is used.
--output-streams {capture|pass-through|relpath}, --o-s {capture|pass-through|relpath}
ways to handle outputs. ‘capture’ and return outputs from ‘cmd’ in the record (‘stdout’, ‘stderr’); ‘pass-through’ to the screen (and thus absent from returned record); prefix with ‘relpath’ captured output (similar to like grep does) and write to stdout and stderr. In ‘relpath’, relative path is relative to the top of the dataset if DATASET is specified, and if not - relative to current directory. Constraints: value must be one of (‘capture’, ‘pass-through’, ‘relpath’) [Default: ‘pass-through’]
--chpwd {ds|pwd}
‘ds’ will change working directory to the top of the corresponding dataset. With ‘pwd’ no change of working directory will happen. Note that for Python commands, due to use of threads, we do not allow chdir=ds to be used with jobs > 1. Hint: use ‘ds’ and ‘refds’ objects’ methods to execute commands in the context of those datasets. Constraints: value must be one of (‘ds’, ‘pwd’) [Default: ‘ds’]
--safe-to-consume {auto|all-subds-done|superds-done|always}
Important only in the case of parallel (jobs greater than 1) execution. ‘all- subds-done’ instructs to not consider superdataset until command finished execution in all subdatasets (it is the value in case of ‘auto’ if traversal is bottomup). ‘superds-done’ instructs to not process subdatasets until command finished in the super-dataset (it is the value in case of ‘auto’ in traversal is not bottom up, which is the default). With ‘always’ there is no constraint on either to execute in sub or super dataset. Constraints: value must be one of (‘auto’, ‘all-subds-done’, ‘superds-done’, ‘always’) [Default: ‘auto’]
-J NJOBS, --jobs NJOBS
how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,)
--version
show the module and its version which provides the command