datalad.api.foreach_dataset

datalad.api.foreach_dataset(cmd, *, cmd_type='auto', dataset=None, state='present', recursive=False, recursion_limit=None, contains=None, bottomup=False, subdatasets_only=False, output_streams='pass-through', chpwd='ds', safe_to_consume='auto', jobs=None)

Run a command or Python code on the dataset and/or each of its sub-datasets.

This command provides a convenience for the cases were no dedicated DataLad command is provided to operate across the hierarchy of datasets. It is very similar to git submodule foreach command with the following major differences

  • by default (unless subdatasets_only=True) it would include operation on the original dataset as well,

  • subdatasets could be traversed in bottom-up order,

  • can execute commands in parallel (see jobs option), but would account for the order, e.g. in bottom-up order command is executed in super-dataset only after it is executed in all subdatasets.

Additional notes:

  • for execution of “external” commands we use the environment used to execute external git and git-annex commands.

Command format

cmd_type=’external’: A few placeholders are supported in the command via Python format specification:

  • “{pwd}” will be replaced with the full path of the current working directory.

  • “{ds}” and “{refds}” will provide instances of the dataset currently operated on and the reference “context” dataset which was provided via dataset argument.

  • “{tmpdir}” will be replaced with the full path of a temporary directory.

Examples

Aggressively git clean all datasets, running 5 parallel jobs:

> foreach_dataset(['git', 'clean', '-dfx'], recursive=True, jobs=5)
Parameters:
  • cmd – command for execution. For cmd_type=’exec’ or cmd_type=’eval’ (Python code) should be either a string or a list with only a single item. If ‘eval’, the actual function can be passed, which will be provided all placeholders as keyword arguments.

  • cmd_type ({'auto', 'external', 'exec', 'eval'}, optional) – type of the command. external: to be run in a child process using dataset’s runner; ‘exec’: Python source code to execute using ‘exec(), no value returned; ‘eval’: Python source code to evaluate using ‘eval()’, return value is placed into ‘result’ field. ‘auto’: If used via Python API, and cmd is a Python function, it will use ‘eval’, and otherwise would assume ‘external’. [Default: ‘auto’]

  • dataset (Dataset or None, optional) – specify the dataset to operate on. If no dataset is given, an attempt is made to identify the dataset based on the input and/or the current working directory. [Default: None]

  • state ({'present', 'absent', 'any'}, optional) – indicate which (sub)datasets to consider: either only locally present, absent, or any of those two kinds. [Default: ‘present’]

  • recursive (bool, optional) – if set, recurse into potential subdatasets. [Default: False]

  • recursion_limit (int or None, optional) – limit recursion into subdatasets to the given number of levels. [Default: None]

  • contains (list of str or None, optional) – limit to the subdatasets containing the given path. If a root path of a subdataset is given, the last considered dataset will be the subdataset itself. Can be a list with multiple paths, in which case datasets that contain any of the given paths will be considered. [Default: None]

  • bottomup (bool, optional) – whether to report subdatasets in bottom-up order along each branch in the dataset tree, and not top-down. [Default: False]

  • subdatasets_only (bool, optional) – whether to exclude top level dataset. It is implied if a non-empty contains is used. [Default: False]

  • output_streams ({'capture', 'pass-through', 'relpath'}, optional) – ways to handle outputs. ‘capture’ and return outputs from ‘cmd’ in the record (‘stdout’, ‘stderr’); ‘pass-through’ to the screen (and thus absent from returned record); prefix with ‘relpath’ captured output (similar to like grep does) and write to stdout and stderr. In ‘relpath’, relative path is relative to the top of the dataset if dataset is specified, and if not - relative to current directory. [Default: ‘pass-through’]

  • chpwd ({'ds', 'pwd'}, optional) – ‘ds’ will change working directory to the top of the corresponding dataset. With ‘pwd’ no change of working directory will happen. Note that for Python commands, due to use of threads, we do not allow chdir=ds to be used with jobs > 1. Hint: use ‘ds’ and ‘refds’ objects’ methods to execute commands in the context of those datasets. [Default: ‘ds’]

  • safe_to_consume ({'auto', 'all-subds-done', 'superds-done', 'always'}, optional) – Important only in the case of parallel (jobs greater than 1) execution. ‘all-subds-done’ instructs to not consider superdataset until command finished execution in all subdatasets (it is the value in case of ‘auto’ if traversal is bottomup). ‘superds-done’ instructs to not process subdatasets until command finished in the super-dataset (it is the value in case of ‘auto’ in traversal is not bottom up, which is the default). With ‘always’ there is no constraint on either to execute in sub or super dataset. [Default: ‘auto’]

  • jobs (int or None or {'auto'}, optional) – how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. [Default: None]

  • on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]

  • result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]

  • result_renderer – select rendering mode command results. ‘tailored’ enables a command- specific rendering style that is typically tailored to human consumption, if there is one for a specific command, or otherwise falls back on the the ‘generic’ result renderer; ‘generic’ renders each result in one line with key info like action, status, path, and an optional message); ‘json’ a complete JSON line serialization of the full result record; ‘json_pp’ like ‘json’, but pretty-printed spanning multiple lines; ‘disabled’ turns off result rendering entirely; ‘<template>’ reports any value(s) of any result properties in any format indicated by the template (e.g. ‘{path}’, compare with JSON output for all key-value choices). The template syntax follows the Python “format() language”. It is possible to report individual dictionary values, e.g. ‘{metadata[name]}’. If a 2nd-level key contains a colon, e.g. ‘music:Genre’, ‘:’ must be substituted by ‘#’ in the template, like so: ‘{metadata[music#Genre]}’. [Default: ‘tailored’]

  • result_xfm ({'datasets', 'successdatasets-or-none', 'paths', 'relpaths', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]

  • return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘list’]