datalad.api.clone

datalad.api.clone(source, path=None, git_clone_opts=None, *, dataset=None, description=None, reckless=None)

Obtain a dataset (copy) from a URL or local directory

The purpose of this command is to obtain a new clone (copy) of a dataset and place it into a not-yet-existing or empty directory. As such clone provides a strict subset of the functionality offered by install. Only a single dataset can be obtained, and immediate recursive installation of subdatasets is not supported. However, once a (super)dataset is installed via clone, any content, including subdatasets can be obtained by a subsequent get command.

Primary differences over a direct git clone call are 1) the automatic initialization of a dataset annex (pure Git repositories are equally supported); 2) automatic registration of the newly obtained dataset as a subdataset (submodule), if a parent dataset is specified; 3) support for additional resource identifiers (DataLad resource identifiers as used on datasets.datalad.org, and RIA store URLs as used for store.datalad.org - optionally in specific versions as identified by a branch or a tag; see examples); and 4) automatic configurable generation of alternative access URL for common cases (such as appending ‘.git’ to the URL in case the accessing the base URL failed).

In case the clone is registered as a subdataset, the original URL passed to clone is recorded in .gitmodules of the parent dataset in addition to the resolved URL used internally for git-clone. This allows to preserve datalad specific URLs like ria+ssh://… for subsequent calls to get if the subdataset was locally removed later on.

By default, the command returns a single Dataset instance for an installed dataset, regardless of whether it was newly installed (‘ok’ result), or found already installed from the specified source (‘notneeded’ result).

URL mapping configuration

‘clone’ supports the transformation of URLs via (multi-part) substitution specifications. A substitution specification is defined as a configuration setting ‘datalad.clone.url-substition.<seriesID>’ with a string containing a match and substitution expression, each following Python’s regular expression syntax. Both expressions are concatenated to a single string with an arbitrary delimiter character. The delimiter is defined by prefixing the string with the delimiter. Prefix and delimiter are stripped from the expressions (Example: “,^http://(.*)$,https://1”). This setting can be defined multiple times, using the same ‘<seriesID>’. Substitutions in a series will be applied incrementally, in order of their definition. The first substitution in such a series must match, otherwise no further substitutions in a series will be considered. However, following the first match all further substitutions in a series are processed, regardless whether intermediate expressions match or not. Substitution series themselves have no particular order, each matching series will result in a candidate clone URL. Consequently, the initial match specification in a series should be as precise as possible to prevent inflation of candidate URLs.

See also

handbook:3-001 (http://handbook.datalad.org/symbols): More information on Remote Indexed Archive (RIA) stores

Examples

Install a dataset from GitHub into the current directory:

> clone(source='https://github.com/datalad-datasets/longnow-podcasts.git')

Install a dataset into a specific directory:

> clone(source='https://github.com/datalad-datasets/longnow-podcasts.git',
        path='myfavpodcasts')

Install a dataset as a subdataset into the current dataset:

> clone(dataset='.',
        source='https://github.com/datalad-datasets/longnow-podcasts.git')

Install the main superdataset from datasets.datalad.org:

> clone(source='///')

Install a dataset identified by a literal alias from store.datalad.org:

> clone(source='ria+http://store.datalad.org#~hcp-openaccess')

Install a dataset in a specific version as identified by a branch or tag name from store.datalad.org:

> clone(source='ria+http://store.datalad.org#76b6ca66-36b1-11ea-a2e6-f0d5bf7b5561@myidentifier')

Install a dataset with group-write access permissions:

> clone(source='http://example.com/dataset', reckless='shared-group')

Parameters:

source (str) – URL, DataLad resource identifier, local path or instance of dataset to be cloned.
path – path to clone into. If no path is provided a destination path will be derived from a source URL similar to git clone. [Default: None]
git_clone_opts – A list of command line arguments to pass to git clone. Note that not all options will lead to viable results. For example ‘–single- branch’ will not result in a functional annex repository because both a regular branch and the git-annex branch are required. Note that a version in a RIA URL takes precedence over ‘–branch’. [Default: None]
dataset (Dataset or None, optional) – (parent) dataset to clone into. If given, the newly cloned dataset is registered as a subdataset of the parent. Also, if given, relative paths are interpreted as being relative to the parent dataset, and not relative to the working directory. [Default: None]
description (str or None, optional) – short description to use for a dataset location. Its primary purpose is to help humans to identify a dataset copy (e.g., “mike’s dataset on lab server”). Note that when a dataset is published, this information becomes available on the remote side. [Default: None]
reckless ({None, True, False, 'auto', 'ephemeral'} or shared-..., optional) – Obtain a dataset or subdatset and set it up in a potentially unsafe way for performance, or access reasons. Use with care, any dataset is marked as ‘untrusted’. The reckless mode is stored in a dataset’s local configuration under ‘datalad.clone.reckless’, and will be inherited to any of its subdatasets. Supported modes are: [‘auto’]: hard-link files between local clones. In-place modification in any clone will alter original annex content. [‘ephemeral’]: symlink annex to origin’s annex and discard local availability info via git- annex-dead ‘here’ and declares this annex private. Shares an annex between origin and clone w/o git-annex being aware of it. In case of a change in origin you need to update the clone before you’re able to save new content on your end. Alternative to ‘auto’ when hardlinks are not an option, or number of consumed inodes needs to be minimized. Note that this mode can only be used with clones from non-bare repositories or a RIA store! Otherwise two different annex object tree structures (dirhashmixed vs dirhashlower) will be used simultaneously, and annex keys using the respective other structure will be inaccessible. [‘shared-<mode>’]: set up repository and annex permission to enable multi-user access. This disables the standard write protection of annex’ed files. <mode> can be any value support by ‘git init –shared=’, such as ‘group’, or ‘all’. [Default: None]
on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]
result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: constraint:action:{install}]
result_renderer – select rendering mode command results. ‘tailored’ enables a command- specific rendering style that is typically tailored to human consumption, if there is one for a specific command, or otherwise falls back on the the ‘generic’ result renderer; ‘generic’ renders each result in one line with key info like action, status, path, and an optional message); ‘json’ a complete JSON line serialization of the full result record; ‘json_pp’ like ‘json’, but pretty-printed spanning multiple lines; ‘disabled’ turns off result rendering entirely; ‘<template>’ reports any value(s) of any result properties in any format indicated by the template (e.g. ‘{path}’, compare with JSON output for all key-value choices). The template syntax follows the Python “format() language”. It is possible to report individual dictionary values, e.g. ‘{metadata[name]}’. If a 2nd-level key contains a colon, e.g. ‘music:Genre’, ‘:’ must be substituted by ‘#’ in the template, like so: ‘{metadata[music#Genre]}’. [Default: ‘tailored’]
result_xfm ({'datasets', 'successdatasets-or-none', 'paths', 'relpaths', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: ‘successdatasets-or- none’]
return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘item-or-list’]