datalad.plugin.addurls

Create and update a dataset from a list of URLs.

class datalad.plugin.addurls.Addurls[source]

Bases: datalad.interface.base.Interface

Create and update a dataset from a list of URLs.

Format specification

Several arguments take format strings. These are similar to normal Python format strings where the names from URL-FILE (column names for a CSV or properties for JSON) are available as placeholders. If URL-FILE is a CSV file, a positional index can also be used (i.e., “{0}” for the first column). Note that a placeholder cannot contain a ‘:’ or ‘!’.

In addition, the FILENAME-FORMAT arguments has a few special placeholders.

  • _repindex

    The constructed file names must be unique across all fields rows. To avoid collisions, the special placeholder “_repindex” can be added to the formatter. Its value will start at 0 and increment every time a file name repeats.

  • _url_hostname, _urlN, _url_basename*

    Various parts of the formatted URL are available. Take “http://datalad.org/asciicast/seamless_nested_repos.sh” as an example.

    “datalad.org” is stored as “_url_hostname”. Components of the URL’s path can be referenced as “_urlN”. “_url0” and “_url1” would map to “asciicast” and “seamless_nested_repos.sh”, respectively. The final part of the path is also available as “_url_basename”.

    This name is broken down further. “_url_basename_root” and “_url_basename_ext” provide access to the root name and extension. These values are similar to the result of os.path.splitext, but, in the case of multiple periods, the extension is identified using the same length heuristic that git-annex uses. As a result, the extension of “file.tar.gz” would be “.tar.gz”, not “.gz”. In addition, the fields “_url_basename_root_py” and “_url_basename_ext_py” provide access to the result of os.path.splitext.

  • _url_filename*

    These are similar to _url_basename* fields, but they are obtained with a server request. This is useful if the file name is set in the Content-Disposition header.

Examples

Consider a file “avatars.csv” that contains:

who,ext,link
neurodebian,png,https://avatars3.githubusercontent.com/u/260793
datalad,png,https://avatars1.githubusercontent.com/u/8927200

To download each link into a file name composed of the ‘who’ and ‘ext’ fields, we could run:

$ datalad addurls -d avatar_ds --fast avatars.csv '{link}' '{who}.{ext}'

The -d avatar_ds is used to create a new dataset in “$PWD/avatar_ds”.

If we were already in a dataset and wanted to create a new subdataset in an “avatars” subdirectory, we could use “//” in the FILENAME-FORMAT argument:

$ datalad addurls --fast avatars.csv '{link}' 'avatars//{who}.{ext}'

Note

For users familiar with ‘git annex addurl’: A large part of this plugin’s functionality can be viewed as transforming data from URL-FILE into a “url filename” format that fed to ‘git annex addurl –batch –with-files’.

class EnsureChoice(*values)

Bases: datalad.support.constraints.Constraint

Ensure an input is element of a set of possible values

long_description()
short_description()
class EnsureDataset

Bases: datalad.support.constraints.Constraint

long_description()
short_description()
class EnsureNone

Bases: datalad.support.constraints.Constraint

Ensure an input is of value None

long_description()
short_description()
class EnsureStr(min_len=0)

Bases: datalad.support.constraints.Constraint

Ensure an input is a string.

No automatic conversion is attempted.

long_description()
short_description()
class Parameter(constraints=None, doc=None, args=None, **kwargs)

Bases: object

This class shall serve as a representation of a parameter.

get_autodoc(name, indent=' ', width=70, default=None, has_default=False)

Docstring for the parameter to be used in lists of parameters

Returns:
Return type:string or list of strings (if indent is None)
datasetmethod(name=None, dataset_argname='dataset')
eval_results()

Decorator for return value evaluation of datalad commands.

Note, this decorator is only compatible with commands that return status dict sequences!

Two basic modes of operation are supported: 1) “generator mode” that yields individual results, and 2) “list mode” that returns a sequence of results. The behavior can be selected via the kwarg return_type. Default is “list mode”.

This decorator implements common functionality for result rendering/output, error detection/handling, and logging.

Result rendering/output can be triggered via the datalad.api.result-renderer configuration variable, or the result_renderer keyword argument of each decorated command. Supported modes are: ‘default’ (one line per result with action, status, path, and an optional message); ‘json’ (one object per result, like git-annex), ‘json_pp’ (like ‘json’, but pretty-printed spanning multiple lines), ‘tailored’ custom output formatting provided by each command class (if any).

Error detection works by inspecting the status item of all result dictionaries. Any occurrence of a status other than ‘ok’ or ‘notneeded’ will cause an IncompleteResultsError exception to be raised that carries the failed actions’ status dictionaries in its failed attribute.

Status messages will be logged automatically, by default the following association of result status and log channel will be used: ‘ok’ (debug), ‘notneeded’ (debug), ‘impossible’ (warning), ‘error’ (error). Logger instances included in the results are used to capture the origin of a status report.

Parameters:func (function) – __call__ method of a subclass of Interface, i.e. a datalad command definition
class datalad.plugin.addurls.Formatter(idx_to_name=None, missing_value=None)[source]

Bases: string.Formatter

Formatter that gives precedence to custom keys.

The first positional argument to the format call should be a mapping whose keys are exposed as placeholders (e.g., “{key1}.py”).

Parameters:
  • idx_to_name (dict) – A mapping from a positional index to a key. If not provided, “{N}” elements are not supported.
  • missing (str, optional) – When column lookup results in an empty string, use this value in its place.
convert_field(value, conversion)[source]
format(format_string, *args, **kwargs)[source]
get_value(key, args, kwargs)[source]

Look for key’s value in args[0] mapping first.

class datalad.plugin.addurls.RepFormatter(*args, **kwargs)[source]

Bases: datalad.plugin.addurls.Formatter

Extend Formatter to support a {_repindex} placeholder.

format(*args, **kwargs)[source]
get_value(key, args, kwargs)[source]

Look for key’s value in args[0] mapping first.

datalad.plugin.addurls.add_extra_filename_values(filename_format, rows, urls, dry_run)[source]

Extend rows with values for special formatting fields.

datalad.plugin.addurls.clean_meta_args(args)[source]

Process metadata arguments.

Parameters:args (iterable of str) – Formatted metadata arguments for ‘git-annex metadata –set’.
Returns:
Return type:A dict mapping field names to values.
datalad.plugin.addurls.extract(stream, input_type, url_format='{0}', filename_format='{1}', exclude_autometa=None, meta=None, dry_run=False, missing_value=None)[source]

Extract and format information from url_file.

Parameters:
  • stream (file object) – Items used to construct the file names and URLs.
  • input_type ({'csv', 'json'}) –
  • other parameters match those described in AddUrls. (All) –
Returns:

  • A tuple where the first item is a list with a dict of extracted information
  • for each row in stream and the second item is a set that contains all the
  • subdataset paths.

Remove illegal names from fields.

Note: This is like filter(is_legal_metafield, fields) but the dropped values are logged.

datalad.plugin.addurls.fmt_to_name(format_string, num_to_name)[source]

Try to map a format string to a single name.

Parameters:
  • format_string (string) –
  • num_to_name (dict) – A dictionary that maps from an integer to a column name. This enables mapping the format string to an integer to a name.
Returns:

  • A placeholder name if format_string consists of a single
  • placeholder and no other text. Otherwise, None is returned.

datalad.plugin.addurls.get_file_parts(filename, prefix='name')[source]

Assign a name to various parts of a file.

Parameters:
  • filename (str) – A file name (no leading path is permitted).
  • prefix (str) – Prefix to prepend to the key names.
Returns:

Return type:

A dict mapping each part to a value.

datalad.plugin.addurls.get_fmt_names(format_string)[source]

Yield field names in format_string.

datalad.plugin.addurls.get_subpaths(filename)[source]

Convert “//” marker in filename to a list of subpaths.

>>> from datalad.plugin.addurls import get_subpaths
>>> get_subpaths("p1/p2//p3/p4//file")
('p1/p2/p3/p4/file', ['p1/p2', 'p1/p2/p3/p4'])

Note: With Python 3, the subpaths could be generated with

itertools.accumulate(filename.split(“//”)[:-1], os.path.join)
Parameters:filename (str) – File name with “//” marking subpaths.
Returns:
  • A tuple of the filename with any “//” collapsed to a single
  • separator and a list of subpaths (str).
datalad.plugin.addurls.get_url_parts(url)[source]

Assign a name to various parts of the URL.

Parameters:url (str) –
Returns:
  • A dict with keys _url_hostname and, for a path with N+1 parts,
  • ’_url0’ through ‘_urlN’ . There is also a _url_basename key for
  • the rightmost part of the path.

Test whether name is a valid metadata field.

The set of permitted characters is taken from git-annex’s MetaData.hs:legalField.