datalad.utils

class datalad.utils.File(name, executable=False)[source]

Bases: object

Helper for a file entry in the create_tree/@with_tree

It allows to define additional settings for entries

class datalad.utils.SequenceFormatter(separator=' ', element_formatter=<string.Formatter object>, *args, **kwargs)[source]

Bases: string.Formatter

string.Formatter subclass with special behavior for sequences.

This class delegates formatting of individual elements to another formatter object. Non-list objects are formatted by calling the delegate formatter’s “format_field” method. List-like objects (list, tuple, set, frozenset) are formatted by formatting each element of the list according to the specified format spec using the delegate formatter and then joining the resulting strings with a separator (space by default).

format_element(elem, format_spec)[source]

Format a single element

For sequences, this is called once for each element in a sequence. For anything else, it is called on the entire object. It is intended to be overridden in subclases.

format_field(value, format_spec)[source]
datalad.utils.all_same(items)[source]

Quick check if all items are the same.

Identical to a check like len(set(items)) == 1 but should be more efficient while working on generators, since would return False as soon as any difference detected thus possibly avoiding unnecessary evaluations

Return if any of regexes (list or str) searches succesfully for value

datalad.utils.as_unicode(val, cast_types=<type 'object'>)[source]

Given an arbitrary value, would try to obtain unicode value of it

For unicode it would return original value, for python2 str or python3 bytes it would use assure_unicode, for None - an empty (unicode) string, and for any other type (see cast_types) - would apply the unicode constructor. If value is not an instance of cast_types, TypeError is thrown

Parameters:cast_types (type) – Which types to cast to unicode by providing to constructor
datalad.utils.assert_no_open_files(*args, **kwargs)[source]
datalad.utils.assure_bool(s)[source]

Convert value into boolean following convention for strings

to recognize on,True,yes as True, off,False,no as False

datalad.utils.assure_bytes(s, encoding='utf-8')[source]

Convert/encode unicode to str (PY2) or bytes (PY3) if of ‘text_type’

Parameters:encoding (str, optional) – Encoding to use. “utf-8” is the default
datalad.utils.assure_dict_from_str(s, **kwargs)[source]

Given a multiline string with key=value items convert it to a dictionary

Parameters:
  • s (str or dict) –
  • None if input s is empty (Returns) –
datalad.utils.assure_dir(*args)[source]

Make sure directory exists.

Joins the list of arguments to an os-specific path to the desired directory and creates it, if it not exists yet.

datalad.utils.assure_iter(s, cls, copy=False, iterate=True)[source]

Given not a list, would place it into a list. If None - empty list is returned

Parameters:
  • s (list or anything) –
  • cls (class) – Which iterable class to assure
  • copy (bool, optional) – If correct iterable is passed, it would generate its shallow copy
  • iterate (bool, optional) – If it is not a list, but something iterable (but not a text_type) iterate over it.
datalad.utils.assure_list(s, copy=False, iterate=True)[source]

Given not a list, would place it into a list. If None - empty list is returned

Parameters:
  • s (list or anything) –
  • copy (bool, optional) – If list is passed, it would generate a shallow copy of the list
  • iterate (bool, optional) – If it is not a list, but something iterable (but not a text_type) iterate over it.
datalad.utils.assure_list_from_str(s, sep='\n')[source]

Given a multiline string convert it to a list of return None if empty

Parameters:s (str or list) –
datalad.utils.assure_tuple_or_list(obj)[source]

Given an object, wrap into a tuple if not list or tuple

datalad.utils.assure_unicode(s, encoding=None, confidence=None)[source]

Convert/decode to unicode (PY2) or str (PY3) if of ‘binary_type’

Parameters:
  • encoding (str, optional) – Encoding to use. If None, “utf-8” is tried, and then if not a valid UTF-8, encoding will be guessed
  • confidence (float, optional) – A value between 0 and 1, so if guessing of encoding is of lower than specified confidence, ValueError is raised
datalad.utils.auto_repr(cls)[source]

Decorator for a class to assign it an automagic quick and dirty __repr__

It uses public class attributes to prepare repr of a class

Original idea: http://stackoverflow.com/a/27799004/1265472

datalad.utils.better_wraps(to_be_wrapped)[source]

Decorator to replace functools.wraps

This is based on wrapt instead of functools and in opposition to wraps preserves the correct signature of the decorated function. It is written with the intention to replace the use of wraps without any need to rewrite the actual decorators.

class datalad.utils.chpwd(path, mkdir=False, logsuffix='')[source]

Bases: object

Wrapper around os.chdir which also adjusts environ[‘PWD’]

The reason is that otherwise PWD is simply inherited from the shell and we have no ability to assess directory path without dereferencing symlinks.

If used as a context manager it allows to temporarily change directory to the given path

datalad.utils.create_tree(path, tree, archives_leading_dir=True, remove_existing=False)[source]

Given a list of tuples (name, load) create such a tree

if load is a tuple itself – that would create either a subtree or an archive with that content and place it into the tree if name ends with .tar.gz

datalad.utils.create_tree_archive(path, name, load, overwrite=False, archives_leading_dir=True)[source]

Given an archive name, create under path with specified load tree

datalad.utils.decode_input(s)[source]

Given input string/bytes, decode according to stdin codepage (or UTF-8) if not defined

If fails – issue warning and decode allowing for errors being replaced

datalad.utils.disable_logger(*args, **kwds)[source]

context manager to temporarily disable logging

This is to provide one of swallow_logs’ purposes without unnecessarily creating temp files (see gh-1865)

Parameters:logger (Logger) – Logger whose handlers will be ordered to not log anything. Default: datalad’s topmost Logger (‘datalad’)
datalad.utils.dlabspath(path, norm=False)[source]

Symlinks-in-the-cwd aware abspath

os.path.abspath relies on os.getcwd() which would not know about symlinks in the path

TODO: we might want to norm=True by default to match behavior of os .path.abspath?

datalad.utils.encode_filename(filename)[source]

Encode unicode filename

datalad.utils.escape_filename(filename)[source]

Surround filename in “” and escape ” in the filename

datalad.utils.expandpath(path, force_absolute=True)[source]

Expand all variables and user handles in a path.

By default return an absolute path

datalad.utils.file_basename(name, return_ext=False)[source]

Strips up to 2 extensions of length up to 4 characters and starting with alpha not a digit, so we could get rid of .tar.gz etc

datalad.utils.find_files(regex, topdir='.', exclude=None, exclude_vcs=True, exclude_datalad=False, dirs=False)[source]

Generator to find files matching regex

Parameters:
  • regex (basestring) –
  • exclude (basestring, optional) – Matches to exclude
  • exclude_vcs – If True, excludes commonly known VCS subdirectories. If string, used as regex to exclude those files (regex: ‘/.(?:git|gitattributes|svn|bzr|hg)(?:/|$)’)
  • exclude_datalad – If True, excludes files known to be datalad meta-data files (e.g. under .datalad/ subdirectory) (regex: ‘/.(?:datalad)(?:/|$)’)
  • topdir (basestring, optional) – Directory where to search
  • dirs (bool, optional) – Whether to match directories as well as files
datalad.utils.generate_chunks(container, size)[source]

Given a container, generate chunks from it with size up to size

datalad.utils.get_dataset_pwds(dataset)[source]

Return the current directory for the dataset.

Parameters:dataset (Dataset) –
Returns:
  • A tuple, where the first item is the absolute path of the pwd and the
  • second is the pwd relative to the dataset’s path.
datalad.utils.get_dataset_root(path)[source]

Return the root of an existent dataset containing a given path

The root path is returned in the same absolute or relative form as the input argument. If no associated dataset exists, or the input path doesn’t exist, None is returned.

datalad.utils.get_encoding_info()[source]

Return a dictionary with various encoding/locale information

datalad.utils.get_envvars_info()[source]
datalad.utils.get_func_kwargs_doc(func)[source]

Provides args for a function

Parameters:func (str) – name of the function from which args are being requested
Returns:of the args that a function takes in
Return type:list
datalad.utils.get_ipython_shell()[source]

Detect if running within IPython and returns its ip (shell) object

Returns None if not under ipython (no get_ipython function)

datalad.utils.get_logfilename(dspath, cmd='datalad')[source]

Return a filename to use for logging under a dataset/repository

directory would be created if doesn’t exist, but dspath must exist and be a directory

datalad.utils.get_open_files(path, log_open=False)[source]

Get open files under a path

Parameters:
  • path (str) – File or directory to check for open files under
  • log_open (bool or int) – If set - logger level to use
Returns:

path : pid

Return type:

dict

datalad.utils.get_path_prefix(path, pwd=None)[source]

Get path prefix (for current directory)

Returns relative path to the topdir, if we are under topdir, and if not absolute path to topdir. If pwd is not specified - current directory assumed

datalad.utils.get_tempfile_kwargs(tkwargs=None, prefix='', wrapped=None)[source]

Updates kwargs to be passed to tempfile. calls depending on env vars

datalad.utils.get_timestamp_suffix(time_=None, prefix='-')[source]

Return a time stamp (full date and time up to second)

primarily to be used for generation of log files names

datalad.utils.get_trace(edges, start, end, trace=None)[source]

Return the trace/path to reach a node in a tree.

Parameters:
  • edges (sequence(2-tuple)) – The tree given by a sequence of edges (parent, child) tuples. The nodes can be identified by any value and data type that supports the ‘==’ operation.
  • start – Identifier of the start node. Must be present as a value in the parent location of an edge tuple in order to be found.
  • end – Identifier of the target/end node. Must be present as a value in the child location of an edge tuple in order to be found.
  • trace (list) – Mostly useful for recursive calls, and used internally.
Returns:

Returns a list with the trace to the target (the starts and the target are not included in the trace, hence if start and end are directly connected an empty list is returned), or None when no trace to the target can be found, or start and end are identical.

Return type:

None or list

datalad.utils.getpwd()[source]

Try to return a CWD without dereferencing possible symlinks

This function will try to use PWD environment variable to provide a current working directory, possibly with some directories along the path being symlinks to other directories. Unfortunately, PWD is used/set only by the shell and such functions as os.chdir and os.getcwd nohow use or modify it, thus os.getcwd() returns path with links dereferenced.

While returning current working directory based on PWD env variable we verify that the directory is the same as os.getcwd() after resolving all symlinks. If that verification fails, we fall back to always use os.getcwd().

Initial decision to either use PWD env variable or os.getcwd() is done upon the first call of this function.

datalad.utils.import_module_from_file(modpath, pkg=None, log=<bound method Logger.debug of <logging.Logger object>>)[source]

Import provided module given a path

TODO: - RF/make use of it in pipeline.py which has similar logic - join with import_modules above?

Parameters:pkg (module, optional) – If provided, and modpath is under pkg.__path__, relative import will be used
datalad.utils.import_modules(modnames, pkg, msg='Failed to import {module}', log=<bound method Logger.debug of <logging.Logger object>>)[source]

Helper to import a list of modules without failing if N/A

Parameters:
  • modnames (list of str) – List of module names to import
  • pkg (str) – Package under which to import
  • msg (str, optional) – Message template for .format() to log at DEBUG level if import fails. Keys {module} and {package} will be provided and ‘: {exception}’ appended
  • log (callable, optional) – Logger call to use for logging messages
datalad.utils.is_explicit_path(path)[source]

Return whether a path explicitly points to a location

Any absolute path, or relative path starting with either ‘../’ or ‘./’ is assumed to indicate a location on the filesystem. Any other path format is not considered explicit.

datalad.utils.is_interactive()[source]

Return True if all in/outs are tty

datalad.utils.knows_annex(path)[source]

Returns whether at a given path there is information about an annex

It is just a thin wrapper around GitRepo.is_with_annex() classmethod which also checks for path to exist first.

This includes actually present annexes, but also uninitialized ones, or even the presence of a remote annex branch.

datalad.utils.line_profile(func)[source]
datalad.utils.lmtime(filepath, mtime)[source]

Set mtime for files, while not de-referencing symlinks.

To overcome absence of os.lutime

Works only on linux and OSX ATM

datalad.utils.make_tempfile(*args, **kwds)[source]

Helper class to provide a temporary file name and remove it at the end (context manager)

Parameters:
  • mkdir (bool, optional (default: False)) – If True, temporary directory created using tempfile.mkdtemp()
  • content (str or bytes, optional) – Content to be stored in the file created
  • wrapped (function, optional) – If set, function name used to prefix temporary file name
  • **tkwargs – All other arguments are passed into the call to tempfile.mk{,d}temp(), and resultant temporary filename is passed as the first argument into the function t. If no ‘prefix’ argument is provided, it will be constructed using module and function names (‘.’ replaced with ‘_’).
  • change the used directory without providing keyword argument 'dir' set (To) –
  • DATALAD_TESTS_TEMP_DIR.

Examples

>>> from os.path import exists
>>> from datalad.utils import make_tempfile
>>> with make_tempfile() as fname:
...    k = open(fname, 'w').write('silly test')
>>> assert not exists(fname)  # was removed
>>> with make_tempfile(content="blah") as fname:
...    assert open(fname).read() == "blah"
datalad.utils.map_items(func, v)[source]

A helper to apply func to all elements (keys and values) within dict

No type checking of values passed to func is done, so func should be resilient to values which it should not handle

Initial usecase - apply_recursive(url_fragment, assure_unicode)

datalad.utils.md5sum(filename)[source]
datalad.utils.not_supported_on_windows(msg=None)[source]

A little helper to be invoked to consistently fail whenever functionality is not supported (yet) on Windows

datalad.utils.nothing_cm(*args, **kwds)[source]

Just a dummy cm to programmically switch context managers

datalad.utils.open_r_encdetect(fname, readahead=1000)[source]

Return a file object in read mode with auto-detected encoding

This is helpful when dealing with files of unknown encoding.

Parameters:readahead (int, optional) – How many bytes to read for guessing the encoding type. If negative - full file will be read
datalad.utils.optional_args(decorator)[source]

allows a decorator to take optional positional and keyword arguments. Assumes that taking a single, callable, positional argument means that it is decorating a function, i.e. something like this:

@my_decorator
def function(): pass

Calls decorator with decorator(f, *args, **kwargs)

datalad.utils.partition(items, predicate=<type 'bool'>)[source]

Partition items by predicate.

Parameters:
  • items (iterable) –
  • predicate (callable) – A function that will be mapped over each element in items. The elements will partitioned based on whether the return value is false or true.
Returns:

  • A tuple with two generators, the first for ‘false’ items and the second for
  • ’true’ ones.

Notes

Taken from Peter Otten’s snippet posted at https://nedbatchelder.com/blog/201306/filter_a_list_into_two_parts.html

datalad.utils.path_is_subpath(path, prefix)[source]

Return True if path is a subpath of prefix

It will return False if path == prefix.

Parameters:
  • path (str) –
  • prefix (str) –
datalad.utils.path_startswith(path, prefix)[source]

Return True if path starts with prefix path

Parameters:
  • path (str) –
  • prefix (str) –
datalad.utils.posix_relpath(path, start=None)[source]

Behave like os.path.relpath, but always return POSIX paths…

on any platform.

datalad.utils.read_csv_lines(fname, dialect=None, readahead=16384, **kwargs)[source]

A generator of dict records from a CSV/TSV

Automatically guesses the encoding for each record to convert to UTF-8

Parameters:
  • fname (str) – Filename
  • dialect (str, optional) – Dialect to specify to csv.reader. If not specified – guessed from the file, if fails to guess, “excel-tab” is assumed
  • readahead (int, optional) – How many bytes to read from the file to guess the type
  • **kwargs – Passed to csv.reader
datalad.utils.rmdir(path, *args, **kwargs)[source]

os.rmdir with our optional checking for open files

datalad.utils.rmtemp(f, *args, **kwargs)[source]

Wrapper to centralize removing of temp files so we could keep them around

It will not remove the temporary file/directory if DATALAD_TESTS_TEMP_KEEP environment variable is defined

datalad.utils.rmtree(path, chmod_files='auto', children_only=False, *args, **kwargs)[source]

To remove git-annex .git it is needed to make all files and directories writable again first

Parameters:
  • chmod_files (string or bool, optional) – Whether to make files writable also before removal. Usually it is just a matter of directories to have write permissions. If ‘auto’ it would chmod files on windows by default
  • children_only (bool, optional) – If set, all files and subdirectories would be removed while the path itself (must be a directory) would be preserved
  • *args
  • **kwargs – Passed into shutil.rmtree call
datalad.utils.rotree(path, ro=True, chmod_files=True)[source]

To make tree read-only or writable

Parameters:
  • path (string) – Path to the tree/directory to chmod
  • ro (bool, optional) – Whether to make it R/O (default) or RW
  • chmod_files (bool, optional) – Whether to operate also on files (not just directories)
datalad.utils.safe_print(s)[source]

Print with protection against UTF-8 encoding errors

datalad.utils.saved_generator(gen)[source]

Given a generator returns two generators, where 2nd one just replays

So the first one would be going through the generated items and 2nd one would be yielding saved items

datalad.utils.setup_exceptionhook(ipython=False)[source]

Overloads default sys.excepthook with our exceptionhook handler.

If interactive, our exceptionhook handler will invoke pdb.post_mortem; if not interactive, then invokes default handler.

datalad.utils.shortened_repr(value, l=30)[source]
datalad.utils.slash_join(base, extension)[source]

Join two strings with a ‘/’, avoiding duplicate slashes

If any of the strings is None the other is returned as is.

datalad.utils.sorted_files(dout)[source]

Return a (sorted) list of files under dout

datalad.utils.swallow_logs(*args, **kwds)[source]

Context manager to consume all logs.

datalad.utils.swallow_outputs(*args, **kwds)[source]

Context manager to help consuming both stdout and stderr, and print()

stdout is available as cm.out and stderr as cm.err whenever cm is the yielded context manager. Internally uses temporary files to guarantee absent side-effects of swallowing into StringIO which lacks .fileno.

print mocking is necessary for some uses where sys.stdout was already bound to original sys.stdout, thus mocking it later had no effect. Overriding print function had desired effect

datalad.utils.try_multiple(ntrials, exception, base, f, *args, **kwargs)[source]

Call f multiple times making exponentially growing delay between the calls

datalad.utils.try_multiple_dec(f, ntrials=None, duration=0.1, exceptions=None, increment_type=None)[source]
datalad.utils.unique(seq, key=None)[source]

Given a sequence return a list only with unique elements while maintaining order

This is the fastest solution. See https://www.peterbe.com/plog/uniqifiers-benchmark and http://stackoverflow.com/a/480227/1265472 for more information. Enhancement – added ability to compare for uniqueness using a key function

Parameters:
  • seq – Sequence to analyze
  • key (callable, optional) – Function to call on each element so we could decide not on a full element, but on its member etc

‘Robust’ unlink. Would try multiple times

On windows boxes there is evidence for a latency of more than a second until a file is considered no longer “in-use”. WindowsError is not known on Linux, and if IOError or any other exception is thrown then if except statement has WindowsError in it – NameError also see gh-2533

datalad.utils.updated(d, update)[source]

Return a copy of the input with the ‘update’

Primarily for updating dictionaries

datalad.utils.with_pathsep(path)[source]

Little helper to guarantee that path ends with /