File URL handling
DataLad datasets can record URLs for file content access as metadata. This is a feature provided by git-annex and is available for any annexed file. DataLad improves upon the git-annex functionality in two ways:
Support for a variety of (additional) protocols and authentication methods.
Support for special URLs pointing to individual files located in registered (annexed) archives, such as tarballs and ZIP files.
These additional features are available to all functionality that is processing
URLs, such as get
, addurls
, or download-url
.
Extensible protocol and authentication support
DataLad ships with a dedicated implementation of an external git-annex special
remote named git-annex-remote-datalad
. This is a somewhat atypical special
remote, because it cannot receive files and store them, but only supports
read operations.
Specifically, it uses the CLAIMURL
feature of the external special remote
protocol to take over processing of URLs with supported protocols in all
datasets that have this special remote configured and enabled.
This special remote is automatically configured and enabled in DataLad dataset
as a datalad
remote, by commands that utilize its features, such as
download-url
. Once enabled, DataLad (but also git-annex) is able to act on
additional protocols, such as s3://
, and the respective URLs can be given
directly to commands like git annex addurl
, or datalad download-url
.
Beyond additional protocol support, the datalad
special remote also
interfaces with DataLad’s Credential management. It can identify a
particular credential required for a given URL (based on something called a
“provider” configuration), ask for the credential or retrieve it from a
credential store, and supply it to the respective service in an appropriate
form. Importantly, this feature neither requires the necessary credential or
provider configuration to be encoded in a URL (where it would become part of
the git-annex metadata), nor to be committed to a dataset. Hence all
information that may depend on which entity is performing a URL request
and in what environment is completely separated from the location information
on a particular file content. This minimizes the required dataset maintenance
effort (when credentials change), and offers a clean separation of identity
and availability tracking vs. authentication management.
Indexing and access of archive content
Another git-annex special remote, named
git-annex-remote-datalad-archives
, is used to enable file content retrieval
from annexed archive files, such as tarballs and ZIP files. Its implementation
concept is closely related to the git-annex-remote-datalad
, described
above. Its main difference is that it claims responsibility for a particular
type of “URL” (starting with dl+archive:
). These URLs encode the identity
of an archive file, in terms of its git-annex key name, and a relative path
inside this archive pointing to a particular file.
Like git-annex-remote-datalad
, only read operations are supported. When
a request to a dl+archive:
“URL” is made, the special remote identifies
the archive file, if necessary obtains it at the precise version needed, and
extracts the respected file content from the archive at the correct location.
This special remote is automatically configured and enabled as
datalad-archives
by the add-archive-content
command. This command
indexes annexed archives, extracts, and registers their content to a
dataset. File content availability information is recorded in terms of the
dl+archive:
“URLs”, which are put into the git-annex metadata on a file’s
content.