.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/learn/1-dataset-dataloader.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_learn_1-dataset-dataloader.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_learn_1-dataset-dataloader.py:


.. _learn-tutorial-dataset-dataloader:

Datasets and data loaders
=========================

.. py:currentmodule:: metatensor.learn.data

This tutorial shows how to define :py:class:`Dataset` and :py:class:`DataLoader`
compatible with PyTorch and containing metatensor data (i.e. data stored in
:py:class:`metatensor.torch.TensorMap`) in addition to more usual types of data.

.. GENERATED FROM PYTHON SOURCE LINES 13-22

.. code-block:: Python


    import os

    import torch

    from metatensor.learn.data import DataLoader, Dataset
    from metatensor.torch import Labels, TensorBlock, TensorMap


.. GENERATED FROM PYTHON SOURCE LINES 23-29

Let's define a simple dummy dataset with two fields, named ``x`` and ``y``.
Every field in the :py:class:`Dataset` must be a list of objects corresponding
to the different samples in this dataset.

Let's define our x data as a list of random tensors, and our y data as a list
of integers enumerating the samples.

.. GENERATED FROM PYTHON SOURCE LINES 30-35

.. code-block:: Python


    n_samples = 5
    x_data = [torch.randn(3) for _ in range(n_samples)]
    y_data = [i for i in range(n_samples)]


.. GENERATED FROM PYTHON SOURCE LINES 36-44

In-memory dataset
-----------------

We are ready to build out first dataset. The simplest use case is when all
data is in memory. In this case, we can pass the data directly to the
:py:class:`Dataset` constructor as keyword arguments, named and ordered
according to how we want the data to be returned when we access samples in
the dataset.

.. GENERATED FROM PYTHON SOURCE LINES 45-48

.. code-block:: Python


    in_memory_dataset = Dataset(x=x_data, y=y_data)


.. GENERATED FROM PYTHON SOURCE LINES 49-52

We can now access samples in the dataset. The returned object is a named tuple
with fields corresponding to the keyword arguments given to the
:py:class:`Dataset` constructor (here ``x`` and ``y``).

.. GENERATED FROM PYTHON SOURCE LINES 53-56

.. code-block:: Python


    print(in_memory_dataset[0])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sample(x=tensor([-0.0362,  0.4042, -1.5136]), y=0)


.. GENERATED FROM PYTHON SOURCE LINES 57-58

One can also iterate over the samples in the dataset as follows:

.. GENERATED FROM PYTHON SOURCE LINES 59-63

.. code-block:: Python


    for sample in in_memory_dataset:
        print(sample)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sample(x=tensor([-0.0362,  0.4042, -1.5136]), y=0)
    Sample(x=tensor([ 0.4930, -0.8115, -0.5711]), y=1)
    Sample(x=tensor([ 1.5987, -0.5422, -0.7144]), y=2)
    Sample(x=tensor([ 2.3143,  0.8797, -1.2336]), y=3)
    Sample(x=tensor([ 1.0564, -0.6846,  0.0085]), y=4)


.. GENERATED FROM PYTHON SOURCE LINES 64-72

Any number of named data fields can be passed to the :py:class:`Dataset`
constructor, as long as they are all uniquely named, and are all lists of the
same length. The elements of each list can be any type of object (integer,
string, ``torch.Tensor``, etc.), as long as it is the same type for all
samples in the respective field.

For example, here we are creating a dataset of torch tensors (``x``), integers
(``y``), and strings (``z``).

.. GENERATED FROM PYTHON SOURCE LINES 73-78

.. code-block:: Python


    bigger_dataset = Dataset(x=x_data, y=y_data, z=["a", "b", "c", "d", "e"])
    print(bigger_dataset[0])
    print("Sample 4, z field:", bigger_dataset[4].z)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sample(x=tensor([-0.0362,  0.4042, -1.5136]), y=0, z='a')
    Sample 4, z field: e


.. GENERATED FROM PYTHON SOURCE LINES 79-87

Mixed in-memory / on-disk dataset
---------------------------------

Now suppose we have a large dataset, where the ``x`` data is too large to fit
in memory. In this case, we might want to lazily load data when training a model
with minibatches.

Let's save the ``x`` data to disk to simulate this use case.

.. GENERATED FROM PYTHON SOURCE LINES 88-95

.. code-block:: Python


    # Create a directory to save the dummy ``x`` data to disk
    os.makedirs("data", exist_ok=True)

    for i, x in enumerate(x_data):
        torch.save(x, f"data/x_{i}.pt")


.. GENERATED FROM PYTHON SOURCE LINES 96-101

In order for the ``x`` data to be loaded lazily, we need to give the
:py:class:`Dataset` a ``load`` function that loads a single sample into
memory. This can be a function of arbitrary complexity, taking a single
argument which is the numeric index (between ``0`` and ``len(dataset) - 1``)
of the sample to load

.. GENERATED FROM PYTHON SOURCE LINES 102-115

.. code-block:: Python


    def load_x(sample_id):
        """
        Loads the x data for the sample indexed by `sample_id` from disk and returns the
        object in memory
        """
        print(f"loading x for sample {sample_id}")
        return torch.load(f"data/x_{sample_id}.pt")


    print("load_x called with sample index 0:", load_x(0))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    loading x for sample 0
    load_x called with sample index 0: tensor([-0.0362,  0.4042, -1.5136])


.. GENERATED FROM PYTHON SOURCE LINES 116-118

Now when we define a dataset, the ``x`` data field can be passed as a
callable.

.. GENERATED FROM PYTHON SOURCE LINES 119-123

.. code-block:: Python


    mixed_dataset = Dataset(x=load_x, y=y_data)
    print(mixed_dataset[3])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    loading x for sample 3
    Sample(x=tensor([ 2.3143,  0.8797, -1.2336]), y=3)


.. GENERATED FROM PYTHON SOURCE LINES 124-132

On-disk dataset
---------------

Finally, suppose we have a large dataset, where both the ``x`` and ``y`` data
are too large to fit in memory. In this case, we might want to lazily load all
data when training a model with minibatches.

Let's save the ``y`` data to disk as well to simulate this use case.

.. GENERATED FROM PYTHON SOURCE LINES 133-149

.. code-block:: Python


    for i, y in enumerate(y_data):
        torch.save(y, f"data/y_{i}.pt")


    def load_y(sample_id):
        """
        Loads the y data for the sample indexed by `sample_id` from disk and
        returns the object in memory
        """
        print(f"loading y for sample {sample_id}")
        return torch.load(f"data/y_{sample_id}.pt")


    print("load_y called with sample index 0:", load_y(0))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    loading y for sample 0
    load_y called with sample index 0: 0


.. GENERATED FROM PYTHON SOURCE LINES 150-158

Now when we define a dataset, as all the fields are to be lazily loaded, we
need to indicate how many samples are in the dataset with the ``size``
argument.

Internally, the :py:class:`Dataset` class infers the unique sample indexes as
a continuous integer sequence starting from ``0`` to ``size - 1`` (inclusive).
In this case, sample indexes are therefore ``[0, 1, 2, 3, 4]``. These indexes
are used to lazily load the data upon access.

.. GENERATED FROM PYTHON SOURCE LINES 159-163

.. code-block:: Python


    on_disk_dataset = Dataset(x=load_x, y=load_y, size=n_samples)
    print(on_disk_dataset[2])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    loading x for sample 2
    loading y for sample 2
    Sample(x=tensor([ 1.5987, -0.5422, -0.7144]), y=2)


.. GENERATED FROM PYTHON SOURCE LINES 164-174

Building a Dataloader
---------------------

Now let's see how we can use the :py:class:`Dataset` class to build a
:py:class:``DataLoader``.

Metatensor's :py:class:``DataLoader`` class is a wrapper around the PyTorch
``DataLoader`` class, and as such can be initialized with a
:py:class:``Dataset`` object. It will also inherit all of the default
arguments from the PyTorch ``DataLoader`` class.

.. GENERATED FROM PYTHON SOURCE LINES 175-178

.. code-block:: Python


    in_memory_dataloader = DataLoader(in_memory_dataset)


.. GENERATED FROM PYTHON SOURCE LINES 179-182

We can now iterate over the ``DataLoader`` to access batches of samples from
the dataset. With no arguments passed, the default batch size is 1 and the
samples are not shuffled.

.. GENERATED FROM PYTHON SOURCE LINES 183-187

.. code-block:: Python


    for batch in in_memory_dataloader:
        print(batch.y)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    (0,)
    (1,)
    (2,)
    (3,)
    (4,)


.. GENERATED FROM PYTHON SOURCE LINES 188-190

As an alternative syntax, the data fields can be unpacked into separate
variables in the for loop.

.. GENERATED FROM PYTHON SOURCE LINES 191-195

.. code-block:: Python


    for x, y in in_memory_dataloader:
        print(x, y)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    tensor([[-0.0362,  0.4042, -1.5136]]) (0,)
    tensor([[ 0.4930, -0.8115, -0.5711]]) (1,)
    tensor([[ 1.5987, -0.5422, -0.7144]]) (2,)
    tensor([[ 2.3143,  0.8797, -1.2336]]) (3,)
    tensor([[ 1.0564, -0.6846,  0.0085]]) (4,)


.. GENERATED FROM PYTHON SOURCE LINES 196-198

We can also pass arguments to the DataLoader constructor to change the batch
size and shuffling of the samples.

.. GENERATED FROM PYTHON SOURCE LINES 199-204

.. code-block:: Python

    in_memory_dataloader = DataLoader(in_memory_dataset, batch_size=2, shuffle=True)

    for batch in in_memory_dataloader:
        print(batch.y)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    (0, 1)
    (4, 2)
    (3,)


.. GENERATED FROM PYTHON SOURCE LINES 205-212

Data loaders for cross-validation
---------------------------------

One can use the usual torch :py:func:`torch.utils.data.random_split` function
to split a ``Dataset`` into train, validation, and test subsets for
cross-validation purposes. ``DataLoader`` s can then be constructed for each
subset.

.. GENERATED FROM PYTHON SOURCE LINES 213-233

.. code-block:: Python


    # Perform a random train/val/test split of the Dataset,
    # in the relative proportions (60% / 20% / 20%)
    train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(
        in_memory_dataset, [0.6, 0.2, 0.2]
    )

    # Construct DataLoaders for each subset
    train_dataloader = DataLoader(train_dataset)
    val_dataloader = DataLoader(val_dataset)
    test_dataloader = DataLoader(test_dataset)

    # As the Dataset was initialized with 5 samples, the split should be 3:1:1
    print(f"Dataset size: {len(on_disk_dataset)}")
    print(f"Training set size: {len(train_dataloader)}")
    print(f"Validation set size: {len(val_dataloader)}")
    print(f"Test set size: {len(test_dataloader)}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Dataset size: 5
    Training set size: 3
    Validation set size: 1
    Test set size: 1


.. GENERATED FROM PYTHON SOURCE LINES 234-247

Working with :py:class:`torch.Tensor` and :py:class:`metatensor.torch.TensorMap`
--------------------------------------------------------------------------------

As the :py:class:`Dataset` and :py:class:`DataLoader` classes exist to
interface metatensor and torch, let's explore how they behave when using
:py:class:`torch.Tensor` and :py:class:`metatensor.torch.TensorMap` objects
as the data.

We'll consider some dummy data consisting of the following fields:

- **descriptor**: a list of random TensorMap objects
- **scalar**: a list of random floats
- **vector**: a list of random torch Tensors

.. GENERATED FROM PYTHON SOURCE LINES 248-280

.. code-block:: Python


    # Create a dummy descriptor as a ``TensorMap``
    descriptor = [
        TensorMap(
            keys=Labels(
                names=["key_1", "key_2"],
                values=torch.tensor([[1, 2]]),
            ),
            blocks=[
                TensorBlock(
                    values=torch.randn((1, 3)),
                    samples=Labels("sample_id", torch.tensor([[sample_id]])),
                    components=[],
                    properties=Labels("p", torch.tensor([[1], [4], [5]])),
                )
            ],
        )
        for sample_id in range(n_samples)
    ]

    # Create dummy scalar and vectorial target properties as ``torch.Tensor``
    scalar = [float(torch.rand(1, 1)) for _ in range(n_samples)]
    vector = [torch.rand(1, 3) for _ in range(n_samples)]

    # Build the ``Dataset``
    dataset = Dataset(
        scalar=scalar,
        vector=vector,
        descriptor=descriptor,
    )
    print(dataset[0])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sample(scalar=0.9521020650863647, vector=tensor([[0.6514, 0.3126, 0.8538]]), descriptor=TensorMap with 1 blocks
    keys: key_1  key_2
            1      2)


.. GENERATED FROM PYTHON SOURCE LINES 281-296

Merging samples in a batch
--------------------------

As is typically customary when working with torch tensors, we want to vertically stack
the samples in a minibatch into a single ``torch.Tensor`` object. This allows passing
a single ``torch.Tensor`` object to a model, rather than a tuple of ``torch.Tensor``
objects. In a similar way, sparse data stored in metatensor ``TensorMap`` objects can
also be vertically stacked, i.e. joined along the samples axis, into a single
``TensorMap`` object.

The default ``collate_fn`` used by :py:class:`DataLoader`
(:py:func:`metatensor.learn.data.group_and_join`), vstacks (respectively joins along
the samples axis) data fields that correspond :py:class:`torch.Tensor` (respectively
:py:class:`metatensor.torch.TensorMap`). For all other data types, the data is left as
a tuple containing all samples in the current batch in order.

.. GENERATED FROM PYTHON SOURCE LINES 297-301

.. code-block:: Python


    batch_size = 2
    dataloader = DataLoader(dataset, batch_size=batch_size)


.. GENERATED FROM PYTHON SOURCE LINES 302-304

We can look at a single ``Batch`` object (i.e. a named tuple, returned by
``DataLoader.__iter__()``) to see this in action.

.. GENERATED FROM PYTHON SOURCE LINES 305-320

.. code-block:: Python


    batch = next(iter(dataloader))

    # ``TensorMap``s for each sample in the batch are joined along the samples axis
    # into a single ``TensorMap``
    print("batch.descriptor =", batch.descriptor)

    # ``scalar`` data are float objects, so are just grouped and returned in a tuple
    print("batch.scalar =", batch.scalar)
    assert len(batch.scalar) == batch_size

    # ``vector`` data are ``torch.Tensor``s, so are vertically stacked into a single
    # ``torch.Tensor``
    print("batch.vector =", batch.vector)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    batch.descriptor = TensorMap with 1 blocks
    keys: key_1  key_2
            1      2
    batch.scalar = (0.9521020650863647, 0.3561684489250183)
    batch.vector = tensor([[0.6514, 0.3126, 0.8538],
            [0.5025, 0.7564, 0.7443]])


.. GENERATED FROM PYTHON SOURCE LINES 321-327

Advanced functionality: IndexedDataset
--------------------------------------

What if we wanted to explicitly define the sample indexes used to store and access
samples in the dataset? See the next tutorial,
:ref:`learn-tutorial-indexed-dataset-dataloader`, for more details!


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.374 seconds)


.. _sphx_glr_download_examples_learn_1-dataset-dataloader.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 1-dataset-dataloader.ipynb <1-dataset-dataloader.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 1-dataset-dataloader.py <1-dataset-dataloader.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 1-dataset-dataloader.zip <1-dataset-dataloader.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_