.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/learn/2-indexed-dataset.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_learn_2-indexed-dataset.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_learn_2-indexed-dataset.py:


.. _learn-tutorial-indexed-dataset-dataloader:

Using IndexedDataset
====================

.. py:currentmodule:: metatensor.torch.learn.data

.. GENERATED FROM PYTHON SOURCE LINES 10-18

.. code-block:: Python


    import os

    import torch

    from metatensor.learn.data import DataLoader, Dataset, IndexedDataset


.. GENERATED FROM PYTHON SOURCE LINES 19-36

Review of the standard :py:class:`Dataset`
------------------------------------------

The previous tutorial, :ref:`learn-tutorial-dataset-dataloader`, showed how to define
a :py:class:`Dataset` which can handle both ```torch.Tensor``` and metatensor's
``TensorMap``. We saw that in-memory, on-disk, or mixed in-memory/on-disk datasets
can be defined. ``DataLoaders`` are then defined on top of these ``Dataset`` objects.

In all cases, however, each sample is accessed by a numeric integer index,
which ranges from ``0`` to ``len(dataset) - 1``. Let us use a simple example
to review this.

Again let's define some dummy data. Our ``x`` data is a list of random tensors,
and our ``y`` data is a list of integers that enumerate the samples.

For the purposes of this tutorial, we will only focus on an in-memory dataset, though
the same principles apply to on-disk and mixed datasets.

.. GENERATED FROM PYTHON SOURCE LINES 37-44

.. code-block:: Python


    n_samples = 5
    x_data = [torch.randn(3) for _ in range(n_samples)]
    y_data = [i for i in range(n_samples)]

    dataset = Dataset(x=x_data, y=y_data)


.. GENERATED FROM PYTHON SOURCE LINES 45-50

A sample is accessed by its numeric index. As the length of the lists passed as kwargs
is 5, both for ``x`` and ``y``, the valid indices are ``[0, 1, 2, 3, 4]``.

Let's retrieve the 4th sample (index 3) and print it. The value of the ``y`` data
field should be 3.

.. GENERATED FROM PYTHON SOURCE LINES 51-54

.. code-block:: Python


    print(dataset[3])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sample(x=tensor([ 0.4012,  1.7670, -0.1331]), y=3)


.. GENERATED FROM PYTHON SOURCE LINES 55-63

What if we wanted to access samples by something other than an integer index part of a
continuous range?

For instance, what if we wanted to access samples by:
   1. a string id, or other arbitrary hashable objects?
   2. an integer index that is not defined inside a continuous range?

In these cases, we can use an :py:class:`IndexedDataset` instead.

.. GENERATED FROM PYTHON SOURCE LINES 66-73

IndexedDataset
--------------

First let's define a ``Dataset`` where the samples are indexed by arbitrary unique
indexes, such as strings, integers, and tuples.

Suppose the unique indexes for our 5 samples are:

.. GENERATED FROM PYTHON SOURCE LINES 74-90

.. code-block:: Python


    sample_id = [
        "cat",
        4,
        ("small", "cow"),
        "dog",
        0,
    ]

    # Build an 'IndexedDataset', specifying the unique sample indexes with 'sample_id'
    dataset = IndexedDataset(
        x=x_data,
        y=y_data,
        sample_id=sample_id,
    )


.. GENERATED FROM PYTHON SOURCE LINES 91-94

Now, when we access the dataset, we can access samples by their unique sample index
using the ``get_sample`` method. This method takes a single argument, the sample
index, and returns the corresponding sample.

.. GENERATED FROM PYTHON SOURCE LINES 95-100

.. code-block:: Python


    print(dataset.get_sample("dog"))
    print(dataset.get_sample(4))
    print(dataset.get_sample(("small", "cow")))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sample(sample_id='dog', x=tensor([ 0.4012,  1.7670, -0.1331]), y=3)
    Sample(sample_id=4, x=tensor([-0.2741,  1.9667,  0.2642]), y=1)
    Sample(sample_id=('small', 'cow'), x=tensor([-0.8153, -0.0136,  0.9017]), y=2)


.. GENERATED FROM PYTHON SOURCE LINES 101-113

Note that using ``__getitem__``, i.e. ``dataset[4]``, will return the sample passed to
the constructor at position 5. In this case, the sample indexes map to the numeric
indices as follows:

0. ``"cat"``
1. ``4``
2. ``("small", "cow")``
3. ``"dog"``
4. ``0``

Thus, accessing the unique sample index ``"cat"`` can be done equivalently with
either of:

.. GENERATED FROM PYTHON SOURCE LINES 114-119

.. code-block:: Python


    print(dataset[0])
    print(dataset.get_sample("cat"))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sample(sample_id='cat', x=tensor([-2.4215,  1.0809,  0.2023]), y=0)
    Sample(sample_id='cat', x=tensor([-2.4215,  1.0809,  0.2023]), y=0)


.. GENERATED FROM PYTHON SOURCE LINES 120-128

Note that the named tuple returned in both cases contains the unique sample index as
the ``sample_id`` field, which precedes all other data fields. This is in contrast to
the standard ``Dataset``, which only returns the passed data fields and not the index.

A :py:class:`DataLoader` can be constructed on top of an :py:class:`IndexedDataset`
in the same way as a :py:class:`Dataset`. Batches are accessed by iterating over the
:py:class:`DataLoader`, though this time the ``Batch`` named tuple returned by the
data loader will contain the unique sample indexes ``sample_id`` as the first field.

.. GENERATED FROM PYTHON SOURCE LINES 129-136

.. code-block:: Python


    dataloader = DataLoader(dataset, batch_size=2)

    # Iterate over batches
    for batch in dataloader:
        print(batch)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Batch(sample_id=('cat', 4), x=tensor([[-2.4215,  1.0809,  0.2023],
            [-0.2741,  1.9667,  0.2642]]), y=(0, 1))
    Batch(sample_id=(('small', 'cow'), 'dog'), x=tensor([[-0.8153, -0.0136,  0.9017],
            [ 0.4012,  1.7670, -0.1331]]), y=(2, 3))
    Batch(sample_id=(0,), x=tensor([[ 0.0110, -0.9424, -1.4410]]), y=(4,))


.. GENERATED FROM PYTHON SOURCE LINES 137-138

As before, we can create separate variables in the iteration pattern

.. GENERATED FROM PYTHON SOURCE LINES 139-142

.. code-block:: Python

    for ids, x, y in dataloader:
        print(ids, x, y)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ('cat', 4) tensor([[-2.4215,  1.0809,  0.2023],
            [-0.2741,  1.9667,  0.2642]]) (0, 1)
    (('small', 'cow'), 'dog') tensor([[-0.8153, -0.0136,  0.9017],
            [ 0.4012,  1.7670, -0.1331]]) (2, 3)
    (0,) tensor([[ 0.0110, -0.9424, -1.4410]]) (4,)


.. GENERATED FROM PYTHON SOURCE LINES 143-155

On-disk :py:class:`IndexedDataset` with arbitrary sample indexes
-----------------------------------------------------------------

When defining an :py:class:`IndexedDataset` with data fields on-disk, i.e. to be
loaded lazily, the sample indexes passed as the ``sample_id`` kwarg to the
constructor are used as the arguments to the load function.

To demonstrate this, as we did in the previous tutorial, let's save the ``x`` data to
disk and build a mixed in-memory/on-disk :py:class:`IndexedDataset`.

For instance, the below code will save some ``x`` data for the sample ``"dog"`` at
relative path ``"data/x_dog.pt"``.

.. GENERATED FROM PYTHON SOURCE LINES 156-163

.. code-block:: Python


    # Create a directory to save the dummy x data to disk
    os.makedirs("data", exist_ok=True)

    for i, x in zip(sample_id, x_data):
        torch.save(x, f"data/x_{i}.pt")


.. GENERATED FROM PYTHON SOURCE LINES 164-166

We can now define a load function to load data from disk. This should take the unique
sample index as a single argument, and return the corresponding data in memory.

.. GENERATED FROM PYTHON SOURCE LINES 167-178

.. code-block:: Python


    def load_x(sample_id):
        """
        Loads the x data for the sample indexed by `sample_id` from disk and
        returns the object in memory
        """
        print(f"loading x for sample {sample_id}")
        return torch.load(f"data/x_{sample_id}.pt")


.. GENERATED FROM PYTHON SOURCE LINES 179-181

Now when we define an IndexedDataset, the ``x`` data field can be passed as a
callable.

.. GENERATED FROM PYTHON SOURCE LINES 182-188

.. code-block:: Python


    mixed_dataset = IndexedDataset(x=load_x, y=y_data, sample_id=sample_id)
    print(mixed_dataset.get_sample("dog"))
    print(mixed_dataset.get_sample(("small", "cow")))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    loading x for sample dog
    Sample(sample_id='dog', x=tensor([ 0.4012,  1.7670, -0.1331]), y=3)
    loading x for sample ('small', 'cow')
    Sample(sample_id=('small', 'cow'), x=tensor([-0.8153, -0.0136,  0.9017]), y=2)


.. GENERATED FROM PYTHON SOURCE LINES 189-199

Using an IndexedDataset: subset integer ranges
----------------------------------------------

One could also define an ``IndexedDataset`` where the samples indices are integers
forming a possibly shuffled and non-continuous subset of a larger continuous range
of numeric indices.

For instance, imagine we have a global Dataset of 1000 samples, with indices
``[0, ..., 999]``, but only want to build a dataset for samples with indices
``[4, 7, 200, 5, 999]``, in that order. We can pass these indices kwarg ``sample_id``.

.. GENERATED FROM PYTHON SOURCE LINES 200-205

.. code-block:: Python


    # Build an IndexedDataset, specifying the subset sample indexes in a specific order
    sample_id = [4, 7, 200, 5, 999]
    dataset = IndexedDataset(x=x_data, y=y_data, sample_id=sample_id)


.. GENERATED FROM PYTHON SOURCE LINES 206-212

Now, when we access the dataset, we can access samples by their unique sample index
using the ``get_sample`` method. This method takes a single argument, the sample
index, and returns the corresponding sample.

Again, the numeric index can be used equivalently to access the sample, and again note
that the ``Sample`` named tuple includes the ``sample_id`` field.

.. GENERATED FROM PYTHON SOURCE LINES 213-218

.. code-block:: Python


    # These return the same sample
    print(dataset.get_sample(5))
    print(dataset[4])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sample(sample_id=5, x=tensor([ 0.4012,  1.7670, -0.1331]), y=3)
    Sample(sample_id=999, x=tensor([ 0.0110, -0.9424, -1.4410]), y=4)


.. GENERATED FROM PYTHON SOURCE LINES 219-220

And finally, the ``DataLoader`` behaves as expected:

.. GENERATED FROM PYTHON SOURCE LINES 221-226

.. code-block:: Python


    dataloader = DataLoader(dataset, batch_size=2)

    for batch in dataloader:
        print(batch)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Batch(sample_id=(4, 7), x=tensor([[-2.4215,  1.0809,  0.2023],
            [-0.2741,  1.9667,  0.2642]]), y=(0, 1))
    Batch(sample_id=(200, 5), x=tensor([[-0.8153, -0.0136,  0.9017],
            [ 0.4012,  1.7670, -0.1331]]), y=(2, 3))
    Batch(sample_id=(999,), x=tensor([[ 0.0110, -0.9424, -1.4410]]), y=(4,))


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.013 seconds)


.. _sphx_glr_download_examples_learn_2-indexed-dataset.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 2-indexed-dataset.ipynb <2-indexed-dataset.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 2-indexed-dataset.py <2-indexed-dataset.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 2-indexed-dataset.zip <2-indexed-dataset.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_