.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/learn/1-dataset-dataloader.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_learn_1-dataset-dataloader.py: .. _learn-tutorial-dataset-dataloader: Datasets and data loaders ========================= .. py:currentmodule:: metatensor.learn.data This tutorial shows how to define :py:class:`Dataset` and :py:class:`DataLoader` compatible with PyTorch and containing metatensor data (i.e. data stored in :py:class:`metatensor.torch.TensorMap`) in addition to more usual types of data. .. GENERATED FROM PYTHON SOURCE LINES 13-22 .. code-block:: Python import os import torch from metatensor.learn.data import DataLoader, Dataset from metatensor.torch import Labels, TensorBlock, TensorMap .. GENERATED FROM PYTHON SOURCE LINES 23-29 Let's define a simple dummy dataset with two fields, named 'x' and 'y'. Every field in the `Dataset` must be a list of objects corresponding to the different samples in this dataset. Let's define our x data as a list of random tensors, and our y data as a list of integers enumerating the samples. .. GENERATED FROM PYTHON SOURCE LINES 30-35 .. code-block:: Python n_samples = 5 x_data = [torch.randn(3) for _ in range(n_samples)] y_data = [i for i in range(n_samples)] .. GENERATED FROM PYTHON SOURCE LINES 36-43 In-memory dataset ----------------- We are ready to build out first dataset. The simplest use case is when all data is in memory. In this case, we can pass the data directly to the :py:class:`Dataset` constructor as keyword arguments, named and ordered according to how we want the data to be returned when we access samples in the dataset. .. GENERATED FROM PYTHON SOURCE LINES 44-47 .. code-block:: Python in_memory_dataset = Dataset(x=x_data, y=y_data) .. GENERATED FROM PYTHON SOURCE LINES 48-51 We can now access samples in the dataset. The returned object is a named tuple with fields corresponding to the keyword arguments given to the :py:class:``Dataset` constructor (here ``x`` and ``y``). .. GENERATED FROM PYTHON SOURCE LINES 52-55 .. code-block:: Python print(in_memory_dataset[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none Sample(x=tensor([0.8755, 1.1659, 0.4774]), y=0) .. GENERATED FROM PYTHON SOURCE LINES 56-57 One can also iterate over the samples in the dataset as follows: .. GENERATED FROM PYTHON SOURCE LINES 58-62 .. code-block:: Python for sample in in_memory_dataset: print(sample) .. rst-class:: sphx-glr-script-out .. code-block:: none Sample(x=tensor([0.8755, 1.1659, 0.4774]), y=0) Sample(x=tensor([ 0.2203, 0.5828, -1.4850]), y=1) Sample(x=tensor([ 0.5390, -0.4458, -0.5234]), y=2) Sample(x=tensor([-1.6205, 0.2696, 0.8789]), y=3) Sample(x=tensor([-1.2561, 0.8285, 1.4113]), y=4) .. GENERATED FROM PYTHON SOURCE LINES 63-70 Any number of named data fields can be passed to the Dataset constructor, as long as they are all uniquely named, and are all lists of the same length. The elements of each list can be any type of object (integer, string, torch Tensor, etc.), as long as it is the type same for all samples in the respective field. For example, here we are creating a dataset of torch tensors (``x``), integers (``y``), and strings (``z``). .. GENERATED FROM PYTHON SOURCE LINES 71-76 .. code-block:: Python bigger_dataset = Dataset(x=x_data, y=y_data, z=["a", "b", "c", "d", "e"]) print(bigger_dataset[0]) print("Sample 4, z field:", bigger_dataset[4].z) .. rst-class:: sphx-glr-script-out .. code-block:: none Sample(x=tensor([0.8755, 1.1659, 0.4774]), y=0, z='a') Sample 4, z field: e .. GENERATED FROM PYTHON SOURCE LINES 77-85 Mixed in-memory / on-disk dataset --------------------------------- Now suppose we have a large dataset, where the x data is too large to fit in memory. In this case, we might want to lazily load data when training a model with minibatches. Let's save the x data to disk to simulate this use case. .. GENERATED FROM PYTHON SOURCE LINES 86-93 .. code-block:: Python # Create a directory to save the dummy x data to disk os.makedirs("data", exist_ok=True) for i, x in enumerate(x_data): torch.save(x, f"data/x_{i}.pt") .. GENERATED FROM PYTHON SOURCE LINES 94-98 In order for the x data to be loaded lazily, we need to give the ``Dataset`` a ``load`` function that loads a single sample into memory. This can a function of arbitrary complexity, taking a single argument which is the numeric index (between ``0`` and ``len(dataset)``) of the sample to load .. GENERATED FROM PYTHON SOURCE LINES 99-112 .. code-block:: Python def load_x(sample_id): """ Loads the x data for the sample indexed by `sample_id` from disk and returns the object in memory """ print(f"loading x for sample {sample_id}") return torch.load(f"data/x_{sample_id}.pt") print("load_x called with sample index 0:", load_x(0)) .. rst-class:: sphx-glr-script-out .. code-block:: none loading x for sample 0 /home/runner/work/metatensor/metatensor/python/examples/learn/1-dataset-dataloader.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(f"data/x_{sample_id}.pt") load_x called with sample index 0: tensor([0.8755, 1.1659, 0.4774]) .. GENERATED FROM PYTHON SOURCE LINES 113-114 Now when we define a dataset, the 'x' data field can be passed as a callable. .. GENERATED FROM PYTHON SOURCE LINES 115-119 .. code-block:: Python mixed_dataset = Dataset(x=load_x, y=y_data) print(mixed_dataset[3]) .. rst-class:: sphx-glr-script-out .. code-block:: none loading x for sample 3 Sample(x=tensor([-1.6205, 0.2696, 0.8789]), y=3) .. GENERATED FROM PYTHON SOURCE LINES 120-128 On-disk dataset --------------- Finally, suppose we have a large dataset, where both the x and y data are too large to fit in memory. In this case, we might want to lazily load all data when training a model with minibatches. Let's save the y data to disk as well to simulate this use case. .. GENERATED FROM PYTHON SOURCE LINES 129-145 .. code-block:: Python for i, y in enumerate(y_data): torch.save(y, f"data/y_{i}.pt") def load_y(sample_id): """ Loads the y data for the sample indexed by `sample_id` from disk and returns the object in memory """ print(f"loading y for sample {sample_id}") return torch.load(f"data/y_{sample_id}.pt") print("load_y called with sample index 0:", load_y(0)) .. rst-class:: sphx-glr-script-out .. code-block:: none loading y for sample 0 /home/runner/work/metatensor/metatensor/python/examples/learn/1-dataset-dataloader.py:140: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(f"data/y_{sample_id}.pt") load_y called with sample index 0: 0 .. GENERATED FROM PYTHON SOURCE LINES 146-152 Now when we define a dataset, as all the fields are to be lazily loaded, we need to indicate how many samples are in the dataset with the ``size`` argument. Internally, the Dataset class infers the unique sample indexes as a continuous integer sequence starting from 0 to ``size - 1`` (inclusive). In this case, sample indexes are therefore [0, 1, 2, 3, 4]. These indexes are used to lazily load the data upon access. .. GENERATED FROM PYTHON SOURCE LINES 153-156 .. code-block:: Python on_disk_dataset = Dataset(x=load_x, y=load_y, size=n_samples) print(on_disk_dataset[2]) .. rst-class:: sphx-glr-script-out .. code-block:: none loading x for sample 2 loading y for sample 2 Sample(x=tensor([ 0.5390, -0.4458, -0.5234]), y=2) .. GENERATED FROM PYTHON SOURCE LINES 157-165 Building a Dataloader --------------------- Now let's see how we can use the Dataset class to build a DataLoader. Metatensor's ``DataLoader`` class is a wrapper around the PyTorch ``DataLoader`` class, and as such can be initialized with a ``Dataset`` object. It will also inherit all of the default arguments from the PyTorch DataLoader class. .. GENERATED FROM PYTHON SOURCE LINES 166-169 .. code-block:: Python in_memory_dataloader = DataLoader(in_memory_dataset) .. GENERATED FROM PYTHON SOURCE LINES 170-173 We can now iterate over the DataLoader to access batches of samples from the dataset. With no arguments passed, the default batch size is 1 and the samples are not shuffled. .. GENERATED FROM PYTHON SOURCE LINES 174-178 .. code-block:: Python for batch in in_memory_dataloader: print(batch.y) .. rst-class:: sphx-glr-script-out .. code-block:: none (0,) (1,) (2,) (3,) (4,) .. GENERATED FROM PYTHON SOURCE LINES 179-181 As an alternative syntax, the data fields can be unpacked into separate variables in the for loop. .. GENERATED FROM PYTHON SOURCE LINES 182-186 .. code-block:: Python for x, y in in_memory_dataloader: print(x, y) .. rst-class:: sphx-glr-script-out .. code-block:: none tensor([[0.8755, 1.1659, 0.4774]]) (0,) tensor([[ 0.2203, 0.5828, -1.4850]]) (1,) tensor([[ 0.5390, -0.4458, -0.5234]]) (2,) tensor([[-1.6205, 0.2696, 0.8789]]) (3,) tensor([[-1.2561, 0.8285, 1.4113]]) (4,) .. GENERATED FROM PYTHON SOURCE LINES 187-189 We can also pass arguments to the DataLoader constructor to change the batch size and shuffling of the samples. .. GENERATED FROM PYTHON SOURCE LINES 190-195 .. code-block:: Python in_memory_dataloader = DataLoader(in_memory_dataset, batch_size=2, shuffle=True) for batch in in_memory_dataloader: print(batch.y) .. rst-class:: sphx-glr-script-out .. code-block:: none (2, 3) (1, 0) (4,) .. GENERATED FROM PYTHON SOURCE LINES 196-202 Data loaders for cross-validation --------------------------------- One can use the usual torch :py:func:`torch.utils.data.random_split` function to split a ``Dataset`` into train, validation, and test subsets for cross-validation purposes. ``DataLoader`` s can then be constructed for each subset. .. GENERATED FROM PYTHON SOURCE LINES 203-223 .. code-block:: Python # Perform a random train/val/test split of the Dataset, # in the relative proportions (60% / 20% / 20%) train_dataset, val_dataset, test_dataset = torch.utils.data.random_split( in_memory_dataset, [0.6, 0.2, 0.2] ) # Construct DataLoaders for each subset train_dataloader = DataLoader(train_dataset) val_dataloader = DataLoader(val_dataset) test_dataloader = DataLoader(test_dataset) # As the Dataset was initialized with 5 samples, the split should be 3:1:1 print(f"Dataset size: {len(on_disk_dataset)}") print(f"Training set size: {len(train_dataloader)}") print(f"Validation set size: {len(val_dataloader)}") print(f"Test set size: {len(test_dataloader)}") .. rst-class:: sphx-glr-script-out .. code-block:: none Dataset size: 5 Training set size: 3 Validation set size: 1 Test set size: 1 .. GENERATED FROM PYTHON SOURCE LINES 224-237 Working with :py:class:`torch.Tensor` and :py:class:`metatensor.torch.TensorMap` -------------------------------------------------------------------------------- As the :py:class:`Dataset` and :py:class:`DataLoader` classes exist to interface metatensor and torch, let's explore how they behave when using :py:class:`torch.Tensor` and :py:class:`metatensor.torch.TensorMap` objects as the data. We'll consider some dummy data consisting of the following fields: - **descriptor**: a list of random TensorMap objects - **scalar**: a list of random floats - **vector**: a list of random torch Tensors .. GENERATED FROM PYTHON SOURCE LINES 238-270 .. code-block:: Python # Create a dummy descriptor as a TensorMap descriptor = [ TensorMap( keys=Labels( names=["key_1", "key_2"], values=torch.tensor([[1, 2]]), ), blocks=[ TensorBlock( values=torch.randn((1, 3)), samples=Labels("sample_id", torch.tensor([[sample_id]])), components=[], properties=Labels("p", torch.tensor([[1], [4], [5]])), ) ], ) for sample_id in range(n_samples) ] # Create dummy scalar and vectorial target properties as torch Tensors scalar = [float(torch.rand(1, 1)) for _ in range(n_samples)] vector = [torch.rand(1, 3) for _ in range(n_samples)] # Build the Dataset dataset = Dataset( scalar=scalar, vector=vector, descriptor=descriptor, ) print(dataset[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none Sample(scalar=0.4385954737663269, vector=tensor([[0.8823, 0.5315, 0.5599]]), descriptor=TensorMap with 1 blocks keys: key_1 key_2 1 2) .. GENERATED FROM PYTHON SOURCE LINES 271-285 Merging samples in a batch -------------------------- As is typically customary when working with torch Tensors, we want to vertically stack the samples in a minibatch into a single Tensor object. This allows passing a single Tensor object to a model, rather than a tuple of Tensor objects. In a similar way, sparse data stored in metatensor TensorMap objects can also be vertically stacked, i.e. joined along the samples axis, into a single TensorMap object. The default ``collate_fn`` used by :py:class:`DataLoader` (:py:func:`metatensor.learn.data.group_and_join`), vstacks (respectively joins along the samples axis) data fields that correspond :py:class:`torch.Tensor` (respectively :py:class:`metatensor.torch.TensorMap`). For all other data types, the data is left as tuple containing all samples in the current batch in order. .. GENERATED FROM PYTHON SOURCE LINES 286-290 .. code-block:: Python batch_size = 2 dataloader = DataLoader(dataset, batch_size=batch_size) .. GENERATED FROM PYTHON SOURCE LINES 291-293 We can look at a single ``Batch`` object (i.e. a named tuple, returned by the ``DataLoader.__iter__()``) to see this in action. .. GENERATED FROM PYTHON SOURCE LINES 294-309 .. code-block:: Python batch = next(iter(dataloader)) # TensorMaps for each sample in the batch joined along the samples axis # into a single TensorMap print("batch.descriptor =", batch.descriptor) # `scalar` data are float objects, so are just grouped and returned in a tuple print("batch.scalar =", batch.scalar) assert len(batch.scalar) == batch_size # `vector` data are torch Tensors, so are vertically stacked into a single # Tensor print("batch.vector =", batch.vector) .. rst-class:: sphx-glr-script-out .. code-block:: none batch.descriptor = TensorMap with 1 blocks keys: key_1 key_2 1 2 batch.scalar = (0.4385954737663269, 0.5116015672683716) batch.vector = tensor([[0.8823, 0.5315, 0.5599], [0.3168, 0.5620, 0.9536]]) .. GENERATED FROM PYTHON SOURCE LINES 310-316 Advanced functionality: IndexedDataset -------------------------------------- What if we wanted to explicitly define the sample indexes used to store and access samples in the dataset? See the next tutorial, :ref:`learn-tutorial-indexed-dataset-dataloader`, for more details! .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.433 seconds) .. _sphx_glr_download_examples_learn_1-dataset-dataloader.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 1-dataset-dataloader.ipynb <1-dataset-dataloader.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 1-dataset-dataloader.py <1-dataset-dataloader.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 1-dataset-dataloader.zip <1-dataset-dataloader.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_