Note
Go to the end to download the full example code.
Handling sparsity¶
The one sentence introduction to metatensor mentions that this is a “self-describing sparse tensor data format”. The previous tutorial explained the self-describing part of the format, and in this tutorial we will explore what makes metatensor a sparse format; and how to remove the corresponding sparsity when required.
Like in the previous tutorial, we will load the data we need from a file. The code used to generate this file can be found below:
Show the code used to generate the radial-spectrum.npz
file
The data was generated with featomic, a package to compute atomistic representations for machine learning applications.
import ase from featomic import SoapRadialSpectrum import metatensor atoms = ase.Atoms( "COO2N2", positions=[(0, 0, 0), (1.2, 0, 0), (0, 6, 0), (1.1, 6, 0), (6, 0, 0), (7.3, 0, 0)], ) calculator = SoapRadialSpectrum( cutoff={ "radius": 2.5, "smoothing": {"type": "ShiftedCosine", "width": 0.5}, }, density={ "type": "Gaussian", "width": 0.2, }, basis={ "radial": {"type": "Gto", "max_radial": 2}, }, ) descriptor = calculator.compute(atoms, gradients=["positions"]) metatensor.save("radial-spectrum.npz", descriptor)
The file contains a representation of two molecules called the radial spectrum. The atom \(i\) is represented by the radial spectrum \(R_i^\alpha\), which is an expansion of the neighbor density \(\rho_i^\alpha(r)\) on a set of radial basis functions \(f_n(r)\)
The density \(\rho_i^\alpha(r)\) associated with all neighbors of species \(\alpha\) of the atom \(i\) (each neighbor is replaced with a Gaussian function centered on the neighbor \(g(r_{ij})\)) is defined as:
The exact mathematical details above don’t matter too much for this tutorial, the main point being that this representation treats atomic species as completely independent, effectively using the neighbor species \(\alpha\) for one-hot encoding.
import ase
import ase.visualize.plot
import matplotlib.pyplot as plt
import numpy as np
import metatensor
We will work on the radial spectrum representation of three molecules in our system: a carbon monoxide, an oxygen molecule and a nitrogen molecule.
atoms = ase.Atoms(
"COO2N2",
positions=[(0, 0, 0), (1.2, 0, 0), (0, 6, 0), (1.1, 6, 0), (6, 0, 0), (7.3, 0, 0)],
)
fig, ax = plt.subplots(figsize=(3, 3))
ase.visualize.plot.plot_atoms(atoms, ax)
ax.set_axis_off()
plt.show()
Sparsity in TensorMap
¶
The radial spectrum representation has two keys: central_species
indicating the
species of the central atom (atom \(i\) in the equations); and
neighbor_type
indicating the species of the neighboring atoms (atom \(j\)
in the equations)
radial_spectrum = metatensor.load("radial-spectrum.npz")
print(radial_spectrum)
TensorMap with 5 blocks
keys: center_type neighbor_type
6 6
6 8
7 7
8 6
8 8
This shows the first level of sparsity in TensorMap
: block sparsity.
Out of all possible combinations of central_species
and neighbor_type
, some
are missing such as central_species=7, neighbor_type=8
. This is because we are
using a spherical cutoff of 2.5 Å, and as such there are no oxygen neighbor atoms
close enough to the nitrogen centers. This means that all the corresponding radial
spectrum coefficients \(R_i^\alpha(n)\) will be zero (since the neighbor density
\(\rho_i^\alpha(r)\) is zero everywhere).
Instead of wasting memory space by storing all of these zeros explicitly, we simply avoid creating the corresponding blocks from the get-go and save a lot of memory!
Let’s now look at the block containing the representation for oxygen centers and carbon neighbors:
block = radial_spectrum.block(center_type=8, neighbor_type=6)
Naively, this block should contain samples for all oxygen atoms (since
center_type=8
); in practice we only have a single sample!
print(block.samples)
Labels(
system atom
0 1
)
There is a second level of sparsity happening here, using a format related to coordinate sparse arrays (COO format). Since there is only one oxygen atom with carbon neighbors, we only include this atom in the samples, and the density/radial spectrum coefficient for all the other oxygen atoms is assumed to be zero.
Making the data dense again¶
Sometimes, we might have to use data in a sparse metatensor format with code that does not understands this sparsity. One solution is to convert the data to a dense format, making the zeros explicit as much as possible. Metatensor provides functionalities to convert sparse data to a dense format for the keys sparsity; and metadata to convert to a dense format for sample sparsity.
First, the sample sparsity can be removed block by block by creating a new array full
of zeros, and copying the data according to the indices in block.samples
dense_block_data = np.zeros((len(atoms), block.values.shape[1]))
# only copy the non-zero data stored in the block
dense_block_data[block.samples["atom"]] = block.values
print(dense_block_data)
[[0. 0. 0. ]
[0.05052215 0.13111562 0.01025378]
[0. 0. 0. ]
[0. 0. 0. ]
[0. 0. 0. ]
[0. 0. 0. ]]
Alternatively, we can undo the keys sparsity with
TensorMap.keys_to_samples()
and TensorMap.keys_to_properties()
,
which merge multiple blocks along the samples or properties dimensions respectively.
Which one of these functions to call will depend on the data you are handling.
Typically, one-hot encoding (the neighbor_types
key here) should be merged
along the properties dimension; and keys that define subsets of the samples
(center_type
) should be merged along the samples dimension.
dense_radial_spectrum = radial_spectrum.keys_to_samples("center_type")
dense_radial_spectrum = dense_radial_spectrum.keys_to_properties("neighbor_type")
After calling these two functions, we now have a TensorMap
with a single
block and no keys:
print(dense_radial_spectrum)
block = dense_radial_spectrum.block()
TensorMap with 1 blocks
keys: _
0
We can see that the resulting dense data array contains a lot of zeros (and has a well defined block-sparse structure):
with np.printoptions(precision=3):
print(block.values)
[[ 0.556 -0.282 0.016 0.051 0.131 0.01 0. 0. 0. ]
[ 0.051 0.131 0.01 0.556 -0.282 0.016 0. 0. 0. ]
[ 0. 0. 0. 0.633 -0.152 0.017 0. 0. 0. ]
[ 0. 0. 0. 0.633 -0.152 0.017 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0.586 -0.157 0.036]
[ 0. 0. 0. 0. 0. 0. 0.586 -0.157 0.036]]
And using the metadata attached to the block, we can understand which part of the data is zero and why. For example, the lower-right corner of the array corresponds to nitrogen atoms (the last two samples):
print(block.samples.print(max_entries=-1))
system atom center_type
0 0 6
0 1 8
0 2 8
0 3 8
0 4 7
0 5 7
And these two bottom rows are zero everywhere, except in the part representing the nitrogen neighbor density:
print(block.properties.print(max_entries=-1))
neighbor_type n
6 0
6 1
6 2
8 0
8 1
8 2
7 0
7 1
7 2
Total running time of the script: (0 minutes 0.034 seconds)