Training YAML Reference¶
Overview¶
metatrain uses a YAML file to specify the parameters for model training, accessed
via mtt train options.yaml. In this section, we provide a complete reference for the
parameters provided by the training YAML input. For a minimal example of a YAML input
file, suitable to start a first training, we refer the viewer to the sample YAML file in
the Quickstart section.
The YAML input file can be divided into five sections:
Computational Parameters¶
The computational parameters define the computational device, base_precision and
seed. These parameters are optional.
device: cuda
base_precision: 32
seed: 0
- BaseHypers.device: NotRequired[str]
The computational device used for model training. If not provided,
metatrainautomatically chooses the best option by default. The available devices and the best device option depend on the model architecture. The easiest way to use this parameter is to use either either"cpu","gpu","multi-gpu". Internally, under the choice"gpu", the script will automatically choose between"cuda"or"mps".
- BaseHypers.base_precision: NotRequired[Literal[16, 32, 64]]
The base precision for float values. For example, a value of
16corresponds to the data typefloat16. The datatypes that are supported as well as the default datatype depend on the model architecture used.
- BaseHypers.seed: NotRequired[Annotated[int, Ge(ge=0)]]
The seed used for non-deterministic operations. It sets the seed of
numpy.random,random,torchandtorch.cuda. This parameter is important for ensuring reproducibility. If not specified, the seed is generated randomly and reported in the log.
Architecture¶
The next section of the YAML file would focus on options pertaining to the architecture. The main skeleton is as follows:
architecture:
name: architecture_name
model:
...
training:
...
The options for the architecture.model and architecture.training sections are
highly specific to the architecture used. You can refer to the architecture
documentation page to find the options for your desired
architecture.
Loss¶
A special parameter that you will find in the architecture.training section is
the one dedicated to the loss. There is a plethora of loss functions used in different
ML workflows, and you can refer to the loss functions documentation
to understand the support of metatrain for all these different cases.
Data¶
The final section of the YAML file focuses on options regarding the data used in model training. This secion can be broken down into three subsections:
training_setvalidation_settest_set(optional)
The training set is the data that will be used for model training, the validation set is the data that will be used to track the generalizability of the model during training and is usually used to decide on the best model. The test set is only used after training and it is used to evaluate the model’s performance on an unseen dataset after training. If not specified, no test set will be created. Each subsection has the same parameter configuration. As an example, the configuration of the training set is usually divided into three main sections:
training_set:
systems:
...
targets:
...
extra_data:
...
with the three sections being:
systems: defines the molecular/crystal structures, which are the inputs to the model.targets: defines the outputs to be predicted by the model.extra_data: defines any additional data required by the loss function during training.
The validation and test set sections can also be fully specified in the same way as the training set section, but they can also be simply a fraction of the training set. For example:
training_set:
... # Training set specification
validation_set: 0.1
test_set: 0.2
will randomly select 10% of the training set for validation and 20% for testing.
The selected indices for the training, validation and test subset will be
available in the outputs directory.
Note
If you don’t need a test set, you can simply omit the test_set parameter entirely.
Systems¶
The systems section can be defined as simply as:
training_set:
systems: dataset.xyz
... # Rest of training set specification
which would instruct metatrain to read the systems from the file
dataset.xyz using the default reader inferred from the file extension. If one
requires more control over the way the systems are read, one can provide a
specification that is defined by the following parameters:
- class metatrain.share.base_hypers.SystemsHypers[source]
Hyperparameters for the systems in the dataset.
- read_from: NotRequired[str]
Path to the file containing the systems.
- reader: NotRequired[Literal['ase', 'metatensor'] | None]
The reader library to use for parsing.
If
nullor not provided, the reader will be guessed from the file extension. For example,.xyzand.extxyzwill be read byaseand.mtswill be read bymetatensor.
- length_unit: NotRequired[str | None]
Unit of lengths in the system file, optional but highly recommended for running simulations. If not given, no unit conversion will be performed when running simulations which may lead to severe errors.
The list of possible length units is available here.
As an example, the simple configuration that we saw previously is equivalent to:
training_set:
systems:
read_from: dataset.xyz
reader: null
length_unit: null
... # Rest of training set specification
Targets¶
In the targets category, one can define any number of target sections, each with a
unique name, i.e. something like:
training_set:
targets:
energy:
... # Energy target specification
mtt:dipole:
... # Dipole target specification
... # Rest of training set specification
The name of the target should either be a standard output of metatomic
(see metatomic outputs documentation)
or begin with mtt::, see example below for a fully fledged
version of a training set specification.
Each target can be specified with the following parameters:
- class metatrain.share.base_hypers.TargetHypers[source]
Hyperparameters for the targets in the dataset.
- quantity: NotRequired[str] = ''
The quantity that the target represents (e.g.,
energy,dipole). Currently onlyenergygets a special treatment frommetatrain, for any other quantity there is no need to specify it.
- read_from: NotRequired[str]
The path to the file containing the target data, defaults to
systems.read_frompath if not provided.
- reader: NotRequired[Literal['ase', 'metatensor'] | None | dict]
The reader library to use for parsing.
If
nullor not provided, the reader will be guessed from the file extension. For example,.xyzand.extxyzwill be read byaseand.mtswill be read bymetatensor.
- key: NotRequired[str]
The key under which the target is stored in the file.
If not provided, it defaults to the key of the target in the yaml dataset specification.
- unit: NotRequired[str] = ''
Unit of the target, optional but highly recommended for running simulations. If not given, no unit conversion will be performed when running simulations which may lead to severe errors.
The list of possible units is available here.
- per_atom: NotRequired[bool] = False
Whether the target is a per-atom quantity, as opposed to a global (per-structure) quantity.
- type: NotRequired[Literal['scalar'] | CartesianTargetTypeHypers | SphericalTargetTypeHypers]
Specifies the type of the target.
See Fitting Generic Targets to understand in detail how to specify each target type.
- num_subtargets: NotRequired[int] = 1
Specifies the number of sub-targets that need to be learned as part of this target.
Each subtarget is treated as entirely equivalent by models in metatrain and they will often be represented as outputs of the same neural network layer. A common use case for this field is when you are learning a discretization of a continuous target, such as the grid points of a function. In the example above, there are 4000 sub-targets for the density of states (DOS). In metatensor, these correspond to the number of properties of the target.
- description: NotRequired[str] = ''
A description of this target. A description is highly recommended if there is more than one target with the same
quantity.
- forces: NotRequired[bool | str | GradientDict]
Specification for the forces associated with the target.
See Gradient Subsection.
- stress: NotRequired[bool | str | GradientDict]
Specification for the stress associated with the target.
See Gradient Subsection.
- virial: NotRequired[bool | str | GradientDict]
Specification for the virial associated with the target.
See Gradient Subsection.
A single string in a target section automatically expands, using the string as the
read_from parameter.
Gradient Subsection¶
Each gradient subsection (like forces or stress) has similar parameters:
- class metatrain.share.base_hypers.GradientDict[source]
- read_from: NotRequired[str]
The path to the file for gradient data.
If not provided, the path from its associated target is used.
- reader: NotRequired[Literal['ase', 'metatensor'] | None | dict]
The reader library to use for parsing.
If
nullor not provided, the reader will be guessed from the file extension. For example,.xyzand.extxyzwill be read byaseand.mtswill be read bymetatensor.
- key: NotRequired[str]
The key under which the target is stored in the file.
If not provided, it defaults to the key of the gradient in the yaml dataset specification.
A single string in a gradient section automatically expands, using the string as the
read_from parameter.
Sections set to true or on automatically expand with default parameters. A
warning is raised if requisite data for a gradient is missing, but training proceeds
without them. For instance,
targets:
energy:
quantity: energy
read_from: dataset.xyz
reader: ase
key: energy
unit: null
forces:
read_from: dataset.xyz
reader: ase
key: forces
stress:
read_from: dataset.xyz
reader: ase
key: stress
can be condensed into
targets:
energy:
quantity: energy
read_from: dataset.xyz
reader: ase
key: energy
unit: null
forces: on
stress: on
Note
Unknown keys are ignored and not deleted in all sections during dataset parsing.
Datasets requiring additional data¶
Some targets require additional data to be passed to the loss function for training. In
the example above, we included the mask for the density of states, which defines the
regions of the DOS that are well-defined based on the eigenvalues of the underlying
electronic structure calculation. This is important when the DOS is computed over a
finite energy range, as the DOS near the edges of this range may be inaccurate due to
the lack of states computed beyond this range. metatrain supports passing additional
data in the options.yaml file. This can be seen in the extra_data section of the
full example above.
As another example, training a model to predict the polarization for extended systems
under periodic boundary conditions might require the quantum of polarization to be
provided for each system in the dataset. For this, you can add the following section to
your options.yaml file:
training_set:
systems:
read_from: dataset_0.xyz
length_unit: angstrom
targets:
mtt::polarization:
read_from: polarization.mts
extra_data:
polarization_quantum:
read_from: polarization_quantum.mts
Warning
While the extra_data section can always be present, it will typically be ignored
unless using specific loss functions. If the loss function you picked does not
support the extra data, it will be ignored.
The extra_data section supports the same parameters as the target sections. In this
case, we have also read the targets and extra data from files other than the systems
file.
Full data example¶
Here is a full fledged example of a training set specification, in this case for learning the electronic density of states (DOS) along with forces and stresses:
training_set:
systems:
read_from: dataset.xyz
reader: ase
length_unit: null
targets:
energy:
quantity: energy
read_from: dataset.xyz
reader: ase
key: energy
unit: null
per_atom: True
type: scalar
num_subtargets: 1
forces:
read_from: dataset.xyz
reader: ase
key: forces
stress:
read_from: dataset.xyz
reader: ase
key: stress
non_conservative_forces:
quantity: null
read_from: nonconservative_force.mts
reader: metatensor
key: forces
unit: null
per_atom: True
type:
cartesian:
rank: 1
num_subtargets: 1
mtt::dos:
quantity: null
read_from: DOS.mts
reader: metatensor
key: dos
unit: null
per_atom: False
type: scalar
num_subtargets: 4000
extra_data:
mtt::dos_mask:
quantity: null
read_from: dataset.xyz
reader: ase
key: dos_mask
unit: null
per_atom: False
type: scalar
num_subtargets: 4000
Using Multiple Files for Training¶
For some applications, it is simpler to provide more than one dataset for model
training. metatrain supports stacking several datasets together using the YAML
list syntax, which consists of lines beginning at the same indentation level starting
with a "- " (a dash and a space)
training_set:
- systems:
read_from: dataset_0.xyz
length_unit: angstrom
targets:
energy:
quantity: energy
key: my_energy_label0
unit: eV
- systems:
read_from: dataset_1.xyz
length_unit: angstrom
targets:
energy:
quantity: energy
key: my_energy_label1
unit: eV
free-energy:
quantity: energy
key: my_free_energy
unit: hartree
test_set: 0.1
validation_set: 0.1
The required test and validation splits are performed consistently for each element
element in training_set .
The length_unit has to be the same for each element of the list. If target section
names are the same for different elements of the list, their unit also has to be the
same. In the the example above the target section energy exists in both list
elements and therefore has the the same unit eV. The target section free-energy
only exists in the second element and its unit does not have to be the same as in the
first element of the list.
Typically the global atomic types the the model is defined for are inferred from the
training and validation datasets. Sometimes, due to shuffling of datasets with low
representation of some types, these datasets may not contain all atomic types that you
want to use in your model. To explicitly control the atomic types the model is defined
for, specify the atomic_types key in the architecture section of the options
file:
architecture:
name: pet
model:
cutoff: 5.0
training:
batch_size: 32
epochs: 100
atomic_types: [1, 6, 7, 8, 16] # i.e. for H, C, N, O, S
Warning
Even though parsing several datasets is supported by the library, it may not work with every architecture. Check your desired architecture if they support multiple datasets.
WandB Integration¶
Optional section dealing with integration with Weights and Biases (wandb) logging. Leaving this section blank will simply disable wandb integration. The parameters for this section is the same as that in wandb.init. Here we provide a minimal example for the YAML input
wandb:
project: my_project
name: my_run_name
tags:
- tag1
- tag2
notes: This is a test run
All parameters of your options.yaml file will be automatically added to the wandb
run so you don’t have to set the config parameter.
Important
You need to install wandb with pip install wandb if you want to use this logger.
Before running also set up your credentials with wandb login from the
command line. See wandb login documentation for details on the setup.