Build database¶

Collect paths of VASP calculations¶

Find all directories containing OUTCAR file:
```
find . -name OUTCAR > paths.log
```
Remove the string ‘OUTCAR’ in paths.log.
```
sed -i 's/OUTCAR$//g' paths.log
```
Specify the absolute paths in paths.log.
```
sed -i "s#^.#${PWD}#g" paths.log
```

You may want to remove lines with string: sed -i '/string/d' paths.log

Python script¶

Modify data_config for your own purposes. See default_data_config to know how to use the parameter settings.

from agat.data import BuildDatabase
data_config =  {
    'species': ['H', 'Ni', 'Co', 'Fe', 'Pd', 'Pt'],
    'path_file': 'paths.log', # A file of absolute paths where OUTCAR and XDATCAR files exist.
    'build_properties': {'energy': True,
                         'forces': True,
                         'cell': True,
                         'cart_coords': False,
                         'frac_coords': True,
                         'constraints': True,
                         'stress': True,
                         'distance': True,
                         'direction': True,
                         'path': False}, # Properties needed to be built into graph.
    'dataset_path': 'dataset', # Path where the collected data to save.
    'mode_of_NN': 'ase_dist', # How to identify connections between atoms. 'ase_natural_cutoffs', 'pymatgen_dist', 'ase_dist', 'voronoi'. Note that pymatgen is much faster than ase.
    'cutoff': 5.0, # Cutoff distance to identify connections between atoms. Deprecated if ``mode_of_NN`` is ``'ase_natural_cutoffs'``
    'load_from_binary': False, # Read graphs from binary graphs that are constructed before. If this variable is ``True``, these above variables will be depressed.
    'num_of_cores': 8,
    'super_cell': False,
    'has_adsorbate': False,
    'keep_readable_structural_files': False,
    'mask_similar_frames': False,
    'mask_reversed_magnetic_moments': False, # or -0.5 # Frames with atomic magnetic moments lower than this value will be masked.
    'energy_stride': 0.05,
    'scale_prop': False
             }

if __name__ == '__main__': # encapsulate the following line in '__main__' because of the `multiprocessing`
    database = BuildDatabase(**data_config)
    database.build()

Outputs¶

A new folder is created, which is defined by the data_config['dataset_path']. The structure of this folder is:

dataset
├── all_graphs.bin
├── fname_prop.csv
└── graph_build_scheme.json

File name	Explanation
`all_graphs.bin`	Binary file of the DGL graphs
`fname_prop.csv`	A file storing the structural file name, properties, and paths. This file will not be used in the training, but is useful for checking the raw data.
`graph_build_scheme.json`	An information file tells you how to build the database. When deploying the well-trained model, this file is useful to construct new graphs.