3. Running the DeepH@FHI-aims interface

In this part, we showcase examples of dataset generation with DeepH@FHI-aims interface.

Note

This chapter should be done in the aims2deeph environment.

3.1. Running on a single structure: molecule

Note

These paths have been setup in advance on the tutorial virtual machine and HPC resources.

Before executing the DeepH@FHI-aims interface, it is necessary to set up several environment variables. For example, with the following commands:

export AIMS_LIB_PATH="/path/to/fhi-aims/build/libaims.240507.scalapack.mpi.so"
export AIMS_SPECIES_PATH="/path/to/fhi-aims/species_defaults/defaults_2020/light"
export DEEPH_INTERFACE_PATH="/path/to/deeph_interface/Src"

Please remember to change the three paths to the paths specific to your machine.

The DeepH@FHI-aims interface is executed through the ASI python package via a python call. Therefore there's large flexibility in several set-ups, including the way to input the structure. In the example we provided, we assume the user has a geometry.in file for geometry input, but it will be fine if you want to read formats like POSCAR, *.cif, etc.

For test purpose, we provide a geometry.in file of the \(\text{H}_2\text{O}\) molecule:

atom 0.0000000000000000  0.0000000000000000  0.1192620000000000 O
atom 0.0000000000000000  0.7632390000000000 -0.4770470000000000 H
atom 0.0000000000000000 -0.7632390000000000 -0.4770470000000000 H

A minimal example of python script for running DeepH@FHI-aims interface is as follows:

# The ASI part of the code is modified from ASI package's examples.
# For details please refer to the documentation: https://pvst.gitlab.io/asi/asi_8h.html
import sys, os
import numpy as np
from asi4py.asecalc import ASI_ASE_calculator
from ase.io import read
from ase.calculators.aims import Aims
sys.path.append(os.environ["DEEPH_INTERFACE_PATH"])
from aims2DeepH import aims_get_data

deeph_output_dir = "preprocessed" # Output directory of DeepH's preprocessed files
atoms = read("geometry.in") # The structure to be computed
work_dir = "asi.temp" # Default work directory of ASI_ASE_calculator (no need to change)
logfile = "asi.log" # Default output file of ASI_ASE_calculator (no need to change)

def init_via_ase(asi):
  calc = Aims(xc='pbe',
    relativistic="atomic_zora scalar",
    occupation_type="gaussian 0.010",
    density_update_method="density_matrix",
    species_dir = os.environ['AIMS_SPECIES_PATH'],
    output = ['h_s_matrices'], # Mandatory for the DeepH interface
  )
  calc.write_input(asi.atoms)

# read path to ASI-implementing shared library from environment variable
AIMS_LIB_PATH = os.environ["AIMS_LIB_PATH"]

# initialize ASI library via ASE calculators
atoms.calc = ASI_ASE_calculator(AIMS_LIB_PATH, init_via_ase, None, atoms, work_dir=work_dir, logfile=logfile)

# Ask to save Hamiltonian and overlap matrices
atoms.calc.asi.keep_hamiltonian = True
atoms.calc.asi.keep_overlap = True
print(f'basis size = {atoms.calc.asi.n_basis}')
print(f'E = {atoms.get_potential_energy():.6f} eV') # actual calculation

aims_get_data(atoms, work_dir, logfile, deeph_output_dir) # The DeepH interface

This python script may be regarded as a template for running DeepH@FHI-aims interface for molecules. On your own usage, understanding the meaning of several parameters are meaningful:

deeph_output_dir: Output directory for DeepH-formatted files.
atoms = read("..."): From which file to read in structural information. the read is imported from ase.io.read, therefore a lot of input formats will be supported.
calc=Aims(...): Set-ups for FHI-aims calculation, normally placed in control.in file of DeepH. Please note output = ['h_s_matrices'] is mandatory.
The basis set is controlled by the species_dir argument in Aims() function, and the basis of relevant atom species will be atomatically pasted to the ASI-generated control.in. In the example above the species is controlled by the variable AIMS_SPECIES_PATH.

In summary to execute the interface, please follow the step-step guide:

Setup on Workshop Resources

The setup of DeepH and its environment variables has been prepared in the tutorial resources, both on HPC and in the virtual machine!

Set the environment variables AIMS_LIB_PATH, AIMS_SPECIES_PATH and DEEPH_INTERFACE_PATH.
Activate your python environment for the interface, created in installation. Also load the MPI environment you used to install FHI-aims.
In the directory with the template python script (e.g., named as run.py) and input geometry.in, execute python run.py for serial running, or mpirun -np {your_process_number} run.py for parallel running.

Note

On certain servers, if h5py package cannot be installed with parallel support, you may only be able to run with mpirun -np 1 run.py.

Whether the execution has been successful could be examined as follows:

In the subdirectory named asi.temp, you'll see files as if a regular FHI-aims execution has been performed, but with ASI-auto-generated control.in and geometry.in files, and stdout in the filename you specified with logfile = in the python script.
In the directory you specified for deeph_output_dir, you'll see a series of DeepH-formatted files (to be introduced later). Specifically, the execution is likely successful if you see files hamiltonians.h5 and overlaps.h5 under that directory.

3.2. Running on a single structure: periodic system

Benefited from both FHI-aims and DeepH's compatibility with periodic system, running the interface on periodic systems will also be rather similar, with exception to the mandatory modification of two parameters:

The output argument in the ASI Aims() calculator function must be set to output = ['h_s_matrices', "k_point_list"].
An argument k_grid = (kx, ky, kz) should be added in the Aims() calculator function, to specify the Brouillon zone sampling. Remember to ensure the k_grid is reasonably converged, to guarantee data quality.

Here we provide a minimal example for executing the interface with \(\text{H-MoS}_2\), with the geometry.in as follows:

lattice_vector  3.1962229999999998 0.0000000000000000  0.0000000000000000 
lattice_vector -1.5981114999999999 2.7680103141601093  0.0000000000000000 
lattice_vector  0.0000000000000000 0.0000000000000000 23.1298299999999983 
atom_frac 0.0000000000000000 0.0000000000000000 0.5000000000000000 Mo
atom_frac 0.6666666667000001 0.3333333332999999 0.4217892221000000 S
atom_frac 0.6666666667000001 0.3333333332999999 0.5782107779000000 S

and the run.py:

# The ASI part of the code is modified from ASI package's examples.
# For details please refer to the documentation: https://pvst.gitlab.io/asi/asi_8h.html
import sys, os
import numpy as np
from asi4py.asecalc import ASI_ASE_calculator
from ase.io import read
from ase.calculators.aims import Aims
sys.path.append(os.environ["DEEPH_INTERFACE_PATH"])
from aims2DeepH import aims_get_data

deeph_output_dir = "preprocessed" # Output directory of DeepH's preprocessed files
atoms = read("geometry.in") # The structure to be computed
work_dir = "asi.temp" # Default work directory of ASI_ASE_calculator (no need to change)
logfile = "asi.log" # Default output file of ASI_ASE_calculator (no need to change)

def init_via_ase(asi):
  calc = Aims(xc='pbe',
    relativistic="atomic_zora scalar",
    occupation_type="gaussian 0.010",
    density_update_method="density_matrix",
    species_dir = os.environ['AIMS_SPECIES_PATH'],
    output = ['h_s_matrices', "k_point_list"], # Mandatory for the DeepH interface
    k_grid = (7, 7, 1)
  )
  calc.write_input(asi.atoms)

# read path to ASI-implementing shared library from environment variable
AIMS_LIB_PATH = os.environ["AIMS_LIB_PATH"]

# initialize ASI library via ASE calculators
atoms.calc = ASI_ASE_calculator(AIMS_LIB_PATH, init_via_ase, None, atoms, work_dir=work_dir, logfile=logfile)

# Ask to save Hamiltonian and overlap matrices
atoms.calc.asi.keep_hamiltonian = True
atoms.calc.asi.keep_overlap = True
print(f'basis size = {atoms.calc.asi.n_basis}')
print(f'E = {atoms.get_potential_energy():.6f} eV') # actual calculation

aims_get_data(atoms, work_dir, logfile, deeph_output_dir) # The DeepH interface

The step-to-step execution guide is the same with section 3.1.

We kindly remind the users ensure that the calculation parameters are carefully chosen, and the convergence of the calculations are double-checked.

3.3. Guidance for dataset generation

It should be noted that, the generation of dataset for DeepH training is largely a open question and strongly depend on what you indend to research. Below we provide guidance for several simple scenarios for reference:

Dataset generation of perturbed datasets. Training DeepH on perturbed dataset would be useful in applications such as inspecting electron-phonon coupling. You can generate datasets either by random perturbations, or from ab inito molecular dynamics. DeepH has been trained on both setups in our example studies.
Dataset generation for Moiré-twisted materials. DeepH has capability to learn from Hamiltonians of small-sized structures and generalized to large-sized ones, as exemplified from non-twisted small-sized bilayers to large-scale Moiré-twisted ones. Due to the twisting, different local stacking (e.g. AA, AB and BA) will presence in the Moiré-twisted structure. The training dataset could therefore be generated by including strucutures with random interlayer shift, plus random atomic perturbations, in the training set.
Dataset generation for more complicated structures. It's meaningful to explore the DeepH's performance on existing databases based on database structures for formulating universal DeepH models. Also inspecting the electronic structures of defects/interfaces/alloys with DeepH could be quite meaningful. For generating structures, we recommend using specialized structure-generation packages that can be seamlessly integrated with the DeepH@FHI-aims interface.

Several rule-of-thumbs guidance may be helpful regarding dataset generation:

DeepH typically requires 50 to 500 structures for perturbed structures or Moiré-twisted materials, depending on the complexity of the material and the perturbation magnitude you inposed. We still recommend users check data sufficience in their specific use.
A rough order of magnitude of DeepH models' being "accurate" is achieving <1 meV in terms of Hamiltonians' mean absolute error (MAE). Yet on most datasets you're expected to find DeepH far more accurate than that.
In dataset generation please remember avoid including absurd structures such as structures with too-close-by atoms, or without SCF convergence. Even including one such structure could be disastrous to DeepH training.
As mentioned in section 1, the Hamiltonians H(R) are obtained from a reverse fourier transformation of H(k), therefore requiring a reasonably dense k-mesh for convergence. A very coarse k-mesh may lead to absurd results of H(R)! There's a hard lower limit for number of k-points under a crystal axis (e.g. the first crystal axis) follows the expression:

\[n_k\cdot\frac{V_{\text{cell}}}{|\textbf{a}_2\times\textbf{a}_3|}>2\cdot R_c\]

In which \(R_c\) denotes the maximal cutoff radius of FHI-aims' basis, \(V_\text{cell}\) represents the volume of the unit cell, and \(\textbf{a}_i\) denotes the \(i\)-th crystal axis.

Since the predicted Hamiltonian is under non-orthogonal basis, if the basis set is large, then the post-processing of the Hamiltonian is likely ill-conditioned, leading to large numerical error due to the near-linear-dependence of the basis sets. We therefore recommend using basis sets no larger than intermediate in routine usage.
It will be time-consuming for DeepH to work with Hamiltonian matrix elements corresponding to high-angular-momentum basis sets, and we therefore recommend users use basis sets up to f-orbitals in practical use. Also if you include basis set up to \(l_{\text{max}}\), the irreducible representations of DeepH should be at least up to \(2l_{\text{max}}\), as will be described in section 4.

Although the previous examples generate DeepH files in each calculation subdirectory, it is not necessary to transfer the scripts or FHI-aims standard outputs for training, and only the DeepH-formatted files are mandatory. It is recommended to collect only the preprocessed directories from your dataset into a single parent directory. Each preprocessed directory should be renamed according to your structure identifier (for example, an integer index for each structure).

3.4 A practical example for generating datasets

Note

Since the aims2DeepH is a python-based workflow, there's large flexibility regarding the dataset generation workflow. Below we provide a basic example for generating a dataset of perturbed graphene structures, and python scripts to execute it.

Step 1: Prepare the structures

An example script is provided alongside the tutorial, assuming you have the aims2deeph environment activated.

The script opts to generate 4 \(\times\) 4 \(\times\) 1 supercell of a graphene structure, red from graphene.cif, plus random perturbations to each carbon's Cartesian directions, up to 0.1 Angstrom. The script can be divided into two parts:

Specifying the input parameters, in which we define the input structure, supercell size, maximal value of the perturbation, and the number of the structures:

import os
import numpy as np
from ase.io import read, write
from ase.build import make_supercell

# Input Parameters
input_cif = "./graphene.cif"
supercell_matrix = np.diag([4, 4, 1])  # 4x4x1
n_structures = 100 # You usually wants more!
perturb_range = 0.1  # Å
template_script = "./run.py" # template python script
dataset_root = "batch_calc"

Generating the dataset structures:

# Read input structure and make supercell
atoms = read(input_cif)
supercell = make_supercell(atoms, supercell_matrix)

# Generate perturbed copies
for i in range(n_structures):
    # Create a perturbed copy
    perturbed = supercell.copy()
    perturbations = np.random.uniform(-perturb_range, perturb_range, size=perturbed.positions.shape)
    perturbed.positions += perturbations

    # Create output directory
    outdir = os.path.join(dataset_root, f"{i}")
    os.makedirs(outdir, exist_ok=True)

    # Write to geometry.in in FHI-aims format
    outfile = os.path.join(outdir, "geometry.in")
    write(outfile, perturbed, format="aims")
    os.system(f"cp {template_script} {outdir}/")

print(f"Generated {n_structures} perturbed structures under '{dataset_root}/'")

After executing the script, here will be n_structures subdirectories in your specified dataset_root, each contains a geometry.in of perturbed graphene supercell, as well as a copy of template_script you provided.

Step 2: Carry out FHI-aims simulations with ASI and aims2DeepH interface

The execution of FHI-aims simulations of the dataset, however, is highly flexible to you. Since you don't usually want hundreds of simulations on one computation node, you may create a run.sh for executing each run.py in each subdirectory, and batch submit to your HPC server.

If your HPC server prohibits submission of large amounts of jobs, you may wrap up several run.py into one run.sh. For example:

# Headers for slurm/PBS/...

# Setup environments for FHI-aims@DeepH interface

for ((istru=1;istru<=100;istru++))
do 
    echo $istru
    cd batch_calc/$istru 
    mpirun -np xxx python run.py
    cd ../..
done

Step 3: Check convergence and collect computed structures

Check whether FHI-aims run is completed is necessary before collecting the dataset. Usually this is indicated by checking the precense of Self-consistency cycle converged. in the logfile of ASI (e.g. asi.temp/asi.log). After that, you're recommended to collect outputfiles of DeepH@FHI-aims interface into a directory and transport them to your GPU server for subsequent DeepH training. An example script for collecting is as follows:

# Input parameters
calc_dir="batch_calc"
dataset_root="collect_preprocessed"
n_structures=100

mkdir -p "$dataset_root"

for ((i=0; i<n_structures; i++)); do
    log_file="$calc_dir/$i/asi.temp/asi.log"
    src_preprocessed="$calc_dir/$i/preprocessed"
    dst_preprocessed="$dataset_root/$i"

    if [[ -f "$log_file" ]]; then
        if grep -q "Self-consistency cycle converged." "$log_file"; then
            echo "[$i] Converged — moving preprocessed/"
            mkdir -p "$dataset_root"
            mv "$src_preprocessed" "$dst_preprocessed"
        else
            echo "[$i] Not converged."
        fi
    else
        echo "[$i] No log file found."
    fi
done

Appendix: Explanation of DeepH-formatted files

The current version of the DeepH@FHI-aims interface produces files in the ``traditional" DeepH format, compatible with the original DeepH-pack and DeepH-E3. Such format will include in a directory all information of a structure's structural and electronic structure information, specified as follows:

Structural information:

element.dat: File containing atomic number (Z) for each atom of the structure
lat.dat: File specifying the lattice parameters, in Angstrom. Each Column correspond to a lattice vector
rlat.dat (optional): File specifying the reciprocal lattice parameters. This file is optional for training and subsequent usage, since we can always comput reciprocal lattice from lattice parameters.
site_positions.dat: File specifying Cartesian atomic positions, in Angstrom. Eash Column correspond to the x, y, and z coordinate of a atom.

Electronic stucture information:

info.json: Meta information for the structure. The fermi_level entry has unit eV.
orbital_types.dat: Each row specifies the basis set combinations of an atom, represented by a sequence of non-negative integers. An integer 0 indicates the presence of one s-type basis; 1 corresponds to a set of three p-type bases; 2 and 3 represent d- and f-type basis sets, respectively. For example, an atom with three sets of s-bases, two sets of p-bases, two sets of d-bases, and a set of f-bases is encoded as 0 0 0 1 1 2 2 3, which represents 8 sets of basis functions, totaling 26 individual basis functions.
hamiltonians.h5: A HDF5 file specifying the electronic Hamiltonian, where each key points to the Hamiltonian matrix elements between orbitals of atom pairs. Each key is given by a string `"[Rx, Ry, Rz, iatom, jatom]"``. Here Rx, Ry and Rz indicate how the starting and terminating atom of the hopping spans accross the unit cells, while iatom and jatom denote the atom indices of starting and terminating atom in their respective unit cells. Each key corresponds to a matrix with dimension (nbasis_i, nbasis_j), with nbasis_i and nbasis_j are the numbers of bases functions for atoms i and j, respectively.

Note

We're preparing the official release of DeepH-pack in a completely reconstructed form, with improved accuracy and efficiency, and the version will be introduced in part 4 and 5 of the tutorial. The new DeepH-pack facilitates usage of a more compact DeepH format. At the start of part 4, we'll showcase how to perform format transformation into that compact format.