4. Training DeepH models

In this part, we demonstrate quick examples for DeepH training.

Note

This chapter should be done in the deeph environment.

For demonstration purposes, we enclose two example datasets: \(\text{H}_2\text{O}\) and \(\text{Mo}\text{Te}_2\). Please note that, due to data size limitation, only a small subset of \(\text{Mo}\text{Te}_2\) dataset is provided, which is insufficient for most practical trainings. Consequently, the lightweight \(\text{H}_2\text{O}\) dataset is used to demonstrate training, while inference is exemplified on the more complex \(\text{Mo}\text{Te}_2\), including visualization of DeepH-predicted band structures.

Please do not confuse the two datasets involved. The corresponding training configuration files are located in the subdirectories for H2O and MoTe2. You can also distinguish them by checking the dataset_name field in each configuration (*.toml) files. Nevertheless, since DeepH is designed to handle both molecules and solid-state systems seamlessly, the configuration files for these two examples are largely similar.

Note

The training data for the H2O and MoTe2 cases needs to be downloaded from this link in advance before continuing with the tutorial.

4.1. Migrate DeepH legacy data to the updated format seamlessly

Upon completing the operations in Section 3, you will obtain training data fully compatible with the DeepH-E3 (PyTorch) implementation developed by Xiaoxun Gong. During our development of the new DeepH version, user feedback consistently highlighted the extremely low I/O efficiency of Hamiltonian and overlap matrix storage formats. To address this, we've optimized the file storage architecture to create a new format. Recognizing that existing DeepH users possess substantial legacy datasets, we've integrated data conversion interfaces within deeph-dock. This enables seamless conversion of both traditional DeepH-E3 training data and FHI-aims-generated datasets to the optimized new format.

dock.io.DeepHOldToNew -p [PARALLEL_NUM] <legacy_dir> <updated_dir>

such as,

dock.io.DeepHOldToNew -p 32 ./collect_preprocessed ./dft

The folder tree structure looks like this:

collect_preprocessed/ # (legacy data folder)
dft/ # (updated data folder)
  |- 0/
     |- POSCAR
     |- info.json
     |- overlap.h5
     |- hamiltonian.h5 # (optional)
  |- 1/
  |- ...

where overlap.h5, hamiltonian.h5 are the corresponding matrices under localized atomic orbital (AO) basis, the POSCAR file contains structural information, while the info.json file stores critical metadata for the current structure. The root directory for DFT raw data must be strictly named dft/, while the subfolders inside can be named more freely with some naming conventions (e.g., free-form labels instead of numerical indices like 0, 1, 2...).

4.2. Build training data Graph files

DeepH models are graph neural networks (GNNs). They take atomic structures as input and predict physical quantities. The input structures are treated as graphs with atoms as nodes. Any pair of atoms \(i\) and \(j\) are connected by directed edges \(i \rightarrow j\) and \(j \rightarrow i\) if they're sufficiently close (i.e., their atomic orbital basis functions overlap). There are also self-loops \(i \rightarrow i\) in the graph. Physical quantities, such as Hamiltonian matrix elements, are interpreted as ``features'' associated with the nodes and edges of the graph.

Technically, graph files are directly constructed from DFT data, demonstrating complete data equivalence with the DeepH training process. Compared to traditional folder-based decentralized DFT data storage methods, the graph file system exhibits multiple technical advantages:

Numerical Precision Flexibility: DeepH-pack supports user-defined 32-bit or 64-bit floating point precision storage, significantly enhancing storage efficiency through optimized data type configurations.
Unified Data Portability: Leveraging a single-file integrated architecture, graph files should be prioritized over raw fragmented data during cross-server cluster transfers to streamline data mobility.
Generalized Compatibility: Designed with a universal data structure, the graph file format is not only compatible with the DeepH framework but also theoretically extensible to training workflows of diverse neural network architectures.

In DeepH-pack, the graph folder layout looks like this:

graph
  |- <GRAPH_NAME>.<GRAPH_TYPE>.memory.pt
  |- <GRAPH_NAME>.<GRAPH_TYPE>.disk.pt
  |- <GRAPH_NAME>.<GRAPH_TYPE>.disk.part1-of-1.db/
  |- <GRAPH_NAME>.<GRAPH_TYPE>.disk.part1-of-1.info.pt

The root directory for the raw DFT data must be named as graph/, with all graph files residing within this directory.

DeepH-pack currently supports two distinct storage modes for graph files: memory mode and disk mode. The former pre-loads the entire graph file into node memory during DeepH training initialization, prioritizing operational efficiency for datasets compatible with available memory resources. The latter employs on-demand data streaming through integrated database-hardware storage solutions, specifically designed for over-sized graph files exceeding node memory capacity (e.g., >10 TiB). This dual-mode architecture ensures memory-agnostic training workflows by dynamically adapting to data scales, where disk mode enables real-time access during computation while bypassing full memory occupancy, thereby maintaining system flexibility across varying computational constraints.

To initiate DeepH training:

Either prepare the DFT directory (enabling automatic graph construction at training start)
Or provide pre-built graph files (transferred to the GPU cluster from external sources).

Both approaches are fully supported by DeepH-pack.

inputs
  |- dft/ # (optional, if graph folder exist)
  |- graph/ # (optional)

4.3. Building the Graph Sperately

Upon initiating a standard DeepH training session, the framework automatically constructs graph files from DFT data stored in the designated dft/ directory and generates the corresponding graph dataloader. Given the CPU-exclusive nature of graph construction and the inherent advantages of graph files in data portability, DeepH-pack also supports decoupled graph generation from the GPU-accelerated training process. If graph files already exist, the training sessions would skip raw DFT data, streamlining the training workflow through graph-based data abstraction.

build_graph.toml:

# ----------------------------- SYSTEM -----------------------------
[system]
note = "Welcome to DeepH-pack!"
device = "cpu"
float_type = "fp32" # or `fp64`
random_seed = 137

# ----------------------------- DATA -------------------------------
[data]
inputs_dir = "." # Inputs path that contains `dft` and `graph`
outputs_dir = "./build_graph_logs" # Logging path

[data.graph]
dataset_name = "H2O_5K"
graph_type = "HS" # Graph will include both Hamiltonian and Overlap matrices
storage_type = "memory" # or `disk`
common_orbital_types = "" # See the Doc. for more detailed info.
parallel_num = -1 # Parallel processes during build graph
only_save_graph = true # A task for generate and save graph only

You can then use the following command to build data graph file:

deeph-train build_graph.toml

Note: For the only_save_graph task, GPUs are not required. You may create the Python environment using DEEPH-env --install --cpu as specified in Section 2.

4.4. Start Training DeepH models

With prepared training data in the inputs/ directory, configure a minimal TOML file train.toml and execute the following command to run model training:

deeph-train train.toml

The TOML configuration file comprises four core sections:

system: handles hardware and computational environment declarations, etc.
data: specifies training dataset locations, features, and metadata, etc.
model: defines base architecture components and target physical quantities (not limited to the Hamiltonian – future releases will progressively support force fields, interatomic potentials, charge density, density matrices, GW calculations, etc.), including loss function selection.
process: controls training/inference workflows through convergence criteria, data loader configurations, optimizers, restart settings, etc.

Due to space and time constraints, this section cannot exhaustively cover all TOML configuration details. Only the most critical parameters for the current model are presented below.

system.device: The device configuration follows the syntax <type>*<num>:<id>, where <type> specifies hardware type (cpu, gpu, tpu, rocm, dcu, or cuda), <num> denotes either the total devices per node (for accelerators like GPU) or the number of CPU partitions (when using cpu), and <id> defines target device indices (e.g., gpu*8:1-4,7 selects 5 GPUs from an 8-device node using indices 1,2,3,4,7). Note: For CPU configurations, <id> is ignored while <num> controls thread partitioning.
model.net_type: The neural network architecture. In DeepH-pack Light version, two architectures are available:
sparrow is a light-weighted architecture (typically <1M parameters) with both node and edge features, which is suitable for small tasks of DFT Hamiltonian learning. (DeepH-E3 like networks)
eagle is an advanced architecture (typically \(\sim\) 5M parameters) with both node and edge features, which is suitable for tasks of DFT Hamiltonian learning that requires high accuracy.
model.advanced.net_irreps: Irreducible representations of the neural network features, which ensure the equivariance of the network. Set in the string form of e3nn.Irreps, namely irreducible representations, which describes the symmetry of input features. Note that, For Hamiltonian prediction tasks, the maximum \(l\) specified in the Irreps must be at least twice the highest angular momentum quantum number present in the Hamiltonian's basis set. This requirement arises because Irreps transform the uncoupled representation (direct product basis) of the Hamiltonian into a coupled representation (direct sum basis). For example, when f-orbitals (\(l=3\)) are included, the Irreps must support \(l_{\text{max}} \geq 6\).
process.train.optimizer.init_learning_rate: Starting learning rate.
process.train.scheduler.min_learning_rate_scale: The minimum scaling factor for the learning rate. Training automatically terminates when the learning rate multiplier reaches this threshold, at which point the effective learning rate becomes scale \(\times\) initial_learning_rate.

The complete configuration file for GPU training is shown as follows:

# ---------------------------------- SYSTEM ----------------------------------
[system]
note = "DeepH-JAX"
device = "cpu"
float_type = "fp32"
random_seed = 137
log_level = "info"
jax_memory_preallocate = true
show_train_process_bar = true

# ----------------------------------- DATA ------------------------------------
[data]
inputs_dir = "./inputs"
outputs_dir = "./outputs"

[data.dft]
data_dir_depth = 0
validation_check = false

[data.graph]
dataset_name = "H2O_5K"
graph_type = "HS"
storage_type = "memory"
common_orbital_types = ""
parallel_num = -1
only_save_graph = false

[data.model_save]
best = true
latest = true
latest_interval = 1
latest_num = 10

# ----------------------------- MODEL -----------------------------------------
[model]
net_type = "eagle"
target_type = "H"
loss_type = "mae"

[model.advanced]
gaussian_basis_rmax = 10.0
net_irreps = "64x0e+48x1e+32x2e+16x3e+8x4e"
num_blocks = 4
num_heads = 2
enable_bs3b_layer = false
bs3b_orbital_types = ""
consider_parity = false
standardize_gauge = false

# ------------------------------ PROCESS --------------------------------------
[process.train]
max_epoch = 10000

multi_way_jit_num = 1
ahead_of_time_compile = true
do_remat = false

[process.train.dataloader]
batch_size = 100

train_size = 3000
validate_size = 1000
test_size = 999
dateset_split_json = ""
only_use_train_loss = false

[process.train.drop]
dropout_rate = 0.0
stochastic_depth = 0.0
proj_rate = 0.0

[process.train.optimizer]
type = "adamw"
init_learning_rate = 1E-3
make_clip = false
betas = [0.9, 0.999]
weight = 0.001
eps = 1E-8

[process.train.scheduler]
min_learning_rate_scale = 1E-5
type = "reduce_on_plateau"
factor = 0.5
patience = 500
rtol = 0.05
cooldown = 120
accum_size = -1

[process.train.continued]
enable = false
new_training_data = false
new_optimizer = false
previous_output_dir = ""
load_model_type = "latest"
load_model_epoch = -1

Note

In the process.train.dataloader configuration, the sum of train_size, validation_size, and test_size must not exceed the total number of snapshots in the dataset. Violating this constraint will trigger a ValueError() exception.

Note

Currently, DeepH does not support cross-node CPU inference. This functionality will be implemented in future releases.

4.4. Monitoring the training process

After training commences, DeepH automatically constructs a structured output directory with the following base hierarchy:

outputs/<TIME_STAMP>
  |- dataset_split.json
  |- deepx.log
  |- model
      |- train.toml  
      |- variables.json
      |- params
        |- best.pytree
            |- epoch_124/
        |- latest.pytree
            |- epoch_120/
            |- epoch_110/
            |- epoch_100/
            |- ...
      |- states
        |- best.pytree
        |- latest.pytree

The deepx.log file enables real-time monitoring of training progress throughout the execution.

tail -f outputs/2025-10-14_10-34-43/deepx.log

[ 10.14-10:34:45 ]                                                    
[ 10.14-10:34:45 ]            Welcome to DeepH-pack (deepx)!     
[ 10.14-10:34:45 ]                  Version 1.0.3+light             
[ 10.14-10:34:45 ]                                                    
[ 10.14-10:34:45 ] ...................................................
[ 10.14-10:34:45 ] ........_____....................._...._.[PACK]....
[ 10.14-10:34:45 ] .......|  __ \...................| |..| |..........
[ 10.14-10:34:45 ] .......| |  | | ___  ___ ._ _ ...| |..| |..........
[ 10.14-10:34:45 ] .......| |  | |/ _ \/ _ \| '_ \ .|X'><'X|..........
[ 10.14-10:34:45 ] .......| |__| |. __/. __/| |_) |.| |..| |..........
[ 10.14-10:34:45 ] .......|_____/ \___|\___|| .__/ .|_|..|_|..........
[ 10.14-10:34:45 ] .........................| |.......................
[ 10.14-10:34:45 ] .........................|_|.......................
[ 10.14-10:34:45 ] ...................................................
[ 10.14-10:34:45 ]                                                    
[ 10.14-10:34:45 ]             Copyright CMT@Phys.Tsinghua            
[ 10.14-10:34:45 ]                  Powered by JAX                    
[ 10.14-10:34:45 ]                                                    
[ 10.14-10:34:45 ]                                                    
[ 10.14-10:34:45 ] [system] Under the machine `liyang@bee4`, with `x86_64 (64 cores)` CPU, and `251GB` RAM.
[ 10.14-10:34:46 ] [system] Use the GPU device(s) `[1]` of totally `4` device(s). Succeeded test on the head device `cuda:1`!
[ 10.14-10:34:46 ] [system] Totally use `16` CPU cores.
[ 10.14-10:34:46 ] [system] The calculation will be sharding across `Mesh('data': 1, axis_types=(Auto,))`.
[ 10.14-10:34:46 ] [system] Set random stream with seed `137`, type `key<fry>`.
[ 10.14-10:34:46 ] [system] Using the float type `fp32`. The testing results on JAX and PyTorch are `jnp.float32` and `torch.float32`
[ 10.14-10:34:50 ] [graph] Building the graph with type: `train-HS`.
[ 10.14-10:34:50 ] [graph] Reading graph from the graph pt file: `/data/Workshop/FHIaims_workshop_202511/DeepH-FHI-aims-tutorial/files/4.DeepH_training/H2O_5K/inputs/graph/H2O_5K.train-HS.memory.pt`...
[ 10.14-10:34:50 ] [graph] Finish loading the graph. Totally processed `4999` structures, with `2` kind of elements.
[ 10.14-10:34:50 ] [graph] Using the common orbital types: `[0, 0, 0, 1, 1, 2, 3]`
[ 10.14-10:34:50 ] [graph] Split the graph set with batch size: `100`
[ 10.14-10:34:51 ] [dataloader] Train size: `3000`. Val size: `1000`. Test size: `999`.
[ 10.14-10:34:51 ] [dataloader] Data sharding way: `1`. Batch size: `100`. Number of nodes each batch: `[302]`, `[302]`, `[302]`. Number of edges each batch: `[1202]`, `[1202]`, `[1202]`.
[ 10.14-10:34:51 ] [dataloader] The training dataset encompasses `27000` edges, aggregating a total of `3267000` data entries.
[ 10.14-10:34:51 ] [model] Building the model `eagle-H` with loss `mae`.
[ 10.14-10:34:52 ] [model] Initializing the net parameters with dummy data...
[ 10.14-10:35:38 ] [model] The parameters size is `7684960`.
[ 10.14-10:35:47 ] [optimizer] Using the optimizer `AdamW` with: betas `[0.9, 0.999]`, eps `1e-08`, weight decay strength `0.001`, and initial learning rate `0.001`.
[ 10.14-10:35:47 ] [optimizer] The global CLIP norm algo factor is NOT USED.
[ 10.14-10:35:47 ] [optimizer] Using the scheduler `ReduceOnPlateau` with: factor `0.5`, patience `500`, rtol `0.05`, cooldown `120`, and accumulation size `30`.
[ 10.14-10:35:50 ] [model] We will save the model into `/data/Workshop/FHIaims_workshop_202511/DeepH-FHI-aims-tutorial/files/4.DeepH_training/H2O_5K/outputs/2025-10-14_10-34-43/model`. The best model will be saved. The latest model (keep `10` each `1` epoch) will be saved.
[ 10.14-10:35:50 ] [train] JAX networks: Parallel threads AOT-compiling `1` frameworks for training and `1` for validation.
[ 10.14-10:36:50 ] [train] Compile networks done!
[ 10.14-10:36:50 ] [train] Starting the training process ... 
[ 10.14-10:37:20 ] [train] Epoch 1 | Time 23.39 s | Train-Loss 2.618185e+00 | Val-Loss 1.979444e+00 | Scale 1.0
[ 10.14-10:37:33 ] [train] Epoch 2 | Time 4.16 s | Train-Loss 1.599524e+00 | Val-Loss 1.268336e+00 | Scale 1.0
[ 10.14-10:37:44 ] [train] Epoch 3 | Time 4.63 s | Train-Loss 1.048272e+00 | Val-Loss 8.516293e-01 | Scale 1.0
[ 10.14-10:37:55 ] [train] Epoch 4 | Time 4.41 s | Train-Loss 7.483740e-01 | Val-Loss 6.374618e-01 | Scale 1.0
[ 10.14-10:38:07 ] [train] Epoch 5 | Time 3.80 s | Train-Loss 4.998804e-01 | Val-Loss 4.011233e-01 | Scale 1.0
[ 10.14-10:38:19 ] [train] Epoch 6 | Time 3.79 s | Train-Loss 3.648625e-01 | Val-Loss 3.389998e-01 | Scale 1.0

Meanwhile, the model directory stores all critical data for restarting computations, fine-tuning, and model inference – including complete model parameters. We will provide detailed guidance on utilizing this data in the next section.