`output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, shoudl the system memory permits, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur.
The placeholders `number_of_cores_per_sockets` and `number_of_mpi_workers` should be replaced by the number of CPU cores in a CPU socket and the number of copies of the neural network is trained in parallel. For example, if a simulation is running on 4 nodes, each of which with two CPU sockets, and each CPU has 64 cores, then `number_of_cores_per_sockets = 64` and `number_of_mpi_workers = 8` (4 nodes, 2 MPI workers per node). The `batch_size` parameter is specific to machine learning rather than HPC, but users should choose a proper batch size to make sure that the hardware resources are fully utilised but not overloaded. `output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, shoudl the system memory permits, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur.
The benchmark data of the training are written to the file `train_log.txt`.