Skip to content
Snippets Groups Projects
Commit 2b7b663c authored by Maxwell Cai's avatar Maxwell Cai
Browse files

Update README.md

parent 6646bddf
Branches
Tags
No related merge requests found
......@@ -48,9 +48,20 @@ In the `DeepGalaxy` directory, download the training dataset. Depending on the b
**Step 3**: Run the code on different numbers of workers. For example, the following command executes the code on `np = 4` workers:
```
mpirun -np 4 python dg_train.py -f output_bw_512.hdf5 --epochs 20 --noise 0.1 --batch-size 4 --arch EfficientNetB4
export OMP_NUM_THREADS=<number_of_cores_per_sockets>
HOROVOD_FUSION_THRESHOLD=134217728 \
mpirun --np <number_of_mpi_workers> \
--map-by ppr:1:socket:pe=$OMP_NUM_THREADS \
--report-bindings \
--oversubscribe \
-x LD_LIBRARY_PATH \
-x HOROVOD_FUSION_THRESHOLD \
-x OMP_NUM_THREADS=$OMP_NUM_THREADS \
python dg_train.py -f output_bw_512.hdf5 --num-camera 3 --arch EfficientNetB4 \
--epochs 5 --batch-size <batch_size>
```
`output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, shoudl the system memory permits, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur.
The placeholders `number_of_cores_per_sockets` and `number_of_mpi_workers` should be replaced by the number of CPU cores in a CPU socket and the number of copies of the neural network is trained in parallel. For example, if a simulation is running on 4 nodes, each of which with two CPU sockets, and each CPU has 64 cores, then `number_of_cores_per_sockets = 64` and `number_of_mpi_workers = 8` (4 nodes, 2 MPI workers per node). The `batch_size` parameter is specific to machine learning rather than HPC, but users should choose a proper batch size to make sure that the hardware resources are fully utilised but not overloaded. `output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, shoudl the system memory permits, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur.
The benchmark data of the training are written to the file `train_log.txt`.
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment