diff --git a/tensorflow/README.md b/tensorflow/README.md index f2dbd5923cef884a97c7bbbb2f7f4641422233fb..8e1582783107e6abfcc2fc8bcd4d1305501181b0 100644 --- a/tensorflow/README.md +++ b/tensorflow/README.md @@ -48,9 +48,20 @@ In the `DeepGalaxy` directory, download the training dataset. Depending on the b **Step 3**: Run the code on different numbers of workers. For example, the following command executes the code on `np = 4` workers: ``` -mpirun -np 4 python dg_train.py -f output_bw_512.hdf5 --epochs 20 --noise 0.1 --batch-size 4 --arch EfficientNetB4 +export OMP_NUM_THREADS= + HOROVOD_FUSION_THRESHOLD=134217728 \ + mpirun --np \ + --map-by ppr:1:socket:pe=$OMP_NUM_THREADS \ + --report-bindings \ + --oversubscribe \ + -x LD_LIBRARY_PATH \ + -x HOROVOD_FUSION_THRESHOLD \ + -x OMP_NUM_THREADS=$OMP_NUM_THREADS \ + python dg_train.py -f output_bw_512.hdf5 --num-camera 3 --arch EfficientNetB4 \ +--epochs 5 --batch-size + ``` -`output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, shoudl the system memory permits, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur. +The placeholders `number_of_cores_per_sockets` and `number_of_mpi_workers` should be replaced by the number of CPU cores in a CPU socket and the number of copies of the neural network is trained in parallel. For example, if a simulation is running on 4 nodes, each of which with two CPU sockets, and each CPU has 64 cores, then `number_of_cores_per_sockets = 64` and `number_of_mpi_workers = 8` (4 nodes, 2 MPI workers per node). The `batch_size` parameter is specific to machine learning rather than HPC, but users should choose a proper batch size to make sure that the hardware resources are fully utilised but not overloaded. `output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, shoudl the system memory permits, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur. The benchmark data of the training are written to the file `train_log.txt`.