From 7a7a697dbc6f335b6b3eca0ea91e1ae2956e467f Mon Sep 17 00:00:00 2001
From: Maxwell Cai <maxwell.cai@surfsara.nl>
Date: Fri, 22 Jan 2021 17:02:35 +0200
Subject: [PATCH] Update tensorflow/README.md

---
 tensorflow/README.md | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/tensorflow/README.md b/tensorflow/README.md
index 66c6d72..6f8d7f6 100644
--- a/tensorflow/README.md
+++ b/tensorflow/README.md
@@ -28,3 +28,44 @@ pip install scikit-image
 pip install pandas
 ```
 Note: there is no guarantee of optimal performance when `tensorflow` is installed using `pip`. It is better if `tensorflow` is compiled from source, in which case the compiler will likely be able to take advantage of the advanced instruction sets supported by the processor (e.g., AVX512). An official build instruction can be found at https://www.tensorflow.org/install/source. Sometimes, an HPC center may have a tensorflow module optimized for their hardware, in which case the `pip install tensorflow` line can be replaced with a line like `module load <name_of_the_tensorflow_module>`.
+
+
+## How to benchmark the throughput of a HPC system
+**Step 1**: Download the benchmark code:
+```
+git clone https://github.com/maxwelltsai/DeepGalaxy.git
+```
+This should clone the full benchmark code to a local directory called `DeepGalaxy`. Enter this directory with `cd DeepGalaxy`.
+
+**Step 2**: Download the training dataset.
+In the `DeepGalaxy` directory, download the training dataset. Depending on the benchmark size, there are three datasets available:
+
+- (512, 512) pixels: https://edu.nl/r3wh3 (2GB)
+- (1024, 1024) pixels: https://edu.nl/gcy96 (6.1GB)
+- (2048, 2048) pixels: https://edu.nl/bruf6 (14GB)
+
+**Step 3**: Run the code on different number of workers. For example, the following command executes the code on `np = 4` workers:
+```
+mpirun -np 4 dg_train.py -f output_bw_512.hdf5 --epochs 20 --noise 0.1 --batch-size 4 --arch EfficientNetB4
+```
+where `output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, if there are a lot of memory, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur.
+
+It is wise to save the output of the `mpirun` command to a text file, for example, `DeepGalaxy.np_4.out`. 
+
+**Step 4**: Repeat Step 3 with different `np`. 
+All the desired `np` settings are completed, we should have a bunch of output files on the local directory. For example, `DeepGalaxy.np_4.out`, `DeepGalaxy.np_8.out`, and so on. We could then extract the throughput using the following command:
+```
+grep sample DeepGalaxy.np_4.out
+``` 
+A sample output looks like this:
+```
+7156/7156 [==============================] - 1435s 201ms/sample - loss: 5.9885 - sparse_categorical_accuracy: 0.0488 - val_loss: 5.8073 - val_sparse_categorical_accuracy: 0.1309
+7156/7156 [==============================] - 1141s 160ms/sample - loss: 3.0371 - sparse_categorical_accuracy: 0.3376 - val_loss: 2.0614 - val_sparse_categorical_accuracy: 0.5666
+7156/7156 [==============================] - 1237s 173ms/sample - loss: 0.5927 - sparse_categorical_accuracy: 0.8506 - val_loss: 0.0503 - val_sparse_categorical_accuracy: 0.9835
+7156/7156 [==============================] - 1123s 157ms/sample - loss: 0.0245 - sparse_categorical_accuracy: 0.9963 - val_loss: 0.0033 - val_sparse_categorical_accuracy: 0.9994
+7156/7156 [==============================] - 1236s 173ms/sample - loss: 0.0026 - sparse_categorical_accuracy: 0.9998 - val_loss: 9.3778e-07 - val_sparse_categorical_accuracy: 1.0000
+```
+The throughput can be read from the timing here, such as `173ms/sample`. Usually, this number is a bit larger in the first epoch, because `TensorFlow` needs to do some initialization in the first epoch. So we could pikc up the number from the 3rd or even 5th epoch when it is stablized.
+
+Extract this number for different `np`, and see how this number changes a function of `np`. In a system with perfect (i.e., linear) scaling, this number should be constant. But in reality, this number should increase due to the communication overhead. Therefore, the growth of this number as a function of `np` tell us something about the scaling efficiency of the underlying system.
+
-- 
GitLab