Commit d4226fdf authored by Maxwell Cai's avatar Maxwell Cai
Browse files

Update README.md

parent 14d9360f
......@@ -2,7 +2,7 @@ TensorFlow
===
TensorFlow (https://www.tensorflow.org) is a popular open-source library for symbolic math and linear algebra, with particular optimization for neural-networks-based machine learning workflow. Maintained by Google, it is widely used for research and production in both the academia and the industry.
TensorFlow supports a wide variety of hardware platforms (CPUs, GPUs, TPUs), and can be scaled up to utilize multiple compute devices on a single or multiple compute nodes. The main objective of this benchmark is to profile the scaling behavior of TensorFlow on different hardware, and thereby provide a reference baseline of its performance for different sizes of applications.
TensorFlow supports a wide variety of hardware platforms (CPUs, GPUs, TPUs) and can be scaled up to utilize multiple computing devices on a single or multiple compute nodes. The main objective of this benchmark is to profile the scaling behavior of TensorFlow on different hardware, and thereby provide a reference baseline of its performance for different sizes of applications.
DeepGalaxy
===
......@@ -11,9 +11,9 @@ There are many open-source datasets available for benchmarking TensorFlow, such
- Website: https://github.com/maxwelltsai/DeepGalaxy
- Code download: https://github.com/maxwelltsai/DeepGalaxy
- [Prerequisites installation](#prerequisites-installation)
- [Test Case A](Testcase_A/README.md)
- [Test Case B](Testcase_B/README.md)
- [Test Case C](Testcase_C/README.md)
- [Test Case A (small)](Testcase_A/README.md)
- [Test Case B (medium)](Testcase_B/README.md)
- [Test Case C (large)](Testcase_C/README.md)
......@@ -35,7 +35,7 @@ Note: there is no guarantee of optimal performance when `tensorflow` is installe
```
git clone https://github.com/maxwelltsai/DeepGalaxy.git
```
This should clone the full benchmark code to a local directory called `DeepGalaxy`. Enter this directory with `cd DeepGalaxy`.
In doing so, the latest version of the `DeepGalaxy` benchmark suite will be downloaded. Note that the latest version is not necessarily the most stable version, and there is no guarantee of backward compatability with older TensorFlow versions.
**Step 2**: Download the training dataset.
In the `DeepGalaxy` directory, download the training dataset. Depending on the benchmark size, there are three datasets available:
......@@ -44,28 +44,46 @@ In the `DeepGalaxy` directory, download the training dataset. Depending on the b
- (1024, 1024) pixels: https://edu.nl/gcy96 (6.1GB)
- (2048, 2048) pixels: https://edu.nl/bruf6 (14GB)
**Step 3**: Run the code on different number of workers. For example, the following command executes the code on `np = 4` workers:
**Step 3**: Run the code on different numbers of workers. For example, the following command executes the code on `np = 4` workers:
```
mpirun -np 4 dg_train.py -f output_bw_512.hdf5 --epochs 20 --noise 0.1 --batch-size 4 --arch EfficientNetB4
```
where `output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, if there are a lot of memory, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur.
where `output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, shoudl the system memory permits, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur.
It is wise to save the output of the `mpirun` command to a text file, for example, `DeepGalaxy.np_4.out`.
The benchmark data of the training are written to the file `train_log.txt`.
**Step 4**: Repeat Step 3 with different `np`.
All the desired `np` settings are completed, we should have a bunch of output files on the local directory. For example, `DeepGalaxy.np_4.out`, `DeepGalaxy.np_8.out`, and so on. We could then extract the throughput using the following command:
By the time when all the desired `np` settings are completed, we should have all the throughput data written in the `train_log.txt` file. The content of the file looks like this:
```
grep sample DeepGalaxy.np_4.out
Time is now 2021-06-21 14:26:08.581012
Parallel training enabled.
batch_size = 4, global_batch_size = 16, num_workers = 4
hvd_rank = 0, hvd_local_rank = 0
Loading part of the dataset since distributed training is enabled ...
Shape of X: (319, 512, 512, 1)
Shape of Y: (319,)
Number of classes: 213
[Performance] Epoch 0 takes 107.60 seconds. Throughput: 2.37 images/sec (per worker), 9.48 images/sec (total)
[Performance] Epoch 1 takes 17.15 seconds. Throughput: 14.87 images/sec (per worker), 59.47 images/sec (total)
[Performance] Epoch 2 takes 10.95 seconds. Throughput: 23.29 images/sec (per worker), 93.15 images/sec (total)
[Performance] Epoch 3 takes 10.99 seconds. Throughput: 23.21 images/sec (per worker), 92.82 images/sec (total)
[Performance] Epoch 4 takes 11.01 seconds. Throughput: 23.17 images/sec (per worker), 92.67 images/sec (total)
[Performance] Epoch 5 takes 11.00 seconds. Throughput: 23.18 images/sec (per worker), 92.72 images/sec (total)
[Performance] Epoch 6 takes 11.05 seconds. Throughput: 23.08 images/sec (per worker), 92.31 images/sec (total)
[Performance] Epoch 7 takes 11.16 seconds. Throughput: 22.86 images/sec (per worker), 91.44 images/sec (total)
[Performance] Epoch 8 takes 11.11 seconds. Throughput: 22.96 images/sec (per worker), 91.85 images/sec (total)
[Performance] Epoch 9 takes 11.10 seconds. Throughput: 22.97 images/sec (per worker), 91.87 images/sec (total)
On hostname r38n1.lisa.surfsara.nl - After training using 4.195556640625 GB of memory
```
A sample output looks like this:
```
7156/7156 [==============================] - 1435s 201ms/sample - loss: 5.9885 - sparse_categorical_accuracy: 0.0488 - val_loss: 5.8073 - val_sparse_categorical_accuracy: 0.1309
7156/7156 [==============================] - 1141s 160ms/sample - loss: 3.0371 - sparse_categorical_accuracy: 0.3376 - val_loss: 2.0614 - val_sparse_categorical_accuracy: 0.5666
7156/7156 [==============================] - 1237s 173ms/sample - loss: 0.5927 - sparse_categorical_accuracy: 0.8506 - val_loss: 0.0503 - val_sparse_categorical_accuracy: 0.9835
7156/7156 [==============================] - 1123s 157ms/sample - loss: 0.0245 - sparse_categorical_accuracy: 0.9963 - val_loss: 0.0033 - val_sparse_categorical_accuracy: 0.9994
7156/7156 [==============================] - 1236s 173ms/sample - loss: 0.0026 - sparse_categorical_accuracy: 0.9998 - val_loss: 9.3778e-07 - val_sparse_categorical_accuracy: 1.0000
```
The throughput can be read from the timing here, such as `173ms/sample`. Usually, this number is a bit larger in the first epoch, because `TensorFlow` needs to do some initialization in the first epoch. So we could pikc up the number from the 3rd or even 5th epoch when it is stablized.
Extract this number for different `np`, and see how this number changes a function of `np`. In a system with perfect (i.e., linear) scaling, this number should be constant. But in reality, this number should increase due to the communication overhead. Therefore, the growth of this number as a function of `np` tell us something about the scaling efficiency of the underlying system.
This output contains several useful information for use to derive the scaling efficiency of the HPC system:
- `num_workers`: the number of (MPI) workers. This is essentially equal to the `-np` parameter in the `mpirun` command. Do not confuse this with the number of (CPU) cores used, because one worker may make use of multiple cores. If GPUs are used, one worker is typically associated with one GPU card.
- images/sec (per worker): this is the throughput per worker
- images/sec (total): this is the total throughput of the system
Due to the initialization effect, the throughputs of the first two epochs are lower, so please read the throughput data from the third epoch onwards.
With the data of total throughput, we could calculate the scaling efficiency. In an ideal system, the total throughput scales linearly as a function of `num_workers`, and hence the scaling efficiency is 1. In practice, the scaling efficiency drops with more workers due to the communication overhead. The better connectivity the HPC system is, the better scaling efficiency.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment