README.md 7.73 KB
Newer Older
maxwelltsai's avatar
maxwelltsai committed
1
2
3
4
TensorFlow
===
TensorFlow (https://www.tensorflow.org) is a popular open-source library for symbolic math and linear algebra, with particular optimization for neural-networks-based machine learning workflow. Maintained by Google, it is widely used for research and production in both the academia and the industry. 

Maxwell Cai's avatar
Maxwell Cai committed
5
TensorFlow supports a wide variety of hardware platforms (CPUs, GPUs, TPUs) and can be scaled up to utilize multiple computing devices on a single or multiple compute nodes. The main objective of this benchmark is to profile the scaling behavior of TensorFlow on different hardware, and thereby provide a reference baseline of its performance for different sizes of applications.
maxwelltsai's avatar
maxwelltsai committed
6
7
8
9
10
11
12
13

DeepGalaxy
===
There are many open-source datasets available for benchmarking TensorFlow, such as `mnist`, `fashion_mnist`, `cifar`, `imagenet`, and so on. This benchmark suite, however, would like to focus on a scientific research use case. `DeepGalaxy` is a code built with TensorFlow, which uses deep neural network to classify galaxy mergers in the Universe, observed by the Hubble Space Telescope and the Sloan Digital Sky Survey.  

- Website: https://github.com/maxwelltsai/DeepGalaxy
- Code download: https://github.com/maxwelltsai/DeepGalaxy
- [Prerequisites installation](#prerequisites-installation)
Maxwell Cai's avatar
Maxwell Cai committed
14
15
16
- [Test Case A (small)](Testcase_A/README.md)
- [Test Case B (medium)](Testcase_B/README.md)
- [Test Case C (large)](Testcase_C/README.md)
maxwelltsai's avatar
maxwelltsai committed
17
18
19
20
21




## Prerequisites Installation
Maxwell Cai's avatar
Maxwell Cai committed
22
The prerequsities consists of a list of python packages as shown below. It is recommended to create a python virtual environment (either with `pyenv` or `conda`). In general, the following packages can be installed using the `pip` package management tool:
maxwelltsai's avatar
maxwelltsai committed
23
24
25
26
27
28
29
```
pip install tensorflow
pip install horovod
pip install scikit-learn
pip install scikit-image
pip install pandas
```
Maxwell Cai's avatar
Maxwell Cai committed
30
31
32
Note: there is no guarantee of optimal performance when TensorFlow is installed using `pip`. It is better if TensorFlow is compiled from source, in which case the compiler will likely be able to take advantage of the advanced instruction sets supported by the processor (e.g., AVX512). An official build instruction can be found at https://www.tensorflow.org/install/source. Sometimes, an HPC center may have a tensorflow module optimized for their hardware, in which case the `pip install tensorflow` line can be replaced with a line like `module load <name_of_the_tensorflow_module>`.

For multi-node training, `MPI` and/or `NCCL` should be installed as well.
Maxwell Cai's avatar
Maxwell Cai committed
33
34
35
36
37
38
39


## How to benchmark the throughput of a HPC system
**Step 1**: Download the benchmark code:
```
git clone https://github.com/maxwelltsai/DeepGalaxy.git
```
Maxwell Cai's avatar
Maxwell Cai committed
40
In doing so, the latest version of the `DeepGalaxy` benchmark suite will be downloaded. Note that the latest version is not necessarily the most stable version, and there is no guarantee of backward compatability with older TensorFlow versions. 
Maxwell Cai's avatar
Maxwell Cai committed
41
42
43
44
45
46
47
48

**Step 2**: Download the training dataset.
In the `DeepGalaxy` directory, download the training dataset. Depending on the benchmark size, there are three datasets available:

- (512, 512) pixels: https://edu.nl/r3wh3 (2GB)
- (1024, 1024) pixels: https://edu.nl/gcy96 (6.1GB)
- (2048, 2048) pixels: https://edu.nl/bruf6 (14GB)

Maxwell Cai's avatar
Maxwell Cai committed
49
**Step 3**: Run the code on different numbers of workers. For example, the following command executes the code on `np = 4` workers:
Maxwell Cai's avatar
Maxwell Cai committed
50
```
Maxwell Cai's avatar
Maxwell Cai committed
51
52
53
54
55
56
57
58
59
60
61
62
export OMP_NUM_THREADS=<number_of_cores_per_sockets>
 HOROVOD_FUSION_THRESHOLD=134217728 \
 mpirun --np <number_of_mpi_workers> \
 --map-by ppr:1:socket:pe=$OMP_NUM_THREADS \
 --report-bindings \
 --oversubscribe \
 -x LD_LIBRARY_PATH \
 -x HOROVOD_FUSION_THRESHOLD \
 -x OMP_NUM_THREADS=$OMP_NUM_THREADS \
 python dg_train.py -f output_bw_512.hdf5 --num-camera 3 --arch EfficientNetB4 \
--epochs 5 --batch-size <batch_size>

Maxwell Cai's avatar
Maxwell Cai committed
63
```
Maxwell Cai's avatar
Maxwell Cai committed
64
The placeholders `number_of_cores_per_sockets` and `number_of_mpi_workers` should be replaced by the number of CPU cores in a CPU socket and the number of copies of the neural network is trained in parallel. For example, if a simulation is running on 4 nodes, each of which with two CPU sockets, and each CPU has 64 cores, then `number_of_cores_per_sockets = 64` and `number_of_mpi_workers = 8` (4 nodes, 2 MPI workers per node). The `batch_size` parameter is specific to machine learning rather than HPC, but users should choose a proper batch size to make sure that the hardware resources are fully utilised but not overloaded. `output_bw_512.hdf5` is the training dataset downloaded in the previous step. Please change the file name if necessary. One could also change the other parameters, such as `--epochs`, `--batch-size`, and `--arch` according to the size of the benchmark. For example, the `EfficientNetB0` deep neural network is for small HPC systems, `EfficientNetB4` is for medium-size ones, and `EfficientNetB7` is for large systems. Also, shoudl the system memory permits, increasing the `--batch-size` could improve the throughput. If the `--batch-size` parameter is too large, an out-of-memory error could occur.
Maxwell Cai's avatar
Maxwell Cai committed
65

Maxwell Cai's avatar
Maxwell Cai committed
66
The benchmark data of the training are written to the file `train_log.txt`.
Maxwell Cai's avatar
Maxwell Cai committed
67
68

**Step 4**: Repeat Step 3 with different `np`. 
Maxwell Cai's avatar
Maxwell Cai committed
69
By the time when all the desired `np` settings are completed, we should have all the throughput data written in the `train_log.txt` file. The content of the file looks like this:
Maxwell Cai's avatar
Maxwell Cai committed
70
```
Maxwell Cai's avatar
Maxwell Cai committed
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
Time is now 2021-06-21 14:26:08.581012

Parallel training enabled.
batch_size = 4, global_batch_size = 16, num_workers = 4

hvd_rank = 0, hvd_local_rank = 0
Loading part of the dataset since distributed training is enabled ...
Shape of X: (319, 512, 512, 1)
Shape of Y: (319,)
Number of classes: 213
[Performance] Epoch 0 takes 107.60 seconds. Throughput: 2.37 images/sec (per worker), 9.48 images/sec (total)
[Performance] Epoch 1 takes 17.15 seconds. Throughput: 14.87 images/sec (per worker), 59.47 images/sec (total)
[Performance] Epoch 2 takes 10.95 seconds. Throughput: 23.29 images/sec (per worker), 93.15 images/sec (total)
[Performance] Epoch 3 takes 10.99 seconds. Throughput: 23.21 images/sec (per worker), 92.82 images/sec (total)
[Performance] Epoch 4 takes 11.01 seconds. Throughput: 23.17 images/sec (per worker), 92.67 images/sec (total)
[Performance] Epoch 5 takes 11.00 seconds. Throughput: 23.18 images/sec (per worker), 92.72 images/sec (total)
[Performance] Epoch 6 takes 11.05 seconds. Throughput: 23.08 images/sec (per worker), 92.31 images/sec (total)
[Performance] Epoch 7 takes 11.16 seconds. Throughput: 22.86 images/sec (per worker), 91.44 images/sec (total)
[Performance] Epoch 8 takes 11.11 seconds. Throughput: 22.96 images/sec (per worker), 91.85 images/sec (total)
[Performance] Epoch 9 takes 11.10 seconds. Throughput: 22.97 images/sec (per worker), 91.87 images/sec (total)
On hostname r38n1.lisa.surfsara.nl - After training using 4.195556640625 GB of memory
Maxwell Cai's avatar
Maxwell Cai committed
92
93
``` 

Maxwell Cai's avatar
Maxwell Cai committed
94
95
96
97
98
99
100
This output contains several useful information for use to derive the scaling efficiency of the HPC system:

- `num_workers`: the number of (MPI) workers. This is essentially equal to the `-np` parameter in the `mpirun` command. Do not confuse this with the number of (CPU) cores used, because one worker may make use of multiple cores. If GPUs are used, one worker is typically associated with one GPU card.
- images/sec (per worker): this is the throughput per worker
- images/sec (total): this is the total throughput of the system

Due to the initialization effect, the throughputs of the first two epochs are lower, so please read the throughput data from the third epoch onwards.
Maxwell Cai's avatar
Maxwell Cai committed
101

Maxwell Cai's avatar
Maxwell Cai committed
102
With the data of total throughput, we could calculate the scaling efficiency. In an ideal system, the total throughput scales linearly as a function of `num_workers`, and hence the scaling efficiency is 1. In practice, the scaling efficiency drops with more workers due to the communication overhead. The better connectivity the HPC system is, the better scaling efficiency.