diff --git a/tensorflow/.gitkeep b/tensorflow/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/tensorflow/README.md b/tensorflow/README.md new file mode 100644 index 0000000000000000000000000000000000000000..66c6d72671ed84f5983447583ea36bf8aa8af803 --- /dev/null +++ b/tensorflow/README.md @@ -0,0 +1,30 @@ +TensorFlow +=== +TensorFlow (https://www.tensorflow.org) is a popular open-source library for symbolic math and linear algebra, with particular optimization for neural-networks-based machine learning workflow. Maintained by Google, it is widely used for research and production in both the academia and the industry. + +TensorFlow supports a wide variety of hardware platforms (CPUs, GPUs, TPUs), and can be scaled up to utilize multiple compute devices on a single or multiple compute nodes. The main objective of this benchmark is to profile the scaling behavior of TensorFlow on different hardware, and thereby provide a reference baseline of its performance for different sizes of applications. + +DeepGalaxy +=== +There are many open-source datasets available for benchmarking TensorFlow, such as `mnist`, `fashion_mnist`, `cifar`, `imagenet`, and so on. This benchmark suite, however, would like to focus on a scientific research use case. `DeepGalaxy` is a code built with TensorFlow, which uses deep neural network to classify galaxy mergers in the Universe, observed by the Hubble Space Telescope and the Sloan Digital Sky Survey. + +- Website: https://github.com/maxwelltsai/DeepGalaxy +- Code download: https://github.com/maxwelltsai/DeepGalaxy +- [Prerequisites installation](#prerequisites-installation) +- [Test Case A](Testcase_A/README.md) +- [Test Case B](Testcase_B/README.md) +- [Test Case C](Testcase_C/README.md) + + + + +## Prerequisites Installation +The prerequsities consists of a list of python packages as shown below. It is recommended to create a python virtual environment (either with `pyenv` or `conda`). The following packages can be installed using the `pip` package management tool: +``` +pip install tensorflow +pip install horovod +pip install scikit-learn +pip install scikit-image +pip install pandas +``` +Note: there is no guarantee of optimal performance when `tensorflow` is installed using `pip`. It is better if `tensorflow` is compiled from source, in which case the compiler will likely be able to take advantage of the advanced instruction sets supported by the processor (e.g., AVX512). An official build instruction can be found at https://www.tensorflow.org/install/source. Sometimes, an HPC center may have a tensorflow module optimized for their hardware, in which case the `pip install tensorflow` line can be replaced with a line like `module load `. diff --git a/tensorflow/Testcase_A/README.md b/tensorflow/Testcase_A/README.md new file mode 100644 index 0000000000000000000000000000000000000000..15385a3b5b80e3fceb62b946b2e5a55158bbb2ea --- /dev/null +++ b/tensorflow/Testcase_A/README.md @@ -0,0 +1,18 @@ +## Test Case A + +This test case is designed to benchmark TensorFlow with small-to-medium-sized datasets using a medium-sized deep neural network (DNN). The image resolution is set at (512, 512) px. The training can be carried out using one or more nodes. The DNN is relatively small (about 17 million parameters), which would fit into most GPUs when using a `batch_size` of 8 or even 16. + +The dataset can be downloaded at: https://surfdrive.surf.nl/files/index.php/s/Mzm28FQ1udG3FG7 (2GB) + +If the training is done on a single node, running the following command (after necessary allocation of the compute resources) would be enough: +``` +python dg_train.py -f output_bw_512.hdf5 --arch EfficientNetB4 --epochs 10 --noise 0.3 --batch-size 4 +``` +Please replace `output_bw_512.hdf5` with the actual dataset file name, and modify other parameters whenever necessary. `--batch-size` of 8 may be used if the GPU has 32 GB of memory. + +If multiple nodes are used, it is necessary to run the code with `mpirun` or `mpiexec`. For example, train the DNN on 2 nodes, each with 4 GPUs: +``` +mpirun -np 8 python dg_train.py -f output_bw_512.hdf5 --arch EfficientNetB4 --epochs 10 --noise 0.3 --batch-size 4 +``` + +If NVIDIA GPUs are used, `DeepGalaxy` can automatically bind an MPI process to a GPU, so no explicit specification of `CUDA_VISIBLE_DEVICES` is needed. diff --git a/tensorflow/Testcase_B/README.md b/tensorflow/Testcase_B/README.md new file mode 100644 index 0000000000000000000000000000000000000000..16597640f59fff17b03d241576f578024273fc3f --- /dev/null +++ b/tensorflow/Testcase_B/README.md @@ -0,0 +1,18 @@ +## Test Case B + +This test case is designed to benchmark TensorFlow with small-to-medium-sized datasets using a large-sized deep neural network (DNN). The image resolution is set at (512, 512) px. The training can be carried out using one or more nodes. The DNN is moderately large (about 64 million parameters). In comparison, the popular `ResNet50` CNN has about 23 million parameters. With a network of this size, using a batch size of 2 or 4 is recommended for most GPU. + +The dataset can be downloaded at: https://surfdrive.surf.nl/files/index.php/s/Mzm28FQ1udG3FG7 (2GB) + +If the training is done on a single node, running the following command (after necessary allocation of the compute resources) would be enough: +``` +python dg_train.py -f output_bw_512.hdf5 --arch EfficientNetB7 --epochs 10 --noise 0.3 --batch-size 4 +``` +Please replace `output_bw_512.hdf5` with the actual dataset file name, and modify other parameters whenever necessary. `--batch-size` of 8 may be used if the GPU has 32 GB of memory. + +If multiple nodes are used, it is necessary to run the code with `mpirun` or `mpiexec`. For example, train the DNN on 2 nodes, each with 4 GPUs: +``` +mpirun -np 8 python dg_train.py -f output_bw_512.hdf5 --arch EfficientNetB7 --epochs 10 --noise 0.3 --batch-size 4 +``` + +If NVIDIA GPUs are used, `DeepGalaxy` can automatically bind an MPI process to a GPU, so no explicit specification of `CUDA_VISIBLE_DEVICES` is needed. diff --git a/tensorflow/Testcase_C/README.md b/tensorflow/Testcase_C/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9464c4c475f43a7e33468190ce53849eadaf9c94 --- /dev/null +++ b/tensorflow/Testcase_C/README.md @@ -0,0 +1,14 @@ +## Test Case C + +This test case aims to stress the underlying hardware with high-resolution images and a large deep neural network (DNN). For example, running on a tier-0 cluster using 256 nodes (each with 2 CPU sockets) would be (please allocate the compute resources first): + +``` +mpirun -np 512 python dg_train.py -f output_bw_2048.hdf5 --arch EfficientNetB7 --epoches 10 --noise 0.3 --batch-size 1 +``` +Here, `--batch-size 1` is used because the DNN is so large that it would use up to 160 GB of memory. This memory requirement exceeds the available memory of all currently available GPUs, and so the training should be performed on the CPU in principle. In some large-memory nodes, a `batch-size` of 2 or even 4 might be possible. + +As a walkaround, it is still possible to perform the training on the GPUs by using CUDA's unified memory. This would likely introduce performance penalty. As an example, if one wishes to allocate 5x the size of GPU memory, and carry out training on 256 nodes (each with 4 GPUs), the commandline would be +``` +mpirun -np 1024 python dg_train.py -f output_bw_2048.hdf5 --arch EfficientNetB7 --epoches 10 --noise 0.3 --batch-size 1 --gpu-mem-frac 5 +``` +Please note that the surplus of memory requirement is offset by the host memory. So in this example, the host memory is expected to offset (160 - 32) * 4 = 512 GB. So please choose the `--gpu-mem-frac` with according to the actual hardware specs to ensure that the model has enough memory to operate. diff --git a/tensorflow/compile.sh b/tensorflow/compile.sh new file mode 100755 index 0000000000000000000000000000000000000000..d0bc9e238b12c42b92f513b1ef9927f45aded231 --- /dev/null +++ b/tensorflow/compile.sh @@ -0,0 +1,3 @@ +#!/bin/bash +echo "no need to compile, it's python code with calls to TensorfFlow and Horovod" +echo "To install the required libraries, please run the `install_prerequisites.sh` script." diff --git a/tensorflow/download.sh b/tensorflow/download.sh new file mode 100755 index 0000000000000000000000000000000000000000..928965997326f4ccdb9e23937fe08c2c64cef9d0 --- /dev/null +++ b/tensorflow/download.sh @@ -0,0 +1,2 @@ +#!/bin/bash +git clone https://github.com/maxwelltsai/DeepGalaxy.git diff --git a/tensorflow/install_prerequisites.sh b/tensorflow/install_prerequisites.sh new file mode 100644 index 0000000000000000000000000000000000000000..2a8a3d26530ea65ead0697c7c30a2a924a876e6d --- /dev/null +++ b/tensorflow/install_prerequisites.sh @@ -0,0 +1,7 @@ +# please create a virtual environment and activate it if applicable. + +pip install tensorflow +pip install horovod +pip install scikit-learn +pip install scikit-image +pip install h5py diff --git a/tensorflow/machines/.gitkeep b/tensorflow/machines/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/tensorflow/machines/jean-zay-gpu/.gitkeep b/tensorflow/machines/jean-zay-gpu/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/tensorflow/machines/jean-zay-gpu/batch_medium.slurm b/tensorflow/machines/jean-zay-gpu/batch_medium.slurm new file mode 100644 index 0000000000000000000000000000000000000000..460fce118279c7245f0f2ba3eadcde5a9d820afc --- /dev/null +++ b/tensorflow/machines/jean-zay-gpu/batch_medium.slurm @@ -0,0 +1,19 @@ +#!/bin/bash +#SBATCH --job-name=dg_512_1_4_4_3_100 # ,,,,, +#SBATCH --ntasks=4 # number of tasks +#SBATCH -N 4 # number of nodes +#SBATCH --gres=gpu:4 # number of GPU per node +#SBATCH --cpus-per-task=10 # number of CPU cores + +#SBATCH --hint=nomultithread # use physical core +#SBATCH --time=03:00:00 # walltime max +#SBATCH --exclusive +#SBATCH -A qbg@gpu + +#SBATCH --output=dg.out # output filename +#SBATCH --error=dg.err # error filename + +set -x +source env_bench + +srun /gpfslocalsup/pub/idrtools/bind_gpu.sh python $DG_TRAIN diff --git a/tensorflow/machines/jean-zay-gpu/env_bench b/tensorflow/machines/jean-zay-gpu/env_bench new file mode 100644 index 0000000000000000000000000000000000000000..a426044e98cb080e3850d838c1a135a9b91832e7 --- /dev/null +++ b/tensorflow/machines/jean-zay-gpu/env_bench @@ -0,0 +1,11 @@ +module purge + +# modules load + +module load openmpi/4.0.2-cuda +module load tensorflow-gpu/py3/2.1.0+hvd-0.19 +module load git + +GIT_ROOT=`git rev-parse --show-toplevel` +DG_DIR=$GIT_ROOT/Deepgalaxy/ +DG_TRAIN=$DG_DIR/DeepGalaxy-master/dg_train.py diff --git a/tensorflow/machines/lisa-gpu-surf/batch_medium.slurm b/tensorflow/machines/lisa-gpu-surf/batch_medium.slurm new file mode 100644 index 0000000000000000000000000000000000000000..460fce118279c7245f0f2ba3eadcde5a9d820afc --- /dev/null +++ b/tensorflow/machines/lisa-gpu-surf/batch_medium.slurm @@ -0,0 +1,19 @@ +#!/bin/bash +#SBATCH --job-name=dg_512_1_4_4_3_100 # ,,,,, +#SBATCH --ntasks=4 # number of tasks +#SBATCH -N 4 # number of nodes +#SBATCH --gres=gpu:4 # number of GPU per node +#SBATCH --cpus-per-task=10 # number of CPU cores + +#SBATCH --hint=nomultithread # use physical core +#SBATCH --time=03:00:00 # walltime max +#SBATCH --exclusive +#SBATCH -A qbg@gpu + +#SBATCH --output=dg.out # output filename +#SBATCH --error=dg.err # error filename + +set -x +source env_bench + +srun /gpfslocalsup/pub/idrtools/bind_gpu.sh python $DG_TRAIN diff --git a/tensorflow/machines/lisa-gpu-surf/env_bench b/tensorflow/machines/lisa-gpu-surf/env_bench new file mode 100644 index 0000000000000000000000000000000000000000..54cc270e85d3032539783936a4f91e6a08f02181 --- /dev/null +++ b/tensorflow/machines/lisa-gpu-surf/env_bench @@ -0,0 +1,6 @@ +module purge + +# modules load +module load 2020 +module load TensorFlow/2.1.0-foss-2019b-Python-3.7.4-CUDA-10.1.243 + diff --git a/tensorflow/machines/lisa-gpu-surf/submit.slurm b/tensorflow/machines/lisa-gpu-surf/submit.slurm new file mode 100644 index 0000000000000000000000000000000000000000..4dfa645ba8d6f99793b2bab947fdcd4476ea7cec --- /dev/null +++ b/tensorflow/machines/lisa-gpu-surf/submit.slurm @@ -0,0 +1,22 @@ +#!/bin/bash -l +#SBATCH -J DeepGalaxyTrain +#SBATCH -o DeepGalaxyTrain_out.txt +#SBATCH -e DeepGalaxyTrain_err.txt + +#SBATCH -t 1:00:00 + +#SBATCH --partition gpu_titanrtx + +#SBATCH --nodes 8 +#SBATCH --ntasks-per-node=4 +#SBATCH --gres=gpu:4 + + + +module purge +module load 2020 +module load TensorFlow/2.1.0-foss-2019b-Python-3.7.4-CUDA-10.1.243 +# module load TensorFlow/1.15.0-foss-2019b-Python-3.7.4-10.1.243 + +mpirun -np 32 python dg_train.py -f output_bw_512.hdf5 --arch EfficientNetB4 --epochs 10 --noise 0.3 --batch-size 4 + diff --git a/tensorflow/testcase_medium/.gitignore b/tensorflow/testcase_medium/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..6b9728353a8a39e5ed5cab7842e0e78de88271e6 --- /dev/null +++ b/tensorflow/testcase_medium/.gitignore @@ -0,0 +1,8 @@ +batch_medium.slurm +efn_b4.h5 +env_bench +model_hvd_bw_512_B4_with_noise_n_p_4.h5 +output_bw_512.hdf5 +results-DG-medium/ +train_log.txt + diff --git a/tensorflow/testcase_medium/.gitkeep b/tensorflow/testcase_medium/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/tensorflow/testcase_medium/README.md b/tensorflow/testcase_medium/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8f0eae751b92553aed24238bb9d4664c81848f47 --- /dev/null +++ b/tensorflow/testcase_medium/README.md @@ -0,0 +1,8 @@ +Medium test case presentation +----------------------------- + +This test case performs a training using 512X512 images, with 3 positions per image, as input. + +Reference time on Jean-zay with 4 nodes, 16 MPI proces, 16 GPUs, 3 positions and 100 epochs: + +* For 100epochs: ~67ms/sample and 32min30s as time to solution diff --git a/tensorflow/testcase_medium/prepare.sh b/tensorflow/testcase_medium/prepare.sh new file mode 100755 index 0000000000000000000000000000000000000000..528d2720aabf45e3740d9c62fd9b15a977bcbad1 --- /dev/null +++ b/tensorflow/testcase_medium/prepare.sh @@ -0,0 +1,16 @@ +#!/bin/bash + +if [ -z "$1" ] + then + echo "Please provide the targeted machine from:" + ls ../machines/ + echo "" + echo "Example: ./prepare.sh jeanzay-gpu" + exit 1 +fi +machine_dir="../machines/$1" + +cp $machine_dir/env_bench . +cp $machine_dir/batch_medium.slurm . + +ln -s ../DeepGalaxy-master/output_bw_512.hdf5 . diff --git a/tensorflow/testcase_medium/run.sh b/tensorflow/testcase_medium/run.sh new file mode 100755 index 0000000000000000000000000000000000000000..2469b4581cf6f0ca145e98ee2d1dfd25d659f22b --- /dev/null +++ b/tensorflow/testcase_medium/run.sh @@ -0,0 +1,2 @@ +#!/bin/bash +sbatch batch_medium.slurm diff --git a/tensorflow/testcase_medium/validate.sh b/tensorflow/testcase_medium/validate.sh new file mode 100755 index 0000000000000000000000000000000000000000..cd3f906d4f2d9d5232aa712f2a02c56aa65ff54d --- /dev/null +++ b/tensorflow/testcase_medium/validate.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +set -e + +RESULT_DIR=results-DG-medium +mkdir -p $RESULT_DIR + +cp dg.err dg.out $RESULT_DIR + + +grep "Epoch" -A1 dg.out > $RESULT_DIR/epochs.results +grep "Epoch 100/100" -A1 dg.out > $RESULT_DIR/last_epoch.results