Commit 46913cc5 authored by Jacob Finkenrath's avatar Jacob Finkenrath
Browse files

Update QUDA instructions, add QPhiX benchmark results from JUWELS

parent 1a1f4085
......@@ -9,7 +9,7 @@ and on the QPhiX library
[^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). Currently a HIP and a generic version is under development, which can be used for AMD GPUs and if ready for CPU architectures, such as ARM. The generic QUDA kernel might replace the computational kernel of QPhiX in the future. The QPhiX library consists of routines developed for Intel Xeon Phi architecture and can perform on x86 architecture. QPhiX are optimize to use Intel intrinsic functions of multiple vector length including AVX512 (http://jeffersonlab.github.io/qphix/). In general the benchmark kernels are applying the Conjugated Gradient solver to the Wilson Dirac operator, a 4-dimension stencil.
. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). Currently a HIP and a generic version is under development, which can be used for AMD GPUs and if ready for CPU architectures, such as ARM. The generic QUDA kernel might replace the computational kernel of QPhiX in the future. The QPhiX library consists of routines developed for Intel Xeon Phi architecture and can perform on x86 architecture, such as Intel Xeon and AMD Epyc CPUs. QPhiX are optimize to use Intel intrinsic functions of multiple vector length including AVX512 (http://jeffersonlab.github.io/qphix/). In general the benchmark kernels are applying the Conjugated Gradient solver to the Wilson Dirac operator, a 4-dimension stencil.
## Table of Contents
......@@ -33,16 +33,18 @@ and build `quda` via
cd quda
mkdir build
cd build
cmake -DQUDA_DIRAC_WILSON=ON -DQUDA_DIRAC_TWISTED_MASS=OFF -DQUDA_DIRAC_WILSON=ON -DQUDA_DIRAC_TWISTED_MASS=OFF -DQUDA_DIRACR_DOMAIN_WALL=OFF -DQUDA_HISQ_LINK=OFF -DQUDA_GAUGE_FORCE=OFF -DQUDA_HISQ_FORCE=OFF -DQUDA_MPI=ON -DQUDA_GPU_ARCH=sm_XX..
```
Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_60` for Pascals, `sm_35` for Keplers)
Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_80` for Nvidia A100,`sm_70` for Nvidia V100, `sm_60` for Pascals, `sm_35` for Keplers, etc.)
If CMake or the compilation fails, library paths and options can be set by the cmake provided function "ccmake".
If cmake or the compilation fails, library paths and options can be set by the cmake provided function "ccmake".
Use `./PATH2CMAKE/ccmake PATH2BUILD_DIR` to edit and to see the available options.
Cmake generates the Makefiles. Run them by use `make`.
Now in the folder /test one can find the needed ` executable "invert_".
Note that currently (status 07/21) the master branch of QUDA fails to download `Eigen`, due to updated hash of the `Eigen` version. To circumvent this, `Eigen` can be provided external by switching off the automatic download of `Eigen` via `-DQUDA_DOWNLOAD_EIGEN=OFF ` and provide the path external path via `-DEIGEN_INCLUDE_DIR=$PATH_TO_EIGEN`. Alternatively is it possible to update the hash line within the CMake file of QUDA. It is expected that this issue is solved in the near future with a new release of QUDA.
#### 1.2 Run
......@@ -204,13 +206,13 @@ QPhiX currently requires additional third party libraries, like the USQCD-librar
https://github.com/usqcd-software
```
and for `libxml2`
`libxml2` is available under
```shell
http://xmlsoft.org/
```
while the repository of QPhiX is hosted under Jefferson Lab.
while the repository of QPhiX is hosted under Jefferson Lab github account, see
``` shell
git clone https://github.com/JeffersonLab/qphix
......@@ -218,8 +220,7 @@ git clone https://github.com/JeffersonLab/qphix
#### 2.1 Compile
The QPhiX library is based on QMP communication functions.
For that QMP has to be setup first. Note that you might have to reconfigure the configure-file using `autoreconf` from `Autotool`. QMP can be configured using:
The QPhiX library is based on QMP communication functions which need to be provided. Note that you might have to reconfigure the configure-file using `autoreconf` from `Autotool`. QMP can be configured using:
``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -xHOST -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
......@@ -235,7 +236,7 @@ make install
to compile and setup the necessary source files in `$QMP_INSTALL_DIR`.
For the current master branch of `QPhiX` it is required to provide the package `qdp++`, which has sub-dependencies given by ``qio`,`xpath_reader`,`c-lime` and `libxml2`. QDP++ can be configure using (here for Skylake chip)
For the current master branch of `QPhiX` it is required to provide the package `qdp++`, which has sub-dependencies given by `qio`,`xpath_reader`,`c-lime` and `libxml2`. QDP++ can be configure using (here for Skylake chip)
``` shell
./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR l --disable-filedb
......@@ -453,17 +454,51 @@ of the target machine.
#### 2.3 Example Benchmark Results
The benchmark results for the XeonPhi benchmark suite are performed on
Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is
equiped with two KNCs per node. The data on Frioul are generated by using
the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2
and are done for the two test cases "Strong-Scaling" with a lattice size
of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per
card. In case of the data generated at MareNostrum, data for the "Strong-Scaling"
mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the
Here, we show some benchmark results for the Xeon(Phi) benchmark suite, which were performed on
JUWELS Cluster Moduel at JSC, on Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is equiped with two KNCs per node. JUWELS Cluster module is equipped with 2× Intel Xeon Platinum 8168 CPU, generation Intel Xeon Skylake, per Node. The data shows "Strong-Scaling" with a lattice size
of 32^3x96 for all three machines. The benchmark is using a random gauge configuration and uses the
Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator.
```
---------------------
JUWELS Cluster Module
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
Per node 8 MPI tasks with 6 openMP threads each
precision: single
Nodes time to solution GFLOPS
1 6,38306 276,41
2 3,45452 514,21
4 1,91458 915,26
8 1,02946 1725,51
16 0,790274 2217,38
32 0,338817 5313,64
64 0,271364 7121,05
128 0,261831 7426,16
256 0,355913 6103,95
512 0,532555 4079,35
precision: double
Nodes time to solution GFLOPS
1 22,8705 132,26
2 12,3162 245,60
4 6,63012 456,23
8 3,49709 864,96
16 1,77833 1700,95
32 0,94527 3199,98
64 0,585362 5167,48
128 0,379343 7973,9
256 0,966274 3130,42
512 0,884134 3421,25
```
```
---------------------
Frioul - KNLs
......@@ -487,27 +522,6 @@ KNLs GFLOPS
2 616.467
4 1047.79
8 1616.37
Weak - Scaling:
local lattice size (48x48x48x24)
precision: single
KNLs GFLOPS
1 348.304
2 616.697
4 1214.82
8 2425.45
16 4404.63
precision: double
KNLs GFLOPS
1 172.303
2 320.761
4 629.79
8 1228.77
16 2310.63
```
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment