Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
# CP2K
## Summary Version
1.0
## Purpose of Benchmark
NEMO (Nucleus for European Modelling of the Ocean) is a mathematical modelling framework for research activities and prediction services in ocean and climate sciences developed by a European consortium. It is intended to be a tool for studying the ocean and its interaction with the other components of the earth climate system over a large number of space and time scales. It comprises of the core engines namely OPA (ocean dynamics and thermodynamics), SI3 (sea ice dynamics and thermodynamics), TOP (oceanic tracers) and PISCES (biogeochemical process).
Prognostic variables in NEMO are the three-dimensional velocity field, a linear or non-linear sea surface height, the temperature and the salinity. In the horizontal direction, the model uses a curvilinear orthogonal grid and in the vertical direction, a full or partial step z-coordinate, or s-coordinate, or a mixture of the two. The distribution of variables is a three-dimensional Arakawa C-type grid for most of the cases.
## Characteristics of Benchmark
The model is implemented in Fortran 90, with pre-processing (C-pre-processor). It is optimized for vector computers and parallelized by domain decomposition with MPI. It supports modern C/C++ and FORTRAN compilers. All input and output is done with third party software called XIOS with a dependency on NetCDF (Network Common Data Format) and HDF5. It is highly scalable and a perfect application for measuring supercomputing performances in terms of compute capacity, memory subsystem, I/O and interconnect performance.
## Mechanics of Building Benchmark
### Building XIOS
1. Download the XIOS source code:
svn co https://forge.ipsl.jussieu.fr/ioserver/svn/XIOS/branchs/xios-2.5
3. There are available known architectures which can be seen with the following command:
./make_xios –avail
If target architecture is a known one, it can be built by the following command
./make_xios --arch X64_CURIE
Otherwise arch-local.env, arch-local.fcm, arch-local.path files should be placed according to target architecture. Then build by:
./make_xios --arch local
Note that XIOS requires Netcdf4. Please load the appropriate HDF5 and NetCDF4 modules. You might have to change the path in the configuration file.
### Building NEMO
1. Download the XIOS source code
svn co https://forge.ipsl.jussieu.fr/nemo/svn/NEMO/releases/release-4.0
2. Copy and setup the appropriate architecture file in the arch folder. The following changes are recommended:
a. add the -lnetcdff and -lstdc++ flags to NetCDF flags
b. using mpif90 which is a MPI binding of gfortran-4.9
c. add -cpp and -ffree-line-length-none to Fortran flags
d. swap out gmake with make
3. Then build the executable with the following command
./makenemo -m MY_CONFIG -r GYRE_XIOS -n MY_GYRE add_key "key_nosignedzero"
4. Apply the patch as described here to measure step time :
https://software.intel.com/en-us/articles/building-and-running-nemo-on-xeon-processors
## Mechanics of Running Benchmark
### Prepare input files
cd MY_GYRE/EXP00
sed -i '/using_server/s/false/true/' iodef.xml
sed -i '/&nameos/a ln_useCT = .false.' namelist_cfg
sed -i '/&namctl/a nn_bench = 1' namelist_cfg
### Run the experiment interactively
mpirun -n 4 ../BLD/bin/nemo.exe -n 2 $PATH_TO_XIOS/bin/xios_server.exe
### GYRE configuration with higher resolution
Modify configuration (for example for the test case A):
rm -f time.step solver.stat output.namelist.dyn ocean.output slurm-* GYRE_* mesh_mask_00*
jp_cfg=4
sed -i -r \
-e 's/^( *nn_itend *=).*/\1 21600/' \
-e 's/^( *nn_stock *=).*/\1 21600/' \
-e 's/^( *nn_write *=).*/\1 1000/' \
-e 's/^( *jp_cfg *=).*/\1 '"$jp_cfg"'/' \
-e 's/^( *jpidta *=).*/\1 '"$(( 30 * jp_cfg +2))"'/' \
-e 's/^( *jpjdta *=).*/\1 '"$(( 20 * jp_cfg +2))"'/' \
-e 's/^( *jpiglo *=).*/\1 '"$(( 30 * jp_cfg +2))"'/' \
-e 's/^( *jpjglo *=).*/\1 '"$(( 20 * jp_cfg +2))"'/' \
namelist_cfg
## Verification of Results
The GYRE configuration is set through the namelist_cfg file. The horizontal resolution is determined by setting jp_cfg as follows:
Jpiglo = 30 × jp_cfg + 2
Jpjglo = 20 × jp_cfg + 2
In this configuration, we use a default value of 30 ocean levels, depicted by jpk=31. The GYRE configuration is an ideal case for benchmark tests as it is very simple to increase the resolution and perform both weak and strong scalability experiment using the same input files. We use two configurations as follows:
Test Case A:
jp_cfg = 128 suitable up to 1000 cores
Number of Days: 20
Number of Time steps: 1440
Time step size: 20 mins
Number of seconds per time step: 1200
Test Case B:
jp_cfg = 256 suitable up to 20,000 cores.
Number of Days (real): 80
Number of time step: 4320
Time step size(real): 20 mins
Number of seconds per time step: 1200
We performed scalability test on 512 cores and 1024 cores for test case A. We performed scalability test for 4096 cores, 8192 cores and 16384 cores for test case B.
Both these test cases can give us quite good understanding of node performance and interconnect behavior. We switch off the generation of mesh files by setting the flag nn_mesh = 0 in the namelist_ref file. Also using_server = false is defined in io_server file.
We report the performance in step time which is the total computational time averaged over the number of time steps for different test cases. This helps us to compare systems in a standard manner across all combinations of system architectures. The other main reason for reporting time per computational time step is to make sure that results are more reproducible and comparable.
Since NEMO supports both weak and strong scalability, test case A and test case B both can be scaled down to run on smaller number of processors while keeping the memory per processor constant achieving similar results for step time. To measure the step time, we inserted a patch which includes the 'MPI_wtime()' functional call in ''nemogcn.f90'' file for each step which also cumulatively adds the step time until the second last step. We then divide the total cumulative time by the number of time steps to average out any overhead.
## Sources
https://forge.ipsl.jussieu.fr/nemo/chrome/site/doc/NEMO/guide/html/install.html
https://forge.ipsl.jussieu.fr/ioserver/wiki/documentation
https://nemo-related.readthedocs.io/en/latest/compilation_notes/nemo37.html
<!--
GNU make and Python 2.x are required for the build process, as are a Fortran 2003 compiler and matching C compiler, e.g. gcc/gfortran (gcc >=4.6 works, later version is recommended).
CP2K can benefit from a number of external libraries for improved performance. It is advised to use vendor-optimized versions of these libraries. If these are not available on your machine, there exist freely available implementations of these libraries including but not limited to those listed below.
### Download the source code
Download a CP2K release from https://sourceforge.net/projects/cp2k/files/ or follow instructions at https://www.cp2k.org/download to check out the relevant branch of the CP2K GitHub repository.
### Install or locate required libraries
**LAPACK & BLAS**
Can be provided from:
netlib : http://netlib.org/lapack & http://netlib.org/blas
MKL : part of the Intel MKL installation
LibSci : installed on Cray platforms
ATLAS : http://math-atlas.sf.net
OpenBLAS : http://www.openblas.net
clBLAS : http://gpuopen.com/compute-product/clblas/
**SCALAPACK and BLACS**
Can be provided from:
netlib : http://netlib.org/scalapack/
MKL : part of the Intel MKL installation
LibSci : installed on Cray platforms
**LIBINT**
Available from - https://www.cp2k.org/static/downloads/libint-1.1.4.tar.gz
The following commands will uncompress and install the LIBINT library required for the UEABS benchmarks:
tar xzf libint-1.1.4
cd libint-1.1.4
./configure CC=cc CXX=CC --prefix=install_path : must not be this directory
make
make install
Note: The environment variables ``CC`` and ``CXX`` are optional and can be used to specify the C and C++ compilers to use for the build (the example above is configured to use the compiler wrappers ``cc`` and ``CC`` used on Cray systems).
### Install optional libraries
FFTW3 : http://www.fftw.org or provided as an interface by MKL
Libxc : http://www.tddft.org/programs/octopus/wiki/index.php/Libxc
ELPA : https://www.cp2k.org/static/downloads/elpa-2016.05.003.tar.gz
libgrid : within CP2K distribution - cp2k/tools/autotune_grid
libxsmm : https://www.cp2k.org/static/downloads/libxsmm-1.4.4.tar.gz
### Compile CP2K
Before compiling the choice of compilers, the library locations and compilation and linker flags need to be specified. This is done in an arch (architecture) file. Example arch files for a number of common architecture examples can be found inside the ``cp2k/arch`` directory. The names of these files match the pattern architecture.version (e.g., Linux-x86-64-gfortran.sopt). The case "version=psmp" corresponds to the hybrid MPI + OpenMP version that you should build to run the UEABS benchmarks. Machine specific examples can be found in the relevent subdirectory.
In most cases you need to create a custom arch file, either from scratch or by modifying an existing one that roughly fits the cpu type, compiler, and installation paths of libraries on your system. You can also consult https://dashboard.cp2k.org, which provides sample arch files as part of the testing reports for some platforms (click on the status field for a platform, and search for 'ARCH-file' in the resulting output).
As a guide for GNU compilers the following should be included in the ``arch`` file:
**Specification of which compiler and linker commands to use:**
CC = gcc
FC = mpif90
LD = mpif90
AR = ar -r
CP2K is primarily a Fortran code, so only the Fortran compiler needs to be MPI-enabled.
**Specification of the ``DFLAGS`` variable, which should include:**
-D__parallel : to build parallel CP2K executable)
-D__SCALAPACK : to link SCALAPACK
-D__LIBINT : to link to LIBINT
-D__MKL : if relying on MKL for ScaLAPACK and/or an FFTW interface
-D__HAS_NO_SHARED_GLIBC : for convenience on HPC systems, see INSTALL.md file
Additional DFLAGS which are needed to link to performance libraries, such as -D__FFTW3 to link to FFTW3, are listed in the INSTALL file.
**Specification of compiler flags ``FCFLAGS`` (for gfortran):**
FCFLAGS = $(DFLAGS) -ffree-form -fopenmp : Required
FCFLAGS = $(DFLAGS) -ffree-form -fopenmp -O3 -ffast-math -funroll-loops : Recommended
If you want to link any libraries containing header files you should pass the path to the directory containing these to FCFLAGS in the format -I/path_to_include_dir.
**Specification of libraries to link to:**
-L{path_to_libint}/lib -lderiv -lint : Required for LIBINT
If you use MKL to provide ScaLAPACK and/or an FFTW interface the LIBS variable should be used to pass the relevant flags provided by the MKL Link Line Advisor (https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor), which you should use carefully in order to generate the right options for your system.
#### Building the executable
To build the hybrid MPI+OpenMP executable ``cp2k.psmp`` using *your_arch_file.psmp* run make in the ``cp2k/makefiles`` directory for v4-6 (or in the top-level cp2k directory for v7+).
make -j N ARCH=your_arch_file VERSION=psmp : on N threads
make ARCH=your_arch_file VERSION=psmp : serially
The executable ``cp2k.psmp`` will then be located in:
cp2k/exe/your_arch_file
### Compiling CP2K for CUDA enabled GPUs
Arch files for compiling CP2K for CUDA enabled GPUs can be found here:
In general the main steps are:
1. Load the cuda module.
2. Ensure that CUDA_PATH variable is set.
3. Add the following to the arch file:
**Addtional required compiler and linker commands**
NVCC = nvcc
**Additional ``DFLAGS``**
-D__ACC -D__DBCSR_ACC -D__PW_CUDA
**Set ``NVFLAGS``**
NVFLAGS = $(DFLAGS) -O3 -arch sm_60
**Additional required libraries**
-lcudart -lcublas -lcufft -lrt
## Mechanics of Running Benchmark
The general way to run the benchmarks with the hybrid parallel executable is:
export OMP_NUM_THREADS=X
parallel_launcher launcher_options path_to_/cp2k.psmp -i inputfile.inp -o logfile
Where:
* The environment variable for the number of threads must be set before calling the executable.
* The parallel_launcher is mpirun, mpiexec, or some variant such as aprun on Cray systems or srun when using Slurm.
* launcher_options specifies parallel placement in terms of total numbers of nodes, MPI ranks/tasks, tasks per node, and OpenMP threads per task (which should be equal to the value given to OMP_NUM_THREADS). This is not necessary if parallel runtime options are picked up by the launcher from the job environment.
* You can try any combination of tasks per node and OpenMP threads per task to investigate absolute performance and scaling on the machine of interest.
* The inputfile usually has the extension .inp, and may specify within it further requried files (such as basis sets, potentials, etc.)
You can try any combination of tasks per node and OpenMP threads per task to investigate absolute performance and scaling on the machine of interest. For tier-1 systems the best performance is usually obtained with pure MPI, while for tier-0 systems the best performance is typically obtained using 1 MPI task per node with the number of threads being equal to the number of cores per node.
**UEABS benchmarks:**
Test Case | System | Number of Atoms | Run type | Description | Location |
----------|------------|-----------------|---------------|------------------------------------------------------|---------------------------------|
a | H2O-512 | 1236 | MD | Uses the Born-Oppenheimer approach via Quickstep DFT | ``/tests/QS/benchmark/`` |
b | LiHFX | 216 | Single-energy | Must create wavefuntion first - see benchmark README | ``/tests/QS/benchmark_HFX/LiH`` |
c | H2O-DFT-LS | 6144 | Single-energy | Uses linear scaling DFT | ``/tests/QS/benchmark_DM_LS`` |
More information in the form of a README and an example job script is included in each benchmark tar file.
## Verification of Results
The run walltime is reported near the end of logfile:
grep "CP2K " logfile | awk -F ' ' '{print $7}'
-->