diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml deleted file mode 100644 index 5335a883cbb8295c980a3ac944d2de65d5d8f5cb..0000000000000000000000000000000000000000 --- a/.gitlab-ci.yml +++ /dev/null @@ -1,28 +0,0 @@ -iwoph17_paper: - image: aergus/latex - script: - - mkdir -vp doc/build/tex - - latexmk -help - - cd doc/iwoph17/ && latexmk -output-directory=../build/tex -pdf t72b.tex - artifacts: - paths: - - doc/build/tex/t72b.pdf - -pages: - image: alpine - script: - - apk --no-cache add py2-pip python-dev - - pip install sphinx - - apk --no-cache add make - - ls -al - - pwd - - cd doc/sphinx && make html && cd - - - mv doc/build/sphinx/html public - only: - - master - artifacts: - paths: - - public - - - diff --git a/Makefile b/Makefile deleted file mode 100644 index 92526d17a18789562fe1751f0443b68eb6e75655..0000000000000000000000000000000000000000 --- a/Makefile +++ /dev/null @@ -1,8 +0,0 @@ -.PHONY:doc clean - -doc: - $(MAKE) html -C doc/sphinx/ - -clean: - $(MAKE) clean -C doc/sphinx/ - diff --git a/README.md b/README.md index 1384692da0dde8e63fa5982b4449869fcb2dbfad..178e3a0a411724810982ea973249ae05e2329a30 100644 --- a/README.md +++ b/README.md @@ -53,21 +53,22 @@ The Alya System is a Computational Mechanics code capable of solving different p # Code_Saturne -Code_Saturne® is a multipurpose Computational Fluid Dynamics (CFD) software package, which has been developed by EDF (France) since 1997. The code was originally designed for industrial applications and research activities in several fields related to energy production; typical examples include nuclear power thermal-hydraulics, gas and coal combustion, turbo-machinery, heating, ventilation, and air conditioning. In 2007, EDF released the code as open-source and this provides both industry and academia to benefit from its extensive pedigree. Code_Saturne®’s open-source status allows for answers to specific needs that cannot easily be made available in commercial “black box” packages. It also makes it possible for industrial users and for their subcontractors to develop and maintain their own independent expertise and to fully control the software they use. +Code_Saturne is open-source multi-purpose CFD software, primarily developed by EDF R&D and maintained by them. It relies on the Finite Volume method and a collocated arrangement of unknowns to solve the Navier-Stokes equations, for incompressible or compressible flows, laminar or turbulent flows and non-Newtonian and Newtonian fluids. A highly parallel coupling library (Parallel Locator Exchange - PLE) is also available in the distribution to account for other physics, such as conjugate heat transfer and structure mechanics. For the incompressible solver, the pressure is solved using an integrated Algebraic Multi-Grid algorithm and the scalars are computed by conjugate gradient methods or Gauss-Seidel/Jacobi. -Code_Saturne® is based on a co-located finite volume approach that can handle three-dimensional meshes built with any type of cell (tetrahedral, hexahedral, prismatic, pyramidal, polyhedral) and with any type of grid structure (unstructured, block structured, hybrid). The code is able to simulate either incompressible or compressible flows, with or without heat transfer, and has a variety of models to account for turbulence. Dedicated modules are available for specific physics such as radiative heat transfer, combustion (e.g. with gas, coal and heavy fuel oil), magneto-hydro dynamics, and compressible flows, two-phase flows. The software comprises of around 350 000 lines of source code, with about 37% written in Fortran90, 50% in C and 15% in Python. The code is parallelised using MPI with some OpenMP. +The original version of the code is written in C for pre-postprocessing, IO handling, parallelisation handling, linear solvers and gradient computation, and Fortran 95 for most of the physics implementation. MPI is used on distributed memory machines and OpenMP pragmas have been added to the most costly parts of the code to handle potential shared memory. The version used in this work (also freely available) relies also on CUDA to take advantage of potential GPU acceleration. -- Web site: http://code-saturne.org -- Code download: http://code-saturne.org/cms/download or https://repository.prace-ri.eu/ueabs/Code_Saturne/1.3/Code_Saturne-4.0.6_UEABS.tar.gz +The equations are solved iteratively using time-marching algorithms, and most of the time spent during a time step is usually due to the computation of the velocity-pressure coupling, for simple physics. For this reason, the two test cases chosen for the benchmark suite have been designed to assess the velocity-pressure coupling computation, and rely on the same configuration, with a mesh 8 times larger for Test Case B than for Test Case A, the time step being halved to ensure a correct Courant number. + +- Web site: https://code-saturne.org +- Code download: https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/CS_5.3_PRACE_UEABS.tar.gz - Disclaimer: please note that by downloading the code from this website, you agree to be bound by the terms of the GPL license. -- Build instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/code_saturne/Code_Saturne_Build_Run_4.0.6.pdf -- Test Case A: https://repository.prace-ri.eu/ueabs/Code_Saturne/1.3/Code_Saturne_TestCaseA.tar.gz -- Test Case B: https://repository.prace-ri.eu/ueabs/Code_Saturne/1.3/Code_Saturne_TestCaseB.tar.gz -- Run instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/code_saturne/Code_Saturne_Build_Run_4.0.6.pdf +- Build and Run instructions: [code_saturne/Code_Saturne_Build_Run_5.3_UEABS.pdf](code_saturne/Code_Saturne_Build_Run_5.3_UEABS.pdf) +- Test Case A: https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/CS_5.3_PRACE_UEABS_CAVITY_13M.tar.gz +- Test Case B: https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/CS_5.3_PRACE_UEABS_CAVITY_111M.tar.gz # CP2K -CP2K is a freely available quantum chemistry and solid-state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. CP2K provides a general framework for different modelling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, ...), and classical force fields (AMBER, CHARMM, ...). CP2K can do simulations of molecular dynamics, metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method. +CP2K is a freely available quantum chemistry and solid-state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. CP2K provides a general framework for different modelling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, ...), and classical force fields (AMBER, CHARMM, ...). CP2K can do simulations of molecular dynamics, metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method. CP2K is written in Fortran 2008 and can be run in parallel using a combination of multi-threading, MPI, and CUDA. All of CP2K is MPI parallelised, with some additional loops also being OpenMP parallelised. It is therefore most important to take advantage of MPI parallelisation, however running one MPI rank per CPU core often leads to memory shortage. At this point OpenMP threads can be used to utilise all CPU cores without suffering an overly large memory footprint. The optimal ratio between MPI ranks and OpenMP threads depends on the type of simulation and the system in question. CP2K supports CUDA, allowing it to offload some linear algebra operations including sparse matrix multiplications to the GPU through its DBCSR acceleration layer. FFTs can optionally also be offloaded to the GPU. Benefits of GPU offloading may yield improved performance depending on the type of simulation and the system in question. diff --git a/alya/ALYA_Build_README.txt b/alya/ALYA_Build_README.txt index 71aa30f0420dfdcf17125a7c332794d2d2509fd4..bd971b53b3abf7fcabc62a991e71da3fe0c51d52 100644 --- a/alya/ALYA_Build_README.txt +++ b/alya/ALYA_Build_README.txt @@ -1,8 +1,8 @@ In order to build ALYA (Alya.x), please follow these steps: -- Go to: Thirdparties/metis-4.0 and build the Metis library (libmetis.a) using 'make' - Go to the directory: Executables/unix -- Adapt the file: configure-marenostrum-mpi.txt to your own MPI wrappers and paths +- Build the Metis library (libmetis.a) using "make metis4" +- Adapt the file: configure.in to your own MPI wrappers and paths (examples on the configure.in folder) - Execute: - ./configure -x -f=configure-marenostrum-mpi.txt nastin parall + ./configure -x nastin parall make diff --git a/alya/ALYA_Run_README.txt b/alya/ALYA_Run_README.txt index 20c2ae7deb9ef34f1a60273f9fa4e7283317d1ed..ee01b98e941d6fecf9730303cf9735ff78a22b5a 100644 --- a/alya/ALYA_Run_README.txt +++ b/alya/ALYA_Run_README.txt @@ -1,18 +1,12 @@ Data sets --------- -The parameters used in the datasets try to represent at best typical industrial runs in order to obtain representative speedups. For example, the iterative solvers -are never converged to machine accuracy, as the system solution is inside a non-linear loop. - -The datasets represent the solution of the cavity flow at Re=100. A small mesh of 10M elements should be used for Tier-1 supercomputers while a 30M element mesh -is specifically designed to run on Tier-0 supercomputers. -However, the number of elements can be multiplied by using the mesh multiplication option in the file *.ker.dat (DIVISION=0,2,3...). The mesh multiplication is -carried out in parallel and the numebr of elements is multiplied by 8 at each of these levels. "0" means no mesh multiplication. +The parameters used in the datasets try to represent at best typical industrial runs in order to obtain representative speedups. For example, the iterative solvers are never converged to machine accuracy, but only as a percentage of the initial residual. The different datasets are: -cavity10_tetra ... 10M tetrahedra mesh -cavity30_tetra ... 30M tetrahedra mesh +SPHERE_16.7M ... 16.7M sphere mesh +SPHERE_132M .... 132M sphere mesh How to execute Alya with a given dataset ---------------------------------------- @@ -20,30 +14,44 @@ How to execute Alya with a given dataset In order to run ALYA, you need at least the following input files per execution: X.dom.dat -X.typ.dat -X.geo.dat -X.bcs.dat -X.inflow_profile.bcs X.ker.dat X.nsi.dat X.dat -In our case, there are 2 different inputs, so X={cavity10_tetra,cavity30_tetra} -To execute a simulation, you must be inside the input directory and you should submit a job like: +In our case X=sphere -mpirun Alya.x cavity10_tetra -or -mpirun Alya.x cavity30_tetra +To execute a simulation, you must be inside the input directory and you should submit a job like: +mpirun Alya.x sphere How to measure the speedup -------------------------- -1. Edit the fensap.nsi.cvg file -2. You will see ten rows, each one corresponds to one simulation timestep -3. Go to the second row, it starts with a number 2 -4. Get the last number of this row, that corresponds to the elapsed CPU time of this timestep -5. Use this value in order to measure the speedup +There are many ways to compute the scalability of Nastin module. + +1. For the complete cycle including: element assembly + boundary assembly + subgrid scale assembly + solvers, etc. + +2. For single kernels: element assembly, boundary assembly, subgrid scale assembly, solvers + +3. Using overall times + + +1. In *.nsi.cvg file, column "30. Elapsed CPU time" + + +2. Single kernels. Here, average and maximum times are indicated in *.nsi.cvg at each iteration of each time step: + +Element assembly: 19. Ass. ave cpu time 20. Ass. max cpu time + +Boundary assembly: 33. Bou. ave cpu time 34. Bou. max cpu time + +Subgrid scale assembly: 31. SGS ave cpu time 32. SGS max cpu time + +Iterative solvers: 21. Sol. ave cpu time 22. Sol. max cpu time + +Note that in the case of using Runge-Kutta time integration (the case of the sphere), the element and boundary assembly times are this of the last assembly of current time step (out of three for third order). + +3. At the end of *.log file, total timings are shown for all modules. In this case we use the first value of the NASTIN MODULE. Contact ------- diff --git a/alya/README_ACC.md b/alya/README_ACC.md index 868598aeca27586d2c57c94c928dadd81744584f..d6ad50784bf4f3baf0269ec8c1c784f123162962 100644 --- a/alya/README_ACC.md +++ b/alya/README_ACC.md @@ -160,29 +160,6 @@ Alya can be used with just MPI or hybrid MPI-OpenMP parallelism. Standard execut make -j num_processors ``` -### KNL Usage - - - Extract the small one node test case. - -```shell - $ tar xvf cavity1_hexa_med.tar.bz2 && cd cavity1_hexa_med - $ cp ../Alya/Thirdparties/ninja/GPUconfig.dat . -``` - - - Edit the job script to submit the calculation to the batch system. - -```shell - job.sh: Modify the path where you have your Alya.x (compiled with MPI options) - sbatch job.sh -``` - Alternatively execute directly: - -```shell -OMP_NUM_THREADS=4 mpirun -np 16 Alya.x cavity1_hexa -``` - - - ## Remarks diff --git a/code_saturne/CS_4.2.2_FOR_GPUs/Code_Saturne_on_GPUs.txt b/code_saturne/CS_4.2.2_FOR_GPUs/Code_Saturne_on_GPUs.txt deleted file mode 100644 index 9f46a1b79800cc785dd9e834bb7839114dfb3e67..0000000000000000000000000000000000000000 --- a/code_saturne/CS_4.2.2_FOR_GPUs/Code_Saturne_on_GPUs.txt +++ /dev/null @@ -1,95 +0,0 @@ - -************************************************************************************ -Code_Saturne 4.2.2 is linked to PETSC developer's version, in order to benefit from -its GPU implementation. Note that the normal release of PETSC does not support GPU. -************************************************************************************ -Installation -************************************************************************************ - -The version has been tested for K80s, and with the following settings:- - --OPENMPI 2.0.2 --GCC 4.8.5 --CUDA 7.5 - -To install Code_Saturne 4.2.2, 4 libraries are required, BLAS, LAPACK, SOWING and CUSP. -The tests have been carried out with lapack-3.6.1 for BLAS and LAPACK, sowing-1.1.23-p1 -for SOWING and cusplibrary-0.5.1 for CUSP. - -PETSC is first installed, and PATH_TO_PETSC, PATH_TO_CUSP, PATH_TO_SOWING, PATH_TO_LAPACK -have to be updated in INSTALL_PETSC_GPU_sm37 under petsc-petsc-a31f61e8abd0 -PETSC is configured for K80s, ./INSTALL_PETSC_GPU_sm37 is used from petsc-petsc-a31f61e8abd0 -It is finally compiled and installed, by typing make and make install. - -Before installing Code_Saturne, adapt PATH_TO_PETSC in InstallHPC.sh under SATURNE_4.2.2 -and type ./InstallHPC.sh - -The code should be installed and code_saturne be found under: - -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne, which should return: - -Usage: ./code_saturne - -Topics: - help - autovnv - bdiff - bdump - compile - config - create - gui - info - run - salome - submit - -Options: - -h, --help show this help message and exit - -************************************************************************************ -Test case - Cavity 13M -************************************************************************************ - -In CAVITY_13M.tar.gz are found the mesh+partitions and 2 sets of subroutines, one for CPU and the -second one for GPU, i.e.: - -CAVITY_13M/PETSC_CPU/SRC/* -CAVITY_13M/PETSC_GPU/SRC/* -CAVITY_13M/MESH/mesh_input_13M -CAVITY_13M/MESH/partition_METIS_5.1.0/* - -To prepare a run, it is required to set up a "study" with 2 directories, one for CPU and the other one for GPU -as, for instance: - -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne create --study NEW_CAVITY_13M PETSC_CPU -cd NEW_CAVITY_13M -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne create --case PETSC_GPU - -The mesh has to be copied from CAVITY_13M/MESH/mesh_input_13M into NEW_CAVITY_13M/MESH/. -And the same has to be done for partition_METIS_5.1.0. - -The subroutines contained in CAVITY_13M/PETSC_CPU/SRC should be copied into NEW_CAVITY_13M/PETSC_CPU/SRC and -the subroutines contained in CAVITY_13M/PETSC_GPU/SRC should be copied into NEW_CAVITY_13M/PETSC_GPU/SRC. - -In each DATA subdirectory of NEW_CAVITY_13M/PETSC_CPU and NEW_CAVITY_13M/PETSC_GPU, the path -to the mesh+partition has to be set as: - -cd DATA -cp REFERENCE/cs_user_scripts.py . -edit cs_user_scripts.py -At line 138, change None to "../MESH/mesh_input_13M" -At line 139, change None to "../MESH/partition_METIS_5.1.0" - -At this stage, everything is set to run both simulations, one for the CPU and the other one for the GPU. - -cd NEW_CAVITY_13M/PETSC_CPU -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne run --initialize -cd RESU/YYYYMMDD-HHMM -submit the job - -cd NEW_CAVITY_13M/PETSC_GPU -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne run --initialize -cd RESU/YYYYMMDD-HHMM -submit the job - diff --git a/code_saturne/CS_4.2.2_FOR_KNLs/Code_Saturne_on_KNLs.txt b/code_saturne/CS_4.2.2_FOR_KNLs/Code_Saturne_on_KNLs.txt deleted file mode 100644 index e0cabbf19f199242d5759d421da62eb50f4e6a86..0000000000000000000000000000000000000000 --- a/code_saturne/CS_4.2.2_FOR_KNLs/Code_Saturne_on_KNLs.txt +++ /dev/null @@ -1,103 +0,0 @@ - -************************************************************************************ -Code_Saturne 4.2.2 is installed for KNLs. It is also linked to PETSC, but the -default linear solvers are the native ones. -************************************************************************************ -Installation -************************************************************************************ - -The installation script is: - -SATURNE_4.2.2/InstallHPC_with_PETSc.sh - -The path to PETSC (official released version, and therefore assumed to be installed -on the machine) should be added to the aforementioned script. - -After typing ./InstallHPC_with_PETSc.sh the code should be installed and code_saturne be found under: - -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne, which should return: - -Usage: ./code_saturne - -Topics: - help - autovnv - bdiff - bdump - compile - config - create - gui - info - run - salome - submit - -Options: - -h, --help show this help message and exit - -************************************************************************************ -Two cases are dealt with, TGV_256_CS_OPENMP.tar.gz to test the native solvers, and -CAVITY_13M_FOR_KNLs_WITH_PETSC.tar.gz to test Code_Saturne and PETSC on KNLs -************************************************************************************ -First test case: TGV_256_CS_OPENMP.tar.gz -************************************************************************************ - -In TGV_256_CS_OPENMP.tar.gz are found the mesh and the set of subroutines for Code_Saturne on KNLs, i.e.: - -TGV_256_CS_OPENMP/MESH/mesh_input_256by256by256 -TGV_256_CS_OPENMP/ARCHER_KNL/SRC/* - -To prepare a run, it is required to set up a "study" as, for instance: - -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne create --study NEW_TGV_256_CS_OPENMP KNL - -The mesh has to be copied from TGV_256_CS_OPENMP/MESH/mesh_input_256by256by256 into NEW_TGV_256_CS_OPENMP/MESH/. - -The subroutines contained in TGV_256_CS_OPENMP/KNL/SRC should be copied into NEW_TGV_256_CS_OPENMP/KNL/SRC - -In the DATA subdirectory of NEW_TGV_256_CS_OPENMP/KNL the path to the mesh has to be set as: - -cd DATA -cp REFERENCE/cs_user_scripts.py . -edit cs_user_scripts.py -At line 138, change None to "../MESH/mesh_input_256by256by256" - -At this stage, everything is set to run the simulation: - -cd TGV_256_CS_OPENMP/KNL -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne run --initialize -cd RESU/YYYYMMDD-HHMM -submit the job - -************************************************************************************ -Second test case: CAVITY_13M_FOR_KNLs_WITH_PETSC.tar.gz -************************************************************************************ - -In CAVITY_13M.tar.gz are found the mesh and the set of subroutines for PETSC and KNLs, i.e.: - -CAVITY_13M/MESH/mesh_input -CAVITY_13M/KNL/SRC/* - -To prepare a run, it is required to set up a "study" as, for instance: - -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne create --study NEW_CAVITY_13M PETSC_KNL - -The mesh has to be copied from CAVITY_13M/MESH/mesh_input into NEW_CAVITY_13M/MESH/. - -The subroutines contained in CAVITY_13M/KNL/SRC should be copied into NEW_CAVITY_13M/PETSC_KNL/SRC - -In the DATA subdirectory of NEW_CAVITY_13M/PETSC_KNL the path to the mesh has to be set as: - -cd DATA -cp REFERENCE/cs_user_scripts.py . -edit cs_user_scripts.py -At line 138, change None to "../MESH/mesh_input" - -At this stage, everything is set to run the simulation: - -cd NEW_CAVITY_13M/PETSC_KNL -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne run --initialize -cd RESU/YYYYMMDD-HHMM -submit the job - diff --git a/code_saturne/Code_Saturne_Build_Run_4.0.6.pdf b/code_saturne/Code_Saturne_Build_Run_4.0.6.pdf deleted file mode 100644 index 4b2839f17a9751dfc228c16ccf8658767209b8a3..0000000000000000000000000000000000000000 Binary files a/code_saturne/Code_Saturne_Build_Run_4.0.6.pdf and /dev/null differ diff --git a/code_saturne/Code_Saturne_Build_Run_5.3_UEABS.pdf b/code_saturne/Code_Saturne_Build_Run_5.3_UEABS.pdf new file mode 100644 index 0000000000000000000000000000000000000000..3bc660dbe2357818a822bd42561e7459a8421537 Binary files /dev/null and b/code_saturne/Code_Saturne_Build_Run_5.3_UEABS.pdf differ diff --git a/code_saturne/README.md b/code_saturne/README.md new file mode 100644 index 0000000000000000000000000000000000000000..3f1437b5728b39d0052db6b289db40d45ffb74d7 --- /dev/null +++ b/code_saturne/README.md @@ -0,0 +1,17 @@ +# Code_Saturne + +Code_Saturne is open-source multi-purpose CFD software, primarily developed by EDF R&D and maintained by them. It relies on the Finite Volume method and a collocated arrangement of unknowns to solve the Navier-Stokes equations, for incompressible or compressible flows, laminar or turbulent flows and non-Newtonian and Newtonian fluids. A highly parallel coupling library (Parallel Locator Exchange - PLE) is also available in the distribution to account for other physics, such as conjugate heat transfer and structure mechanics. For the incompressible solver, the pressure is solved using an integrated Algebraic Multi-Grid algorithm and the scalars are computed by conjugate gradient methods or Gauss-Seidel/Jacobi. + +The original version of the code is written in C for pre-postprocessing, IO handling, parallelisation handling, linear solvers and gradient computation, and Fortran 95 for most of the physics implementation. MPI is used on distributed memory machines and OpenMP pragmas have been added to the most costly parts of the code to handle potential shared memory. The version used in this work (also freely available) relies on CUDA to take advantage of potential GPU acceleration. + +The equations are solved iteratively using time-marching algorithms, and most of the time spent during a time step is usually due to the computation of the velocity-pressure coupling, for simple physics. For this reason, the two test cases ([https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/Code_Saturne_Build_Run_5.3_UEABS.pdf](CS_5.3_PRACE_UEABS_CAVITY_13M.tar.gz) and [https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/Code_Saturne_Build_Run_5.3_UEABS.pdf](CS_5.3_PRACE_UEABS_CAVITY_111M.tar.gz)) chosen for the benchmark suite have been designed to assess the velocity-pressure coupling computation, and rely on the same configuration, with a mesh 8 times larger for CAVITY_111M than for CAVITY_13M, the time step being halved to ensure a correct Courant number. + +## Building and running the code is described in the file +[Code_Saturne_Build_Run_5.3_UEABS.pdf](Code_Saturne_Build_Run_5.3_UEABS.pdf) + +## The test cases are to be found under: +https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/CS_5.3_PRACE_UEABS_CAVITY_111M.tar.gz +https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/CS_5.3_PRACE_UEABS_CAVITY_13M.tar.gz + +## The distribution is to be found under: +https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/CS_5.3_PRACE_UEABS.tar.gz diff --git a/code_saturne/README_ACC.md b/code_saturne/README_ACC.md deleted file mode 100644 index 27911369d3166fc03e8dd3e78015c7d47317ef17..0000000000000000000000000000000000000000 --- a/code_saturne/README_ACC.md +++ /dev/null @@ -1,243 +0,0 @@ -# Code_Saturn - -## GPU Version - -Code_Saturne 4.2.2 is linked to PETSC developer's version, in order to benefit from -its GPU implementation. Note that the normal release of PETSC does not support GPU. - -### Installation - -The version has been tested for K80s, and with the following settings: - - * OPENMPI 2.0.2 - * GCC 4.8.5 - * CUDA 7.5 - -To install Code_Saturne 4.2.2, 4 libraries are required, BLAS, LAPACK, SOWING and CUSP. -The tests have been carried out with lapack-3.6.1 for BLAS and LAPACK, sowing-1.1.23-p1 -for SOWING and cusplibrary-0.5.1 for CUSP. - -PETSC is first installed, and `PATH_TO_PETSC`, `PATH_TO_CUSP`, `PATH_TO_SOWING`, `PATH_TO_LAPACK` -have to be updated in `INSTALL_PETSC_GPU_sm37` under `petsc-petsc-a31f61e8abd0` -PETSC is configured for K80s, `./INSTALL_PETSC_GPU_sm37` is used from `petsc-petsc-a31f61e8abd0` -It is finally compiled and installed, by typing `make` and `make install`. - -Before installing Code\_Saturne, adapt `PATH_TO_PETSC` in `InstallHPC.sh` under `SATURNE_4.2.2` -and type `./InstallHPC.sh` - -The code should be installed and code_saturne be found under: - -``` -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne -``` - -And should return: - -``` -Usage: ./code_saturne - -Topics: - help - autovnv - bdiff - bdump - compile - config - create - gui - info - run - salome - submit - -Options: - -h, --help show this help message and exit -``` - -### Test case - Cavity 13M - -In `CAVITY_13M.tar.gz` are found the mesh+partitions and 2 sets of subroutines, one for CPU and the -second one for GPU, i.e.: - -``` -CAVITY_13M/PETSC_CPU/SRC/* -CAVITY_13M/PETSC_GPU/SRC/* -CAVITY_13M/MESH/mesh_input_13M -CAVITY_13M/MESH/partition_METIS_5.1.0/* -``` - -To prepare a run, it is required to set up a "study" with 2 directories, one for CPU and the other one for GPU -as, for instance: - -``` -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne create --study NEW_CAVITY_13M PETSC_CPU -cd NEW_CAVITY_13M -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne create --case PETSC_GPU -``` - -The mesh has to be copied from `CAVITY_13M/MESH/mesh_input_13M` into `NEW_CAVITY_13M/MESH/`. -And the same has to be done for `partition_METIS_5.1.0`. - -The subroutines contained in `CAVITY_13M/PETSC_CPU/SRC` should be copied into `NEW_CAVITY_13M/PETSC_CPU/SRC` and -the subroutines contained in `CAVITY_13M/PETSC_GPU/SRC` should be copied into `NEW_CAVITY_13M/PETSC_GPU/SRC`. - -In each DATA subdirectory of `NEW_CAVITY_13M/PETSC_CPU` and `NEW_CAVITY_13M/PETSC_GPU`, the path -to the mesh+partition has to be set as: - -``` -cd DATA -cp REFERENCE/cs_user_scripts.py . -``` - -edit cs_user_scripts.py: - - * At line 138, change `None` to `../MESH/mesh_input_13M` - * At line 139, change `None` to `../MESH/partition_METIS_5.1.0` - -At this stage, everything is set to run both simulations, one for the CPU and the other one for the GPU. - -``` -cd NEW_CAVITY_13M/PETSC_CPU -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne run --initialize -cd RESU/YYYYMMDD-HHMM -``` - -Then submit the job. - - -``` -cd NEW_CAVITY_13M/PETSC_GPU -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne run --initialize -cd RESU/YYYYMMDD-HHMM -``` - -submit the job - -## KNL Version - -Code_Saturne 4.2.2 is installed for KNLs. It is also linked to PETSC, but the -default linear solvers are the native ones. - -### Installation - -The installation script is: - -``` -SATURNE_4.2.2/InstallHPC_with_PETSc.sh -``` - -The path to PETSC (official released version, and therefore assumed to be installed -on the machine) should be added to the aforementioned script. - -After typing `./InstallHPC_with_PETSc.sh` the code should be installed and code_saturne be found under: - -``` -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne -``` - -And should return: - -``` -Usage: ./code_saturne - -Topics: - help - autovnv - bdiff - bdump - compile - config - create - gui - info - run - salome - submit - -Options: - -h, --help show this help message and exit -``` - -### Running the code - -Two cases are dealt with, `TGV_256_CS_OPENMP.tar.gz` to test the native solvers, and -`CAVITY_13M_FOR_KNLs_WITH_PETSC.tar.gz` to test Code_Saturne and PETSC on KNLs - -#### First test case: TGV_256_CS_OPENMP - -In `TGV_256_CS_OPENMP.tar.gz` are found the mesh and the set of subroutines for Code_Saturne on KNLs, i.e.: - -``` -TGV_256_CS_OPENMP/MESH/mesh_input_256by256by256 -TGV_256_CS_OPENMP/ARCHER_KNL/SRC/* -``` - -To prepare a run, it is required to set up a "study" as, for instance: - -``` -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne create --study NEW_TGV_256_CS_OPENMP KNL -``` - -The mesh has to be copied from `TGV_256_CS_OPENMP/MESH/mesh_input_256by256by256` into `NEW_TGV_256_CS_OPENMP/MESH/`. - -The subroutines contained in `TGV_256_CS_OPENMP/KNL/SRC` should be copied into `NEW_TGV_256_CS_OPENMP/KNL/SRC`. - -In the `DATA` subdirectory of `NEW_TGV_256_CS_OPENMP/KNL` the path to the mesh has to be set as: - -``` -cd DATA -cp REFERENCE/cs_user_scripts.py . -``` - -Then edit `cs_user_scripts.py`: -At line 138, change `None` to `../MESH/mesh_input_256by256by256` - -At this stage, everything is set to run the simulation: - -``` -cd TGV_256_CS_OPENMP/KNL -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne run --initialize -cd RESU/YYYYMMDD-HHMM -``` - -And then submit the job. - -#### Second test case: CAVITY_13M_FOR_KNLs_WITH_PETSC - -In `CAVITY_13M.tar.gz` are found the mesh and the set of subroutines for PETSC and KNLs, i.e.: - -``` -CAVITY_13M/MESH/mesh_input -CAVITY_13M/KNL/SRC/* -``` - -To prepare a run, it is required to set up a "study" as, for instance: - -``` -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne create --study NEW_CAVITY_13M PETSC_KNL -``` - -The mesh has to be copied from `CAVITY_13M/MESH/mesh_input` into `NEW_CAVITY_13M/MESH/`. - -The subroutines contained in `CAVITY_13M/KNL/SRC` should be copied into `NEW_CAVITY_13M/PETSC_KNL/SRC`. - -In the `DATA` subdirectory of `NEW_CAVITY_13M/PETSC_KNL` the path to the mesh has to be set as: - -``` -cd DATA -cp REFERENCE/cs_user_scripts.py . -``` - -Then edit `cs_user_scripts.py` -At line 138, change `None` to `../MESH/mesh_input` - -At this stage, everything is set to run the simulation: - -``` -cd NEW_CAVITY_13M/PETSC_KNL -PATH_TO_CODE_SATURNE/SATURNE_4.2.2/code_saturne-4.2.2/arch/Linux/bin/code_saturne run --initialize -cd RESU/YYYYMMDD-HHMM -``` - -Finaly submit the job - diff --git a/doc/.gitignore b/doc/.gitignore deleted file mode 100644 index 567609b1234a9b8806c5a05da6c866e480aa148d..0000000000000000000000000000000000000000 --- a/doc/.gitignore +++ /dev/null @@ -1 +0,0 @@ -build/ diff --git a/doc/d7.5_4IP_1.0.docx b/doc/d7.5_4IP_1.0.docx deleted file mode 100644 index 1c4d646334c0bdcb7836253c0d9aed897b3d252a..0000000000000000000000000000000000000000 Binary files a/doc/d7.5_4IP_1.0.docx and /dev/null differ diff --git a/doc/docx2latex/d7.5_4IP_1.0.csv b/doc/docx2latex/d7.5_4IP_1.0.csv deleted file mode 100644 index 53476fc9e8eea51994577526369247941bb6c394..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.csv +++ /dev/null @@ -1,16 +0,0 @@ -Caption; \Caption{ ; } -CommentReference; ; -Emphasis; ; -Heading10; ; -Heading2; ; -Heading3; ; -Heading4; ; -Hyperlink; ; -ListParagraph; ; -NormalPRACE; ; -PageNumber; ; -TableofFigures; ; -Title; ; -TOC1; ; -TOC2; ; -TOC3; ; diff --git a/doc/docx2latex/d7.5_4IP_1.0.d2t.log b/doc/docx2latex/d7.5_4IP_1.0.d2t.log deleted file mode 100644 index e2815fcc6da0690a76f75813ed209e487fec22a1..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.d2t.log +++ /dev/null @@ -1,48 +0,0 @@ -cp: '../CodeVault/ueabs_accelerator/doc/d7.5_4IP_1.0.docx' and '/home/cameo/git/CodeVault/ueabs_accelerator/doc/d7.5_4IP_1.0.docx' are the same file -Message: Mode: insert-xpath -Message: Mode: docx2hub:add-props -Message: Mode: docx2hub:props2atts -Message: Mode: docx2hub:remove-redundant-run-atts -Message: Mode: docx2hub:join-instrText-runs -Message: Mode: docx2hub:field-functions -Message: Mode: wml-to-dbk -INFO : xproc-util/xslt-mode/xpl/xslt-mode.xpl:100:27:## Error Code: W2D_020 -INFO : xproc-util/xslt-mode/xpl/xslt-mode.xpl:100:27:## Error Msg: W2D_020: "Unbekanntes Element im Mode 'wml-to-dbk': 'Element: a:srcRect Parent: pic:blipFill'" -INFO : xproc-util/xslt-mode/xpl/xslt-mode.xpl:100:27: -INT -wml-to-dbk -Element: a:srcRect Parent: pic:blipFill - -INFO : xproc-util/xslt-mode/xpl/xslt-mode.xpl:100:27:## Mode: wml-to-dbk -INFO : xproc-util/xslt-mode/xpl/xslt-mode.xpl:100:27:## Level: INT -INFO : xproc-util/xslt-mode/xpl/xslt-mode.xpl:100:27:## XPath: -INFO : xproc-util/xslt-mode/xpl/xslt-mode.xpl:100:27:## Info: Element: a:srcRect Parent: pic:blipFill -Message: Mode: docx2hub:join-runs -ERROR: xproc-util/load/xpl/load.xpl:0:load-error:Could not load file:/home/cameo/git/docx2tex/conf/conf.csv (file:///home/cameo/git/docx2tex/xproc-util/load/xpl/load.xpl) dtd-validate=false -Message: Mode: hub:twipsify-lengths -Message: Mode: hub:split-at-tab -Message: Mode: hub:identifiers -Message: Mode: hub:tabs-to-indent -Message: Mode: hub:handle-indent -Message: Mode: hub:prepare-lists -Message: Mode: hub:lists -Message: Mode: hub:postprocess-lists -Message: Mode: docx2tex-preprocess -Message: Mode: docx2tex-postprocess -Message: Mode: mml2tex-grouping -Message: Mode: mml2tex-preprocess -WARN : err:SXXP0005:The source document is in namespace http://docbook.org/ns/docbook, but none of the template rules match elements in this namespace (Use --suppressXsltNamespaceCheck:on to avoid this warning) -Message: Mode: escape-bad-chars -Message: Mode: apply-regex -Message: Mode: replace-chars -Message: Mode: dissolve-pi -Message: Mode: apply-xpath -Message: Mode: clean diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2hub-start.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2hub-start.txt deleted file mode 100644 index a35d6ee7315fb971faaf27be96908a0889e67cbc..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2hub-start.txt +++ /dev/null @@ -1 +0,0 @@ -Starting DOCX to flat Hub XML conversion diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2hub-success.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2hub-success.txt deleted file mode 100644 index 0718fdf8eb557fb251549aa57be7775cf83e073c..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2hub-success.txt +++ /dev/null @@ -1 +0,0 @@ -Successfully finished DOCX to flat Hub XML conversion diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-docx2hub.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-docx2hub.txt deleted file mode 100644 index c00fc63a81b09bd87a87801624b426d0375b115d..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-docx2hub.txt +++ /dev/null @@ -1 +0,0 @@ -Conversion from DOCX to Hub XML finished diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-evolve-hub-init.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-evolve-hub-init.txt deleted file mode 100644 index fa44ee9e783a4094bc73300bda17a81b00b8bfe1..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-evolve-hub-init.txt +++ /dev/null @@ -1 +0,0 @@ -Init XML normalization diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-finished.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-finished.txt deleted file mode 100644 index 5fd33e90d8f3d4e5fe03f9bea352606844f9cbfe..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-finished.txt +++ /dev/null @@ -1 +0,0 @@ -docx2tex conversion finished. diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-start.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-start.txt deleted file mode 100644 index 30b4bd4c6598bc8a23a1f62975efd961ec1ee9df..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_docx2tex-start.txt +++ /dev/null @@ -1 +0,0 @@ -Start docx2tex diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_error-pi2svrl-start.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_error-pi2svrl-start.txt deleted file mode 100644 index 9c4ad48c725d407b3f724bd651f1373c9ac47660..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_error-pi2svrl-start.txt +++ /dev/null @@ -1 +0,0 @@ -Creating SVRL documents from error/warning processing instructions diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_error-pi2svrl-success.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_error-pi2svrl-success.txt deleted file mode 100644 index 7cbdb4357c73e58f5ab770efbc45a31d7b819e6a..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_error-pi2svrl-success.txt +++ /dev/null @@ -1 +0,0 @@ -Successfully created SVRL documents from error/warning processing instructions diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-convert-mml.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-convert-mml.txt deleted file mode 100644 index 6064a64b8dfc4109d239bf5c2b21ebc7d0d55e81..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-convert-mml.txt +++ /dev/null @@ -1 +0,0 @@ -Convert OMML equations to TeX diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-convert-tables.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-convert-tables.txt deleted file mode 100644 index b15e54df718605be226d0bb239382f143972bef6..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-convert-tables.txt +++ /dev/null @@ -1 +0,0 @@ -Convert CALS tables to TeX tabular diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-convert-xml.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-convert-xml.txt deleted file mode 100644 index 17bffc02347b20e9b58e8f1214651ed2146fabc9..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-convert-xml.txt +++ /dev/null @@ -1 +0,0 @@ -Apply xml2tex configuration and convert XML to LaTeX diff --git a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-validate-config.txt b/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-validate-config.txt deleted file mode 100644 index dde57efdaaa9d56312600f8f39fc3f34fa1e2e38..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.debug/status/CBdd59b1c2604e3b21d29ee4f7794180f2_xml2tex-validate-config.txt +++ /dev/null @@ -1 +0,0 @@ -Validation of xml2tex configuration successfull diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/[Content_Types].xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/[Content_Types].xml deleted file mode 100644 index fbc60629f13c3c9dbaf7a6c13b2143b228ea4c38..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/[Content_Types].xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/_rels/.rels b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/_rels/.rels deleted file mode 100644 index 9cf3a5c1d49377f8a274fdf7d69fad7183297e6d..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/_rels/.rels +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/customXml/_rels/item1.xml.rels b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/customXml/_rels/item1.xml.rels deleted file mode 100644 index a9c831d4cbb816f4463e0261478495a5f4ba0751..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/customXml/_rels/item1.xml.rels +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/customXml/item1.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/customXml/item1.xml deleted file mode 100644 index 627b86b332b3b421a915533733c1a055414508dc..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/customXml/item1.xml +++ /dev/null @@ -1 +0,0 @@ - diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/customXml/itemProps1.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/customXml/itemProps1.xml deleted file mode 100644 index f37b97f711e00ebf0cc84425be3e35e3bb02b57a..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/customXml/itemProps1.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/docProps/app.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/docProps/app.xml deleted file mode 100644 index 02414a13b3a7bbf09c626a33c4cca33fe8397bdb..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/docProps/app.xml +++ /dev/null @@ -1,2 +0,0 @@ - -750541396879621Microsoft Macintosh Word0663186falseTitle1New TemplateHomefalse93403false176954812005http://www.prace-project.eu/170398710705_Toc19446707917695482405http://www.prace-project.eu/1769548905http://www.prace-project.eu/false15.0000 \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/docProps/core.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/docProps/core.xml deleted file mode 100644 index 3b6c39b328042738ebea8a634cf56ebf3807c7dc..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/docProps/core.xml +++ /dev/null @@ -1,2 +0,0 @@ - -New TemplateDietmar ErwinVictor Cameo4752017-03-10T17:05:00Z2017-03-10T19:54:00Z2017-03-27T09:52:00Z \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/_rels/document.xml.rels b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/_rels/document.xml.rels deleted file mode 100644 index 74e196950eec4c8a400b59b3c6f426083dc430ea..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/_rels/document.xml.rels +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/comments.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/comments.xml deleted file mode 100644 index bd8472a636fc1428f590d0b83fd648a1c05aeb7c..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/comments.xml +++ /dev/null @@ -1,2 +0,0 @@ - -Faire un tableau récap \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/commentsExtended.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/commentsExtended.xml deleted file mode 100644 index 0f1c28764921be200aceebc87d5f1d6074443f58..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/commentsExtended.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/document.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/document.xml deleted file mode 100644 index 878d18a2520bd97b145cc1d5c6ef542f7d3bab90..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/document.xml +++ /dev/null @@ -1,2 +0,0 @@ - -E-InfrastructuresH2020-EINFRA-2014-2015EINFRA-4-2014: Pan-European High Performance ComputingInfrastructure and ServicesPRACE-4IPPRACE Fourth Implementation Phase ProjectGrant Agreement Number: EINFRA-653838D7.5Application performance on acceleratorsFinal Version: 1.0Author(s): Victor Cameo Ponz, CINESDate:24.03.2016Project and Deliverable Information SheetPRACE ProjectProject Ref. №: REF ReferenceNo \h \* MERGEFORMAT EINFRA-653838Project Title: REF Title \h \* MERGEFORMAT PRACE Fourth Implementation Phase ProjectProject Web Site: http://www.prace-project.euDeliverable ID: < REF DeliverableNumber \* MERGEFORMAT D7.5>Deliverable Nature: <DOC_TYPE: Report / Other>Dissemination Level:PUContractual Date of Delivery:31 / 03 / 2017Actual Date of Delivery:DD / Month / YYYYEC Project Officer: Leonardo Flores Añover* - The dissemination level are indicated as follows: PU – Public, CO – Confidential, only for members of the consortium (including the Commission Services) CL – Classified, as referred to in Commission Decision 2991/844/EC.Document Control SheetDocumentTitle: REF DeliverableTitle \* MERGEFORMAT Application performance on acceleratorsID: REF DeliverableNumber \* MERGEFORMAT D7.5 Version: < REF Version \* MERGEFORMAT 1.0>Status: REF Status \* MERGEFORMAT FinalAvailable at: http://www.prace-project.euSoftware Tool: Microsoft Word 2010File(s): FILENAME d7.5_4IP_1.0.docxAuthorshipWritten by: REF Author Victor Cameo Ponz, CINESContributors:Adem Tekin, ITUAlan Grey, EPCCAndrew Emerson, CINECAAndrew Sunderland, STFCArno Proeme, EPCCCharles Moulinec, STFCDimitris Dellis, GRNETFiona Reid, EPCCGabriel Hautreux, INRIAJacob Finkenrath, CyIJames Clark, STFCJanko Strassburg, BSCJorge Rodriguez, BSCMartti Louhivuori, CSCPhilippe Segers, GENCIValeriu Codreanu, SURFSARAReviewed by:Filip Stanek, IT4IThomas Eickermann, FZJApproved by:MB/TBDocument Status SheetVersionDateStatusComments0.113/03/2017DraftFirst revision0.215/03/2017DraftInclude remark of the first review + new figures1.024/03/2017Final versionImproved the application performance sectionDocument Keywords Keywords:PRACE, HPC, Research Infrastructure, Accelerators, GPU, Xeon Phi, Benchmark suiteDisclaimerThis deliverable has been prepared by the responsible Work Package of the Project in accordance with the Consortium Agreement and the Grant Agreement n° REF ReferenceNo \* MERGEFORMAT EINFRA-653838. It solely reflects the opinion of the parties to such agreements on a collective basis in the context of the Project and to the extent foreseen in such agreements. Please note that even though all participants to the Project are members of PRACE AISBL, this deliverable has not been approved by the Council of PRACE AISBL and therefore does not emanate from it nor should it be considered to reflect PRACE AISBL’s individual opinion.Copyright notices 2016 PRACE Consortium Partners. All rights reserved. This document is a project document of the PRACE project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contract REF ReferenceNo \* MERGEFORMAT EINFRA-653838 for reviewing and dissemination purposes. All trademarks and other rights on third party products mentioned in this document are acknowledged as own by the respective holders.Table of Contents TOC \o "1-4" \t "Überschrift 2;1;Überschrift 3;2;Überschrift 4;3;Heading4;3;Heading2;1;Heading3;2" Project and Deliverable Information Sheet PAGEREF _Toc478378948 \h iDocument Control Sheet PAGEREF _Toc478378949 \h iDocument Status Sheet PAGEREF _Toc478378950 \h iiDocument Keywords PAGEREF _Toc478378951 \h iiiTable of Contents PAGEREF _Toc478378952 \h ivList of Figures PAGEREF _Toc478378953 \h vList of Tables PAGEREF _Toc478378954 \h viReferences and Applicable Documents PAGEREF _Toc478378955 \h viList of Acronyms and Abbreviations PAGEREF _Toc478378956 \h viiList of Project Partner Acronyms PAGEREF _Toc478378957 \h ixExecutive Summary PAGEREF _Toc478378958 \h 111Introduction PAGEREF _Toc478378959 \h 112Targeted architectures PAGEREF _Toc478378960 \h 112.1Co-processor description PAGEREF _Toc478378961 \h 112.2Systems description PAGEREF _Toc478378962 \h 122.2.1Cartesius K40 PAGEREF _Toc478378963 \h 122.2.2MareNostrum KNC PAGEREF _Toc478378964 \h 132.2.3Ouessant P100 PAGEREF _Toc478378965 \h 132.2.4Frioul KNL PAGEREF _Toc478378966 \h 133Benchmark suite description PAGEREF _Toc478378967 \h 143.1Alya PAGEREF _Toc478378968 \h 143.1.1Code description PAGEREF _Toc478378969 \h 143.1.2Test cases description PAGEREF _Toc478378970 \h 153.2Code_Saturne PAGEREF _Toc478378971 \h 153.2.1Code description PAGEREF _Toc478378972 \h 153.2.2Test cases description PAGEREF _Toc478378973 \h 163.3CP2K PAGEREF _Toc478378974 \h 163.3.1Code description PAGEREF _Toc478378975 \h 163.3.2Test cases description PAGEREF _Toc478378976 \h 173.4GPAW PAGEREF _Toc478378977 \h 173.4.1Code description PAGEREF _Toc478378978 \h 173.4.2Test cases description PAGEREF _Toc478378979 \h 173.5GROMACS PAGEREF _Toc478378980 \h 183.5.1Code description PAGEREF _Toc478378981 \h 183.5.2Test cases description PAGEREF _Toc478378982 \h 183.6NAMD PAGEREF _Toc478378983 \h 193.6.1Code description PAGEREF _Toc478378984 \h 193.6.2Test cases description PAGEREF _Toc478378985 \h 193.7PFARM PAGEREF _Toc478378986 \h 203.7.1Code description PAGEREF _Toc478378987 \h 203.7.2Test cases description PAGEREF _Toc478378988 \h 203.8QCD PAGEREF _Toc478378989 \h 213.8.1Code description PAGEREF _Toc478378990 \h 213.8.2Test cases description PAGEREF _Toc478378991 \h 223.9Quantum Espresso PAGEREF _Toc478378992 \h 223.9.1Code description PAGEREF _Toc478378993 \h 223.9.2Test cases description PAGEREF _Toc478378994 \h 233.10Synthetic benchmarks – SHOC PAGEREF _Toc478378995 \h 233.10.1Code description PAGEREF _Toc478378996 \h 243.10.2Test cases description PAGEREF _Toc478378997 \h 243.11SPECFEM3D PAGEREF _Toc478378998 \h 243.11.1Test cases definition PAGEREF _Toc478378999 \h 254Applications performances PAGEREF _Toc478379000 \h 254.1Alya PAGEREF _Toc478379001 \h 254.2Code_Saturne PAGEREF _Toc478379002 \h 294.3CP2K PAGEREF _Toc478379003 \h 314.4GPAW PAGEREF _Toc478379004 \h 324.5GROMACS PAGEREF _Toc478379005 \h 344.6NAMD PAGEREF _Toc478379006 \h 364.7PFARM PAGEREF _Toc478379007 \h 384.8QCD PAGEREF _Toc478379008 \h 404.8.1First implementation PAGEREF _Toc478379009 \h 404.8.2Second implementation PAGEREF _Toc478379010 \h 424.9Quantum Espresso PAGEREF _Toc478379011 \h 474.10Synthetic benchmarks (SHOC) PAGEREF _Toc478379012 \h 504.11SPECFEM3D PAGEREF _Toc478379013 \h 525Conclusion and future work PAGEREF _Toc478379014 \h 52List of Figures TOC \h \z \c "Figure" Figure 1 Shows the matrix construction part of Alya that is parallelised with OpenMP and benefits significantly from the many cores available on KNL. PAGEREF _Toc478379015 \h 27Figure 2 Demonstrates the scalability of the code. As expected Haswell cores with K80 GPU are high-performing while the KNL port is currently being optimized further. PAGEREF _Toc478379016 \h 28Figure 3 Best performance is achieved with GPU in combination with powerful CPU cores. Single thread performance has a big impact on the speedup, both threading and vectorization are employed for additional performance. PAGEREF _Toc478379017 \h 29Figure 4 Code_Saturne's performance on KNL. AMG is used as a solver in V4.2.2. PAGEREF _Toc478379018 \h 30Figure 5 Test case 1 of CP2K on the ARCHER cluster PAGEREF _Toc478379019 \h 32Figure 6 Relative performance (to / t) of GPAW is shown for parallel jobs using an increasing number of CPU (blue) or Xeon Phi KNC (red). Single CPU SCF-cycle runtime (to) was used as the baseline for the normalisation. Ideal scaling is shown as a linear dashed line for comparison. Case 1 (Carbon Nanotube) is shown with square markers and Case 2 (Copper Filament) is shown with round markers. PAGEREF _Toc478379020 \h 34Figure 7 Scalability for GROMACS test case GluCL Ion Channel PAGEREF _Toc478379021 \h 35Figure 8 Scalability for GROMACS test case Lignocellulose PAGEREF _Toc478379022 \h 36Figure 9 Scalability for NAMD test case STMV.8M PAGEREF _Toc478379023 \h 37Figure 10 Scalability for NAMD test case STMV.28M PAGEREF _Toc478379024 \h 37Figure 11 Eigensolver performance on KNL and GPU PAGEREF _Toc478379025 \h 38Figure 12 Small test case results for QCD, first implementation PAGEREF _Toc478379026 \h 40Figure 13 Large test case results for QCD, first implementation PAGEREF _Toc478379027 \h 41Figure 14 shows the time taken by the full MILC 64x64x64x8 test cases on traditional CPU, Intel Knights Landing Xeon Phi and NVIDIA P100 (Pascal) GPU architectures. PAGEREF _Toc478379028 \h 42Figure 15 Result of second implementation of QCD on K40m GPU PAGEREF _Toc478379029 \h 43Figure 16 Result of second implementation of QCD on P100 GPU PAGEREF _Toc478379030 \h 44Figure 17 Result of second implementation of QCD on P100 GPU on larger test case PAGEREF _Toc478379031 \h 45Figure 18 Result of second implementation of QCD on KNC PAGEREF _Toc478379032 \h 46Figure 19 Result of second implementation of QCD on KNL PAGEREF _Toc478379033 \h 47Figure 20 Scalability of Quantum Espresso on GPU for test case 1 PAGEREF _Toc478379034 \h 48Figure 21 Scalability of Quantum Espresso on GPU for test case 2 PAGEREF _Toc478379035 \h 48Figure 22 Scalability of Quantum Espresso on KNL for test case 1 PAGEREF _Toc478379036 \h 49Figure 23 Quantum Espresso - KNL vs BDW vs BGQ (at scale) PAGEREF _Toc478379037 \h 50List of Tables TOC \c "Table" Table 1 Main co-processors specifications PAGEREF _Toc478379038 \h 12Table 2 Codes and corresponding APIs available (in green) PAGEREF _Toc478379039 \h 14Table 3 Performance of Code_Saturne + PETSc on 1 node of the POWER8 clusters. Comparison between 2 different nodes, using different types of CPU and GPU. PETSc is built on LAPACK. The speedup is computed at the ratio between the time to solution on the CPU for a given number of MPI tasks and the time to solution on the CPU/GPU for the same number of MPI tasks. PAGEREF _Toc478379040 \h 31Table 4 Performance of Code_Saturne and PETSc on 1 node of KNL. PETSc is built on the MKL library PAGEREF _Toc478379041 \h 31Table 5 GPAW runtimes (in seconds) for the smaller benchmark (Carbon Nanotube) measured on several architectures when using n sockets (i.e. processors or accelerators). PAGEREF _Toc478379042 \h 33Table 6 GPAW runtimes (in seconds) for the larger benchmark (Copper Filament) measured on several architectures when using n sockets (i.e. processors or accelerators). *Due to memory limitations on the GPU the grid spacing was increased from 0.22 to 0.28 to have a sparser grid. To account for this in the comparison, the K40 and K80 runtimes have been scaled up using a corresponding CPU runtime as a yardstick (scaling factor q=2.1132). PAGEREF _Toc478379043 \h 33Table 7 Overall EXDIG runtime performance on various accelerators (runtime, secs) PAGEREF _Toc478379044 \h 39Table 8 Overall EXDIG runtime parallel performance using MPI-GPU version PAGEREF _Toc478379045 \h 39Table 9 Synthetic benchmarks results on GPU and Xeon Phi PAGEREF _Toc478379046 \h 52Table 10 SPECFEM 3D GLOBE results (run time in second) PAGEREF _Toc478379047 \h 52References and Applicable Documentshttp://www.prace-ri.eu The Unified European Application Benchmark Suite – http://www.prace-ri.eu/ueabs/D7.4 Unified European Applications Benchmark Suite – Mark Bull et al. – 2013http://www.nvidia.com/object/quadro-design-and-manufacturing.html HYPERLINK "https://userinfo.surfsara.nl/systems/cartesius/description" https://userinfo.surfsara.nl/systems/cartesius/descriptionMareNostrum III User’s Guide Barcelona Supercomputing Center – https://www.bsc.es/support/MareNostrum3-ug.pdf HYPERLINK "http://www.idris.fr/eng/ouessant/" http://www.idris.fr/eng/ouessant/PFARM reference – https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/3/34/Pfarm_long_lug.pdfSolvent-Driven Preferential Association of Lignin with Regions of Crystalline Cellulose in Molecular Dynamics Simulation – Benjamin Lindner et al. – Biomacromolecules, 2013NAMD website – http://www.ks.uiuc.edu/Research/namd/SHOC source repository https://github.com/vetter/shocParallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics – R. Babbich, M. Clark and B. Joo – SC 10 (Supercomputing 2010)Lattice QCD on Intel Xeon Phi – B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, W. Watson III – International Supercomputing Conference (ISC’13), 2013Extension of fractional step techniques for incompressible flows: The preconditioned Orthomin(1) for the pressure Schur complement – G. Houzeaux, R. Aubry, and M. Vázquez – Computers & Fluids, 44:297-313, 2011MIMD Lattice Computation (MILC) Collaboration – http://physics.indiana.edu/~sg/milc.htmltargetDP – https://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/READMEQUDA: A library for QCD on GPUhttps://lattice.github.io/quda/QPhiX, QCD for Intel Xeon Phi and Xeon processors – http://jeffersonlab.github.io/qphix/KNC MaxFlops issue (both SP and DP) https://github.com/vetter/shoc/issues/37 KNC SpMV issue https://github.com/vetter/shoc/issues/24, https://github.com/vetter/shoc/issues/23.List of Acronyms and AbbreviationsaisblAssociation International Sans But Lucratif (legal form of the PRACE-RI)BCOBenchmark Code Owner CoECenter of Excellence CPUCentral Processing UnitCUDACompute Unified Device Architecture (NVIDIA)DARPADefense Advanced Research Projects AgencyDEISADistributed European Infrastructure for Supercomputing Applications EU project by leading national HPC centresDoADescription of Action (formerly known as DoW)ECEuropean CommissionEESIEuropean Exascale Software InitiativeEoIExpression of InterestESFRIEuropean Strategy Forum on Research Infrastructures GBGiga (= 230 ~ 109) Bytes (= 8 bits), also GByteGb/sGiga (= 109) bits per second, also Gbit/sGB/sGiga (= 109) Bytes (= 8 bits) per second, also GByte/sGÉANTCollaboration between National Research and Education Networks to build a multi-gigabit pan-European network. The current EC-funded project as of 2015 is GN4.GFlop/sGiga (= 109) Floating point operations (usually in 64-bit, i.e. DP) per second, also GF/sGHzGiga (= 109) Hertz, frequency =109 periods or clock cycles per secondGPUGraphic Processing UnitHETHigh Performance Computing in Europe Taskforce. Taskforce by representatives from European HPC community to shape the European HPC Research Infrastructure. Produced the scientific case and valuable groundwork for the PRACE project.HMMHidden Markov ModelHPCHigh Performance Computing; Computing at a high performance level at any given time; often used synonym with SupercomputingHPLHigh Performance LINPACK ISCInternational Supercomputing Conference; European equivalent to the US based SCxx conference. Held annually in Germany.KBKilo (= 210 ~103) Bytes (= 8 bits), also KByteLINPACKSoftware library for Linear AlgebraMBManagement Board (highest decision making body of the project)MBMega (= 220 ~ 106) Bytes (= 8 bits), also MByteMB/sMega (= 106) Bytes (= 8 bits) per second, also MByte/sMFlop/sMega (= 106) Floating point operations (usually in 64-bit, i.e. DP) per second, also MF/sMooCMassively open online CourseMoUMemorandum of UnderstandingMPIMessage Passing InterfaceNDANon-Disclosure Agreement. Typically signed between vendors and customers working together on products prior to their general availability or announcement.PAPreparatory Access (to PRACE resources)PATCPRACE Advanced Training CentresPRACEPartnership for Advanced Computing in Europe; Project AcronymPRACE 2The upcoming next phase of the PRACE Research Infrastructure following the initial five year period.PRIDEProject Information and Dissemination EventRIResearch InfrastructureTBTechnical Board (group of Work Package leaders)TBTera (= 240 ~ 1012) Bytes (= 8 bits), also TByteTCOTotal Cost of Ownership. Includes recurring costs (e.g. personnel, power, cooling, maintenance) in addition to the purchase cost.TDPThermal Design PowerTFlop/sTera (= 1012) Floating-point operations (usually in 64-bit, i.e. DP) per second, also TF/sTier-0Denotes the apex of a conceptual pyramid of HPC systems. In this context the Supercomputing Research Infrastructure would host the Tier-0 systems; national or topical HPC centres would constitute Tier-1UNICOREUniform Interface to Computing Resources. Grid software for seamless access to distributed resources.List of Project Partner AcronymsBADW-LRZLeibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, Germany (3rd Party to GCS)BILKENTBilkent University, Turkey (3rd Party to UYBHM)BSCBarcelona Supercomputing Center - Centro Nacional de Supercomputacion, Spain CaSToRCComputation-based Science and Technology Research Center, CyprusCCSASComputing Centre of the Slovak Academy of Sciences, SlovakiaCEACommissariat à l’Energie Atomique et aux Energies Alternatives, France (3 rd Party to GENCI)CESGAFundacion Publica Gallega Centro Tecnológico de Supercomputación de Galicia, Spain, (3rd Party to BSC)CINECACINECA Consorzio Interuniversitario, ItalyCINESCentre Informatique National de l’Enseignement Supérieur, France (3 rd Party to GENCI)CNRSCentre National de la Recherche Scientifique, France (3 rd Party to GENCI)CSCCSC Scientific Computing Ltd., FinlandCSICSpanish Council for Scientific Research (3rd Party to BSC)CYFRONETAcademic Computing Centre CYFRONET AGH, Poland (3rd party to PNSC)EPCCEPCC at The University of Edinburgh, UK ETHZurich (CSCS)Eidgenössische Technische Hochschule Zürich – CSCS, SwitzerlandFISFACULTY OF INFORMATION STUDIES, Slovenia (3rd Party to ULFME)GCSGauss Centre for Supercomputing e.V.GENCIGrand Equipement National de Calcul Intensiv, FranceGRNETGreek Research and Technology Network, GreeceINRIAInstitut National de Recherche en Informatique et Automatique, France (3 rd Party to GENCI)ISTInstituto Superior Técnico, Portugal (3rd Party to UC-LCA)IUCCINTER UNIVERSITY COMPUTATION CENTRE, IsraelJKUInstitut fuer Graphische und Parallele Datenverarbeitung der Johannes Kepler Universitaet Linz, AustriaJUELICHForschungszentrum Juelich GmbH, GermanyKTHRoyal Institute of Technology, Sweden (3 rd Party to SNIC)LiULinkoping University, Sweden (3 rd Party to SNIC)NCSANATIONAL CENTRE FOR SUPERCOMPUTING APPLICATIONS, BulgariaNIIFNational Information Infrastructure Development Institute, HungaryNTNUThe Norwegian University of Science and Technology, Norway (3rd Party to SIGMA)NUI-GalwayNational University of Ireland Galway, IrelandPRACEPartnership for Advanced Computing in Europe aisbl, BelgiumPSNCPoznan Supercomputing and Networking Center, PolandRISCSWRISC Software GmbHRZGMax Planck Gesellschaft zur Förderung der Wissenschaften e.V., Germany (3 rd Party to GCS)SIGMA2UNINETT Sigma2 AS, NorwaySNICSwedish National Infrastructure for Computing (within the Swedish Science Council), SwedenSTFCScience and Technology Facilities Council, UK (3rd Party to EPSRC)SURFsaraDutch national high-performance computing and e-Science support center, part of the SURF cooperative, NetherlandsUC-LCAUniversidade de Coimbra, Labotatório de Computação Avançada, PortugalUCPHKøbenhavns Universitet, DenmarkUHEMIstanbul Technical University, Ayazaga Campus, TurkeyUiOUniversity of Oslo, Norway (3rd Party to SIGMA)ULFMEUNIVERZA V LJUBLJANI, SloveniaUmUUmea University, Sweden (3 rd Party to SNIC)UnivEvoraUniversidade de Évora, Portugal (3rd Party to UC-LCA)UPCUniversitat Politècnica de Catalunya, Spain (3rd Party to BSC)UPM/CeSViMaMadrid Supercomputing and Visualization Center, Spain (3rd Party to BSC)USTUTT-HLRSUniversitaet Stuttgart – HLRS, Germany (3rd Party to GCS)VSB-TUOVYSOKA SKOLA BANSKA - TECHNICKA UNIVERZITA OSTRAVA, Czech RepublicWCNSPolitechnika Wroclawska, Poland (3rd party to PNSC)Executive SummaryThis document describes an accelerator benchmark suite, a set of 11 codes that includes 1 synthetic benchmark and 10 commonly used applications. The key focus of this task has been exploiting accelerators or co-processors to improve the performance of real applications. It aims at providing a set of scalable, currently relevant and publically available codes and datasets.This work has been undertaken by Task7.2B "Accelerator Benchmarks" in the PRACE Fourth Implementation Phase (PRACE-4IP) project.Most of the selected application are a subset of the Unified European Applications Benchmark Suite (UEABS) REF _Ref476982133 \r \h [2] REF _Ref476982292 \r \h [3]. One application and a synthetic benchmark have been added.As a result, selected codes are: Alya, Code_Saturne, CP2K, GROMACS, GPAW, NAMD, PFARM, QCD, Quantum Espresso, SHOC and SPECFEM3D.For each code either two or more test case datasets have been selected. These are described in this document, along with a brief introduction to the application codes themselves. For each code, some sample results are presented, from first run on leading edge systems and prototypes.IntroductionThe work produced within this task is an extension of the UEABS for accelerators. This document will cover each code, presenting the code as well as the test cases defined for the benchmarks and the first results that have been recorded on various accelerator systems.As the UEABS, this suite aims to present results for many scientific fields that can use HPC accelerated resources. Hence, it will help the European scientific communities to decide in terms of infrastructures they could buy in a near future. We focus on Intel Xeon Phi coprocessors and NVIDIA GPU cards for benchmarking as they are the two most wide-spread accelerated resources available now.Section REF _Ref476982656 \r \h 2 will present both types of accelerator systems, Xeon Phi and GPU card along with architecture examples. Section REF _Ref477340653 \r \h 3 gives a description of each of the selected applications, together with the test case datasets while section REF _Ref477340707 \r \h 4 presents some sample results. Section REF _Ref477340783 \r \h 5 outlines further work on, and using, the suite.Targeted architecturesThis suite is targeting accelerator cards, more specifically the Intel Xeon Phi and NVIDIA GPU architecture. This section will quickly describe them and will present the 4 machines, the benchmarks ran on.Co-processor descriptionScientific computing using co-processors has gained popularity in recent years. First the utility of GPU has been demonstrated and evaluated in several application domains REF _Ref476982100 \r \h [4]. As a response to NVIDIA’s supremacy in this field, Intel designed Xeon Phi cards.Architectures and programming models of co-processors may differ from CPU and vary among different co-processor types. The main challenges are the high-level parallelism ability required from software and the fact that code may have to be offloaded to the accelerator card.The REF _Ref477772034 \h Table 1 enlightens this fact:Intel Xeon PhiNVIDIA GPU 5110P (KNC)7250 (KNL)K40mP100public availability date Nov-12Jun-16Jun-13May-16theoretical peak perf1,011 GF/s3,046 GF/s1,430 GF/s5,300 GF/soffload requiredpossiblenot possiblerequiredrequiredmax number of thread/cuda cores24027228803584Table SEQ Table \* ARABIC 1 Main co-processors specificationsSystems descriptionThe benchmark suite has been officially granted access to 4 different machines hosted by PRACE partners. Most results presented in this paper were obtained on these machines but some of the simulation has run on similar ones. This section will cover specifications of the sub mentioned 4 official systems while the few other ones will be presented along with concerned results.As it can be noticed on the previous section, leading edge architectures have been available quite recently and some code couldn't run on it yet. Results will be completed in a near future and will be delivered with an update of the benchmark suite. Still, presented performances are a good indicator about potential efficiency of codes on both Xeon Phi and NVIDIA GPU platforms.As for the future, the PRACE-3IP PCP is in its third and last phase and will be a good candidate to provide access to bigger machines. The following suppliers had been awarded with a contract: ATOS/Bull SAS (France), E4 Computer Engineering (Italy) and Maxeler Technologies (UK), providing pilots using Xeon Phi, OPENPower and FPGA technologies. During this final phase, which started in October 2016, the contractors will have to deploy pilot system with a compute capability of around 1 PFlop/s, to demonstrate technology readiness of the proposed solution and the progress in terms of energy efficiency, using high frequency monitoring designed for this purpose. These results will be evaluated on a subset of applications from UEABS (NEMO, SPECFEM3D, QuantumEspresso, BQCD). The access to these systems is foreseen to be open to PRACE partners, with a special interest for the 4IP-WP7 task on accelerated Benchmarks.Cartesius K40The SURFsara institute in The Netherlands granted access to Cartesius which has a GPU island (installed May 2014) with following specifications REF _Ref476982066 \r \h [5]:66 Bullx B515 GPU accelerated nodes2x 8-core 2.5 GHz Intel Xeon E5-2450 v2 (Ivy Bridge) CPU/node2x NVIDIA Tesla K40m GPU/node96 GB/node, DDR3-1600 RAMTotal theoretical peak performance (Ivy Bridge + K40m) 1,056 cores + 132 GPU: 210 TF/sThe interconnect has a fully non-blocking fat-tree topology. Every node has two ConnectX-3 InfiniBand FDR adapters: one per GPU.MareNostrum KNCThe Barcelona Supercomputing Center (BSC) in Spain granted access to MareNostrum III which features KNC nodes (upgrade June 2013). Here's the description of this partition REF _Ref476984580 \r \h [6]:42 hybrid nodes containing:1x Sandy-Bridge-EP (2 x 8 cores) host processors E5-2670 8x 8G DDR3–1600 DIMMs (4GB/core), total: 64GB/node2x Xeon Phi 5110P acceleratorsInterconnection networks:Infiniband Mellanox FDR10: High bandwidth network used by parallel applications communications (MPI)Gigabit Ethernet: 10GbitEthernet network used by the GPFS Filesystem.Ouessant P100GENCI granted access to the Ouessant prototype at IDRIS in France (installed September 2016). It is composed of 12 IBM Minsky compute nodes with each containing REF _Ref476985408 \r \h [7]:Compute nodesPOWER8+ sockets, 10 cores, 8 threads per core (or 160 threads per node)128 GB of DDR4 memory (bandwidth > 9 GB/s per core)4 NVIDIA’s new generation Pascal P100 GPU, 16 GB of HBM2 memoryInterconnect4 NVLink interconnects (40GB/s of bi-directional bandwidth per interconnect); each GPU card is connected to a CPU with 2 NVLink interconnects and another GPU with 2 interconnects remainingA Mellanox EDR InfiniBand CAPI interconnect network (1 interconnect per node)Frioul KNLGENCI also granted access to the Frioul prototype at CINES in France (installed December 2016). It is composed of 48 Intel KNL compute nodes each containing:Compute nodes7250 KNL, 68 cores, 4 threads per cores192GB of DDR4 memory16GB of MCDRAMInterconnect:A Mellanox EDR 4x InfiniBandBenchmark suite descriptionThis part will cover each code, presenting the interest for the scientific community as well as the test cases defined for the benchmarks.As an extension to the EUABS, most codes presented in this suite are included in the latter. Exceptions are PFARM which comes from PRACE-2IP REF _Ref476987482 \r \h [8] and SHOC REF _Ref477368547 \r \h [11] a synthetic benchmark suite.Table SEQ Table \* ARABIC 2 Codes and corresponding APIs available (in green) REF _Ref478378316 \h Table 2 lists the codes that will be presented in the next sections as well as their implementations available. It should be noted that OpenMP can be used with the Intel Xeon Phi architecture while CUDA is used for NVidia GPU cards. OpenCL has been considered as a third alternative that can be used on both architectures. It has been available on the first generation of Xeon Phi (KNC) but has not been ported to the second one (KNL). SHOC is the only code that is impacted, this problem is addressed in section REF _Ref478378712 \r \h 4.10.AlyaAlya is a high performance computational mechanics code that can solve different coupled mechanics problems: incompressible/compressible flows, solid mechanics, chemistry, excitable media, heat transfer and Lagrangian particle transport. It is one single code. There are no particular parallel or individual platform versions. Modules, services and kernels can be compiled individually and used a la carte. The main discretisation technique employed in Alya is based on the variational multiscale finite element method to assemble the governing equations into Algebraic systems. These systems can be solved using solvers like GMRES, Deflated Conjugate Gradient, pipelined CG together with preconditioners like SSOR, Restricted Additive Schwarz, etc. The coupling between physics solved in different computational domains (like fluid-structure interactions) is carried out in a multi-code way, using different instances of the same executable. Asynchronous coupling can be achieved in the same way in order to transport Lagrangian particles.Code descriptionThe code is parallelised with MPI and OpenMP. Two OpenMP strategies are available, without and with a colouring strategy to avoid ATOMICs during the assembly step. A CUDA version is also available for the different solvers. Alya has been also compiled for MIC (Intel Xeon Phi).Alya is written in Fortran 1995 and the incompressible fluid module, present in the benchmark suite, is freely available. This module solves the Navier-Stokes equations using an Orthomin(1) REF _Ref477369174 \r \h [14] method for the pressure Schur complement. This method is an algebraic split strategy which converges to the monolithic solution. At each linearisation step, the momentum is solved twice and the continuity equation is solved once or twice depending whether the momentum preserving or the continuity preserving algorithm is selected.Test cases descriptionCavity-hexaedra elements (10M elements)This test is the classical lid-driven cavity. The problem geometry is a cube of dimensions 1x1x1. The fluid properties are density=1.0 and viscosity=0.01. Dirichlet boundary conditions are applied on all sides, with three no-slip walls and one moving wall with velocity equal to 1.0, which corresponds to a Reynolds number of 100. The Reynolds number is low so the regime is laminar and turbulence modelling is not necessary. The domain is discretised into 9800344 hexaedra elements. The solvers are the GMRES method for the momentum equations and the Deflated Conjugate Gradient to solve the continuity equation. This test case can be run using pure MPI parallelisation or the hybrid MPI/OpenMP strategy.Cavity-hexaedra elements (30M elements)This is the same cavity test as before but with 30M of elements. Note that a mesh multiplication strategy enables one to multiply the number of elements by powers of 8, by simply activating the corresponding option in the ker.dat file.Cavity-hexaedra elements-GPU version (10M elements)This is the same test as Test case 1, but using the pure MPI parallelisation strategy with acceleration of the algebraic solvers using GPU.Code_SaturneCode_Saturne is a CFD software package developed by EDF R&D since 1997 and open-source since 2007. The Navier-Stokes equations are discretised following a finite volume method approach. The code can handle any type of mesh built with any type of cell/grid structure. Incompressible and compressible flows can be simulated, with or without heat transfer, and a range of turbulence models is available. The code can also be coupled with itself or other software to model some multi-physics problems (fluid-structure, fluid-conjugate heat transfer, for instance).Code descriptionParallelism is handled by distributing the domain over the processors (several partitioning tools are available, either internally, i.e. SFC Hilbert and Morton, or through external libraries, i.e. METIS Serial, ParMETIS, Scotch Serial, PT-SCOTCH. Communications between subdomains are handled by MPI. Hybrid parallelism using MPI/OpenMP has recently been optimised for improved multicore performance.For incompressible simulations, most of the time is spent during the computation of the pressure through Poisson equations. The matrices are very sparse. PETSc has recently been linked to the code to offer alternatives to the internal solvers to compute the pressure. The developer’s version of PETSc supports CUDA and is used in this benchmark suite.Code_Saturne is written in C, F95 and Python. It is freely available under the GPL license.Test cases descriptionTwo test cases are dealt with, the former with a mesh made of hexahedral cells and the latter with a mesh made of tetrahedral cells. Both configurations are meant for incompressible laminar flows. The first test case is run on KNL in order to test the performance of the code always completely filling up a node using 64 MPI tasks and then either 1, 2, 4 OpenMP threads, or 1, 2, 4 extra MPI tasks to investigate the effect of hyper-threading. In this case, the pressure is computed using the code's native Algebraic Multigrid (AMG) algorithm as a solver. The second test case is run on KNL and GPU. In this configuration, the pressure equation is solved using the conjugate gradient (CG) algorithm from the PETSc library (the version of PETSc is the developer's version which supports GPU) and tests are run on KNL as well as on CPU+GPU. PETSc is built with the CUSP library and the CUSP format is used.Note that computing the pressure using a CG algorithm has always been slower than using the native AMG algorithm, when using Code_Saturne. The second test is then meant to compare the current results obtained on KNL and GPU using CG only, and not to compare CG and AMG time to solution.Flow in a 3-D lid-driven cavity (tetrahedral cells)The geometry is very simple, i.e. a cube, but the mesh is built using tetrahedral cells only. The Reynolds number is set to 100, and symmetry boundary conditions are applied in the spanwise direction. The case is modular and the mesh size can easily been varied. The largest mesh has about 13 million cells and is used to get some first comparisons using Code_Saturne linked to the developer's PETSc library, in order to get use of the GPU.3-D Taylor-Green vortex flow (hexahedral cells)The Taylor-Green vortex flow is traditionally used to assess the accuracy of CFD code numerical schemes. Periodicity is used in the 3 directions. The total kinetic energy (integral of the velocity) and enstrophy (integral of the vorticity) evolutions as a function of the time are looked at. Code_Saturne is set for 2nd order time and spatial schemes. The mesh size is 2563 cells.CP2KCP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. It can perform molecular dynamics, metadynamics, Quantum Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method.CP2K provides a general framework for different modelling methods such as density functional theory (DFT) using the mixed Gaussian and plane waves approaches (GPW) and Gaussian and Augmented Plane (GAPW). Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, …), and classical force fields (AMBER, CHARMM, …).Code descriptionParallelisation is achieved using a combination of OpenMP-based multi-threading and MPI.Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi).CP2K is written in Fortran 2003 and freely available under the GPL license.Test cases descriptionLiH-HFXThis is a single-point energy calculation for a particular configuration of a 216 atom Lithium Hydride crystal with 432 electrons in a 12.3 Å3 (Angstroms cubed) cell. The calculation is performed using a DFT algorithm with GAPW under the hybrid Hartree-Fock exchange (HFX) approximation. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although the cost of the latter can be reduced by using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performanceThis test case is expected to scale efficiently to 1000+ nodes.H2O-DFT-LSThis is a single-point energy calculation for 2048 water molecules in a 39 Å3 box using linear-scaling DFT. A local-density approximation (LDA) functional is used to compute the Exchange-Correlation energy in combination with a DZVP MOLOPT basis set and a 300 Ry cutoff. For large systems, the linear-scaling approach for solving Self-Consistent-Field equations should be much cheaper computationally than using standard DFT, and allow scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix. The cubically-scaling orthogonalisation step of standard DFT is avoided and key operations are sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in CP2K's DBCSR library.This test case is expected to scale efficiently to 4000+ nodes.GPAWGPAW is a DFT program for ab-initio electronic structure calculations using the projector augmented wave method. It uses a uniform real-space grid representation of the electronic wavefunctions, that allows for excellent computational scalability and systematic converge properties.Code descriptionGPAW is written mostly in Python, but includes also computational kernels written in C as well as leveraging external libraries such as NumPy, BLAS and ScaLAPACK. Parallelisation is based on message-passing using MPI with no threading. Development branches for GPU and MICs include support for offloading to accelerators using either CUDA or pyMIC, respectively. GPAW is freely available under the GPL license.Test cases descriptionCarbon NanotubeThis test case is a ground state calculation for a carbon nanotube in vacuum. By default, it uses a 6-6-10 nanotube with 240 atoms (freely adjustable) and serial LAPACK with an option to use ScaLAPACK.This benchmark is aimed at smaller systems, with an intended scaling range of up to 10 nodes.Copper FilamentThis test case is a ground state calculation for a copper filament in vacuum. By default, it uses a 2x2x3 FCC lattice with 71 atoms (freely adjustable) and ScaLAPACK for parallelisation.This benchmark is aimed at larger systems, with an intended scaling range of up to 100 nodes. A lower limit on the number of nodes may be imposed by the amount of memory required, which can be adjusted to some extent with the run parameters (e.g. lattice size or grid spacing).GROMACSGROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles.It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation, and some additional features:GROMACS provides extremely high performance compared to all other programs. A lot of algorithmic optimisations have been introduced in the code; for instance, the calculation of the virial has been extracted from the innermost loops over pairwise interactions, and we use our own software routines to calculate the inverse square root. In GROMACS 4.6 and up, on almost all common computing platforms, the innermost loops are written in C using intrinsic functions that the compiler transforms to SIMD machine instructions, to utilise the available instruction-level parallelism. These kernels are available in both single and double precision, and support all different kinds of SIMD support found in x86-family (and other) processors.Code descriptionParallelisation is achieved using combined OpenMP and MPI.Offloading for accelerators is implemented through CUDA for GPU and through OpenMP for MIC (Intel Xeon Phi).GROMACS is written in C/C++ and freely available under the GPL license.Test cases descriptionGluCL Ion ChannelThe ion channel system is the membrane protein GluCl, which is a pentameric chloride channel embedded in a lipid bilayer. The GluCl ion channel was embedded in a DOPC membrane and solvated in TIP3P water. This system contains 142k atoms, and is a quite challenging parallelisation case due to the small size. However, it is likely one of the most wanted target sizes for biomolecular simulations due to the importance of these proteins for pharmaceutical applications. It is particularly challenging due to a highly inhomogeneous and anisotropic environment in the membrane, which poses hard challenges for load balancing with domain decomposition.This test case was used as the “Small” test case in previous 2IP and 3IP PRACE phases. It is included in the package's version 5.0 benchmark cases. It is reported to scale efficiently up to 1000+ cores on x86 based systems.LignocelluloseA model of cellulose and lignocellulosic biomass in an aqueous solution REF _Ref476989175 \r \h [9]. This system of 3.3 million atoms is inhomogeneous. This system uses reaction-field electrostatics instead of PME and therefore scales well on x86. This test case was used as the “Large” test case in previous PRACE 2IP and 3IP projects. It is reported in previous PRACE projects to scale efficiently up to 10000+ x86 cores.NAMDNAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. NAMD is developed by the “Theoretical and Computational Biophysics Group” at the University of Illinois at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a large number of processors. The application can read a wide variety of different file formats, for example force fields, protein structures, which are commonly used in bio-molecular science. A NAMD license can be applied for on the developer’s website free of charge. Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website. Deployment areas of NAMD include pharmaceutical research by academic and industrial users. NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other chemical substances is of interest. Typical examples are vaccine research and transport processes through cell membrane proteins.Code descriptionNAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI, supporting both pure MPI and hybrid parallelisation REF _Ref476989447 \r \h [10].Offloading for accelerators is implemented for both GPU and MIC (Intel Xeon Phi).Test cases descriptionThe datasets are based on the original "Satellite Tobacco Mosaic Virus (STMV)" dataset from the official NAMD site. The memory optimised build of the package and data sets are used in benchmarking. Data are converted to the appropriate binary format used by the memory optimised build.STMV.1MThis is the original STMV dataset from the official NAMD site. The system contains roughly 1 million atoms. This data set scales efficiently up to 1000+ x86 Ivy Bridge cores.STMV.8MThis is a 2x2x2 replication of the original STMV dataset from the official NAMD site. The system contains roughly 8 million atoms. This data set scales efficiently up to 6000 x86 Ivy Bridge cores.STMV.28MThis is a 3x3x3 replication of the original STMV dataset from the official NAMD site. The system contains roughly 28 million atoms. This data set also scales efficiently up to 6000 x86 Ivy Bridge cores.PFARMPFARM is part of a suite of programs based on the ‘R-matrix’ ab-initio approach to the varitional solution of the many-electron Schrödinger equation for electron-atom and electron-ion scattering. The package has been used to calculate electron collision data for astrophysical applications (such as: the interstellar medium, planetary atmospheres) with, for example, various ions of Fe and Ni and neutral O, plus other applications such as data for plasma modelling and fusion reactor impurities. The code has recently been adapted to form a compatible interface with the UKRmol suite of codes for electron (positron) molecule collisions thus enabling large-scale parallel ‘outer-region’ calculations for molecular systems as well as atomic systems.Code descriptionIn order to enable efficient computation, the external region calculation takes place in two distinct stages, named EXDIG and EXAS, with intermediate files linking the two. EXDIG is dominated by the assembly of sector Hamiltonian matrices and their subsequent eigensolutions. EXAS uses a combined functional/domain decomposition approach where good load-balancing is essential to maintain efficient parallel performance. Each of the main stages in the calculation is written in Fortran 2003 (or Fortran 2003-compliant Fortran 95), is parallelised using MPI and is designed to take advantage of highly optimised, numerical library routines. Hybrid MPI / OpenMP parallelisation has also been introduced into the code via shared memory enabled numerical library kernels.Accelerator-based implementations have been implemented for both EXDIG and EXAS. EXAS uses offloading via MAGMA (or MKL) for sector Hamiltonian diagonalisations on Intel Xeon Phi and GPU accelerators. EXDIG uses combined MPI and OpenMP to distribute the scattering energy calculations on CPU efficiently both across and within Intel Xeon Phi co-processors.Test cases descriptionExternal region R-matrix propagations take place over the outer partition of configuration space, including the region where long-range potentials remain important. The radius of this region is determined from the user input and the program decides upon the best strategy for dividing this space into multiple sub-regions (or sectors). Generally, a choice of larger sector lengths requires the application of larger numbers of basis functions (and therefore larger Hamiltonian matrices) in order to maintain accuracy across the sector and vice-versa. Memory limits on the target hardware may determine the final preferred configuration for each test case.Iron, FeIIIThis is an electron-ion scattering case with 1181 channels. Hamiltonian assembly in the coarse region applies 10 Legendre functions leading to Hamiltonian matrix diagonalisations of order 11810. In the ‘fine energy region’ up to 30 Legendre functions may be applied leading to Hamiltonian matrices of up to order 35430. The number of sector calculations is likely to range from about 15 to over 30 depending on the user specifications. Several thousand scattering energies are used in the calculation. Methane, CH4The dataset is an electron-molecule calculation with 1361 channels. Hamiltonian dimensions are therefore estimated between 13610 and ~40000. A process in the code which splits the constituent channels according to spin can be used to approximately halve the Hamiltonian size (whilst doubling the overall number of Hamiltonian matrices). As eigensolvers generally require O(N3) operations, spin splitting leads to a saving in both memory requirements and operation count. The final radius of the external region required is relatively long, leading to more numerous sectors calculations (estimated to between 20 and 30). The calculation will require many thousands of scattering energies.In the current model, parallelism in EXDIG is limited to the number of sector calculations, i.e a maximum of around 30 accelerator nodes. Methane is a relatively new dataset which has not been calculated on novel technology platforms at the very large-scale to date, so this is somewhat a step into the unknown. We are also somewhat reliant on collaborative partners that are not associated with PRACE for continuing to develop and fine tune the accelerator-based EXAS program for this proposed work. Access to suitable hardware with throughput suited to development cycles is also a necessity if suitable progress is to be ensured.QCDMatter consists of atoms, which in turn consist of nuclei and electrons. The nuclei consist of neutrons and protons, which comprise quarks bound together by gluons.The theory of how quarks and gluons interact to form nucleons and other elementary particles is called Quantum Chromo Dynamics (QCD). For most problems of interest, it is not possible to solve QCD analytically, and instead numerical simulations must be performed. Such “Lattice QCD” calculations are very computationally intensive, and occupy a significant percentage of all HPC resources worldwide.Code descriptionThe QCD benchmark benefits of two different implementations described below.First implementationThe MILC code is a freely-available suite for performing Lattice QCD simulations, developed over many years by a collaboration of researchers REF _Ref477371577 \r \h [15].The benchmark used here is derived from the MILC code (v6), and consists of a full conjugate gradient solution using Wilson fermions. The benchmark is consistent with “QCD kernel E” in the full UAEBS, and has been adapted so that it can efficiently use accelerators as well as traditional CPU.The implementation for accelerators has been achieved using the “targetDP” programming model REF _Ref477371673 \r \h [16], a lightweight abstraction layer designed to allow the same application source code to be able to target multiple architectures, e.g. NVIDIA GPU and multicore/manycore CPU, in a performance portable manner. The targetDP syntax maps, at compile time, to either NVIDIA CUDA (for execution on GPU) or OpenMP+vectorisation (for implementation on multi/manycore CPU including Intel Xeon Phi). The base language of the benchmark is C and MPI is used for node-level parallelism.Second implementationThe QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA REF _Ref477103549 \r \h [12] and the QPhix REF _Ref477103568 \r \h [13] library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPU REF _Ref477371810 \r \h [17]. The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL's REF _Ref477371895 \r \h [18]. In both QUDA and QPhix, the benchmark kernel uses the conjugate gradient solvers implemented within the libraries.Test cases descriptionLattice QCD involves discretisation of space-time into a lattice of points, where the extent of the lattice in each of the 3 spatial and 1 temporal dimensions can be chosen. This means that the benchmark is very flexible, where the size of the lattice can be varied with the size of the computing system in use (weak scaling) or can be fixed (strong scaling). For testing on a single node, then 64x64x32x8 is a reasonable size, since this fits on a single Intel Xeon Phi or a single GPU. For larger numbers of nodes, the lattice extents can be increased accordingly, keeping the geometric shape roughly similar. Test cases for the second implementation are given by a strong-scaling mode with a lattice size of 32x32x32x96 and 64x64x64x128 and a weak scaling mode with a local lattice size of 48x48x48x24.Quantum EspressoQUANTUM ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modelling, based on density-functional theory, plane waves, and pseudopotentials (norm-conserving, ultrasoft, and projector-augmented wave). QUANTUM ESPRESSO stands for opEn Source Package for Research in Electronic Structure, Simulation, and Optimisation. It is freely available to researchers around the world under the terms of the GNU General Public License. QUANTUM ESPRESSO builds upon newly restructured electronic-structure codes that have been developed and tested by some of the original authors of novel electronic-structure algorithms and applied in the last twenty years by some of the leading materials modelling groups worldwide. Innovation and efficiency are still its main focus, with special attention paid to massively parallel architectures, and a great effort being devoted to user friendliness. QUANTUM ESPRESSO is evolving towards a distribution of independent and inter-operable codes in the spirit of an open-source project, where researchers active in the field of electronic-structure calculations are encouraged to participate in the project by contributing their own codes or by implementing their own ideas into existing codes.QUANTUM ESPRESSO is written mostly in Fortran90, and parallelised using MPI and OpenMP and is released under a GPL license.Code descriptionDuring 2011 a GPU-enabled version of Quantum ESPRESSO was publicly released. The code is currently developed and maintained by Filippo Spiga at the High Performance Computing Service - University of Cambridge (United Kingdom) and Ivan Girotto at the International Centre for Theoretical Physics (Italy). The initial work has been supported by the EC-funded PRACE and a SFI (Science Foundation Ireland, grant 08/HEC/I1450). At the time of writing, the project is self-sustained thanks to the dedication of the people involved and thanks to NVIDIA support in providing hardware and expertise in GPU programming.The current public version of QE-GPU is 14.10.0 as it is the last version maintained as plug-in working on all QE 5.x versions. QE-GPU utilised phiGEMM (external) for CPU+GPU GEMM computation, MAGMA (external) to accelerate eigen-solvers and explicit CUDA kernel to accelerate compute-intensive routines. FFT capabilities on GPU are available only for serial computation due to the hard challenges posed in managing accelerators in the parallel distributed 3D-FFT portion of the code where communication is the dominant element that limits excellent scalability beyond hundreds of MPI ranks.A version for Intel Xeon Phi (MIC) accelerators is not currently available.Test cases descriptionPW-IRMOF_M11Full SCF calculation of a Zn-based isoreticular metal–organic framework (total 130 atoms) over 1 K point. Benchmarks run in 2012 demonstrated speedups due to GPU (NVIDIA K20s, with respect to non-accelerated nodes) in the range 1.37 – 1.87, according to node count (maximum number of accelerators=8). Runs with current hardware technology and an updated version of the code are expected to exhibit higher speedups (probably 2-3x) and scale up to a couple hundred nodes.PW-SiGe432This is a SCF calculation of a Silicon-Germanium crystal with 430 atoms. Being a fairly large system, parallel scalability up to several hundred, perhaps a 1000 nodes is expected, with accelerated speed-ups likely to be of 2-3x.Synthetic benchmarks – SHOCThe Accelerator Benchmark Suite will also include a series of synthetic benchmarks. For this purpose, we choose the Scalable HeterOgeneous Computing (SHOC) benchmark suite, augmented with a series of benchmark examples developed internally. SHOC is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing. Its initial focus is on systems containing GPU and multi-core processors, and on the OpenCL programming standard, but CUDA and OpenACC versions were added. Moreover, a subset of the benchmarks is optimised for the Intel Xeon Phi coprocessor. SHOC can be used on clusters as well as individual hosts.The SHOC benchmark suite currently contains benchmark programs categorised by complexity. Some measure low-level 'feeds and speeds' behaviour (Level 0), some measure the performance of a higher-level operation such as a Fast Fourier Transform (FFT) (Level 1), and the others measure real application kernels (Level 2).The SHOC benchmark suite has been selected to evaluate the performance of accelerators on synthetic benchmarks, mostly because SHOC provides CUDA/OpenCL/Offload/OpenACC variants of the benchmarks. This allowed us to evaluate NVIDIA GPU (with CUDA/OpenCL/OpenACC), Intel Xeon Phi KNC (with both Offload and OpenCL), but also Intel host CPU (with OpenCL/OpenACC). However, on the latest Xeon Phi processor (codenamed KNL) none of these 4 models is supported. Thus, benchmarks on the KNL architecture can not be run at this point, and there aren't any news of Intel supporting OpenCL on the KNL. However, there is work in progress on the PGI compiler to support the KNL as a target. This support will be added during 2017. This will allow us to compile and run the OpenACC benchmarks for the KNL. Alternatively, the OpenACC benchmarks will be ported to OpenMP and executed on the KNL.Code descriptionAll benchmarks are MPI-enabled. Some will report aggregate metrics over all MPI ranks, others will only perform work for specific ranks.Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi). For selected benchmarks OpenACC implementations are provided for GPU. Multi-node parallelisation is achieved using MPI.SHOC is written in C++ and is open-source and freely available.Test cases descriptionThe benchmarks contained in SHOC currently feature 4 different sizes for increasingly large systems. The size convention is as follows:CPU / debuggingMobile/integrated GPUDiscrete GPU (e.g. GeForce or Radeon series)HPC-focused or large memory GPU (e.g. Tesla or Firestream Series)In order to go even larger scale, we plan to add a 5th level for massive supercomputers.SPECFEM3DThe software package SPECFEM3D simulates three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). All SPECFEM3D_GLOBE software is written in Fortran90 with full portability in mind, and conforms strictly to the Fortran95 standard. It uses no obsolete or obsolescent features of Fortran77. The package uses parallel programming based upon the Message Passing Interface (MPI).The SEM was originally developed in computational fluid dynamics and has been successfully adapted to address problems in seismic wave propagation. It is a continuous Galerkin technique, which can easily be made discontinuous; it is then close to a particular case of the discontinuous Galerkin technique, with optimised efficiency because of its tensorised basis functions. In particular, it can accurately handle very distorted mesh elements. It has very good accuracy and convergence properties. The spectral element approach admits spectral rates of convergence and allows exploiting hp-convergence schemes. It is also very well suited to parallel implementation on very large supercomputers as well as on clusters of GPU accelerating graphics cards. Tensor products inside each element can be optimised to reach very high efficiency, and mesh point and element numbering can be optimised to reduce processor cache misses and improve cache reuse. The SEM can also handle triangular (in 2D) or tetrahedral (3D) elements as well as mixed meshes, although with increased cost and reduced accuracy in these elements, as in the discontinuous Galerkin method.In many geological models in the context of seismic wave propagation studies (except for instance for fault dynamic rupture studies, in which very high frequencies of supershear rupture need to be modelled near the fault) a continuous formulation is sufficient because material property contrasts are not drastic and thus conforming mesh doubling bricks can efficiently handle mesh size variations. This is particularly true at the scale of the full earth. Effects due to lateral variations in compressional-wave speed, shear-wave speed, density, a 3D crustal model, ellipticity, topography and bathyletry, the oceans, rotation, and self-gravitation are included. The package can accommodate full 21-parameter anisotropy as well as lateral variations in attenuation. Adjoint capabilities and finite-frequency kernel simulations are also included.Test cases definitionBoth test cases will use the same input data. A 3D shear-wave speed model (S362ANI) will be used to benchmark the code.Here is an explanation of the simulation parameters that will be used to size the test case:NCHUNKS, number of face of the cubed sphere included in the simulation (will be always 6)NPROC_XI, number of slice along one chunk of the cubed sphere (will represents also the number of processors used for 1 chunkNEX_XI, number of spectral elements along one side of a chunk.RECORD_LENGHT_IN_MINUTES, length of the simulated seismograms. The time of the simulation should vary linearly with this parameter.Small test case It runs with 24 MPI tasks and has the following mesh characteristics: NCHUCKS=6NPROC_XI=2NEX_XI =80 RECORD_LENGHT_IN_MINUTES =2.0Bigger test case It runs with 150 MPI tasks and has the following mesh characteristics: NCHUCKS=6NPROC_XI=5NEX_XI =80 RECORD_LENGHT_IN_MINUTES =2.0Applications performancesThis section presents some sample results on targeted machines.AlyaAlya has been compiled and run using test case A on three different types of compute nodes:BSC MinoTauro Westemere Partition (Intel E5649 12 core 2.53 GHz, 24 GB RAM, Infiniband)BSC MinoTauro Haswell + K80 Partition (Intel Xeon E5-2630 v3 16 core 2.4 GHz, 128 GB RAM, NVIDIA K80, Infiniband)KNL 7250 (68 core 1.40 GHz, 16 GB MCDRAM, 96BG DDR4 RAM, Ethernet)Alya supports parallelism via different options, mainly MPI for problem decomposition, OpenMP within the matrix construction phase and CUDA parallelism for selected solvers. In general, the best distribution and performance can be achieved by using MPI. Running on KNL it has been proven optimal to use 4 OpenMP threads and 16 MPI processes for a total of 64 processes, each on its own physical core. The Xeon Phi processor shows slightly better performance in Alya configured in Quadrant/Cache when compared to Quadrant/Flat, although the difference is negligible. The application is not optimized for the first generation Xeon Phi KNC and does not support offloading.Overall speedups have been compared to a one node CPU run on the Haswell partition of MinoTauro. As the application is heavily optimized for traditional computation the best and almost linear scaling is observed on the CPU only runs. Some calculations benefit from the accelerators, GPU yielding from 3.6x to 6.5x speedup for one to three nodes. The KNL runs are limited by the OpenMP scalability and too many MPI tasks on these processors lead to suboptimal scaling. Speedups in this case range from 0.9x to 1.6x and can be further optimized by introducing more threading parallelism. The communication overhead when running with many MPI tasks on KNL is noticeable and further limited by the ethernet connection on multinode runs. High-performance fabrics such as Omni-Path or Infiniband promise to provide significant enhancement for these cases. The results are compared in REF _Ref478141367 \h Figure 3.It can be seen that the best performance is gained on the most recent standard Xeon CPU in conjunction with GPU. This is expected as Alya has been heavily optimized for traditional HPC scalability using mainly MPI and makes good use of available cores. The addition of GPU enabled solvers provides a noticeable boost to the overall performance. To fully exploit the KNL further optimizations are ongoing and additional OpenMP parallelism will need to be employed.Figure SEQ Figure \* ARABIC 1 Shows the matrix construction part of Alya that is parallelised with OpenMP and benefits significantly from the many cores available on KNL.Figure SEQ Figure \* ARABIC 2 Demonstrates the scalability of the code. As expected Haswell cores with K80 GPU are high-performing while the KNL port is currently being optimized further.Figure SEQ Figure \* ARABIC 3 Best performance is achieved with GPU in combination with powerful CPU cores. Single thread performance has a big impact on the speedup, both threading and vectorization are employed for additional performance.Code_SaturneDescription runtime architecture:KNL: ARCHER (model 7210) - The following environment is used, i.e. ENV_6.0.3. The INTEL compiler's version is 17.0.0.098.GPU: 2 POWER8 nodes, i.e. S822LC (2x P8 10-cores + 2x K80 (2 G210 per K80)) and S824L (2x P8 12-cores + 2x K40 (1 G180 per K40)) - The compiler is at/8.0, the MPI distribution openmpi/1.8.8 and the CUDA compiler's version is 7.5.3-D Taylor-Green vortex flow (hexahedral cells)The first test case has been run on ARCHER KNL and the performance has been investigated for several configurations, each of them using 64 MPI tasks per node and either 1, 2 or 4 hyper-threads (extra MPI tasks) or OpenMP threads have been added for testing. The results are compared to ARCHER CPU, in this case IvyBridge CPU. Up to 8 nodes are used for comparison.Figure SEQ Figure \* ARABIC 4 Code_Saturne's performance on KNL. AMG is used as a solver in V4.2.2. REF _Ref477440013 \h Figure 4 shows the CPU time per time step as a function of the number threads/MPI tasks. For all the cases, the time to solution decreases when the number of threads increases. For the case using MPI only and no hyper-threading (green line) only, a simulation is run on half a node as well to investigate the speedup going from half a node to a node, which is about 2 as seen on the figure. The ellipses help comparing the time to solution per node, and finally, a comparison is carried out with simulations run on ARCHER without KNL, using Ivybridge processors. When using 8 nodes, the best configuration for Code_Saturne to run on KNL is for 64 MPI tasks and 2 OpenMP threads per task (blue line on the figure), which is about 15 to 20% faster than running on the Ivybridge nodes, using the same number of nodes.Flow in a 3-D lid-driven cavity (tetrahedral cells)The following options are used for PETSc: --CPU: -ksp_type = cg and -pc_type = jacobi-GPU: -ksp_type = cg and -vec_type = cusp and -mat_type = aijcusp and -pc_type = jacobiTable SEQ Table \* ARABIC 3 Performance of Code_Saturne + PETSc on 1 node of the POWER8 clusters. Comparison between 2 different nodes, using different types of CPU and GPU. PETSc is built on LAPACK. The speedup is computed at the ratio between the time to solution on the CPU for a given number of MPI tasks and the time to solution on the CPU/GPU for the same number of MPI tasks.Table SEQ Table \* ARABIC 4 Performance of Code_Saturne and PETSc on 1 node of KNL. PETSc is built on the MKL library REF _Ref477996102 \h Table 3 and REF _Ref477996105 \h Table 4 show the results obtained using POWER8 CPU and CPU/GPU, and KNL, respectively. Focusing on the results on the POWER8 nodes first, a speedup is observed on each node of the POWER8, when using the same number of MPI tasks and of GPU. However, when the nodes are fully populated (20 and 24 MPI tasks, respectively), it is cheaper to run on the CPU only than using CPU/GPU. This could be explained by the fact that the same overall amount of data is transferred but the system administration costs, latency costs, asynchronicity of transfer in 20 (S822LC) or 24 (S824L) slices might be prohibitive.CP2KTimes shown in the ARCHER KNL (model 7210, 1.30GHz, 96GB memory DDR) vs Ivy Bridge (E5-2697 v2, 2.7 GHz, 64GB) plot are for those CP2K threading configurations that give the best performance in each case. The shorthand for naming threading configurations is:MPI: pure MPIX_TH: X OpenMP threads per MPI rankWhilst single-threaded pure MPI or 2 OpenMP threads is often fastest on conventional processors, on the KNL multithreading is more likely to be beneficial, especially in problems such as the LiH-HFX benchmark in which having fewer MPI ranks means more memory is available to each rank, allowing partial results to be stored in memory instead of expensively recomputed on the fly. Hyperthreads were left disabled (equivalent to the aprun option –j 1), as no significant performance benefit was observed using hyperthreading.Figure SEQ Figure \* ARABIC 5 Test case 1 of CP2K on the ARCHER clusterThe node based comparison shows ( REF _Ref477996530 \h Figure 5) that the runtimes on KNL nodes are roughly 1.7 times slower than runtimes on 2-socket IvyBridge nodes.GPAWThe performance of GPAW using both benchmarks was measured with a range of parallel job sizes on several architectures; with the architectures designated in the following tables, figures, and text as:CPU: x86 Haswell CPU (Intel Xeon E5-2690v3) in a dual-socket nodeKNC: Knights Corner MIC (Intel Xeon Phi 7120P) with a x86 Haswell host CPU (Intel Xeon E5-2680v3) in a dual-socket nodeKNL: Knights Landing MIC (Intel Xeon Phi 7210) in a single-socket nodeK40: K40 GPU (NVIDIA Tesla K40) with a x86 Ivy Bridge host CPU (Intel Xeon E5-2620-v2) in a dual-socket nodeK80: K80 GPU (NVIDIA Tesla K80) with a x86 Haswell host CPU (Intel Xeon E5-2680v3) in a quad-socket nodeOnly time spent in the main SCF-cycle was used as the runtime in the comparison ( REF _Ref478142596 \h Table 5 and REF _Ref478142598 \h Table 6) to exclude any differences in the initialisation overheads.Table SEQ Table \* ARABIC 5 GPAW runtimes (in seconds) for the smaller benchmark (Carbon Nanotube) measured on several architectures when using n sockets (i.e. processors or accelerators).Table SEQ Table \* ARABIC 6 GPAW runtimes (in seconds) for the larger benchmark (Copper Filament) measured on several architectures when using n sockets (i.e. processors or accelerators). *Due to memory limitations on the GPU the grid spacing was increased from 0.22 to 0.28 to have a sparser grid. To account for this in the comparison, the K40 and K80 runtimes have been scaled up using a corresponding CPU runtime as a yardstick (scaling factor q=2.1132).As can been seen from Table 2 and Table 3, in both benchmarks a single KNL or K40/K80 was faster than a single CPU. But when using multiple KNL, the performance does not seem to scale as well as for CPU. In the smaller benchmark (Carbon Nanotube), CPU outperform KNL when using more than 2 processors. In the larger benchmark (Copper Filament), KNL still outperform CPU with 8 processors but it seems likely that the CPU will overtake KNL when using an even larger number of processors.In contrast to KNL, the older KNC are slower than Haswell CPU across the board. Nevertheless, as can been seen from Figure 4, the scaling of KNC is to some extend comparable to CPU but with a lower scaling limit. It is therefore likely that, on systems with considerably slower host CPU than Haswells (e.g. Ivy Bridges), KNC may also give a performance boost over the host CPU.Figure SEQ Figure \* ARABIC 6 Relative performance (to / t) of GPAW is shown for parallel jobs using an increasing number of CPU (blue) or Xeon Phi KNC (red). Single CPU SCF-cycle runtime (to) was used as the baseline for the normalisation. Ideal scaling is shown as a linear dashed line for comparison. Case 1 (Carbon Nanotube) is shown with square markers and Case 2 (Copper Filament) is shown with round markers. GROMACSGromacs was successfully compiled and ran on the following systems:GRNET ARIS: Thin nodes (E5-2680v2), GPU nodes (Dual E5-2660v3+ Dual K40m), all with FDR14 Infiniband, Single node KNL 7210. CINES Frioul KNL 7230IDRIS Ouessant: IBM Power 8 + Dual P100On KNL machines the runs were performed using Quadrant processor and both Cache / Flat memory configuration. On GRNET's single node KNL more configurations were tested. As it is expected the Quandrant/Cache mode gives the best performance for all cases. The performance dependence on the MPI Tasks/OpenMP threads combination was also explored. In most cases 66 tasks/per node using 2 or 4 threads/task gives the best performance on KNL 7230.In all accelerated runs a speed up of 2-2.6x with respect CPU only was achieved with GPU. Gromacs does not support offload on KNC.Figure SEQ Figure \* ARABIC 7 Scalability for GROMACS test case GluCL Ion ChannelFigure SEQ Figure \* ARABIC 8 Scalability for GROMACS test case LignocelluloseNAMDNAMD was successfully compiled and ran on the following systems:GRNET ARIS : Thin nodes (E5-2680v2), GPU nodes (Dual E5-2660v3+ Dual K40m), KNC Nodes (Dual E5-2660v2+Dual KNC 7120P), all with FDR14 Infiniband, Single node KNL 7210.Cines Frioul : KNL 7230 Cines Ouessant : IBM Power 8 + Dual P100On KNL machines the runs were performed using Quadrant processor and both Cache / Flat memory configuration. On GRNET's single node KNL more configurations were tested. As it is expected the Quandrant/Cache mode gives the best performance for all cases. The performance dependence on the MPI Tasks/OpenMP threads combination was also explored. In most cases 66 tasks per node using 4 threads/task or 4 tasks per node/64 threads per task gives the best performance on KNL 7230.In all accelerated runs a speed up of 5-6x with respect CPU only runs was achieved with GPU.On KNC the speed up with respect CPU only is in the range 2-3.5 in all cases.Figure SEQ Figure \* ARABIC 9 Scalability for NAMD test case STMV.8MFigure SEQ Figure \* ARABIC 10 Scalability for NAMD test case STMV.28MPFARMThe code has been tested and timed on several architectures, designated in the following figures, tables and text as:CPU: node contains two 2.7 GHz, 12-core E5-2697 v2 (Ivy Bridge) series processors with 64GB memory.KNL: node is a 64-core KNL processor (model 7210) running at 1.30GHz with 96GB of memory.GPU: node contains a dual socket 16-core Haswell E5-2698 running at 2.3 GHz with 256GB memory and 4 K40, 4 K80 or 4 P100 GPU.Codes on all architectures are compiled with the Intel compiler (CPU v15, KNL & GPU v17).The divide-and-conquer eigensolver routine DSYEVD is used throughout the test runs. The routine is linked from the following numerical libraries:CPU: Intel MKL Version 11.2.2KNL: Intel MKL Version 2017 Initial ReleaseGPU: MAGMA Version 2.2Figure SEQ Figure \* ARABIC 11 Eigensolver performance on KNL and GPUEXDIG calculations are dominated by the eigensolver operations required to diagonalize each sector Hamiltonian matrix. REF _Ref477737037 \h \* MERGEFORMAT Figure 11 summarizes eigensolver performance, using DSYEVD, over a range of problem sizes for the Xeon (CPU), Intel Knight’s Landing (KNL) and a range of recent Nvidia GPU architectures. The results are normalised to the single node CPU performance using 24 OpenMP threads. The CPU runs use 24 OpenMP threads and the KNL runs use 64 OpenMP threads. Dense linear algebra calculations tend to be bound by memory bandwidth, so using hyperthreading on the KNL or CPU is not beneficial. MAGMA is able to parallelise the calculation automatically across multiple GPU on a compute node and these results are denoted by the x2, x4 labels. REF _Ref477737037 \h \* MERGEFORMAT Figure 11 demonstrates that MAGMA performance relative to CPU performance increases as problem size increases, due to the relative overhead cost of data transfer O(N^2) reducing compared to computational load O(N^3).Test Case 1 – FeIIIDefining Computational Characteristics: 10 Fine Region Sector calculations involving Hamiltonian matrices of dimension 23620 and 10 Coarse Region Sector calculations involving Hamiltonian matrices of dimension 11810.Test Case 2 – CH4Defining Computational Characteristics: 10 ‘Spin 1’ Coarse Sector calculations involving Hamiltonian matrices of dimension 5720 and 10 ‘Spin 2’ Coarse Sector calculations involving Hamiltonian matrices of dimension 7890.CPU 24 threadsKNL 64 threadsK80K80x2K80x4P100P100x2P100x4Test Case 1 ; Atomic ; FeIII447526101215828631544427377Test Case 2 ; Molecular ; CH4466346180150134119107111Table SEQ Table \* ARABIC 7 Overall EXDIG runtime performance on various accelerators (runtime, secs) REF _Ref477737720 \h Table 7 records the overall run time on a range of architectures for both test cases described. For the complete runs (including I/O), both KNL-based and GPU-based computations significantly outperform the CPU-based calculations. For Test Case 1, utilising a node with single P100 GPU accelerator results in a runtime more than 8 times quicker than the CPU, correspondingly approximately 4 times quicker for Test Case 2. The smaller Hamiltonian matrices associated with Test Case 2 means that data transfer costs O(N2) are relatively high vs computation costs O(N3). Smaller matrices also result in poorer scaling as we increase the number of GPU per node for Test Case 2.Table SEQ Table \* ARABIC 8 Overall EXDIG runtime parallel performance using MPI-GPU versionA relatively simple MPI harness can be used in EXDIG to farm out different sector Hamiltonian calculations to multiple CPU, KNL or GPU nodes. REF _Ref478145524 \h Table 8 shows that parallel scaling across nodes is very good for each test platform. This strategy is inherently scalable, however the replicated data approach requires significant amounts of memory per node. Test Case 1 is used as the dataset here, although the problem characteristics are slightly different to the setup used for REF _Ref477737720 \h Table 7, with 5 Fine Region sectors with Hamiltonian dimension of 23620 and 20 Coarse Region sectors with Hamiltonian dimension of 11810. With these characteristics, runs using 2 MPI tasks experience inferior load-balancing in the Fine Region calculation compared to runs using 5 MPI tasks.QCDAs stated in the description, QCD benchmark has two implementations.First implementationFigure SEQ Figure \* ARABIC 12 Small test case results for QCD, first implementationFigure SEQ Figure \* ARABIC 13 Large test case results for QCD, first implementationThe strong scaling, on Titan and ARCHER, for small ( REF _Ref477152535 \h Figure 12) and large ( REF _Ref477772687 \h Figure 13) problem sizes. For ARCHER, both CPU are used per node. For Titan, we include results with and without GPU utilization.On each node, Titan has one 16-core Interlagos CPU and one K20X GPU, whereas ARCHER has two 12-core Ivy-bridge CPU. In this section, we evaluate on a node-by-node basis. For Titan, a single MPI task per node, operating on the CPU, is used to drive the GPU on that node. We also include, for Titan, results just using the CPU on each node without any involvement from the GPU, for comparison. This means that, on a single node, our Titan results will be the same as those K20X and Interlagos results presented in the previous section (for the same test case). On ARCHER, however, we fully utilize both the processors per node: to do this we use two MPI tasks per node, each with 12 OpenMP threads (via targetDP). So the single node results for ARCHER are twice as fast as those Ivy-bridge single-processor results presented in the previous section.Figure SEQ Figure \* ARABIC 14 shows the time taken by the full MILC 64x64x64x8 test cases on traditional CPU, Intel Knights Landing Xeon Phi and NVIDIA P100 (Pascal) GPU architectures.In REF _Ref477152624 \h Figure 14 we present preliminary results for on the latest generation Intel Knights Landing (KNL) and NVIDIA Pascal architectures, which offer very high bandwidth stacked memory, together with the same traditional Intel-Ivy-bridge CPU used in previous sections. Note that these results are not directly comparable with those presented earlier, since they are for a different test case size (larger since we are no longer limited by the small memory size of the Knights Corner), and they are for a slightly updated verion of the benchmark. The KNL is the 64-core 7210 model, available from within a test and development platform provided as part of the ARCHER service. The Pascal is a NVIDIA P100 GPU provided as part of the “Ouessant” IBM service at IDRIS, where the host CPU is an IBM Power8+.It can be seen that the KNL is 7.5X faster than the Ivy-bridge; the Pascal is 13X faster than the Ivy-bridge; and the Pascal is 1.7X faster than the KNL. Second implementationGPU resultsThe GPU benchmark results of the second implementation are done on PizDaint located in Switzerland at CSCS and the GPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint is equipped with one P100 Pascal-GPU per node. Two different test-cases are depicted, the "strong-scaling" mode with a random lattice configuration of size 32x32x32x96 and 64x64x64x128. The GPU nodes of Cartesius have two Kepler-GPU K40m per node and the "strong-scaling" test is shown for one card per node and for two cards per node. The benchmark kernel is using the conjugated gradient solver which solve a linear equation system given by D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".Figure SEQ Figure \* ARABIC 15 Result of second implementation of QCD on K40m GPU REF _Ref478368452 \h Figure 15 shows strong scaling of the conjugate gradient solver on K40m GPU on Cartesius. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays. The test is performed with a mixed precision CG in double-double mode (red) and half-double mode (blue). The run is done on one GPU per node (filled) and two GPU nodes per node (non-filled).Figure SEQ Figure \* ARABIC 16 Result of second implementation of QCD on P100 GPU REF _Ref478368421 \h Figure 16 shows strong scaling of the conjugate gradient solver on P100 GPU on PizDaint. The lattice size is given by 32x32x32x96 similar to the strong scaling run on the K40m on Cartesius. The test is performed with mixed precision CG in double-double mode (red) and half-double mode (blue).Figure SEQ Figure \* ARABIC 17 Result of second implementation of QCD on P100 GPU on larger test case REF _Ref478368605 \h Figure 17 shows strong scaling of the conjugate gradient solver on P100 GPU on PizDaint. The lattice size is increase to 64x64x64x128, which is a large lattice nowadays. By increasing the lattice the scaling test shows that the conjugate gradient solver has a very good strong scaling up to 64 GPU.Xeon Phi resultsThe benchmark results for the XeonPhi benchmark suite are performed on Frioul at CINES, and the hybrid partition on MareNostrum III at BSC. Frioul has one KNL-card per node while the hybrid partition of MareNostrum III is equipped with two KNC per node. The data on Frioul are generated by using the bash-scripts provided by the second implementation of QCD and are done for the two test cases "strong-scaling" with a lattice size of 32x32x32x96 and 64x64x64x128. In case of the data generated at MareNostrum, data for the "strong-scaling" mode on a 32x32x32x96 lattice are shown. The benchmark kernel uses a random gauge configuration and the conjugated gradient solver to solve a linear equation involving the clover Wilson Dirac operator.Figure SEQ Figure \* ARABIC 18 Result of second implementation of QCD on KNC REF _Ref478368691 \h Figure 18 shows strong scaling of the conjugate gradient solver on KNC's on the hybrid partition on MareNostrum III. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays. The test is performed with a conjugate gradient solver in single precision by using the native mode and 60 OpenMP tasks per MPI process. The run is done on one KNC per node (filled) and two KNC node per node (non-filled).Figure SEQ Figure \* ARABIC 19 Result of second implementation of QCD on KNL REF _Ref478368762 \h Figure 19 shows strong scaling results of the conjugate gradient solver on KNL's on Frioul. The lattice size is given by 32x32x32x96 which is similar to the strong scaling run on the KNC on MareNostrum III. The run is performed in quadrantic cache mode with 68 OpenMP processes per KNL. The test is performed with a conjugate gradient solver in single precision.Quantum EspressoHere are sample results for Quantum Espresso. This code has run on Cartesius (see section REF _Ref477768402 \r \h 2.2.1) and Marconi (1 node is 1 standalone KNL Xeon Phi 7250, 68 core 1.40 GHz, 16BG MCDRAM, 96BG DDR4 RAM, interconnect is Intel OmniPath).Runs on GPUFigure SEQ Figure \* ARABIC 20 Scalability of Quantum Espresso on GPU for test case 1Figure SEQ Figure \* ARABIC 21 Scalability of Quantum Espresso on GPU for test case 2Test cases ( REF _Ref477769024 \h Figure 20 and REF _Ref477769025 \h Figure 21) show no appreciable speed-up with GPU. Inputs are probably too small, they should evolve in the future of this benchmark suite.Runs on KNLFigure SEQ Figure \* ARABIC 22 Scalability of Quantum Espresso on KNL for test case 1 REF _Ref477769092 \h Figure 22 shows the usual pw.x with the small test case A (AUSURF), comparing Marconi Broadwell (36 cores/node) with KNL (68 cores/node) - this test case is probably small for testing on KNL.Figure SEQ Figure \* ARABIC 23 Quantum Espresso - KNL vs BDW vs BGQ (at scale) REF _Ref477998355 \h Figure 23 presents CNT10POR8 which is the large test case, even though it is using the cp.x executable (i.e. Car-parinello) rather than the usual pw.x (PW SCF calculation).Synthetic benchmarks (SHOC)The SHOC benchmark has been run on Cartesius, Ouessant and MareNostrum. REF _Ref477773433 \h Table 9 presents the results:NVIDIA GPUIntel Xeon PhiK40 CUDAK40 OpenCLPower 8 + P100 CUDAKNC OffloadKNC OpenCLHaswell OpenCLBusSpeedDownload10.5 GB/s10.56 GB/s32.23 GB/s6.6 GB/s6.8 GB/s12.4 GB/sBusSpeedReadback 10.5 GB/s10.56 GB/s34.00 GB/s6.7 GB/s6.8 GB/s12.5 GB/smaxspflops 3716 GFLOPS 3658 GFLOPS10424 GFLOPS215812314 GFLOPS1647 GFLOPS maxdpflops 1412 GFLOPS 1411 GFLOPS5315 GFLOPS160172318 GFLOPS884 GFLOPS gmem_readbw 177 GB/s179 GB/s575.16 GB/s170 GB/s49.7 GB/s20.2 GB/sgmem_readbw_strided 18 GB/s20 GB/s99.15 GB/sN/A35 GB/s156 GB/s gmem_writebw 175 GB/s188 GB/s436 GB/s72 GB/s41 GB/s13.6 GB/sgmem_writebw_strided 7 GB/s7 GB/s26.3 GB/sN/A25 GB/s163 GB/slmem_readbw 1168 GB/s1156 GB/s4239 GB/sN/A442 GB/s238 GB/slmem_writebw 1194 GB/s1162 GB/s5488 GB/sN/A477 GB/s295 GB/sBFS49,236,500 Edges/s 42,088,000 Edges/s 91,935,100 Edges/sN/A1,635,330 Edges/s 14,225,600 Edges/s FFT_sp523 GFLOPS 377 GFLOPS 1472 GFLOPS135 GFLOPS71 GFLOPS80 GFLOPS FFT_dp262 GFLOPS 61 GFLOPS 733 GFLOPS69.5 GFLOPS31 GFLOPS55 GFLOPS SGEMM2900-2990 GFLOPS 694/761 GFLOPS8604-8720 GFLOPS640/645 GFLOPS179/217 GFLOPS419-554 GFLOPS DGEMM1025-1083 GFLOPS 411/433 GFLOPS3635-3785 GFLOPS179/190 GFLOPS76/100 GFLOPS189-196 GFLOPS MD (SP)185 GFLOPS 91 GFLOPS483 GFLOPS28 GFLOPS33 GFLOPS114 GFLOPS MD5Hash3.38 GH/s3.36 GH/s15.77 GH/sN/A1.7 GH/s1.29 GH/sReduction137 GB/s150 GB/s271 GB/s99 GB/s10 GB/s91 GB/sScan47 GB/s39 GB/s99.2 GB/s11 GB/s4.5 GB/s15 GB/sSort3.08 GB/s0.54 GB/s12.54 GB/sN/A0.11 GB/s0.35 GB/sSpmv4-23 GFLOPS3-17 GFLOPS23-65 GFLOPS1-17944 GFLOPSN/A1-10 GFLOPSStencil2D123 GFLOPS135 GFLOPS465 GFLOPS89 GFLOPS8.95 GFLOPS34 GFLOPSStencil2D_dp57 GFLOPS67 GFLOPS258 GFLOPS16 GFLOPS7.92 GFLOPS30 GFLOPSTriad13.5 GB/s9.9 GB/s43 GB/s5.76 GB/s5.57 GB/s8 GB/sS3D (level2)94 GFLOPS91 GFLOPS294 GFLOPS109 GFLOPS18 GFLOPS27 GFLOPSTable SEQ Table \* ARABIC 9 Synthetic benchmarks results on GPU and Xeon PhiMeasures marked red are not relevant and should not be considered:KNC MaxFlops (both SP and DP): In this case the compiler optimizes away some of the computation (although it shouldn't) REF _Ref477999206 \r \h [19].KNC SpMV: For these benchmarks it is a known bug currently being addressed REF _Ref477999262 \r \h [20].Haswell gmem_readbw_strided and gmem_writebw_strided: strided read/write benchmarks doesn't make too much sense in the case of the CPU, as the data will be cache in the large L3 caches. It is reason why we see high number only in the Haswell case.SPECFEM3DTests have been carried out on Ouessant and Firoul.So far it has only been possible to run on one fixed core count for each test case, so scaling curves are not available. Test case A ran on 4 KNL and 4 P100. Test case B ran on 10 KNL and 4 P100.KNLP100Test case A66105Test case B21.468Table SEQ Table \* ARABIC 10 SPECFEM 3D GLOBE results (run time in second)Conclusion and future workThe work presented here stand as a first sight for application benchmarking on accelerators. Most codes have been selected among the main Unified European Application Benchmark Suite. This paper describes each of them as well as implementation, relevance to European science community and test cases. We have presented results on leading edge systemsThe suite will be publicly available on the PRACE web site REF _Ref477156108 \r \h [1] where links to download sources and test cases will be published along with compilation and run instructions.Task 7.2B in PRACE 4IP started to design a benchmark suite for accelerators. This work has been done aiming at integrating it to the main UEABS one so that both can be maintained and evolve together. As PCP (PRACE-3IP) machines will soon be available, it will be very interesting to run the benchmark suite on them. First because these machines will be larger, but also because they will feature energy consumption probes. \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet1.xlsx b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet1.xlsx deleted file mode 100644 index 4ecdd7dfbc523ee4b6a82f7e7924bce6384f09ef..0000000000000000000000000000000000000000 Binary files a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet1.xlsx and /dev/null differ diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet2.xlsx b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet2.xlsx deleted file mode 100644 index ffbb9f092420510ba266ed2ed9ff4ca336e69fa4..0000000000000000000000000000000000000000 Binary files a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet2.xlsx and /dev/null differ diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet3.xlsx b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet3.xlsx deleted file mode 100644 index d24f702c516ac065c67759972da4569f09d03a58..0000000000000000000000000000000000000000 Binary files a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet3.xlsx and /dev/null differ diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet4.xlsx b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet4.xlsx deleted file mode 100644 index 939246e4fad20ff3e65c167b3c5b05d33553e1a0..0000000000000000000000000000000000000000 Binary files a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/embeddings/Microsoft_Excel_Worksheet4.xlsx and /dev/null differ diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/endnotes.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/endnotes.xml deleted file mode 100644 index d474527eed33502485a53f0ea0b58043fdd0db49..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/endnotes.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/fontTable.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/fontTable.xml deleted file mode 100644 index c83e9b875bbcf4c0dfaf120635611b1de468e09d..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/fontTable.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer1.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer1.xml deleted file mode 100644 index 287ad23ac0192f569eb5142c06138469fc9d40c7..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer1.xml +++ /dev/null @@ -1,2 +0,0 @@ - -PAGE 12 \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer2.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer2.xml deleted file mode 100644 index a40ccb329fe05b482ec9efafcb45fbdd1ee97638..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer2.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer3.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer3.xml deleted file mode 100644 index 849178df10aac75ce73f6662e8c676a87470a300..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer3.xml +++ /dev/null @@ -1,2 +0,0 @@ - -PAGE x REF Acronym \* MERGEFORMAT PRACE-4IP - REF ReferenceNo \* MERGEFORMAT EINFRA-653838 REF PrepDate \* MERGEFORMAT 01.03.2016 \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer4.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer4.xml deleted file mode 100644 index b9a0bdbe9e45caa77ab57b1553c996819d9d2d78..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footer4.xml +++ /dev/null @@ -1,2 +0,0 @@ - -PAGE 13 REF Acronym \* MERGEFORMAT PRACE-4IP - REF ReferenceNo \* MERGEFORMAT EINFRA-653838 REF PrepDate \* MERGEFORMAT 01.03.2016 \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footnotes.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footnotes.xml deleted file mode 100644 index 811e7997d5c9fe76c39bf50f34f31e02e43ea309..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/footnotes.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/header1.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/header1.xml deleted file mode 100644 index 01125631426b1d18b52078a2e94423de4e189a3c..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/header1.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/header2.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/header2.xml deleted file mode 100644 index b35685ebccf3f7f74d1b46ff4018373bd526a601..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/header2.xml +++ /dev/null @@ -1,2 +0,0 @@ - - REF DeliverableNumber \* MERGEFORMAT D7.5 REF DeliverableTitle \* MERGEFORMAT Application performance on accelerators \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/header3.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/header3.xml deleted file mode 100644 index d6288453da50ef82ecd830e5fcce3c1f2f824f54..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/header3.xml +++ /dev/null @@ -1,2 +0,0 @@ - - REF DeliverableNumber \* MERGEFORMAT D7.5 REF DeliverableTitle \* MERGEFORMAT Application performance on accelerators \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/numbering.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/numbering.xml deleted file mode 100644 index 6dccc0be051ebff84c649bf3a0be58e026970ad4..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/numbering.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/settings.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/settings.xml deleted file mode 100644 index 0440b778633fa78f7864d985d9f52e34bc1c9701..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/settings.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/styles.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/styles.xml deleted file mode 100644 index c688a5043568320606915d2a75a7d561b9b646e4..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/styles.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/theme/theme1.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/theme/theme1.xml deleted file mode 100644 index a9983a13488ef26fffef7c2a5e55e26a671f6f0f..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/theme/theme1.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/webSettings.xml b/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/webSettings.xml deleted file mode 100644 index a2385a2abd3d03208f2b6a316db92aadc6c119c5..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.docx.tmp/word/webSettings.xml +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/doc/docx2latex/d7.5_4IP_1.0.log b/doc/docx2latex/d7.5_4IP_1.0.log deleted file mode 100644 index b4af01fd992d3dc71440a7406aebf7b46ce0e197..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.log +++ /dev/null @@ -1,382 +0,0 @@ -This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) (preloaded format=pdflatex 2017.4.26) 26 APR 2017 11:30 -entering extended mode - restricted \write18 enabled. - %&-line parsing enabled. -**d7.5_4IP_1.0.tex -(./d7.5_4IP_1.0.tex -LaTeX2e <2016/02/01> -Babel <3.9q> and hyphenation patterns for 3 language(s) loaded. -(/usr/share/texlive/texmf-dist/tex/latex/koma-script/scrbook.cls -Document Class: scrbook 2015/10/03 v3.19a KOMA-Script document class (book) -(/usr/share/texlive/texmf-dist/tex/latex/koma-script/scrkbase.sty -Package: scrkbase 2015/10/03 v3.19a KOMA-Script package (KOMA-Script-dependent -basics and keyval usage) - -(/usr/share/texlive/texmf-dist/tex/latex/koma-script/scrbase.sty -Package: scrbase 2015/10/03 v3.19a KOMA-Script package (KOMA-Script-independent - basics and keyval usage) - -(/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty -Package: keyval 2014/10/28 v1.15 key=value parser (DPC) -\KV@toks@=\toks14 -) -(/usr/share/texlive/texmf-dist/tex/latex/koma-script/scrlfile.sty -Package: scrlfile 2015/10/03 v3.19a KOMA-Script package (loading files) - -Package scrlfile, 2015/10/03 v3.19a KOMA-Script package (loading files) - Copyright (C) Markus Kohm - -))) (/usr/share/texlive/texmf-dist/tex/latex/koma-script/tocbasic.sty -Package: tocbasic 2015/10/03 v3.19a KOMA-Script package (handling toc-files) -) -Package tocbasic Info: omitting babel extension for `toc' -(tocbasic) because of feature `nobabel' available -(tocbasic) for `toc' on input line 125. -Package tocbasic Info: omitting babel extension for `lof' -(tocbasic) because of feature `nobabel' available -(tocbasic) for `lof' on input line 126. -Package tocbasic Info: omitting babel extension for `lot' -(tocbasic) because of feature `nobabel' available -(tocbasic) for `lot' on input line 127. -Package tocbasic Info: defining new hook before heading of `' on input line 158 -4. -Class scrbook Info: File `scrsize11pt.clo' used instead of -(scrbook) file `scrsize11.clo' to setup font sizes on input line 2251 -. - -(/usr/share/texlive/texmf-dist/tex/latex/koma-script/scrsize11pt.clo -File: scrsize11pt.clo 2015/10/03 v3.19a KOMA-Script font size class option (11p -t) -) -(/usr/share/texlive/texmf-dist/tex/latex/koma-script/typearea.sty -Package: typearea 2015/10/03 v3.19a KOMA-Script package (type area) - -Package typearea, 2015/10/03 v3.19a KOMA-Script package (type area) - Copyright (C) Frank Neukam, 1992-1994 - Copyright (C) Markus Kohm, 1994- - -\ta@bcor=\skip41 -\ta@div=\count79 -\ta@hblk=\skip42 -\ta@vblk=\skip43 -\ta@temp=\skip44 -\footheight=\skip45 -Package typearea Info: These are the values describing the layout: -(typearea) DIV = 10 -(typearea) BCOR = 0.0pt -(typearea) \paperwidth = 597.50793pt -(typearea) \textwidth = 418.25555pt -(typearea) DIV departure = -6% -(typearea) \evensidemargin = 47.2316pt -(typearea) \oddsidemargin = -12.5192pt -(typearea) \paperheight = 845.04694pt -(typearea) \textheight = 595.80026pt -(typearea) \topmargin = -25.16531pt -(typearea) \headheight = 17.0pt -(typearea) \headsep = 20.40001pt -(typearea) \topskip = 11.0pt -(typearea) \footskip = 47.6pt -(typearea) \baselineskip = 13.6pt -(typearea) on input line 1509. -) -\c@part=\count80 -\c@chapter=\count81 -\c@section=\count82 -\c@subsection=\count83 -\c@subsubsection=\count84 -\c@paragraph=\count85 -\c@subparagraph=\count86 -LaTeX Info: Redefining \textsubscript on input line 4654. -\abovecaptionskip=\skip46 -\belowcaptionskip=\skip47 -\c@pti@nb@sid@b@x=\box26 -\c@figure=\count87 -\c@table=\count88 -\bibindent=\dimen102 -) (/usr/share/texlive/texmf-dist/tex/latex/graphics/graphicx.sty -Package: graphicx 2014/10/28 v1.0g Enhanced LaTeX Graphics (DPC,SPQR) - -(/usr/share/texlive/texmf-dist/tex/latex/graphics/graphics.sty -Package: graphics 2016/01/03 v1.0q Standard LaTeX Graphics (DPC,SPQR) - -(/usr/share/texlive/texmf-dist/tex/latex/graphics/trig.sty -Package: trig 2016/01/03 v1.10 sin cos tan (DPC) -) -(/usr/share/texlive/texmf-dist/tex/latex/latexconfig/graphics.cfg -File: graphics.cfg 2010/04/23 v1.9 graphics configuration of TeX Live -) -Package graphics Info: Driver file: pdftex.def on input line 95. - -(/usr/share/texlive/texmf-dist/tex/latex/pdftex-def/pdftex.def -File: pdftex.def 2011/05/27 v0.06d Graphics/color for pdfTeX - -(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/infwarerr.sty -Package: infwarerr 2010/04/08 v1.3 Providing info/warning/error messages (HO) -) -(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/ltxcmds.sty -Package: ltxcmds 2011/11/09 v1.22 LaTeX kernel commands for general use (HO) -) -\Gread@gobject=\count89 -)) -\Gin@req@height=\dimen103 -\Gin@req@width=\dimen104 -) -(/usr/share/texlive/texmf-dist/tex/latex/hyperref/hyperref.sty -Package: hyperref 2012/11/06 v6.83m Hypertext links for LaTeX - -(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/hobsub-hyperref.sty -Package: hobsub-hyperref 2012/05/28 v1.13 Bundle oberdiek, subset hyperref (HO) - - -(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/hobsub-generic.sty -Package: hobsub-generic 2012/05/28 v1.13 Bundle oberdiek, subset generic (HO) -Package: hobsub 2012/05/28 v1.13 Construct package bundles (HO) -Package hobsub Info: Skipping package `infwarerr' (already loaded). -Package hobsub Info: Skipping package `ltxcmds' (already loaded). -Package: ifluatex 2010/03/01 v1.3 Provides the ifluatex switch (HO) -Package ifluatex Info: LuaTeX not detected. -Package: ifvtex 2010/03/01 v1.5 Detect VTeX and its facilities (HO) -Package ifvtex Info: VTeX not detected. -Package: intcalc 2007/09/27 v1.1 Expandable calculations with integers (HO) -Package: ifpdf 2011/01/30 v2.3 Provides the ifpdf switch (HO) -Package ifpdf Info: pdfTeX in PDF mode is detected. -Package: etexcmds 2011/02/16 v1.5 Avoid name clashes with e-TeX commands (HO) -Package etexcmds Info: Could not find \expanded. -(etexcmds) That can mean that you are not using pdfTeX 1.50 or -(etexcmds) that some package has redefined \expanded. -(etexcmds) In the latter case, load this package earlier. -Package: kvsetkeys 2012/04/25 v1.16 Key value parser (HO) -Package: kvdefinekeys 2011/04/07 v1.3 Define keys (HO) -Package: pdftexcmds 2011/11/29 v0.20 Utility functions of pdfTeX for LuaTeX (HO -) -Package pdftexcmds Info: LuaTeX not detected. -Package pdftexcmds Info: \pdf@primitive is available. -Package pdftexcmds Info: \pdf@ifprimitive is available. -Package pdftexcmds Info: \pdfdraftmode found. -Package: pdfescape 2011/11/25 v1.13 Implements pdfTeX's escape features (HO) -Package: bigintcalc 2012/04/08 v1.3 Expandable calculations on big integers (HO -) -Package: bitset 2011/01/30 v1.1 Handle bit-vector datatype (HO) -Package: uniquecounter 2011/01/30 v1.2 Provide unlimited unique counter (HO) -) -Package hobsub Info: Skipping package `hobsub' (already loaded). -Package: letltxmacro 2010/09/02 v1.4 Let assignment for LaTeX macros (HO) -Package: hopatch 2012/05/28 v1.2 Wrapper for package hooks (HO) -Package: xcolor-patch 2011/01/30 xcolor patch -Package: atveryend 2011/06/30 v1.8 Hooks at the very end of document (HO) -Package atveryend Info: \enddocument detected (standard20110627). -Package: atbegshi 2011/10/05 v1.16 At begin shipout hook (HO) -Package: refcount 2011/10/16 v3.4 Data extraction from label references (HO) -Package: hycolor 2011/01/30 v1.7 Color options for hyperref/bookmark (HO) -) -(/usr/share/texlive/texmf-dist/tex/generic/ifxetex/ifxetex.sty -Package: ifxetex 2010/09/12 v0.6 Provides ifxetex conditional -) -(/usr/share/texlive/texmf-dist/tex/latex/oberdiek/auxhook.sty -Package: auxhook 2011/03/04 v1.3 Hooks for auxiliary files (HO) -) -(/usr/share/texlive/texmf-dist/tex/latex/oberdiek/kvoptions.sty -Package: kvoptions 2011/06/30 v3.11 Key value format for package options (HO) -) -\@linkdim=\dimen105 -\Hy@linkcounter=\count90 -\Hy@pagecounter=\count91 - -(/usr/share/texlive/texmf-dist/tex/latex/hyperref/pd1enc.def -File: pd1enc.def 2012/11/06 v6.83m Hyperref: PDFDocEncoding definition (HO) -) -\Hy@SavedSpaceFactor=\count92 - -(/usr/share/texlive/texmf-dist/tex/latex/latexconfig/hyperref.cfg -File: hyperref.cfg 2002/06/06 v1.2 hyperref configuration of TeXLive -) -Package hyperref Info: Hyper figures OFF on input line 4443. -Package hyperref Info: Link nesting OFF on input line 4448. -Package hyperref Info: Hyper index ON on input line 4451. -Package hyperref Info: Plain pages OFF on input line 4458. -Package hyperref Info: Backreferencing OFF on input line 4463. -Package hyperref Info: Implicit mode ON; LaTeX internals redefined. -Package hyperref Info: Bookmarks ON on input line 4688. -\c@Hy@tempcnt=\count93 - -(/usr/share/texlive/texmf-dist/tex/latex/url/url.sty -\Urlmuskip=\muskip10 -Package: url 2013/09/16 ver 3.4 Verb mode for urls, etc. -) -LaTeX Info: Redefining \url on input line 5041. -\XeTeXLinkMargin=\dimen106 -\Fld@menulength=\count94 -\Field@Width=\dimen107 -\Fld@charsize=\dimen108 -Package hyperref Info: Hyper figures OFF on input line 6295. -Package hyperref Info: Link nesting OFF on input line 6300. -Package hyperref Info: Hyper index ON on input line 6303. -Package hyperref Info: backreferencing OFF on input line 6310. -Package hyperref Info: Link coloring OFF on input line 6315. -Package hyperref Info: Link coloring with OCG OFF on input line 6320. -Package hyperref Info: PDF/A mode OFF on input line 6325. -LaTeX Info: Redefining \ref on input line 6365. -LaTeX Info: Redefining \pageref on input line 6369. -\Hy@abspage=\count95 -\c@Item=\count96 -\c@Hfootnote=\count97 -) - -Package hyperref Message: Driver (autodetected): hpdftex. - -(/usr/share/texlive/texmf-dist/tex/latex/hyperref/hpdftex.def -File: hpdftex.def 2012/11/06 v6.83m Hyperref driver for pdfTeX -\Fld@listcount=\count98 -\c@bookmark@seq@number=\count99 - -(/usr/share/texlive/texmf-dist/tex/latex/oberdiek/rerunfilecheck.sty -Package: rerunfilecheck 2011/04/15 v1.7 Rerun checks for auxiliary files (HO) -Package uniquecounter Info: New unique counter `rerunfilecheck' on input line 2 -82. -) -\Hy@SectionHShift=\skip48 -) - -! LaTeX Error: File `multirow.sty' not found. - -Type X to quit or to proceed, -or enter new name. (Default extension: sty) - -Enter file name: -! Interruption. - - } -l.11 \usepackage - {tabularx}^^M -? qq -OK, entering \batchmode... - -(/usr/share/texlive/texmf-dist/tex/latex/tools/tabularx.sty -Package: tabularx 2014/10/28 v2.10 `tabularx' package (DPC) - -(/usr/share/texlive/texmf-dist/tex/latex/tools/array.sty -Package: array 2014/10/28 v2.4c Tabular extension package (FMi) -\col@sep=\dimen109 -\extrarowheight=\dimen110 -\NC@list=\toks15 -\extratabsurround=\skip49 -\backup@length=\skip50 -) -\TX@col@width=\dimen111 -\TX@old@table=\dimen112 -\TX@old@col=\dimen113 -\TX@target=\dimen114 -\TX@delta=\dimen115 -\TX@cols=\count100 -\TX@ftn=\toks16 -) -(/usr/share/texlive/texmf-dist/tex/latex/graphics/color.sty -Package: color 2016/01/03 v1.1b Standard LaTeX Color (DPC) - -(/usr/share/texlive/texmf-dist/tex/latex/latexconfig/color.cfg -File: color.cfg 2007/01/18 v1.5 color configuration of teTeX/TeXLive -) -Package color Info: Driver file: pdftex.def on input line 143. -) -(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty -Package: amsmath 2016/03/03 v2.15a AMS math features -\@mathmargin=\skip51 -For additional information on amsmath, use the `?' option. - -(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty -Package: amstext 2000/06/29 v2.01 AMS text - -(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty -File: amsgen.sty 1999/11/30 v2.0 generic functions -\@emptytoks=\toks17 -\ex@=\dimen116 -)) -(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty -Package: amsbsy 1999/11/29 v1.2d Bold Symbols -\pmbraise@=\dimen117 -) -(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty -Package: amsopn 1999/12/14 v2.01 operator names -) -\inf@bad=\count101 -LaTeX Info: Redefining \frac on input line 199. -\uproot@=\count102 -\leftroot@=\count103 -LaTeX Info: Redefining \overline on input line 297. -\classnum@=\count104 -\DOTSCASE@=\count105 -LaTeX Info: Redefining \ldots on input line 394. -LaTeX Info: Redefining \dots on input line 397. -LaTeX Info: Redefining \cdots on input line 518. -\Mathstrutbox@=\box27 -\strutbox@=\box28 -\big@size=\dimen118 -LaTeX Font Info: Redeclaring font encoding OML on input line 630. -LaTeX Font Info: Redeclaring font encoding OMS on input line 631. -\macc@depth=\count106 -\c@MaxMatrixCols=\count107 -\dotsspace@=\muskip11 -\c@parentequation=\count108 -\dspbrk@lvl=\count109 -\tag@help=\toks18 -\row@=\count110 -\column@=\count111 -\maxfields@=\count112 -\andhelp@=\toks19 -\eqnshift@=\dimen119 -\alignsep@=\dimen120 -\tagshift@=\dimen121 -\tagwidth@=\dimen122 -\totwidth@=\dimen123 -\lineht@=\dimen124 -\@envbody=\toks20 -\multlinegap=\skip52 -\multlinetaggap=\skip53 -\mathdisplay@stack=\toks21 -LaTeX Info: Redefining \[ on input line 2735. -LaTeX Info: Redefining \] on input line 2736. -) -(/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty -Package: amssymb 2013/01/14 v3.01 AMS font symbols - -(/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty -Package: amsfonts 2013/01/14 v3.01 Basic AMSFonts support -\symAMSa=\mathgroup4 -\symAMSb=\mathgroup5 -LaTeX Font Info: Overwriting math alphabet `\mathfrak' in version `bold' -(Font) U/euf/m/n --> U/euf/b/n on input line 106. -)) -(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsxtra.sty -Package: amsxtra 1999/11/15 v1.2c AMS extra commands -LaTeX Info: Redefining \nobreakspace on input line 54. -) -(/usr/share/texlive/texmf-dist/tex/latex/wasysym/wasysym.sty -Package: wasysym 2003/10/30 v2.0 Wasy-2 symbol support package -\symwasy=\mathgroup6 -LaTeX Font Info: Overwriting symbol font `wasy' in version `bold' -(Font) U/wasy/m/n --> U/wasy/b/n on input line 90. -) - -! LaTeX Error: File `isomath.sty' not found. - -Type X to quit or to proceed, -or enter new name. (Default extension: sty) - - Enter file name: -! Emergency stop. - - -l.19 \usepackage - {mathtools}^^M -*** (cannot \read from terminal in nonstop modes) - - -Here is how much of TeX's memory you used: - 7302 strings out of 494953 - 110129 string characters out of 6180977 - 257252 words of memory out of 5000000 - 10557 multiletter control sequences out of 15000+600000 - 3940 words of font info for 15 fonts, out of 8000000 for 9000 - 14 hyphenation exceptions out of 8191 - 41i,1n,66p,247b,47s stack positions out of 5000i,500n,10000p,200000b,80000s -! ==> Fatal error occurred, no output PDF file produced! diff --git a/doc/docx2latex/d7.5_4IP_1.0.tex b/doc/docx2latex/d7.5_4IP_1.0.tex deleted file mode 100644 index d1d90f74a45b2fca23a724dde83e676a4d48a6ef..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.tex +++ /dev/null @@ -1,1590 +0,0 @@ -% docx2tex --- ``Garbage In, Garbage Out'' -% -% docx2tex is Open Source and -% you can download it on GitHub: -% https://github.com/transpect/docx2tex -% -\documentclass{scrbook} -\usepackage{graphicx} -\usepackage{hyperref} -\usepackage{multirow} -\usepackage{tabularx} -\usepackage{color} -\usepackage{amsmath} -\usepackage{amssymb} -\usepackage{amsfonts} -\usepackage{amsxtra} -\usepackage{wasysym} -\usepackage{isomath} -\usepackage{mathtools} -\usepackage{txfonts} -\usepackage{upgreek} -\usepackage{enumerate} -\usepackage{tensor} -\usepackage{pifont} - - - - - -\usepackage[ngerman]{babel} -\definecolor{color-1}{rgb}{0.91,0.9,0.9} -\definecolor{color-2}{rgb}{1,1,1} -\definecolor{color-3}{rgb}{0.85,0.85,0.85} -\definecolor{color-4}{rgb}{1,0,0} -\begin{document} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image1.jpeg} - -\textbf{E-Infrastructures} - -\textbf{H2020-EINFRA-2014-2015} - -\textbf{EINFRA-4-2014: Pan-European High Performance Computing} - -\textbf{Infrastructure and Services} - -\textbf{PRACE-4IP} - -\textbf{PRACE Fourth Implementation Phase Project}\label{ref-0001} - -\textbf{Grant Agreement Number: \label{ref-0002}EINFRA-653838} - -\textbf{D7.5}\label{ref-0003} - -\textbf{Application performance on accelerators}\label{ref-0004} - -\textbf{\textit{Final }} \label{ref-0005} - -Version: \label{ref-0006}1.0 - -Author(s): \label{ref-0007}Victor Cameo Ponz, CINES - -Date: 24.03.2016 - -\textbf{Project and Deliverable Information Sheet\label{ref-0008}} - -\begin{table} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 0.23\linewidth-2\tabcolsep} -p{\dimexpr 0.29\linewidth-2\tabcolsep} -p{\dimexpr 0.48\linewidth-2\tabcolsep}} -\multirow{7}{*}{\textbf{PRACE Project}}& \multicolumn{2}{l}{\textbf{Project Ref. №:} \textbf{{\hyperref[ref-0002]{EINFRA-653838}}}} \\ -\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Project Title:} \textbf{{\hyperref[ref-0001]{PRACE Fourth Implementation Phase Project}}}} \\ -\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Project Web Site:} \href{http://www.prace-project.eu}{http://www.prace-project.eu}} \\ -\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Deliverable ID:} {\textless} \textbf{{\hyperref[ref-0003]{D7.5}}}{\textgreater}} \\ -\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Deliverable Nature:} {\textless}DOC\_TYPE: Report / Other{\textgreater}} \\ -\cline{2-2}\cline{3-3} & \multirow{1}{*}{\textbf{Dissemination Level:}\par PU}& \textbf{Contractual Date of Delivery:}\par $31 / 03 / 2017$ \\ -\cline{3-3} & & \textbf{Actual Date of Delivery:}\par DD / Month / YYYY \\ -\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{EC Project Officer:} \textbf{Leonardo Flores A\~{n}over}} \\ - -\end{tabularx} - -\end{table} - -* - The dissemination level are indicated as follows: \textbf{PU} -- Public, \textbf{CO} -- Confidential, only for members of the consortium (including the Commission Services) \textbf{CL} -- Classified, as referred to in Commission Decision 2991/844/EC. - -\textbf{Document Control Sheet\label{ref-0009}} - -\begin{table} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 0.23\linewidth-2\tabcolsep} -p{\dimexpr 0.29\linewidth-2\tabcolsep} -p{\dimexpr 0.48\linewidth-2\tabcolsep}} -\multirow{5}{*}{\textbf{Document}}& \multicolumn{2}{l}{\textbf{Title:} \textbf{{\hyperref[ref-0004]{Application performance on accelerators}}}} \\ -\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{ID:} \textbf{{\hyperref[ref-0003]{D7.5}}} } \\ -\cline{2-2}\cline{3-3} & \textbf{Version:} {\textless}{\hyperref[ref-0006]{1.0}}{\textgreater} & \textbf{Status:} \textbf{\textit{{\hyperref[ref-0005]{Final}}}} \\ -\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Available at:} \href{http://www.prace-project.eu}{http://www.prace-project.eu}} \\ -\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Software Tool:} Microsoft Word 2010} \\ -\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{File(s):} d7.5\_4IP\_1.0.docx} \\ -\multirow{3}{*}{\textbf{Authorship}}& \textbf{Written by:} & {\hyperref[ref-0007]{Victor Cameo Ponz,}}CINES \\ -\cline{2-2}\cline{3-3} & \textbf{Contributors:} & Adem Tekin, ITU\par Alan Grey, EPCC\par Andrew Emerson, CINECA\par Andrew Sunderland, STFC\par Arno Proeme, EPCC\par Charles Moulinec, STFC\par Dimitris Dellis, GRNET\par Fiona Reid, EPCC\par Gabriel Hautreux, INRIA\par Jacob Finkenrath, CyI\par James Clark, STFC\par Janko Strassburg, BSC\par Jorge Rodriguez, BSC\par Martti Louhivuori, CSC\par Philippe Segers, GENCI\par Valeriu Codreanu, SURFSARA \\ -\cline{2-2}\cline{3-3} & \textbf{Reviewed by:} & Filip Stanek, IT4I\par Thomas Eickermann, FZJ \\ -\cline{2-2}\cline{3-3} & \textbf{Approved by:} & MB/TB \\ - -\end{tabularx} - -\end{table} - -\textbf{Document Status Sheet\label{ref-0010}} - -\begin{table} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 0.24\linewidth-2\tabcolsep} -p{\dimexpr 0.24\linewidth-2\tabcolsep} -p{\dimexpr 0.24\linewidth-2\tabcolsep} -p{\dimexpr 0.29\linewidth-2\tabcolsep}} -\textbf{Version} & \textbf{Date} & \textbf{Status} & \textbf{Comments} \\ -0.1 & $13/03/2017$ & Draft & First revision \\ -0.2 & $15/03/2017$ & Draft & Include remark of the first review + new figures \\ -1.0 & $24/03/2017$ & Final version & Improved the application performance section \\ - -\end{tabularx} - -\end{table} - -\textbf{Document Keywords \label{ref-0011}} - -\begin{table} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 0.24\linewidth-2\tabcolsep} -p{\dimexpr 0.76\linewidth-2\tabcolsep}} -\textbf{Keywords:} & PRACE, HPC, Research Infrastructure, Accelerators, GPU, Xeon Phi, Benchmark suite \\ - -\end{tabularx} - -\end{table} - -\textbf{Disclaimer} - -This deliverable has been prepared by the responsible Work Package of the Project in accordance with the Consortium Agreement and the Grant Agreement n$^{\circ}$ {\hyperref[ref-0002]{EINFRA-653838}}. It solely reflects the opinion of the parties to such agreements on a collective basis in the context of the Project and to the extent foreseen in such agreements. Please note that even though all participants to the Project are members of PRACE AISBL, this deliverable has not been approved by the Council of PRACE AISBL and therefore does not emanate from it nor should it be considered to reflect PRACE AISBL's individual opinion. - -\begin{table} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 1\linewidth-2\tabcolsep}} -\textbf{Copyright notices}\par {\textcopyright} 2016 PRACE Consortium Partners. All rights reserved. This document is a project document of the PRACE project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contract {\hyperref[ref-0002]{EINFRA-653838}} for reviewing and dissemination purposes. \par All trademarks and other rights on third party products mentioned in this document are acknowledged as own by the respective holders. \\ - -\end{tabularx} - -\end{table} - -\textbf{Table of Contents\label{ref-0012}} - -\textbf{Project and Deliverable Information Sheet \pageref{ref-0008}} - -\textbf{Document Control Sheet \pageref{ref-0009}} - -\textbf{Document Status Sheet \pageref{ref-0010}} - -\textbf{Document Keywords \pageref{ref-0011}} - -\textbf{Table of Contents \pageref{ref-0012}} - -\textbf{List of Figures \pageref{ref-0013}} - -\textbf{List of Tables \pageref{ref-0014}} - -\textbf{References and Applicable Documents \pageref{ref-0015}} - -\textbf{List of Acronyms and Abbreviations \pageref{ref-0036}} - -\textbf{List of Project Partner Acronyms \pageref{ref-0037}} - -\textbf{Executive Summary \pageref{ref-0038}} - -\textbf{1 Introduction \pageref{ref-0039}} - -\textbf{2 Targeted architectures \pageref{ref-0041}} - -\textbf{2.1 Co-processor description \pageref{ref-0042}} - -\textbf{2.2 Systems description \pageref{ref-0045}} - -\textit{2.2.1 Cartesius K40 \pageref{ref-0047}} - -\textit{2.2.2 MareNostrum KNC \pageref{ref-0048}} - -\textit{2.2.3 Ouessant P100 \pageref{ref-0049}} - -\textit{2.2.4 Frioul KNL \pageref{ref-0050}} - -\textbf{3 Benchmark suite description \pageref{ref-0052}} - -\textbf{3.1 Alya \pageref{ref-0055}} - -\textit{3.1.1 Code description \pageref{ref-0056}} - -\textit{3.1.2 Test cases description \pageref{ref-0057}} - -\textbf{3.2 Code\_Saturne \pageref{ref-0058}} - -\textit{3.2.1 Code description \pageref{ref-0059}} - -\textit{3.2.2 Test cases description \pageref{ref-0060}} - -\textbf{3.3 CP2K \pageref{ref-0061}} - -\textit{3.3.1 Code description \pageref{ref-0062}} - -\textit{3.3.2 Test cases description \pageref{ref-0063}} - -\textbf{3.4 GPAW \pageref{ref-0064}} - -\textit{3.4.1 Code description \pageref{ref-0065}} - -\textit{3.4.2 Test cases description \pageref{ref-0066}} - -\textbf{3.5 GROMACS \pageref{ref-0067}} - -\textit{3.5.1 Code description \pageref{ref-0068}} - -\textit{3.5.2 Test cases description \pageref{ref-0069}} - -\textbf{3.6 NAMD \pageref{ref-0070}} - -\textit{3.6.1 Code description \pageref{ref-0071}} - -\textit{3.6.2 Test cases description \pageref{ref-0072}} - -\textbf{3.7 PFARM \pageref{ref-0073}} - -\textit{3.7.1 Code description \pageref{ref-0074}} - -\textit{3.7.2 Test cases description \pageref{ref-0075}} - -\textbf{3.8 QCD \pageref{ref-0076}} - -\textit{3.8.1 Code description \pageref{ref-0077}} - -\textit{3.8.2 Test cases description \pageref{ref-0078}} - -\textbf{3.9 Quantum Espresso \pageref{ref-0079}} - -\textit{3.9.1 Code description \pageref{ref-0080}} - -\textit{3.9.2 Test cases description \pageref{ref-0081}} - -\textbf{3.10 Synthetic benchmarks -- SHOC \pageref{ref-0082}} - -\textit{3.10.1 Code description \pageref{ref-0083}} - -\textit{3.10.2 Test cases description \pageref{ref-0084}} - -\textbf{3.11 SPECFEM3D \pageref{ref-0085}} - -\textit{3.11.1 Test cases definition \pageref{ref-0086}} - -\textbf{4 Applications performances \pageref{ref-0088}} - -\textbf{4.1 Alya \pageref{ref-0089}} - -\textbf{4.2 Code\_Saturne \pageref{ref-0094}} - -\textbf{4.3 CP2K \pageref{ref-0101}} - -\textbf{4.4 GPAW \pageref{ref-0104}} - -\textbf{4.5 GROMACS \pageref{ref-0110}} - -\textbf{4.6 NAMD \pageref{ref-0113}} - -\textbf{4.7 PFARM \pageref{ref-0116}} - -\textbf{4.8 QCD \pageref{ref-0123}} - -\textit{4.8.1 First implementation \pageref{ref-0124}} - -\textit{4.8.2 Second implementation \pageref{ref-0131}} - -\textbf{4.9 Quantum Espresso \pageref{ref-0142}} - -\textbf{4.10 Synthetic benchmarks (SHOC) \pageref{ref-0152}} - -\textbf{4.11 SPECFEM3D \pageref{ref-0155}} - -\textbf{5 Conclusion and future work \pageref{ref-0158}} - -\textbf{List of Figures\label{ref-0013}} - -{\hyperref[ref-0090]{Figure 1 Shows the matrix construction part of Alya that is parallelised with OpenMP and benefits significantly from the many cores available on KNL.}} {\hyperref[ref-0090]{ }} - -{\hyperref[ref-0091]{Figure 2 Demonstrates the scalability of the code. As expected Haswell cores with K80 GPU are high-performing while the KNL port is currently being optimized further.}} {\hyperref[ref-0091]{ }} - -{\hyperref[ref-0093]{Figure 3 Best performance is achieved with GPU in combination with powerful CPU cores. Single thread performance has a big impact on the speedup, both threading and vectorization are employed for additional performance.}} {\hyperref[ref-0093]{ }} - -{\hyperref[ref-0096]{Figure 4 Code_Saturne's performance on KNL. AMG is used as a solver in V4.2.2.}} {\hyperref[ref-0096]{ }} - -{\hyperref[ref-0103]{Figure 5 Test case 1 of CP2K on the ARCHER cluster}} {\hyperref[ref-0103]{ }} - -{\hyperref[ref-0109]{Figure 6 Relative performance (to / t) of GPAW is shown for parallel jobs using an increasing number of CPU (blue) or Xeon Phi KNC (red). Single CPU SCF-cycle runtime (to) was used as the baseline for the normalisation. Ideal scaling is shown as a linear dashed line for comparison. Case 1 (Carbon Nanotube) is shown with square markers and Case 2 (Copper Filament) is shown with round markers.}} {\hyperref[ref-0109]{ }} - -{\hyperref[ref-0111]{Figure 7 Scalability for GROMACS test case GluCL Ion Channel}} {\hyperref[ref-0111]{ }} - -{\hyperref[ref-0112]{Figure 8 Scalability for GROMACS test case Lignocellulose}} {\hyperref[ref-0112]{ }} - -{\hyperref[ref-0114]{Figure 9 Scalability for NAMD test case STMV.8M}} {\hyperref[ref-0114]{ }} - -{\hyperref[ref-0115]{Figure 10 Scalability for NAMD test case STMV.28M}} {\hyperref[ref-0115]{ }} - -{\hyperref[ref-0118]{Figure 11 Eigensolver performance on KNL and GPU}} {\hyperref[ref-0118]{ }} - -{\hyperref[ref-0126]{Figure 12 Small test case results for QCD, first implementation}} {\hyperref[ref-0126]{ }} - -{\hyperref[ref-0128]{Figure 13 Large test case results for QCD, first implementation}} {\hyperref[ref-0128]{ }} - -{\hyperref[ref-0130]{Figure 14 shows the time taken by the full MILC 64x64x64x8 test cases on traditional CPU, Intel Knights Landing Xeon Phi and NVIDIA P100 (Pascal) GPU architectures.}} {\hyperref[ref-0130]{ }} - -{\hyperref[ref-0133]{Figure 15 Result of second implementation of QCD on K40m GPU}} {\hyperref[ref-0133]{ }} - -{\hyperref[ref-0135]{Figure 16 Result of second implementation of QCD on P100 GPU}} {\hyperref[ref-0135]{ }} - -{\hyperref[ref-0137]{Figure 17 Result of second implementation of QCD on P100 GPU on larger test case}} {\hyperref[ref-0137]{ }} - -{\hyperref[ref-0139]{Figure 18 Result of second implementation of QCD on KNC}} {\hyperref[ref-0139]{ }} - -{\hyperref[ref-0141]{Figure 19 Result of second implementation of QCD on KNL}} {\hyperref[ref-0141]{ }} - -{\hyperref[ref-0144]{Figure 20 Scalability of Quantum Espresso on GPU for test case 1}} {\hyperref[ref-0144]{ }} - -{\hyperref[ref-0146]{Figure 21 Scalability of Quantum Espresso on GPU for test case 2}} {\hyperref[ref-0146]{ }} - -{\hyperref[ref-0148]{Figure 22 Scalability of Quantum Espresso on KNL for test case 1}} {\hyperref[ref-0148]{ }} - -{\hyperref[ref-0150]{Figure 23 Quantum Espresso - KNL vs BDW vs BGQ (at scale)}} {\hyperref[ref-0150]{ }} - -\textbf{List of Tables\label{ref-0014}} - -Table 1 Main co-processors specifications \pageref{ref-0044} - -Table 2 Codes and corresponding APIs available (in green) \pageref{ref-0054} - -Table 3 Performance of Code\_Saturne + PETSc on 1 node of the POWER8 clusters. Comparison between 2 different nodes, using different types of CPU and GPU. PETSc is built on LAPACK. The speedup is computed at the ratio between the time to solution on the CPU for a given number of MPI tasks and the time to solution on the CPU/GPU for the same number of MPI tasks. \pageref{ref-0098} - -Table 4 Performance of Code\_Saturne and PETSc on 1 node of KNL. PETSc is built on the MKL library \pageref{ref-0100} - -Table 5 GPAW runtimes (in seconds) for the smaller benchmark (Carbon Nanotube) measured on several architectures when using n sockets (i.e. processors or accelerators). \pageref{ref-0106} - -Table 6 GPAW runtimes (in seconds) for the larger benchmark (Copper Filament) measured on several architectures when using n sockets (i.e. processors or accelerators). *Due to memory limitations on the GPU the grid spacing was increased from 0.22 to 0.28 to have a sparser grid. To account for this in the comparison, the K40 and K80 runtimes have been scaled up using a corresponding CPU runtime as a yardstick (scaling factor q=2.1132). \pageref{ref-0108} - -Table 7 Overall EXDIG runtime performance on various accelerators (runtime, secs) \pageref{ref-0120} - -Table 8 Overall EXDIG runtime parallel performance using MPI-GPU version \pageref{ref-0122} - -Table 9 Synthetic benchmarks results on GPU and Xeon Phi \pageref{ref-0154} - -Table 10 SPECFEM 3D GLOBE results (run time in second) \pageref{ref-0156} - -\textbf{References and Applicable Documents\label{ref-0015}} - -\begin{enumerate}[1] - -\item \href{http://www.prace-ri.eu}{\label{ref-0016}http://www.prace-ri.eu} - -\item The Unified European Application Benchmark Suite -- \href{http://www.prace-ri.eu/ueabs/}{http://www.prace-ri.eu/ueabs/}\label{ref-0017} - -\item D7.4 Unified European Applications Benchmark Suite -- Mark Bull et al. -- 2013\label{ref-0018} - -\item \href{http://www.nvidia.com/object/quadro-design-and-manufacturing.html}{\label{ref-0019}http://www.nvidia.com/object/quadro-design-and-manufacturing.html} - -\item \href{https://userinfo.surfsara.nl/systems/cartesius/description}{https://userinfo.surfsara.nl/systems/cartesius/description}\label{ref-0020} - -\item MareNostrum III User's Guide Barcelona Supercomputing Center -- \href{https://www.bsc.es/support/MareNostrum3-ug.pdf}{https://www.bsc.es/support/MareNostrum3-ug.pdf}\label{ref-0021} - -\item \href{http://www.idris.fr/eng/ouessant/}{http://www.idris.fr/eng/ouessant/}\label{ref-0022} - -\item PFARM reference -- \href{https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/3/34/Pfarm_long_lug.pdf}{https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/3/34/Pfarm\_long\_lug.pdf}\label{ref-0023} - -\item Solvent-Driven Preferential Association of Lignin with Regions of Crystalline Cellulose in Molecular Dynamics Simulation -- Benjamin Lindner et al. -- Biomacromolecules, 2013\label{ref-0024} - -\item NAMD website -- \href{http://www.ks.uiuc.edu/Research/namd/}{http://www.ks.uiuc.edu/Research/namd/}\label{ref-0025} - -\item SHOC source repository -- \href{https://github.com/vetter/shoc}{https://github.com/vetter/shoc}\label{ref-0026} - -\item Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics -- R. Babbich, M. Clark and B. Joo -- SC 10 (Supercomputing 2010)\label{ref-0027} - -\item Lattice QCD on Intel Xeon Phi -- B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, W. Watson III -- International Supercomputing Conference (ISC'13), 2013\label{ref-0028} - -\item Extension of fractional step techniques for incompressible flows: The preconditioned Orthomin(1) for the pressure Schur complement -- G. Houzeaux, R. Aubry, and M. V\'{a}zquez -- Computers \& Fluids, 44:297-313, 2011\label{ref-0029} - -\item MIMD Lattice Computation (MILC) Collaboration -- \href{http://physics.indiana.edu/~sg/milc.html}{http://physics.indiana.edu/\textasciitilde{}sg/milc.html}\label{ref-0030} - -\item targetDP -- \href{https://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README}{https://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README}\label{ref-0031} - -\item QUDA: A library for QCD on GPU -- \href{https://lattice.github.io/quda/}{https://lattice.github.io/quda/}\label{ref-0032} - -\item QPhiX, QCD for Intel Xeon Phi and Xeon processors -- \href{http://jeffersonlab.github.io/qphix/}{http://jeffersonlab.github.io/qphix/}\label{ref-0033} - -\item KNC MaxFlops issue (both SP and DP) -- \href{https://github.com/vetter/shoc/issues/37}{https://github.com/vetter/shoc/issues/37}\label{ref-0034} - -\item \label{ref-0035}KNC SpMV issue -- https://github.com/vetter/shoc/issues/24, https://github.com/vetter/shoc/issues/23. - -\end{enumerate} - -\textbf{List of Acronyms and Abbreviations\label{ref-0036}} - -\begin{description} -\item[aisbl]Association International Sans But Lucratif \newline - (legal form of the PRACE-RI) - -\item[BCO]Benchmark Code Owner - -\end{description} - -CoE Center of Excellence - -\begin{description} -\item[CPU]Central Processing Unit - -\item[CUDA]Compute Unified Device Architecture (NVIDIA) - -\item[DARPA]Defense Advanced Research Projects Agency - -\item[DEISA]Distributed European Infrastructure for Supercomputing Applications EU project by leading national HPC centres - -\end{description} - -DoA Description of Action (formerly known as DoW) - -\begin{description} -\item[EC]European Commission - -\item[EESI]European Exascale Software Initiative - -\end{description} - -EoI Expression of Interest - -\begin{description} -\item[ ESFRI]European Strategy Forum on Research Infrastructures - -\item[GB]Giga (= 2$^{\mathrm{30}}$ \textasciitilde{} 10$^{\mathrm{9}}$) Bytes (= 8 bits), also GByte - -\end{description} - -Gb/s Giga (= 10$^{\mathrm{9}}$) bits per second, also Gbit/s - -GB/s Giga (= 10$^{\mathrm{9}}$) Bytes (= 8 bits) per second, also GByte/s - -\begin{description} -\item[ G\'{E}ANT]Collaboration between National Research and Education Networks to build a multi-gigabit pan-European network. The current EC-funded project as of 2015 is GN4. - -\item[GFlop/s]Giga (= 10$^{\mathrm{9}}$) Floating point operations (usually in 64-bit, i.e. DP) per second, also GF/s - -\end{description} - -GHz Giga (= 10$^{\mathrm{9}}$) Hertz, frequency =10$^{\mathrm{9}}$ periods or clock cycles per second - -\begin{description} -\item[GPU]Graphic Processing Unit - -\item[ HET]High Performance Computing in Europe Taskforce. Taskforce by representatives from European HPC community to shape the European HPC Research Infrastructure. Produced the scientific case and valuable groundwork for the PRACE project. - -\item[HMM]Hidden Markov Model - -\item[HPC]High Performance Computing; Computing at a high performance level at any given time; often used synonym with Supercomputing - -\item[HPL]High Performance LINPACK - -\item[ ISC]International Supercomputing Conference; European equivalent to the US based SCxx conference. Held annually in Germany. - -\item[KB]Kilo (= 2$^{\mathrm{10}}$ \textasciitilde{}10$^{\mathrm{3}}$) Bytes (= 8 bits), also KByte - -\item[LINPACK]Software library for Linear Algebra - -\item[MB]Management Board (highest decision making body of the project) - -\item[MB]Mega (= 2$^{\mathrm{20}}$ \textasciitilde{} 10$^{\mathrm{6}}$) Bytes (= 8 bits), also MByte - -\end{description} - -MB/s Mega (= 10$^{\mathrm{6}}$) Bytes (= 8 bits) per second, also MByte/s - -MFlop/sMega (= 10$^{\mathrm{6}}$) Floating point operations (usually in 64-bit, i.e. DP) per second, also MF/s - -MooC Massively open online Course - -MoU Memorandum of Understanding - -\begin{description} -\item[MPI]Message Passing Interface - -\item[ NDA]Non-Disclosure Agreement. Typically signed between vendors and customers working together on products prior to their general availability or announcement. - -\item[ PA]Preparatory Access (to PRACE resources) - -\item[ PATC]PRACE Advanced Training Centres - -\item[PRACE]Partnership for Advanced Computing in Europe; Project Acronym - -\item[PRACE 2]The upcoming next phase of the PRACE Research Infrastructure following the initial five year period. - -\item[PRIDE]Project Information and Dissemination Event - -\item[RI]Research Infrastructure - -\item[TB]Technical Board (group of Work Package leaders) - -\item[TB]Tera (= 240 \textasciitilde{} 1012) Bytes (= 8 bits), also TByte - -\item[TCO]Total Cost of Ownership. Includes recurring costs (e.g. personnel, power, cooling, maintenance) in addition to the purchase cost. - -\item[TDP]Thermal Design Power - -\item[TFlop/s]Tera (= 1012) Floating-point operations (usually in 64-bit, i.e. DP) per second, also TF/s - -\item[Tier-0]Denotes the apex of a conceptual pyramid of HPC systems. In this context the Supercomputing Research Infrastructure would host the Tier-0 systems; national or topical HPC centres would constitute Tier-1 - -\item[UNICORE]Uniform Interface to Computing Resources. Grid software for seamless access to distributed resources. - -\end{description} - -\textbf{List of Project Partner Acronyms\label{ref-0037}} - -\begin{description} -\item[ BADW-LRZ]Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, Germany (3$^{\mathrm{rd}}$ Party to GCS) - -\item[ BILKENT]Bilkent University, Turkey (3$^{\mathrm{rd}}$ Party to UYBHM) - -\item[ BSC]Barcelona Supercomputing Center - Centro Nacional de Supercomputacion, Spain - -\item[ CaSToRC]Computation-based Science and Technology Research Center, Cyprus - -\item[ CCSAS]Computing Centre of the Slovak Academy of Sciences, Slovakia - -\item[ CEA]Commissariat \`{a} l'Energie Atomique et aux Energies Alternatives, France (3$^{\mathrm{ rd}}$ Party to GENCI) - -\item[ CESGA]Fundacion Publica Gallega Centro Tecnol\'{o}gico de Supercomputaci\'{o}n de Galicia, Spain, (3$^{\mathrm{rd}}$ Party to BSC) - -\item[ CINECA]CINECA Consorzio Interuniversitario, Italy - -\item[ CINES]Centre Informatique National de l'Enseignement Sup\'{e}rieur, France (3$^{\mathrm{ rd}}$ Party to GENCI) - -\item[ CNRS]Centre National de la Recherche Scientifique, France (3$^{\mathrm{ rd}}$ Party to GENCI) - -\item[ CSC]CSC Scientific Computing Ltd., Finland - -\item[ CSIC]Spanish Council for Scientific Research (3$^{\mathrm{rd}}$ Party to BSC) - -\item[ CYFRONET]Academic Computing Centre CYFRONET AGH, Poland (3rd party to PNSC) - -\item[ EPCC]EPCC at The University of Edinburgh, UK - -\item[ ETHZurich (CSCS)]Eidgen\"{o}ssische Technische Hochschule Z\"{u}rich -- CSCS, Switzerland - -\item[ FIS]FACULTY OF INFORMATION STUDIES, Slovenia (3$^{\mathrm{rd}}$ Party to ULFME) - -\item[ GCS]Gauss Centre for Supercomputing e.V. - -\item[ GENCI]Grand Equipement National de Calcul Intensiv, France - -\item[ GRNET]Greek Research and Technology Network, Greece - -\item[ INRIA]Institut National de Recherche en Informatique et Automatique, France (3$^{\mathrm{ rd}}$ Party to GENCI) - -\item[ IST]Instituto Superior T\'{e}cnico, Portugal (3rd Party to UC-LCA) - -\item[ IUCC]INTER UNIVERSITY COMPUTATION CENTRE, Israel - -\item[ JKU]Institut fuer Graphische und Parallele Datenverarbeitung der Johannes Kepler Universitaet Linz, Austria - -\item[ JUELICH]Forschungszentrum Juelich GmbH, Germany - -\item[ KTH]Royal Institute of Technology, Sweden (3$^{\mathrm{ rd}}$ Party to SNIC) - -\item[ LiU]Linkoping University, Sweden (3$^{\mathrm{ rd}}$ Party to SNIC) - -\item[ NCSA]NATIONAL CENTRE FOR SUPERCOMPUTING APPLICATIONS, Bulgaria - -\item[ NIIF]National Information Infrastructure Development Institute, Hungary - -\item[ NTNU]The Norwegian University of Science and Technology, Norway (3$^{\mathrm{rd}}$ Party to SIGMA) - -\item[ NUI-Galway]National University of Ireland Galway, Ireland - -\item[ PRACE]Partnership for Advanced Computing in Europe aisbl, Belgium - -\item[ PSNC]Poznan Supercomputing and Networking Center, Poland - -\item[ RISCSW]RISC Software GmbH - -\item[ RZG]Max Planck Gesellschaft zur F\"{o}rderung der Wissenschaften e.V., Germany (3$^{\mathrm{ rd}}$ Party to GCS) - -\item[ SIGMA2]UNINETT Sigma2 AS, Norway - -\item[ SNIC]Swedish National Infrastructure for Computing (within the Swedish Science Council), Sweden - -\item[ STFC]Science and Technology Facilities Council, UK (3$^{\mathrm{rd}}$ Party to EPSRC) - -\item[ SURFsara]Dutch national high-performance computing and e-Science support center, part of the SURF cooperative, Netherlands - -\item[ UC-LCA]Universidade de Coimbra, Labotat\'{o}rio de Computa\c{c}\~{a}o Avan\c{c}ada, Portugal - -\item[ UCPH]K\o{}benhavns Universitet, Denmark - -\item[ UHEM]Istanbul Technical University, Ayazaga Campus, Turkey - -\item[ UiO]University of Oslo, Norway (3$^{\mathrm{rd}}$ Party to SIGMA) - -\item[ ULFME]UNIVERZA V LJUBLJANI, Slovenia - -\item[ UmU]Umea University, Sweden (3$^{\mathrm{ rd}}$ Party to SNIC) - -\item[ UnivEvora]Universidade de \'{E}vora, Portugal (3rd Party to UC-LCA) - -\item[ UPC]Universitat Polit\`{e}cnica de Catalunya, Spain (3rd Party to BSC) - -\item[ UPM/CeSViMa]Madrid Supercomputing and Visualization Center, Spain (3$^{\mathrm{rd}}$ Party to BSC) - -\item[ USTUTT-HLRS]Universitaet Stuttgart -- HLRS, Germany (3rd Party to GCS) - -\item[ VSB-TUO]VYSOKA SKOLA BANSKA - TECHNICKA UNIVERZITA OSTRAVA, Czech Republic - -\item[ WCNS]Politechnika Wroclawska, Poland (3rd party to PNSC) - -\end{description} - -\textbf{Executive Summary\label{ref-0038}} - -This document describes an accelerator benchmark suite, a set of 11 codes that includes 1 synthetic benchmark and 10 commonly used applications. The key focus of this task has been exploiting accelerators or co-processors to improve the performance of real applications. It aims at providing a set of scalable, currently relevant and publically available codes and datasets. - -This work has been undertaken by Task7.2B "Accelerator Benchmarks" in the PRACE Fourth Implementation Phase (PRACE-4IP) project. - -Most of the selected application are a subset of the Unified European Applications Benchmark Suite (UEABS) {\hyperref[ref-0017]{[2]}}{\hyperref[ref-0018]{[3]}}. One application and a synthetic benchmark have been added. - -As a result, selected codes are: Alya, Code\_Saturne, CP2K, GROMACS, GPAW, NAMD, PFARM, QCD, Quantum Espresso, SHOC and SPECFEM3D. - -For each code either two or more test case datasets have been selected. These are described in this document, along with a brief introduction to the application codes themselves. For each code, some sample results are presented, from first run on leading edge systems and prototypes. - -\section{1 Introduction\label{ref-0039}} - -The work produced within this task is an extension of the UEABS for accelerators. This document will cover each code, presenting the code as well as the test cases defined for the benchmarks and the first results that have been recorded on various accelerator systems. - -As the UEABS, this suite aims to present results for many scientific fields that can use HPC accelerated resources. Hence, it will help the European scientific communities to decide in terms of infrastructures they could buy in a near future. We focus on Intel Xeon Phi coprocessors and NVIDIA GPU cards for benchmarking as they are the two most wide-spread accelerated resources available now. - -Section {\hyperref[ref-0040]{2}} will present both types of accelerator systems, Xeon Phi and GPU card along with architecture examples. Section {\hyperref[ref-0051]{3}} gives a description of each of the selected applications, together with the test case datasets while section {\hyperref[ref-0087]{4}} presents some sample results. Section {\hyperref[ref-0157]{5}} outlines further work on, and using, the suite. - -\section{2 Targeted architectures\label{ref-0040}\label{ref-0041}} - -This suite is targeting accelerator cards, more specifically the Intel Xeon Phi and NVIDIA GPU architecture. This section will quickly describe them and will present the 4 machines, the benchmarks ran on. - -\subsection{2.1 Co-processor description\label{ref-0042}} - -Scientific computing using co-processors has gained popularity in recent years. First the utility of GPU has been demonstrated and evaluated in several application domains {\hyperref[ref-0019]{[4]}}. As a response to NVIDIA's supremacy in this field, Intel designed Xeon Phi cards. - -Architectures and programming models of co-processors may differ from CPU and vary among different co-processor types. The main challenges are the high-level parallelism ability required from software and the fact that code may have to be offloaded to the accelerator card. - -The {\hyperref[ref-0043]{Table 1}} enlightens this fact: - -\begin{table} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 0.24\linewidth-2\tabcolsep} -p{\dimexpr 0.18\linewidth-2\tabcolsep} -p{\dimexpr 0.2\linewidth-2\tabcolsep} -p{\dimexpr 0.2\linewidth-2\tabcolsep} -p{\dimexpr 0.19\linewidth-2\tabcolsep}} - & \multicolumn{2}{l}{Intel Xeon Phi} & \multicolumn{2}{l}{NVIDIA GPU} \\ - & 5110P (KNC) & 7250 (KNL) & K40m & P100 \\ -public availability date & Nov-12 & Jun-16 & Jun-13 & May-16 \\ -theoretical peak perf & 1,011 GF/s & 3,046 GF/s & 1,430 GF/s & 5,300 GF/s \\ -offload required & possible & not possible & required & required \\ -max number of thread/cuda cores & 240 & 272 & 2880 & 3584 \\ - -\end{tabularx} - -\end{table} - -\textbf{Table 1 Main co-processors specifications\label{ref-0043}\label{ref-0044}} - -\subsection{2.2 Systems description\label{ref-0045}} - -The benchmark suite has been officially granted access to 4 different machines hosted by PRACE partners. Most results presented in this paper were obtained on these machines but some of the simulation has run on similar ones. This section will cover specifications of the sub mentioned 4 official systems while the few other ones will be presented along with concerned results. - -As it can be noticed on the previous section, leading edge architectures have been available quite recently and some code couldn't run on it yet. Results will be completed in a near future and will be delivered with an update of the benchmark suite. Still, presented performances are a good indicator about potential efficiency of codes on both Xeon Phi and NVIDIA GPU platforms. - -As for the future, the PRACE-3IP PCP is in its third and last phase and will be a good candidate to provide access to bigger machines. The following suppliers had been awarded with a contract: ATOS/Bull SAS (France), E4 Computer Engineering (Italy) and Maxeler Technologies (UK), providing pilots using Xeon Phi, OPENPower and FPGA technologies. During this final phase, which started in October 2016, the contractors will have to deploy pilot system with a compute capability of around 1 PFlop/s, to demonstrate technology readiness of the proposed solution and the progress in terms of energy efficiency, using high frequency monitoring designed for this purpose. These results will be evaluated on a subset of applications from UEABS (NEMO, SPECFEM3D, QuantumEspresso, BQCD). The access to these systems is foreseen to be open to PRACE partners, with a special interest for the 4IP-WP7 task on accelerated Benchmarks. - -\subsubsection{\raisebox{-0pt}{2.2.1} Cartesius K40\label{ref-0046}\label{ref-0047}} - -The SURFsara institute in The Netherlands granted access to Cartesius which has a GPU island (installed May 2014) with following specifications {\hyperref[ref-0020]{[5]}}: - -\begin{itemize} -\item 66 Bullx B515 GPU accelerated nodes - -\begin{enumerate}[a] -\setcounter{enumi}{14} - -\item 2x 8-core 2.5 GHz Intel Xeon E5-2450 v2 (Ivy Bridge) CPU/node - -\item 2x NVIDIA Tesla K40m GPU/node - -\item 96 GB/node, DDR3-1600 RAM - -\end{enumerate} - -\item Total theoretical peak performance (Ivy Bridge + K40m) 1,056 cores + 132 GPU: 210 TF/s - -\end{itemize} -The interconnect has a fully non-blocking fat-tree topology. Every node has two ConnectX-3 InfiniBand FDR adapters: one per GPU. - -\subsubsection{\raisebox{-0pt}{2.2.2} MareNostrum KNC\label{ref-0048}} - -The Barcelona Supercomputing Center (BSC) in Spain granted access to MareNostrum III which features KNC nodes (upgrade June 2013). Here's the description of this partition {\hyperref[ref-0021]{[6]}}: - -\begin{itemize} -\item 42 hybrid nodes containing: - -\begin{enumerate}[a] -\setcounter{enumi}{14} - -\item 1x Sandy-Bridge-EP (2 x 8 cores) host processors E5-2670 - -\item 8x 8G DDR3--1600 DIMMs (4GB/core), total: 64GB/node - -\item 2x Xeon Phi 5110P accelerators - -\end{enumerate} - -\item Interconnection networks: - -\begin{enumerate}[a] -\setcounter{enumi}{14} - -\item Infiniband Mellanox FDR10: High bandwidth network used by parallel applications communications (MPI) - -\item Gigabit Ethernet: 10GbitEthernet network used by the GPFS Filesystem. - -\end{enumerate} - -\subsubsection{\raisebox{-0pt}{2.2.3} Ouessant P100\label{ref-0049}} - -\end{itemize} -GENCI granted access to the Ouessant prototype at IDRIS in France (installed September 2016). It is composed of 12 IBM Minsky compute nodes with each containing {\hyperref[ref-0022]{[7]}}: - -\begin{itemize} -\item Compute nodes - -\begin{enumerate}[a] -\setcounter{enumi}{14} - -\item POWER8+ sockets, 10 cores, 8 threads per core (or 160 threads per node) - -\item 128 GB of DDR4 memory (bandwidth {\textgreater} 9 GB/s per core) - -\item 4 NVIDIA's new generation Pascal P100 GPU, 16 GB of HBM2 memory - -\end{enumerate} - -\item Interconnect - -\begin{enumerate}[a] -\setcounter{enumi}{14} - -\item 4 NVLink interconnects (40GB/s of bi-directional bandwidth per interconnect); each GPU card is connected to a CPU with 2 NVLink interconnects and another GPU with 2 interconnects remaining - -\item A Mellanox EDR InfiniBand CAPI interconnect network (1 interconnect per node) - -\end{enumerate} - -\subsubsection{\raisebox{-0pt}{2.2.4} Frioul KNL\label{ref-0050}} - -\end{itemize} -GENCI also granted access to the Frioul prototype at CINES in France (installed December 2016). It is composed of 48 Intel KNL compute nodes each containing: - -\begin{itemize} -\item Compute nodes - -\begin{enumerate}[a] -\setcounter{enumi}{14} - -\item 7250 KNL, 68 cores, 4 threads per cores - -\item 192GB of DDR4 memory - -\item 16GB of MCDRAM - -\end{enumerate} - -\item Interconnect: - -\begin{enumerate}[a] -\setcounter{enumi}{14} - -\item A Mellanox EDR 4x InfiniBand - -\end{enumerate} - -\end{itemize} -\section{3 Benchmark suite description\label{ref-0051}\label{ref-0052}} - -This part will cover each code, presenting the interest for the scientific community as well as the test cases defined for the benchmarks. - -As an extension to the EUABS, most codes presented in this suite are included in the latter. Exceptions are PFARM which comes from PRACE-2IP {\hyperref[ref-0023]{[8]}} and SHOC {\hyperref[ref-0026]{[11]}} a synthetic benchmark suite. - -\includegraphics[width=0.5\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image2.emf}\includegraphics[width=0.5\textwidth]{embeddings/Microsoft_Excel_Worksheet1.xlsx} - -\textbf{Table 2 Codes and corresponding APIs available (in green)\label{ref-0053}\label{ref-0054}} - -{\hyperref[ref-0053]{Table 2}} lists the codes that will be presented in the next sections as well as their implementations available. It should be noted that OpenMP can be used with the Intel Xeon Phi architecture while CUDA is used for NVidia GPU cards. OpenCL has been considered as a third alternative that can be used on both architectures. It has been available on the first generation of Xeon Phi (KNC) but has not been ported to the second one (KNL). SHOC is the only code that is impacted, this problem is addressed in section {\hyperref[ref-0151]{4.10}}. - -\subsection{3.1 Alya\label{ref-0055}} - -Alya is a high performance computational mechanics code that can solve different coupled mechanics problems: incompressible/compressible flows, solid mechanics, chemistry, excitable media, heat transfer and Lagrangian particle transport. It is one single code. There are no particular parallel or individual platform versions. Modules, services and kernels can be compiled individually and used a la carte. The main discretisation technique employed in Alya is based on the variational multiscale finite element method to assemble the governing equations into Algebraic systems. These systems can be solved using solvers like GMRES, Deflated Conjugate Gradient, pipelined CG together with preconditioners like SSOR, Restricted Additive Schwarz, etc. The coupling between physics solved in different computational domains (like fluid-structure interactions) is carried out in a multi-code way, using different instances of the same executable. Asynchronous coupling can be achieved in the same way in order to transport Lagrangian particles. - -\subsubsection{\raisebox{-0pt}{3.1.1} Code description\label{ref-0056}} - -The code is parallelised with MPI and OpenMP. Two OpenMP strategies are available, without and with a colouring strategy to avoid ATOMICs during the assembly step. A CUDA version is also available for the different solvers. Alya has been also compiled for MIC (Intel Xeon Phi). - -Alya is written in Fortran 1995 and the incompressible fluid module, present in the benchmark suite, is freely available. This module solves the Navier-Stokes equations using an Orthomin(1) {\hyperref[ref-0029]{[14]}} method for the pressure Schur complement. This method is an algebraic split strategy which converges to the monolithic solution. At each linearisation step, the momentum is solved twice and the continuity equation is solved once or twice depending whether the momentum preserving or the continuity preserving algorithm is selected. - -\subsubsection{\raisebox{-0pt}{3.1.2} Test cases description\label{ref-0057}} - -\textit{Cavity-hexaedra elements (10M elements)} - -This test is the classical lid-driven cavity. The problem geometry is a cube of dimensions 1x1x1. The fluid properties are density=1.0 and viscosity=0.01. Dirichlet boundary conditions are applied on all sides, with three no-slip walls and one moving wall with velocity equal to 1.0, which corresponds to a Reynolds number of 100. The Reynolds number is low so the regime is laminar and turbulence modelling is not necessary. The domain is discretised into 9800344 hexaedra elements. The solvers are the GMRES method for the momentum equations and the Deflated Conjugate Gradient to solve the continuity equation. This test case can be run using pure MPI parallelisation or the hybrid MPI/OpenMP strategy. - -\textit{Cavity-hexaedra elements (30M elements)} - -This is the same cavity test as before but with 30M of elements. Note that a mesh multiplication strategy enables one to multiply the number of elements by powers of 8, by simply activating the corresponding option in the ker.dat file. - -\textit{Cavity-hexaedra elements-GPU version (10M elements)} - -This is the same test as Test case 1, but using the pure MPI parallelisation strategy with acceleration of the algebraic solvers using GPU. - -\subsection{3.2 Code\_Saturne\label{ref-0058}} - -Code\_Saturne is a CFD software package developed by EDF R\&D since 1997 and open-source since 2007. The Navier-Stokes equations are discretised following a finite volume method approach. The code can handle any type of mesh built with any type of cell/grid structure. Incompressible and compressible flows can be simulated, with or without heat transfer, and a range of turbulence models is available. The code can also be coupled with itself or other software to model some multi-physics problems (fluid-structure, fluid-conjugate heat transfer, for instance). - -\subsubsection{\raisebox{-0pt}{3.2.1} Code description\label{ref-0059}} - -Parallelism is handled by distributing the domain over the processors (several partitioning tools are available, either internally, i.e. SFC Hilbert and Morton, or through external libraries, i.e. METIS Serial, ParMETIS, Scotch Serial, PT-SCOTCH. Communications between subdomains are handled by MPI. Hybrid parallelism using MPI/OpenMP has recently been optimised for improved multicore performance. - -For incompressible simulations, most of the time is spent during the computation of the pressure through Poisson equations. The matrices are very sparse. PETSc has recently been linked to the code to offer alternatives to the internal solvers to compute the pressure. The developer's version of PETSc supports CUDA and is used in this benchmark suite. - -Code\_Saturne is written in C, F95 and Python. It is freely available under the GPL license. - -\subsubsection{\raisebox{-0pt}{3.2.2} Test cases description\label{ref-0060}} - -Two test cases are dealt with, the former with a mesh made of hexahedral cells and the latter with a mesh made of tetrahedral cells. Both configurations are meant for incompressible laminar flows. The first test case is run on KNL in order to test the performance of the code always completely filling up a node using 64 MPI tasks and then either 1, 2, 4 OpenMP threads, or 1, 2, 4 extra MPI tasks to investigate the effect of hyper-threading. In this case, the pressure is computed using the code's native Algebraic Multigrid (AMG) algorithm as a solver. The second test case is run on KNL and GPU. In this configuration, the pressure equation is solved using the conjugate gradient (CG) algorithm from the PETSc library (the version of PETSc is the developer's version which supports GPU) and tests are run on KNL as well as on CPU+GPU. PETSc is built with the CUSP library and the CUSP format is used. - -Note that computing the pressure using a CG algorithm has always been slower than using the native AMG algorithm, when using Code\_Saturne. The second test is then meant to compare the current results obtained on KNL and GPU using CG only, and not to compare CG and AMG time to solution. - -\textit{Flow in a 3-D lid-driven cavity (tetrahedral cells)} - -The geometry is very simple, i.e. a cube, but the mesh is built using tetrahedral cells only. The Reynolds number is set to 100, and symmetry boundary conditions are applied in the spanwise direction. The case is modular and the mesh size can easily been varied. The largest mesh has about 13 million cells and is used to get some first comparisons using Code\_Saturne linked to the developer's PETSc library, in order to get use of the GPU. - -\textit{3-D Taylor-Green vortex flow (hexahedral cells)} - -The Taylor-Green vortex flow is traditionally used to assess the accuracy of CFD code numerical schemes. Periodicity is used in the 3 directions. The total kinetic energy (integral of the velocity) and enstrophy (integral of the vorticity) evolutions as a function of the time are looked at. Code\_Saturne is set for 2nd order time and spatial schemes. The mesh size is 2563 cells. - -\subsection{3.3 CP2K\label{ref-0061}} - -CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. It can perform molecular dynamics, metadynamics, Quantum Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method. - -CP2K provides a general framework for different modelling methods such as density functional theory (DFT) using the mixed Gaussian and plane waves approaches (GPW) and Gaussian and Augmented Plane (GAPW). Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, {\ldots}), and classical force fields (AMBER, CHARMM, {\ldots}). - -\subsubsection{\raisebox{-0pt}{3.3.1} Code description\label{ref-0062}} - -Parallelisation is achieved using a combination of OpenMP-based multi-threading and MPI. - -Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi). - -CP2K is written in Fortran 2003 and freely available under the GPL license. - -\subsubsection{\raisebox{-0pt}{3.3.2} Test cases description\label{ref-0063}} - -\textit{LiH-HFX} - -This is a single-point energy calculation for a particular configuration of a 216 atom Lithium Hydride crystal with 432 electrons in a 12.3 \AA{}$^{\mathrm{3}}$ (Angstroms cubed) cell. The calculation is performed using a DFT algorithm with GAPW under the hybrid Hartree-Fock exchange (HFX) approximation. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although the cost of the latter can be reduced by using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performance - -This test case is expected to scale efficiently to 1000+ nodes. - -\textit{H2O-DFT-LS} - -This is a single-point energy calculation for 2048 water molecules in a 39 \AA{}$^{\mathrm{3}}$ box using linear-scaling DFT. A local-density approximation (LDA) functional is used to compute the Exchange-Correlation energy in combination with a DZVP MOLOPT basis set and a 300 Ry cutoff. For large systems, the linear-scaling approach for solving Self-Consistent-Field equations should be much cheaper computationally than using standard DFT, and allow scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix. The cubically-scaling orthogonalisation step of standard DFT is avoided and key operations are sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in CP2K's DBCSR library. - -This test case is expected to scale efficiently to 4000+ nodes. - -\subsection{3.4 GPAW\label{ref-0064}} - -GPAW is a DFT program for ab-initio electronic structure calculations using the projector augmented wave method. It uses a uniform real-space grid representation of the electronic wavefunctions, that allows for excellent computational scalability and systematic converge properties. - -\subsubsection{\raisebox{-0pt}{3.4.1} Code description\label{ref-0065}} - -GPAW is written mostly in Python, but includes also computational kernels written in C as well as leveraging external libraries such as NumPy, BLAS and ScaLAPACK. Parallelisation is based on message-passing using MPI with no threading. Development branches for GPU and MICs include support for offloading to accelerators using either CUDA or pyMIC, respectively. GPAW is freely available under the GPL license. - -\subsubsection{\raisebox{-0pt}{3.4.2} Test cases description\label{ref-0066}} - -\textit{Carbon Nanotube} - -This test case is a ground state calculation for a carbon nanotube in vacuum. By default, it uses a $6-6-10$ nanotube with 240 atoms (freely adjustable) and serial LAPACK with an option to use ScaLAPACK. - -This benchmark is aimed at smaller systems, with an intended scaling range of up to 10 nodes. - -\textit{Copper Filament} - -This test case is a ground state calculation for a copper filament in vacuum. By default, it uses a 2x2x3 FCC lattice with 71 atoms (freely adjustable) and ScaLAPACK for parallelisation. - -This benchmark is aimed at larger systems, with an intended scaling range of up to 100 nodes. A lower limit on the number of nodes may be imposed by the amount of memory required, which can be adjusted to some extent with the run parameters (e.g. lattice size or grid spacing). - -\subsection{3.5 GROMACS\label{ref-0067}} - -GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. - -It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers. - -GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation, and some additional features: - -GROMACS provides extremely high performance compared to all other programs. A lot of algorithmic optimisations have been introduced in the code; for instance, the calculation of the virial has been extracted from the innermost loops over pairwise interactions, and we use our own software routines to calculate the inverse square root. In GROMACS 4.6 and up, on almost all common computing platforms, the innermost loops are written in C using intrinsic functions that the compiler transforms to SIMD machine instructions, to utilise the available instruction-level parallelism. These kernels are available in both single and double precision, and support all different kinds of SIMD support found in x86-family (and other) processors. - -\subsubsection{\raisebox{-0pt}{3.5.1} Code description\label{ref-0068}} - -Parallelisation is achieved using combined OpenMP and MPI. - -Offloading for accelerators is implemented through CUDA for GPU and through OpenMP for MIC (Intel Xeon Phi). - -GROMACS is written in C/C++ and freely available under the GPL license. - -\subsubsection{\raisebox{-0pt}{3.5.2} Test cases description\label{ref-0069}} - -\textit{GluCL Ion Channel} - -The ion channel system is the membrane protein GluCl, which is a pentameric chloride channel embedded in a lipid bilayer. The GluCl ion channel was embedded in a DOPC membrane and solvated in TIP3P water. This system contains 142k atoms, and is a quite challenging parallelisation case due to the small size. However, it is likely one of the most wanted target sizes for biomolecular simulations due to the importance of these proteins for pharmaceutical applications. It is particularly challenging due to a highly inhomogeneous and anisotropic environment in the membrane, which poses hard challenges for load balancing with domain decomposition. - -This test case was used as the ``Small'' test case in previous 2IP and 3IP PRACE phases. It is included in the package's version 5.0 benchmark cases. It is reported to scale efficiently up to 1000+ cores on x86 based systems. - -\textit{Lignocellulose} - -A model of cellulose and lignocellulosic biomass in an aqueous solution {\hyperref[ref-0024]{[9]}}. This system of 3.3 million atoms is inhomogeneous. This system uses reaction-field electrostatics instead of PME and therefore scales well on x86. This test case was used as the ``Large'' test case in previous PRACE 2IP and 3IP projects. It is reported in previous PRACE projects to scale efficiently up to 10000+ x86 cores. - -\subsection{3.6 NAMD\label{ref-0070}} - -NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. NAMD is developed by the ``Theoretical and Computational Biophysics Group'' at the University of Illinois at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a large number of processors. The application can read a wide variety of different file formats, for example force fields, protein structures, which are commonly used in bio-molecular science. A NAMD license can be applied for on the developer's website free of charge. Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website. Deployment areas of NAMD include pharmaceutical research by academic and industrial users. NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other chemical substances is of interest. Typical examples are vaccine research and transport processes through cell membrane proteins. - -\subsubsection{\raisebox{-0pt}{3.6.1} Code description\label{ref-0071}} - -NAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI, supporting both pure MPI and hybrid parallelisation {\hyperref[ref-0025]{[10]}}. - -Offloading for accelerators is implemented for both GPU and MIC (Intel Xeon Phi). - -\subsubsection{\raisebox{-0pt}{3.6.2} Test cases description\label{ref-0072}} - -The datasets are based on the original "Satellite Tobacco Mosaic Virus (STMV)" dataset from the official NAMD site. The memory optimised build of the package and data sets are used in benchmarking. Data are converted to the appropriate binary format used by the memory optimised build. - -\textit{STMV.1M} - -This is the original STMV dataset from the official NAMD site. The system contains roughly 1 million atoms. This data set scales efficiently up to 1000+ x86 Ivy Bridge cores. - -\textit{STMV.8M} - -This is a 2x2x2 replication of the original STMV dataset from the official NAMD site. The system contains roughly 8 million atoms. This data set scales efficiently up to 6000 x86 Ivy Bridge cores. - -STMV.28M - -This is a 3x3x3 replication of the original STMV dataset from the official NAMD site. The system contains roughly 28 million atoms. This data set also scales efficiently up to 6000 x86 Ivy Bridge cores. - -\subsection{3.7 PFARM\label{ref-0073}} - -PFARM is part of a suite of programs based on the `R-matrix' ab-initio approach to the varitional solution of the many-electron Schr\"{o}dinger equation for electron-atom and electron-ion scattering. The package has been used to calculate electron collision data for astrophysical applications (such as: the interstellar medium, planetary atmospheres) with, for example, various ions of Fe and Ni and neutral O, plus other applications such as data for plasma modelling and fusion reactor impurities. The code has recently been adapted to form a compatible interface with the UKRmol suite of codes for electron (positron) molecule collisions thus enabling large-scale parallel `outer-region' calculations for molecular systems as well as atomic systems. - -\subsubsection{\raisebox{-0pt}{3.7.1} Code description\label{ref-0074}} - -In order to enable efficient computation, the external region calculation takes place in two distinct stages, named EXDIG and EXAS, with intermediate files linking the two. EXDIG is dominated by the assembly of sector Hamiltonian matrices and their subsequent eigensolutions. EXAS uses a combined functional/domain decomposition approach where good load-balancing is essential to maintain efficient parallel performance. Each of the main stages in the calculation is written in Fortran 2003 (or Fortran 2003-compliant Fortran 95), is parallelised using MPI and is designed to take advantage of highly optimised, numerical library routines. Hybrid MPI / OpenMP parallelisation has also been introduced into the code via shared memory enabled numerical library kernels. - -Accelerator-based implementations have been implemented for both EXDIG and EXAS. EXAS uses offloading via MAGMA (or MKL) for sector Hamiltonian diagonalisations on Intel Xeon Phi and GPU accelerators. EXDIG uses combined MPI and OpenMP to distribute the scattering energy calculations on CPU efficiently both across and within Intel Xeon Phi co-processors. - -\subsubsection{\raisebox{-0pt}{3.7.2} Test cases description\label{ref-0075}} - -External region R-matrix propagations take place over the outer partition of configuration space, including the region where long-range potentials remain important. The radius of this region is determined from the user input and the program decides upon the best strategy for dividing this space into multiple sub-regions (or sectors). Generally, a choice of larger sector lengths requires the application of larger numbers of basis functions (and therefore larger Hamiltonian matrices) in order to maintain accuracy across the sector and vice-versa. Memory limits on the target hardware may determine the final preferred configuration for each test case. - -\textit{Iron, FeIII} - -This is an electron-ion scattering case with 1181 channels. Hamiltonian assembly in the coarse region applies 10 Legendre functions leading to Hamiltonian matrix diagonalisations of order 11810. In the `fine energy region' up to 30 Legendre functions may be applied leading to Hamiltonian matrices of up to order 35430. The number of sector calculations is likely to range from about 15 to over 30 depending on the user specifications. Several thousand scattering energies are used in the calculation. - -\textit{Methane, CH4} - -The dataset is an electron-molecule calculation with 1361 channels. Hamiltonian dimensions are therefore estimated between 13610 and \textasciitilde{}40000. A process in the code which splits the constituent channels according to spin can be used to approximately halve the Hamiltonian size (whilst doubling the overall number of Hamiltonian matrices). As eigensolvers generally require O(N3) operations, spin splitting leads to a saving in both memory requirements and operation count. The final radius of the external region required is relatively long, leading to more numerous sectors calculations (estimated to between 20 and 30). The calculation will require many thousands of scattering energies. - -In the current model, parallelism in EXDIG is limited to the number of sector calculations, i.e a maximum of around 30 accelerator nodes. - -Methane is a relatively new dataset which has not been calculated on novel technology platforms at the very large-scale to date, so this is somewhat a step into the unknown. We are also somewhat reliant on collaborative partners that are not associated with PRACE for continuing to develop and fine tune the accelerator-based EXAS program for this proposed work. Access to suitable hardware with throughput suited to development cycles is also a necessity if suitable progress is to be ensured. - -\subsection{3.8 QCD\label{ref-0076}} - -Matter consists of atoms, which in turn consist of nuclei and electrons. The nuclei consist of neutrons and protons, which comprise quarks bound together by gluons. - -The theory of how quarks and gluons interact to form nucleons and other elementary particles is called Quantum Chromo Dynamics (QCD). For most problems of interest, it is not possible to solve QCD analytically, and instead numerical simulations must be performed. Such ``Lattice QCD'' calculations are very computationally intensive, and occupy a significant percentage of all HPC resources worldwide. - -\subsubsection{\raisebox{-0pt}{3.8.1} Code description\label{ref-0077}} - -The QCD benchmark benefits of two different implementations described below. - -\textit{First implementation} - -The MILC code is a freely-available suite for performing Lattice QCD simulations, developed over many years by a collaboration of researchers {\hyperref[ref-0030]{[15]}}. - -The benchmark used here is derived from the MILC code (v6), and consists of a full conjugate gradient solution using Wilson fermions. The benchmark is consistent with ``QCD kernel E'' in the full UAEBS, and has been adapted so that it can efficiently use accelerators as well as traditional CPU. - -The implementation for accelerators has been achieved using the ``targetDP'' programming model {\hyperref[ref-0031]{[16]}}, a lightweight abstraction layer designed to allow the same application source code to be able to target multiple architectures, e.g. NVIDIA GPU and multicore/manycore CPU, in a performance portable manner. The targetDP syntax maps, at compile time, to either NVIDIA CUDA (for execution on GPU) or OpenMP+vectorisation (for implementation on multi/manycore CPU including Intel Xeon Phi). The base language of the benchmark is C and MPI is used for node-level parallelism. - -\textit{Second implementation} - -The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA {\hyperref[ref-0027]{[12]}} and the QPhix {\hyperref[ref-0028]{[13]}} library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPU {\hyperref[ref-0032]{[17]}}. The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL's {\hyperref[ref-0033]{[18]}}. In both QUDA and QPhix, the benchmark kernel uses the conjugate gradient solvers implemented within the libraries. - -\subsubsection{\raisebox{-0pt}{3.8.2} Test cases description\label{ref-0078}} - -Lattice QCD involves discretisation of space-time into a lattice of points, where the extent of the lattice in each of the 3 spatial and 1 temporal dimensions can be chosen. This means that the benchmark is very flexible, where the size of the lattice can be varied with the size of the computing system in use (weak scaling) or can be fixed (strong scaling). For testing on a single node, then 64x64x32x8 is a reasonable size, since this fits on a single Intel Xeon Phi or a single GPU. For larger numbers of nodes, the lattice extents can be increased accordingly, keeping the geometric shape roughly similar. Test cases for the second implementation are given by a strong-scaling mode with a lattice size of 32x32x32x96 and 64x64x64x128 and a weak scaling mode with a local lattice size of 48x48x48x24. - -\subsection{3.9 Quantum Espresso\label{ref-0079}} - -QUANTUM ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modelling, based on density-functional theory, plane waves, and pseudopotentials (norm-conserving, ultrasoft, and projector-augmented wave). QUANTUM ESPRESSO stands for \textit{opEn Source Package for Research in Electronic Structure, Simulation, and Optimisation}. It is freely available to researchers around the world under the terms of the GNU General Public License. QUANTUM ESPRESSO builds upon newly restructured electronic-structure codes that have been developed and tested by some of the original authors of novel electronic-structure algorithms and applied in the last twenty years by some of the leading materials modelling groups worldwide. Innovation and efficiency are still its main focus, with special attention paid to massively parallel architectures, and a great effort being devoted to user friendliness. QUANTUM ESPRESSO is evolving towards a distribution of independent and inter-operable codes in the spirit of an open-source project, where researchers active in the field of electronic-structure calculations are encouraged to participate in the project by contributing their own codes or by implementing their own ideas into existing codes. - -QUANTUM ESPRESSO is written mostly in Fortran90, and parallelised using MPI and OpenMP and is released under a GPL license. - -\subsubsection{\raisebox{-0pt}{3.9.1} Code description\label{ref-0080}} - -During 2011 a GPU-enabled version of Quantum ESPRESSO was publicly released. The code is currently developed and maintained by Filippo Spiga at the High Performance Computing Service - University of Cambridge (United Kingdom) and Ivan Girotto at the International Centre for Theoretical Physics (Italy). The initial work has been supported by the EC-funded PRACE and a SFI (Science Foundation Ireland, grant 08/HEC/I1450). At the time of writing, the project is self-sustained thanks to the dedication of the people involved and thanks to NVIDIA support in providing hardware and expertise in GPU programming. - -The current public version of QE-GPU is 14.10.0 as it is the last version maintained as plug-in working on all QE 5.x versions. QE-GPU utilised phiGEMM (external) for CPU+GPU GEMM computation, MAGMA (external) to accelerate eigen-solvers and explicit CUDA kernel to accelerate compute-intensive routines. FFT capabilities on GPU are available only for serial computation due to the hard challenges posed in managing accelerators in the parallel distributed 3D-FFT portion of the code where communication is the dominant element that limits excellent scalability beyond hundreds of MPI ranks. - -A version for Intel Xeon Phi (MIC) accelerators is not currently available. - -\subsubsection{\raisebox{-0pt}{3.9.2} Test cases description\label{ref-0081}} - -\textit{PW-IRMOF\_M11} - -Full SCF calculation of a Zn-based isoreticular metal--organic framework (total 130 atoms) over 1 K point. Benchmarks run in 2012 demonstrated speedups due to GPU (NVIDIA K20s, with respect to non-accelerated nodes) in the range 1.37 -- 1.87, according to node count (maximum number of accelerators=8). Runs with current hardware technology and an updated version of the code are expected to exhibit higher speedups (probably 2-3x) and scale up to a couple hundred nodes. - -\textit{PW-SiGe432} - -This is a SCF calculation of a Silicon-Germanium crystal with 430 atoms. Being a fairly large system, parallel scalability up to several hundred, perhaps a 1000 nodes is expected, with accelerated speed-ups likely to be of 2-3x. - -\subsection{3.10 Synthetic benchmarks -- SHOC\label{ref-0082}} - -The Accelerator Benchmark Suite will also include a series of synthetic benchmarks. For this purpose, we choose the Scalable HeterOgeneous Computing (SHOC) benchmark suite, augmented with a series of benchmark examples developed internally. SHOC is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing. Its initial focus is on systems containing GPU and multi-core processors, and on the OpenCL programming standard, but CUDA and OpenACC versions were added. Moreover, a subset of the benchmarks is optimised for the Intel Xeon Phi coprocessor. SHOC can be used on clusters as well as individual hosts. - -The SHOC benchmark suite currently contains benchmark programs categorised by complexity. Some measure low-level 'feeds and speeds' behaviour (Level 0), some measure the performance of a higher-level operation such as a Fast Fourier Transform (FFT) (Level 1), and the others measure real application kernels (Level 2). - -The SHOC benchmark suite has been selected to evaluate the performance of accelerators on synthetic benchmarks, mostly because SHOC provides CUDA/OpenCL/Offload/OpenACC variants of the benchmarks. This allowed us to evaluate NVIDIA GPU (with CUDA/OpenCL/OpenACC), Intel Xeon Phi KNC (with both Offload and OpenCL), but also Intel host CPU (with OpenCL/OpenACC). However, on the latest Xeon Phi processor (codenamed KNL) none of these 4 models is supported. Thus, benchmarks on the KNL architecture can not be run at this point, and there aren't any news of Intel supporting OpenCL on the KNL. However, there is work in progress on the PGI compiler to support the KNL as a target. This support will be added during 2017. This will allow us to compile and run the OpenACC benchmarks for the KNL. Alternatively, the OpenACC benchmarks will be ported to OpenMP and executed on the KNL. - -\subsubsection{\raisebox{-0pt}{3.10.1} Code description\label{ref-0083}} - -All benchmarks are MPI-enabled. Some will report aggregate metrics over all MPI ranks, others will only perform work for specific ranks. - -Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi). For selected benchmarks OpenACC implementations are provided for GPU. Multi-node parallelisation is achieved using MPI. - -SHOC is written in C++ and is open-source and freely available. - -\subsubsection{\raisebox{-0pt}{3.10.2} Test cases description\label{ref-0084}} - -The benchmarks contained in SHOC currently feature 4 different sizes for increasingly large systems. The size convention is as follows: - -\begin{enumerate}[1] - -\item CPU / debugging - -\item Mobile/integrated GPU - -\item Discrete GPU (e.g. GeForce or Radeon series) - -\item HPC-focused or large memory GPU (e.g. Tesla or Firestream Series) - -\end{enumerate} - -In order to go even larger scale, we plan to add a 5th level for massive supercomputers. - -\subsection{3.11 SPECFEM3D\label{ref-0085}} - -The software package SPECFEM3D simulates three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). All SPECFEM3D\_GLOBE software is written in Fortran90 with full portability in mind, and conforms strictly to the Fortran95 standard. It uses no obsolete or obsolescent features of Fortran77. The package uses parallel programming based upon the Message Passing Interface (MPI). - -The SEM was originally developed in computational fluid dynamics and has been successfully adapted to address problems in seismic wave propagation. It is a continuous Galerkin technique, which can easily be made discontinuous; it is then close to a particular case of the discontinuous Galerkin technique, with optimised efficiency because of its tensorised basis functions. In particular, it can accurately handle very distorted mesh elements. It has very good accuracy and convergence properties. The spectral element approach admits spectral rates of convergence and allows exploiting hp-convergence schemes. It is also very well suited to parallel implementation on very large supercomputers as well as on clusters of GPU accelerating graphics cards. Tensor products inside each element can be optimised to reach very high efficiency, and mesh point and element numbering can be optimised to reduce processor cache misses and improve cache reuse. The SEM can also handle triangular (in 2D) or tetrahedral (3D) elements as well as mixed meshes, although with increased cost and reduced accuracy in these elements, as in the discontinuous Galerkin method. - -In many geological models in the context of seismic wave propagation studies (except for instance for fault dynamic rupture studies, in which very high frequencies of supershear rupture need to be modelled near the fault) a continuous formulation is sufficient because material property contrasts are not drastic and thus conforming mesh doubling bricks can efficiently handle mesh size variations. This is particularly true at the scale of the full earth. Effects due to lateral variations in compressional-wave speed, shear-wave speed, density, a 3D crustal model, ellipticity, topography and bathyletry, the oceans, rotation, and self-gravitation are included. The package can accommodate full 21-parameter anisotropy as well as lateral variations in attenuation. Adjoint capabilities and finite-frequency kernel simulations are also included. - -\subsubsection{\raisebox{-0pt}{3.11.1} Test cases definition\label{ref-0086}} - -Both test cases will use the same input data. A 3D shear-wave speed model (S362ANI) will be used to benchmark the code. - -Here is an explanation of the simulation parameters that will be used to size the test case: - -\begin{itemize} -\item \textit{NCHUNKS,} number of face of the cubed sphere included in the simulation (will be always 6) - -\item \textit{NPROC\_XI}, number of slice along one chunk of the cubed sphere (will represents also the number of processors used for 1 chunk - -\item \textit{NEX\_XI}, number of spectral elements along one side of a chunk. - -\item \textit{RECORD\_LENGHT\_IN\_MINUTES,} length of the simulated seismograms. The time of the simulation should vary linearly with this parameter. - -\end{itemize} -\textit{Small test case} - -It runs with 24 MPI tasks and has the following mesh characteristics: - -\begin{itemize} -\item NCHUCKS=6 - -\item NPROC\_XI=2 - -\item NEX\_XI =80 - -\item RECORD\_LENGHT\_IN\_MINUTES =2.0 - -\end{itemize} -\textit{Bigger test case} - -It runs with 150 MPI tasks and has the following mesh characteristics: - -\begin{itemize} -\item NCHUCKS=6 - -\item NPROC\_XI=5 - -\item NEX\_XI =80 - -\item RECORD\_LENGHT\_IN\_MINUTES =2.0 - -\end{itemize} -\section{4 Applications performances%Victor CameoVC2017-03-19T19:18:00Z Victor CameoVC Victor CameoVC Victor Cameo Victor Cameo VC VC 2017-03-19T19:18:00Z 2017-03-19T19:18:00Z Faire un tableau r\'{e}cap Faire un tableau r\'{e}cap Faire un tableau r\'{e}cap -\label{ref-0087}\label{ref-0088}} - -This section presents some sample results on targeted machines. - -\subsection{4.1 Alya\label{ref-0089}} - -Alya has been compiled and run using test case A on three different types of compute nodes: - -\begin{itemize} -\item BSC MinoTauro Westemere Partition (Intel E5649 12 core 2.53 GHz, 24 GB RAM, Infiniband) - -\item BSC MinoTauro Haswell + K80 Partition (Intel Xeon E5-2630 v3 16 core 2.4 GHz, 128 GB RAM, NVIDIA K80, Infiniband) - -\item KNL 7250 (68 core 1.40 GHz, 16 GB MCDRAM, 96BG DDR4 RAM, Ethernet) - -\end{itemize} -Alya supports parallelism via different options, mainly MPI for problem decomposition, OpenMP within the matrix construction phase and CUDA parallelism for selected solvers. In general, the best distribution and performance can be achieved by using MPI. Running on KNL it has been proven optimal to use 4 OpenMP threads and 16 MPI processes for a total of 64 processes, each on its own physical core. The Xeon Phi processor shows slightly better performance in Alya configured in Quadrant/Cache when compared to Quadrant/Flat, although the difference is negligible. The application is not optimized for the first generation Xeon Phi KNC and does not support offloading. - -Overall speedups have been compared to a one node CPU run on the Haswell partition of MinoTauro. As the application is heavily optimized for traditional computation the best and almost linear scaling is observed on the CPU only runs. Some calculations benefit from the accelerators, GPU yielding from 3.6x to 6.5x speedup for one to three nodes. The KNL runs are limited by the OpenMP scalability and too many MPI tasks on these processors lead to suboptimal scaling. Speedups in this case range from 0.9x to 1.6x and can be further optimized by introducing more threading parallelism. The communication overhead when running with many MPI tasks on KNL is noticeable and further limited by the ethernet connection on multinode runs. High-performance fabrics such as Omni-Path or Infiniband promise to provide significant enhancement for these cases. The results are compared in {\hyperref[ref-0092]{Figure 3}}. - -It can be seen that the best performance is gained on the most recent standard Xeon CPU in conjunction with GPU. This is expected as Alya has been heavily optimized for traditional HPC scalability using mainly MPI and makes good use of available cores. The addition of GPU enabled solvers provides a noticeable boost to the overall performance. To fully exploit the KNL further optimizations are ongoing and additional OpenMP parallelism will need to be employed. - -\begin{figure} -\caption{Figure 1 Shows the matrix construction part of Alya that is parallelised with OpenMP and benefits significantly from the many cores available on KNL.\label{ref-0090}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image3.png} -\label{fig:1} -\end{figure} - -\begin{figure} -\caption{Figure 2 Demonstrates the scalability of the code. As expected Haswell cores with K80 GPU are high-performing while the KNL port is currently being optimized further.\label{ref-0091}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image4.png} -\label{fig:2} -\end{figure} - -\begin{figure} -\caption{Figure 3 Best performance is achieved with GPU in combination with powerful CPU cores. Single thread performance has a big impact on the speedup, both threading and vectorization are employed for additional performance.\label{ref-0092}\label{ref-0093}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image5.png} -\label{fig:3} -\end{figure} -\subsection{4.2 Code\_Saturne\label{ref-0094}} - -\textit{Description runtime architecture:} - -\begin{itemize} -\item KNL: ARCHER (model 7210) - The following environment is used, i.e. ENV\_6.0.3. The INTEL compiler's version is 17.0.0.098. - -\item GPU: 2 POWER8 nodes, i.e. S822LC (2x P8 10-cores + 2x K80 (2 G210 per K80)) and S824L (2x P8 12-cores + 2x K40 (1 G180 per K40)) - The compiler is at/8.0, the MPI distribution openmpi/1.8.8 and the CUDA compiler's version is 7.5. - -\end{itemize} -\textit{3-D Taylor-Green vortex flow (hexahedral cells)} - -The first test case has been run on ARCHER KNL and the performance has been investigated for several configurations, each of them using 64 MPI tasks per node and either 1, 2 or 4 hyper-threads (extra MPI tasks) or OpenMP threads have been added for testing. The results are compared to ARCHER CPU, in this case IvyBridge CPU. Up to 8 nodes are used for comparison. - -\begin{figure} -\caption{Figure 4 Code\_Saturne's performance on KNL. AMG is used as a solver in V4.2.2.\label{ref-0095}\label{ref-0096}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image6.png} -\label{fig:4} -\end{figure} -{\hyperref[ref-0095]{Figure 4}} shows the CPU time per time step as a function of the number threads/MPI tasks. For all the cases, the time to solution decreases when the number of threads increases. For the case using MPI only and no hyper-threading (green line) only, a simulation is run on half a node as well to investigate the speedup going from half a node to a node, which is about 2 as seen on the figure. The ellipses help comparing the time to solution per node, and finally, a comparison is carried out with simulations run on ARCHER without KNL, using Ivybridge processors. When using 8 nodes, the best configuration for Code\_Saturne to run on KNL is for 64 MPI tasks and 2 OpenMP threads per task (blue line on the figure), which is about 15 to 20\% faster than running on the Ivybridge nodes, using the same number of nodes. - -\textit{Flow in a 3-D lid-driven cavity (tetrahedral cells)} - -The following options are used for PETSc: - - -\begin{itemize} -\item -CPU: -ksp\_type = cg and -pc\_type = jacobi - -\item -GPU: -ksp\_type = cg and -vec\_type = cusp and -mat\_type = aijcusp and -pc\_type = jacobi - -\end{itemize} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image7.emf} - -\textbf{Table 3 Performance of Code\_Saturne + PETSc on 1 node of the POWER8 clusters. Comparison between 2 different nodes, using different types of CPU and GPU. PETSc is built on LAPACK. The speedup is computed at the ratio between the time to solution on the CPU for a given number of MPI tasks and the time to solution on the CPU/GPU for the same number of MPI tasks.\label{ref-0097}\label{ref-0098}} - -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image8.emf} - -\textbf{Table 4 Performance of Code\_Saturne and PETSc on 1 node of KNL. PETSc is built on the MKL library\label{ref-0099}\label{ref-0100}} - -{\hyperref[ref-0097]{Table 3}} and {\hyperref[ref-0099]{Table 4}} show the results obtained using POWER8 CPU and CPU/GPU, and KNL, respectively. Focusing on the results on the POWER8 nodes first, a speedup is observed on each node of the POWER8, when using the same number of MPI tasks and of GPU. However, when the nodes are fully populated (20 and 24 MPI tasks, respectively), it is cheaper to run on the CPU only than using CPU/GPU. This could be explained by the fact that the same overall amount of data is transferred but the system administration costs, latency costs, asynchronicity of transfer in 20 (S822LC) or 24 (S824L) slices might be prohibitive. - -\subsection{4.3 CP2K\label{ref-0101}} - -Times shown in the ARCHER KNL (model 7210, 1.30GHz, 96GB memory DDR) vs Ivy Bridge (E5-2697 v2, 2.7 GHz, 64GB) plot are for those CP2K threading configurations that give the best performance in each case. The shorthand for naming threading configurations is: - -\begin{itemize} -\item MPI: pure MPI - -\item X\_TH: X OpenMP threads per MPI rank - -\end{itemize} -Whilst single-threaded pure MPI or 2 OpenMP threads is often fastest on conventional processors, on the KNL multithreading is more likely to be beneficial, especially in problems such as the LiH-HFX benchmark in which having fewer MPI ranks means more memory is available to each rank, allowing partial results to be stored in memory instead of expensively recomputed on the fly. - -Hyperthreads were left disabled (equivalent to the aprun option --j 1), as no significant performance benefit was observed using hyperthreading. - -\begin{figure} -\caption{Figure 5 Test case 1 of CP2K on the ARCHER cluster\label{ref-0102}\label{ref-0103}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image9.png} -\label{fig:5} -\end{figure} -The node based comparison shows ({\hyperref[ref-0102]{Figure 5}}) that the runtimes on KNL nodes are roughly 1.7 times slower than runtimes on 2-socket IvyBridge nodes. - -\subsection{4.4 GPAW\label{ref-0104}} - -The performance of GPAW using both benchmarks was measured with a range of parallel job sizes on several architectures; with the architectures designated in the following tables, figures, and text as: - -\begin{itemize} -\item CPU: x86 Haswell CPU (Intel Xeon E5-2690v3) in a dual-socket node - -\item KNC: Knights Corner MIC (Intel Xeon Phi 7120P) with a x86 Haswell host CPU (Intel Xeon E5-2680v3) in a dual-socket node - -\item KNL: Knights Landing MIC (Intel Xeon Phi 7210) in a single-socket node - -\item K40: K40 GPU (NVIDIA Tesla K40) with a x86 Ivy Bridge host CPU (Intel Xeon E5-2620-v2) in a dual-socket node - -\item K80: K80 GPU (NVIDIA Tesla K80) with a x86 Haswell host CPU (Intel Xeon E5-2680v3) in a quad-socket node - -\end{itemize} -Only time spent in the main SCF-cycle was used as the runtime in the comparison ({\hyperref[ref-0105]{Table 5}} and {\hyperref[ref-0107]{Table 6}}) to exclude any differences in the initialisation overheads. - -\includegraphics[width=0.5\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image10.emf}\includegraphics[width=0.5\textwidth]{embeddings/Microsoft_Excel_Worksheet2.xlsx} - -\textbf{Table 5 GPAW runtimes (in seconds) for the smaller benchmark (Carbon Nanotube) measured on several architectures when using n sockets (i.e. processors or accelerators).\label{ref-0105}\label{ref-0106}} - -\includegraphics[width=0.5\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image11.emf}\includegraphics[width=0.5\textwidth]{embeddings/Microsoft_Excel_Worksheet3.xlsx} - -\textbf{Table 6 GPAW runtimes (in seconds) for the larger benchmark (Copper Filament) measured on several architectures when using n sockets (i.e. processors or accelerators). *Due to memory limitations on the GPU the grid spacing was increased from 0.22 to 0.28 to have a sparser grid. To account for this in the comparison, the K40 and K80 runtimes have been scaled up using a corresponding CPU runtime as a yardstick (scaling factor q=2.1132).\label{ref-0107}\label{ref-0108}} - -As can been seen from Table 2 and Table 3, in both benchmarks a single KNL or K40/K80 was faster than a single CPU. But when using multiple KNL, the performance does not seem to scale as well as for CPU. In the smaller benchmark (Carbon Nanotube), CPU outperform KNL when using more than 2 processors. In the larger benchmark (Copper Filament), KNL still outperform CPU with 8 processors but it seems likely that the CPU will overtake KNL when using an even larger number of processors. - -In contrast to KNL, the older KNC are slower than Haswell CPU across the board. Nevertheless, as can been seen from Figure 4, the scaling of KNC is to some extend comparable to CPU but with a lower scaling limit. It is therefore likely that, on systems with considerably slower host CPU than Haswells (e.g. Ivy Bridges), KNC may also give a performance boost over the host CPU. - -\begin{figure} -\caption{Figure 6 Relative performance (to / t) of GPAW is shown for parallel jobs using an increasing number of CPU (blue) or Xeon Phi KNC (red). Single CPU SCF-cycle runtime (to) was used as the baseline for the normalisation. Ideal scaling is shown as a linear dashed line for comparison. Case 1 (Carbon Nanotube) is shown with square markers and Case 2 (Copper Filament) is shown with round markers. \label{ref-0109}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image12.png} -\label{fig:6} -\end{figure} -\subsection{4.5 GROMACS\label{ref-0110}} - -Gromacs was successfully compiled and ran on the following systems: - -\begin{itemize} -\item GRNET ARIS: Thin nodes (E5-2680v2), GPU nodes (Dual E5-2660v3+ Dual K40m), all with FDR14 Infiniband, Single node KNL 7210. - -\item CINES Frioul KNL 7230 - -\item IDRIS Ouessant: IBM Power 8 + Dual P100 - -\end{itemize} -On KNL machines the runs were performed using Quadrant processor and both Cache / Flat memory configuration. On GRNET's single node KNL more configurations were tested. - -As it is expected the Quandrant/Cache mode gives the best performance for all cases. The performance dependence on the MPI Tasks/OpenMP threads combination was also explored. In most cases 66 tasks/per node using 2 or 4 threads/task gives the best performance on KNL 7230. - -In all accelerated runs a speed up of 2-2.6x with respect CPU only was achieved with GPU. Gromacs does not support offload on KNC. - -\begin{figure} -\caption{Figure 7 Scalability for GROMACS test case GluCL Ion Channel\label{ref-0111}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image13.png} -\label{fig:7} -\end{figure} - -\begin{figure} -\caption{Figure 8 Scalability for GROMACS test case Lignocellulose\label{ref-0112}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image14.png} -\label{fig:8} -\end{figure} -\subsection{4.6 NAMD\label{ref-0113}} - -NAMD was successfully compiled and ran on the following systems: - -\begin{itemize} -\item GRNET ARIS : Thin nodes (E5-2680v2), GPU nodes (Dual E5-2660v3+ Dual K40m), KNC Nodes (Dual E5-2660v2+Dual KNC 7120P), all with FDR14 Infiniband, Single node KNL 7210. - -\item Cines Frioul : KNL 7230 - -\item Cines Ouessant : IBM Power 8 + Dual P100 - -\end{itemize} -On KNL machines the runs were performed using Quadrant processor and both Cache / Flat memory configuration. On GRNET's single node KNL more configurations were tested. - -As it is expected the Quandrant/Cache mode gives the best performance for all cases. The performance dependence on the MPI Tasks/OpenMP threads combination was also explored. - -In most cases 66 tasks per node using 4 threads/task or 4 tasks per node/64 threads per task gives the best performance on KNL 7230. - -In all accelerated runs a speed up of 5-6x with respect CPU only runs was achieved with GPU. - -On KNC the speed up with respect CPU only is in the range 2-3.5 in all cases. - -\begin{figure} -\caption{Figure 9 Scalability for NAMD test case STMV.8M\label{ref-0114}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image15.png} -\label{fig:9} -\end{figure} - -\begin{figure} -\caption{Figure 10 Scalability for NAMD test case STMV.28M\label{ref-0115}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image16.png} -\label{fig:10} -\end{figure} -\subsection{4.7 PFARM\label{ref-0116}} - -The code has been tested and timed on several architectures, designated in the following figures, tables and text as: - -\begin{itemize} -\item CPU: node contains two 2.7 GHz, 12-core E5-2697 v2 (Ivy Bridge) series processors with 64GB memory. - -\item KNL: node is a 64-core KNL processor (model 7210) running at 1.30GHz with 96GB of memory. - -\item GPU: node contains a dual socket 16-core Haswell E5-2698 running at 2.3 GHz with 256GB memory and 4 K40, 4 K80 or 4 P100 GPU. - -\end{itemize} -Codes on all architectures are compiled with the Intel compiler (CPU v15, KNL \& GPU v17). - -The divide-and-conquer eigensolver routine DSYEVD is used throughout the test runs. The routine is linked from the following numerical libraries: - -\begin{itemize} -\item CPU: Intel MKL Version 11.2.2 - -\item KNL: Intel MKL Version 2017 Initial Release - -\item GPU: MAGMA Version 2.2 - -\end{itemize} - -\begin{figure} -\caption{Figure 11 Eigensolver performance on KNL and GPU\label{ref-0117}\label{ref-0118}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image17.png} -\label{fig:11} -\end{figure} -EXDIG calculations are dominated by the eigensolver operations required to diagonalize each sector Hamiltonian matrix. {\hyperref[ref-0117]{Figure 11}} summarizes eigensolver performance, using DSYEVD, over a range of problem sizes for the Xeon (CPU), Intel Knight's Landing (KNL) and a range of recent Nvidia GPU architectures. The results are normalised to the single node CPU performance using 24 OpenMP threads. The CPU runs use 24 OpenMP threads and the KNL runs use 64 OpenMP threads. Dense linear algebra calculations tend to be bound by memory bandwidth, so using hyperthreading on the KNL or CPU is not beneficial. MAGMA is able to parallelise the calculation automatically across multiple GPU on a compute node and these results are denoted by the x2, x4 labels. {\hyperref[ref-0117]{Figure 11}} demonstrates that MAGMA performance relative to CPU performance increases as problem size increases, due to the relative overhead cost of data transfer O(N\textasciicircum{}2) reducing compared to computational load O(N\textasciicircum{}3). - -\textit{Test Case 1 -- FeIII} - -Defining Computational Characteristics: 10 Fine Region Sector calculations involving Hamiltonian matrices of dimension 23620 and 10 Coarse Region Sector calculations involving Hamiltonian matrices of dimension 11810. - -\textit{Test Case 2 -- CH4} - -Defining Computational Characteristics: 10 `Spin 1' Coarse Sector calculations involving Hamiltonian matrices of dimension 5720 and 10 `Spin 2' Coarse Sector calculations involving Hamiltonian matrices of dimension 7890. - -\begin{table} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 0.19\linewidth-2\tabcolsep} -p{\dimexpr 0.13\linewidth-2\tabcolsep} -p{\dimexpr 0.12\linewidth-2\tabcolsep} -p{\dimexpr 0.08\linewidth-2\tabcolsep} -p{\dimexpr 0.09\linewidth-2\tabcolsep} -p{\dimexpr 0.09\linewidth-2\tabcolsep} -p{\dimexpr 0.08\linewidth-2\tabcolsep} -p{\dimexpr 0.1\linewidth-2\tabcolsep} -p{\dimexpr 0.1\linewidth-2\tabcolsep}} - & CPU 24 threads & KNL 64 threads & K80 & K80x2 & K80x4 & P100 & P100x2 & P100x4 \\ -Test Case 1~; Atomic~; FeIII & 4475 & 2610 & 1215 & 828 & 631 & 544 & 427 & 377 \\ -Test Case 2~; Molecular~; CH4 & 466 & 346 & 180 & 150 & 134 & 119 & 107 & 111 \\ - -\end{tabularx} - -\end{table} - -\textbf{Table 7 Overall EXDIG runtime performance on various accelerators (runtime, secs)\label{ref-0119}\label{ref-0120}} - -{\hyperref[ref-0119]{Table 7}} records the overall run time on a range of architectures for both test cases described. For the complete runs (including I/O), both KNL-based and GPU-based computations significantly outperform the CPU-based calculations. For Test Case 1, utilising a node with single P100 GPU accelerator results in a runtime more than 8 times quicker than the CPU, correspondingly approximately 4 times quicker for Test Case 2. The smaller Hamiltonian matrices associated with Test Case 2 means that data transfer costs O(N2) are relatively high vs computation costs O(N3). Smaller matrices also result in poorer scaling as we increase the number of GPU per node for Test Case 2. - -\textbf{\includegraphics[width=0.5\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image18.emf}\includegraphics[width=0.5\textwidth]{embeddings/Microsoft_Excel_Worksheet4.xlsx}} - -\textbf{Table 8 Overall EXDIG runtime parallel performance using MPI-GPU version\label{ref-0121}\label{ref-0122}} - -A relatively simple MPI harness can be used in EXDIG to farm out different sector Hamiltonian calculations to multiple CPU, KNL or GPU nodes. {\hyperref[ref-0121]{Table 8}} shows that parallel scaling across nodes is very good for each test platform. This strategy is inherently scalable, however the replicated data approach requires significant amounts of memory per node. Test Case 1 is used as the dataset here, although the problem characteristics are slightly different to the setup used for {\hyperref[ref-0119]{Table 7}}, with 5 Fine Region sectors with Hamiltonian dimension of 23620 and 20 Coarse Region sectors with Hamiltonian dimension of 11810. With these characteristics, runs using 2 MPI tasks experience inferior load-balancing in the Fine Region calculation compared to runs using 5 MPI tasks. - -\subsection{4.8 QCD\label{ref-0123}} - -As stated in the description, QCD benchmark has two implementations. - -\subsubsection{\raisebox{-0pt}{4.8.1} First implementation\label{ref-0124}} - -\begin{figure} -\caption{Figure 12 Small test case results for QCD, first implementation\label{ref-0125}\label{ref-0126}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image19.png} -\label{fig:12} -\end{figure} - -\begin{figure} -\caption{Figure 13 Large test case results for QCD, first implementation\label{ref-0127}\label{ref-0128}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image20.png} -\label{fig:13} -\end{figure} -The strong scaling, on Titan and ARCHER, for small ({\hyperref[ref-0125]{Figure 12}}) and large ({\hyperref[ref-0127]{Figure 13}}) problem sizes. For ARCHER, both CPU are used per node. For Titan, we include results with and without GPU utilization. - -On each node, Titan has one 16-core Interlagos CPU and one K20X GPU, whereas ARCHER has two 12-core Ivy-bridge CPU. In this section, we evaluate on a node-by-node basis. For Titan, a single MPI task per node, operating on the CPU, is used to drive the GPU on that node. We also include, for Titan, results just using the CPU on each node without any involvement from the GPU, for comparison. This means that, on a single node, our Titan results will be the same as those K20X and Interlagos results presented in the previous section (for the same test case). On ARCHER, however, we fully utilize both the processors per node: to do this we use two MPI tasks per node, each with 12 OpenMP threads (via targetDP). So the single node results for ARCHER are twice as fast as those Ivy-bridge single-processor results presented in the previous section. - -\begin{figure} -\caption{Figure 14 shows the time taken by the full MILC 64x64x64x8 test cases on traditional CPU, Intel Knights Landing Xeon Phi and NVIDIA P100 (Pascal) GPU architectures.\label{ref-0129}\label{ref-0130}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image21.emf} -\label{fig:14} -\end{figure} -In {\hyperref[ref-0129]{Figure 14}} we present preliminary results for on the latest generation Intel Knights Landing (KNL) and NVIDIA Pascal architectures, which offer very high bandwidth stacked memory, together with the same traditional Intel-Ivy-bridge CPU used in previous sections. Note that these results are not directly comparable with those presented earlier, since they are for a different test case size (larger since we are no longer limited by the small memory size of the Knights Corner), and they are for a slightly updated verion of the benchmark. The KNL is the 64-core 7210 model, available from within a test and development platform provided as part of the ARCHER service. The Pascal is a NVIDIA P100 GPU provided as part of the ``Ouessant'' IBM service at IDRIS, where the host CPU is an IBM Power8+. - -It can be seen that the KNL is 7.5X faster than the Ivy-bridge; the Pascal is 13X faster than the Ivy-bridge; and the Pascal is 1.7X faster than the KNL. - -\subsubsection{\raisebox{-0pt}{4.8.2} Second implementation\label{ref-0131}} - -\textit{GPU results} - -The GPU benchmark results of the second implementation are done on PizDaint located in Switzerland at CSCS and the GPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint is equipped with one P100 Pascal-GPU per node. Two different test-cases are depicted, the "strong-scaling" mode with a random lattice configuration of size 32x32x32x96 and 64x64x64x128. The GPU nodes of Cartesius have two Kepler-GPU K40m per node and the "strong-scaling" test is shown for one card per node and for two cards per node. The benchmark kernel is using the conjugated gradient solver which solve a linear equation system given by D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b". - -\begin{figure} -\caption{Figure 15 Result of second implementation of QCD on K40m GPU\label{ref-0132}\label{ref-0133}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image22.png} -\label{fig:15} -\end{figure} -{\hyperref[ref-0132]{Figure 15}} shows strong scaling of the conjugate gradient solver on K40m GPU on Cartesius. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays. The test is performed with a mixed precision CG in double-double mode (red) and half-double mode (blue). The run is done on one GPU per node (filled) and two GPU nodes per node (non-filled). - -\begin{figure} -\caption{Figure 16 Result of second implementation of QCD on P100 GPU\label{ref-0134}\label{ref-0135}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image23.png} -\label{fig:16} -\end{figure} -{\hyperref[ref-0134]{Figure 16}} shows strong scaling of the conjugate gradient solver on P100 GPU on PizDaint. The lattice size is given by 32x32x32x96 similar to the strong scaling run on the K40m on Cartesius. The test is performed with mixed precision CG in double-double mode (red) and half-double mode (blue). - -\begin{figure} -\caption{Figure 17 Result of second implementation of QCD on P100 GPU on larger test case\label{ref-0136}\label{ref-0137}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image24.png} -\label{fig:17} -\end{figure} -{\hyperref[ref-0136]{Figure 17}} shows strong scaling of the conjugate gradient solver on P100 GPU on PizDaint. The lattice size is increase to 64x64x64x128, which is a large lattice nowadays. By increasing the lattice the scaling test shows that the conjugate gradient solver has a very good strong scaling up to 64 GPU. - -\textit{Xeon Phi results} - -The benchmark results for the XeonPhi benchmark suite are performed on Frioul at CINES, and the hybrid partition on MareNostrum III at BSC. Frioul has one KNL-card per node while the hybrid partition of MareNostrum III is equipped with two KNC per node. The data on Frioul are generated by using the bash-scripts provided by the second implementation of QCD and are done for the two test cases "strong-scaling" with a lattice size of 32x32x32x96 and 64x64x64x128. In case of the data generated at MareNostrum, data for the "strong-scaling" mode on a 32x32x32x96 lattice are shown. The benchmark kernel uses a random gauge configuration and the conjugated gradient solver to solve a linear equation involving the clover Wilson Dirac operator. - -\begin{figure} -\caption{Figure 18 Result of second implementation of QCD on KNC\label{ref-0138}\label{ref-0139}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image25.png} -\label{fig:18} -\end{figure} -{\hyperref[ref-0138]{Figure 18}} shows strong scaling of the conjugate gradient solver on KNC's on the hybrid partition on MareNostrum III. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays. The test is performed with a conjugate gradient solver in single precision by using the native mode and 60 OpenMP tasks per MPI process. The run is done on one KNC per node (filled) and two KNC node per node (non-filled). - -\begin{figure} -\caption{Figure 19 Result of second implementation of QCD on KNL\label{ref-0140}\label{ref-0141}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image26.png} -\label{fig:19} -\end{figure} -{\hyperref[ref-0140]{Figure 19}} shows strong scaling results of the conjugate gradient solver on KNL's on Frioul. The lattice size is given by 32x32x32x96 which is similar to the strong scaling run on the KNC on MareNostrum III. The run is performed in quadrantic cache mode with 68 OpenMP processes per KNL. The test is performed with a conjugate gradient solver in single precision. - -\subsection{4.9 Quantum Espresso\label{ref-0142}} - -Here are sample results for Quantum Espresso. This code has run on Cartesius (see section {\hyperref[ref-0046]{2.2.1}}) and Marconi (1 node is 1 standalone KNL Xeon Phi 7250, 68 core 1.40 GHz, 16BG MCDRAM, 96BG DDR4 RAM, interconnect is Intel OmniPath). - -\textit{Runs on GPU} - -\begin{figure} -\caption{Figure 20 Scalability of Quantum Espresso on GPU for test case 1\label{ref-0143}\label{ref-0144}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image27.png} -\label{fig:20} -\end{figure} - -\begin{figure} -\caption{Figure 21 Scalability of Quantum Espresso on GPU for test case 2\label{ref-0145}\label{ref-0146}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image28.png} -\label{fig:21} -\end{figure} -Test cases ({\hyperref[ref-0143]{Figure 20}} and {\hyperref[ref-0145]{Figure 21}}) show no appreciable speed-up with GPU. Inputs are probably too small, they should evolve in the future of this benchmark suite. - -\textit{Runs on KNL} - -\begin{figure} -\caption{Figure 22 Scalability of Quantum Espresso on KNL for test case 1\label{ref-0147}\label{ref-0148}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image29.png} -\label{fig:22} -\end{figure} -{\hyperref[ref-0147]{Figure 22}} shows the usual pw.x with the small test case A (AUSURF), comparing Marconi Broadwell (36 cores/node) with KNL (68 cores/node) - this test case is probably small for testing on KNL. - -\begin{figure} -\caption{Figure 23 Quantum Espresso - KNL vs BDW vs BGQ (at scale)\label{ref-0149}\label{ref-0150}} -\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image30.png} -\label{fig:23} -\end{figure} -{\hyperref[ref-0149]{Figure 23}} presents CNT10POR8 which is the large test case, even though it is using the cp.x executable (i.e. Car-parinello) rather than the usual pw.x (PW SCF calculation). - -\subsection{4.10 Synthetic benchmarks (SHOC)\label{ref-0151}\label{ref-0152}} - -The SHOC benchmark has been run on Cartesius, Ouessant and MareNostrum. {\hyperref[ref-0153]{Table 9}} presents the results: - -\begin{table} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 0.25\linewidth-2\tabcolsep} -p{\dimexpr 0.13\linewidth-2\tabcolsep} -p{\dimexpr 0.13\linewidth-2\tabcolsep} -p{\dimexpr 0.13\linewidth-2\tabcolsep} -p{\dimexpr 0.1\linewidth-2\tabcolsep} -p{\dimexpr 0.12\linewidth-2\tabcolsep} -p{\dimexpr 0.13\linewidth-2\tabcolsep}} - & \multicolumn{3}{l}{NVIDIA GPU} & \multicolumn{3}{l}{Intel Xeon Phi} \\ - & K40 CUDA & K40 OpenCL & Power 8 + P100 CUDA & KNC Offload & KNC OpenCL & Haswell OpenCL \\ -BusSpeedDownload & 10.5 GB/s & 10.56 GB/s & 32.23 GB/s & 6.6 GB/s & 6.8 GB/s & 12.4 GB/s \\ -BusSpeedReadback & 10.5 GB/s & 10.56 GB/s & 34.00 GB/s & 6.7 GB/s & 6.8 GB/s & 12.5 GB/s \\ -maxspflops & 3716 GFLOPS & 3658 GFLOPS & 10424 GFLOPS & \textcolor{color-4}{21581} & \textcolor{color-4}{2314 GFLOPS} & 1647 GFLOPS \\ -maxdpflops & 1412 GFLOPS & 1411 GFLOPS & 5315 GFLOPS & \textcolor{color-4}{16017} & \textcolor{color-4}{2318 GFLOPS} & 884 GFLOPS \\ -gmem\_readbw & 177 GB/s & 179 GB/s & 575.16 GB/s & 170 GB/s & 49.7 GB/s & 20.2 GB/s \\ -gmem\_readbw\_strided & 18 GB/s & 20 GB/s & 99.15 GB/s & N/A & 35 GB/s & \textcolor{color-4}{156 GB/s} \\ -gmem\_writebw & 175 GB/s & 188 GB/s & 436 GB/s & 72 GB/s & 41 GB/s & 13.6 GB/s \\ -gmem\_writebw\_strided & 7 GB/s & 7 GB/s & 26.3 GB/s & N/A & 25 GB/s & \textcolor{color-4}{163 GB/s} \\ -lmem\_readbw & 1168 GB/s & 1156 GB/s & 4239 GB/s & N/A & 442 GB/s & 238 GB/s \\ -lmem\_writebw & 1194 GB/s & 1162 GB/s & 5488 GB/s & N/A & 477 GB/s & 295 GB/s \\ -BFS & 49,236,500 Edges/s & 42,088,000 Edges/s & 91,935,100 Edges/s & N/A & 1,635,330 Edges/s & 14,225,600 Edges/s \\ -FFT\_sp & 523 GFLOPS & 377 GFLOPS & 1472 GFLOPS & 135 GFLOPS & 71 GFLOPS & 80 GFLOPS \\ -FFT\_dp & 262 GFLOPS & 61 GFLOPS & 733 GFLOPS & 69.5 GFLOPS & 31 GFLOPS & 55 GFLOPS \\ -SGEMM & 2900-2990 GFLOPS & 694/761 GFLOPS & 8604-8720 GFLOPS & 640/645 GFLOPS & 179/217 GFLOPS & 419-554 GFLOPS \\ -DGEMM & 1025-1083 GFLOPS & 411/433 GFLOPS & 3635-3785 GFLOPS & 179/190 GFLOPS & 76/100 GFLOPS & 189-196 GFLOPS \\ -MD (SP) & 185 GFLOPS & 91 GFLOPS & 483 GFLOPS & 28 GFLOPS & 33 GFLOPS & 114 GFLOPS \\ -MD5Hash & 3.38 GH/s & 3.36 GH/s & 15.77 GH/s & N/A & 1.7 GH/s & 1.29 GH/s \\ -Reduction & 137 GB/s & 150 GB/s & 271 GB/s & 99 GB/s & 10 GB/s & 91 GB/s \\ -Scan & 47 GB/s & 39 GB/s & 99.2 GB/s & 11 GB/s & 4.5 GB/s & 15 GB/s \\ -Sort & 3.08 GB/s & 0.54 GB/s & 12.54 GB/s & N/A & 0.11 GB/s & 0.35 GB/s \\ -Spmv & 4-23 GFLOPS & 3-17 GFLOPS & 23-65 GFLOPS & \textcolor{color-4}{1-17944 GFLOPS} & N/A & 1-10 GFLOPS \\ -Stencil2D & 123 GFLOPS & 135 GFLOPS & 465 GFLOPS & 89 GFLOPS & 8.95 GFLOPS & 34 GFLOPS \\ -Stencil2D\_dp & 57 GFLOPS & 67 GFLOPS & 258 GFLOPS & 16 GFLOPS & 7.92 GFLOPS & 30 GFLOPS \\ -Triad & 13.5 GB/s & 9.9 GB/s & 43 GB/s & 5.76 GB/s & 5.57 GB/s & 8 GB/s \\ -S3D (level2) & 94 GFLOPS & 91 GFLOPS & 294 GFLOPS & 109 GFLOPS & 18 GFLOPS & 27 GFLOPS \\ - -\end{tabularx} - -\end{table} - -\textbf{Table 9 Synthetic benchmarks results on GPU and Xeon Phi\label{ref-0153}\label{ref-0154}} - -Measures marked red are not relevant and should not be considered: - -\begin{itemize} -\item KNC MaxFlops (both SP and DP): In this case the compiler optimizes away some of the computation (although it shouldn't) {\hyperref[ref-0034]{[19]}}. - -\item KNC SpMV: For these benchmarks it is a known bug currently being addressed {\hyperref[ref-0035]{[20]}}. - -\item Haswell gmem\_readbw\_strided and gmem\_writebw\_strided: strided read/write benchmarks doesn't make too much sense in the case of the CPU, as the data will be cache in the large L3 caches. It is reason why we see high number only in the Haswell case. - -\end{itemize} -\subsection{4.11 SPECFEM3D\label{ref-0155}} - -Tests have been carried out on Ouessant and Firoul. - -So far it has only been possible to run on one fixed core count for each test case, so scaling curves are not available. Test case A ran on 4 KNL and 4 P100. Test case B ran on 10 KNL and 4 P100. - -\begin{table} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 0.4\linewidth-2\tabcolsep} -p{\dimexpr 0.27\linewidth-2\tabcolsep} -p{\dimexpr 0.33\linewidth-2\tabcolsep}} - & KNL & P100 \\ -Test case A & 66 & 105 \\ -Test case B & 21.4 & 68 \\ - -\end{tabularx} - -\end{table} - -\textbf{Table 10 SPECFEM 3D GLOBE results (run time in second)\label{ref-0156}} - -\section{5 Conclusion and future work\label{ref-0157}\label{ref-0158}} - -The work presented here stand as a first sight for application benchmarking on accelerators. Most codes have been selected among the main Unified European Application Benchmark Suite. This paper describes each of them as well as implementation, relevance to European science community and test cases. We have presented results on leading edge systems - -The suite will be publicly available on the PRACE web site {\hyperref[ref-0016]{[1]}} where links to download sources and test cases will be published along with compilation and run instructions. - -Task 7.2B in PRACE 4IP started to design a benchmark suite for accelerators. This work has been done aiming at integrating it to the main UEABS one so that both can be maintained and evolve together. As PCP (PRACE-3IP) machines will soon be available, it will be very interesting to run the benchmark suite on them. First because these machines will be larger, but also because they will feature energy consumption probes. - -\end{document} diff --git a/doc/docx2latex/d7.5_4IP_1.0.xml b/doc/docx2latex/d7.5_4IP_1.0.xml deleted file mode 100644 index b07c5fbbbc410fa62588770d0055682cfafdf22d..0000000000000000000000000000000000000000 --- a/doc/docx2latex/d7.5_4IP_1.0.xml +++ /dev/null @@ -1,3 +0,0 @@ - - -truedocxfile:/home/cameo/git/CodeVault/ueabs_accelerator/doc/d7.5_4IP_1.0.docx.tmp/file:/home/cameo/git/CodeVault/ueabs_accelerator/doc/d7.5_4IP_1.0Microsoft Macintosh WordtruetrueNew TemplateDietmar ErwinVictor Cameo4752017-03-10T17:05:00Z2017-03-10T19:54:00Z2017-03-27T09:52:00ZNormal.dotm750541396879621Microsoft Macintosh Word0663186falseHomefalse93403falsefalse15.0000PRACE_Logo_pos_RGBE-InfrastructuresH2020-EINFRA-2014-2015EINFRA-4-2014: Pan-European High Performance ComputingInfrastructure and ServicesPRACE-4IPPRACE Fourth Implementation Phase ProjectGrant Agreement Number: EINFRA-653838D7.5Application performance on acceleratorsFinal Version: 1.0Author(s): Victor Cameo Ponz, CINESDate: 24.03.2016Project and Deliverable Information SheetPRACE ProjectProject Ref. №: Project Title: Project Web Site: http://www.prace-project.euDeliverable ID: < >Deliverable Nature: <DOC_TYPE: Report / Other>Dissemination Level:PUContractual Date of Delivery:31 / 03 / 2017Actual Date of Delivery:DD / Month / YYYYEC Project Officer: Leonardo Flores Añover* - The dissemination level are indicated as follows: PU – Public, CO – Confidential, only for members of the consortium (including the Commission Services) CL – Classified, as referred to in Commission Decision 2991/844/EC.Document Control SheetDocumentTitle: ID: Version: <>Status: Available at: http://www.prace-project.euSoftware Tool: Microsoft Word 2010File(s): d7.5_4IP_1.0.docxAuthorshipWritten by:CINESContributors:Adem Tekin, ITUAlan Grey, EPCCAndrew Emerson, CINECAAndrew Sunderland, STFCArno Proeme, EPCCCharles Moulinec, STFCDimitris Dellis, GRNETFiona Reid, EPCCGabriel Hautreux, INRIAJacob Finkenrath, CyIJames Clark, STFCJanko Strassburg, BSCJorge Rodriguez, BSCMartti Louhivuori, CSCPhilippe Segers, GENCIValeriu Codreanu, SURFSARAReviewed by:Filip Stanek, IT4IThomas Eickermann, FZJApproved by:MB/TBDocument Status SheetVersionDateStatusComments0.113/03/2017DraftFirst revision0.215/03/2017DraftInclude remark of the first review + new figures1.024/03/2017Final versionImproved the application performance sectionDocument Keywords Keywords:PRACE, HPC, Research Infrastructure, Accelerators, GPU, Xeon Phi, Benchmark suiteDisclaimerThis deliverable has been prepared by the responsible Work Package of the Project in accordance with the Consortium Agreement and the Grant Agreement n° . It solely reflects the opinion of the parties to such agreements on a collective basis in the context of the Project and to the extent foreseen in such agreements. Please note that even though all participants to the Project are members of PRACE AISBL, this deliverable has not been approved by the Council of PRACE AISBL and therefore does not emanate from it nor should it be considered to reflect PRACE AISBL’s individual opinion.Copyright notices© 2016 PRACE Consortium Partners. All rights reserved. This document is a project document of the PRACE project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contract for reviewing and dissemination purposes. All trademarks and other rights on third party products mentioned in this document are acknowledged as own by the respective holders.Table of Contents
Project and Deliverable Information Sheet Document Control Sheet Document Status Sheet Document Keywords Table of Contents List of Figures List of Tables References and Applicable Documents List of Acronyms and Abbreviations List of Project Partner Acronyms Executive Summary 1 Introduction 2 Targeted architectures 2.1 Co-processor description 2.2 Systems description 2.2.1 Cartesius K40 2.2.2 MareNostrum KNC 2.2.3 Ouessant P100 2.2.4 Frioul KNL 3 Benchmark suite description 3.1 Alya 3.1.1 Code description 3.1.2 Test cases description 3.2 Code_Saturne 3.2.1 Code description 3.2.2 Test cases description 3.3 CP2K 3.3.1 Code description 3.3.2 Test cases description 3.4 GPAW 3.4.1 Code description 3.4.2 Test cases description 3.5 GROMACS 3.5.1 Code description 3.5.2 Test cases description 3.6 NAMD 3.6.1 Code description 3.6.2 Test cases description 3.7 PFARM 3.7.1 Code description 3.7.2 Test cases description 3.8 QCD 3.8.1 Code description 3.8.2 Test cases description 3.9 Quantum Espresso 3.9.1 Code description 3.9.2 Test cases description 3.10 Synthetic benchmarks – SHOC 3.10.1 Code description 3.10.2 Test cases description 3.11 SPECFEM3D 3.11.1 Test cases definition 4 Applications performances 4.1 Alya 4.2 Code_Saturne 4.3 CP2K 4.4 GPAW 4.5 GROMACS 4.6 NAMD 4.7 PFARM 4.8 QCD 4.8.1 First implementation 4.8.2 Second implementation 4.9 Quantum Espresso 4.10 Synthetic benchmarks (SHOC) 4.11 SPECFEM3D 5 Conclusion and future work
List of Figures
List of Tables
Table 1 Main co-processors specifications Table 2 Codes and corresponding APIs available (in green) Table 3 Performance of Code_Saturne + PETSc on 1 node of the POWER8 clusters. Comparison between 2 different nodes, using different types of CPU and GPU. PETSc is built on LAPACK. The speedup is computed at the ratio between the time to solution on the CPU for a given number of MPI tasks and the time to solution on the CPU/GPU for the same number of MPI tasks. Table 4 Performance of Code_Saturne and PETSc on 1 node of KNL. PETSc is built on the MKL library Table 5 GPAW runtimes (in seconds) for the smaller benchmark (Carbon Nanotube) measured on several architectures when using n sockets (i.e. processors or accelerators). Table 6 GPAW runtimes (in seconds) for the larger benchmark (Copper Filament) measured on several architectures when using n sockets (i.e. processors or accelerators). *Due to memory limitations on the GPU the grid spacing was increased from 0.22 to 0.28 to have a sparser grid. To account for this in the comparison, the K40 and K80 runtimes have been scaled up using a corresponding CPU runtime as a yardstick (scaling factor q=2.1132). Table 7 Overall EXDIG runtime performance on various accelerators (runtime, secs) Table 8 Overall EXDIG runtime parallel performance using MPI-GPU version Table 9 Synthetic benchmarks results on GPU and Xeon Phi Table 10 SPECFEM 3D GLOBE results (run time in second)
References and Applicable Documentshttp://www.prace-ri.eu The Unified European Application Benchmark Suite – http://www.prace-ri.eu/ueabs/D7.4 Unified European Applications Benchmark Suite – Mark Bull et al. – 2013http://www.nvidia.com/object/quadro-design-and-manufacturing.htmlhttps://userinfo.surfsara.nl/systems/cartesius/descriptionMareNostrum III User’s Guide Barcelona Supercomputing Center – https://www.bsc.es/support/MareNostrum3-ug.pdfhttp://www.idris.fr/eng/ouessant/PFARM reference – https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/3/34/Pfarm_long_lug.pdfSolvent-Driven Preferential Association of Lignin with Regions of Crystalline Cellulose in Molecular Dynamics Simulation – Benjamin Lindner et al. – Biomacromolecules, 2013NAMD website – http://www.ks.uiuc.edu/Research/namd/SHOC source repository – https://github.com/vetter/shocParallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics – R. Babbich, M. Clark and B. Joo – SC 10 (Supercomputing 2010)Lattice QCD on Intel Xeon Phi – B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, W. Watson III – International Supercomputing Conference (ISC’13), 2013Extension of fractional step techniques for incompressible flows: The preconditioned Orthomin(1) for the pressure Schur complement – G. Houzeaux, R. Aubry, and M. Vázquez – Computers & Fluids, 44:297-313, 2011MIMD Lattice Computation (MILC) Collaboration – http://physics.indiana.edu/~sg/milc.htmltargetDP – https://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/READMEQUDA: A library for QCD on GPU – https://lattice.github.io/quda/QPhiX, QCD for Intel Xeon Phi and Xeon processors – http://jeffersonlab.github.io/qphix/KNC MaxFlops issue (both SP and DP) – https://github.com/vetter/shoc/issues/37 KNC SpMV issue – https://github.com/vetter/shoc/issues/24, https://github.com/vetter/shoc/issues/23.List of Acronyms and AbbreviationsaisblAssociation International Sans But Lucratif
(legal form of the PRACE-RI)
BCOBenchmark Code Owner
CoE Center of Excellence CPUCentral Processing UnitCUDACompute Unified Device Architecture (NVIDIA)DARPADefense Advanced Research Projects AgencyDEISADistributed European Infrastructure for Supercomputing Applications EU project by leading national HPC centresDoA Description of Action (formerly known as DoW)ECEuropean CommissionEESIEuropean Exascale Software InitiativeEoI Expression of InterestESFRIEuropean Strategy Forum on Research Infrastructures GBGiga (= 230 ~ 109) Bytes (= 8 bits), also GByteGb/s Giga (= 109) bits per second, also Gbit/sGB/s Giga (= 109) Bytes (= 8 bits) per second, also GByte/sGÉANTCollaboration between National Research and Education Networks to build a multi-gigabit pan-European network. The current EC-funded project as of 2015 is GN4.GFlop/sGiga (= 109) Floating point operations (usually in 64-bit, i.e. DP) per second, also GF/sGHz Giga (= 109) Hertz, frequency =109 periods or clock cycles per secondGPUGraphic Processing UnitHETHigh Performance Computing in Europe Taskforce. Taskforce by representatives from European HPC community to shape the European HPC Research Infrastructure. Produced the scientific case and valuable groundwork for the PRACE project.HMMHidden Markov ModelHPCHigh Performance Computing; Computing at a high performance level at any given time; often used synonym with SupercomputingHPLHigh Performance LINPACK ISCInternational Supercomputing Conference; European equivalent to the US based SCxx conference. Held annually in Germany.KBKilo (= 210 ~103) Bytes (= 8 bits), also KByteLINPACKSoftware library for Linear AlgebraMBManagement Board (highest decision making body of the project)MBMega (= 220 ~ 106) Bytes (= 8 bits), also MByteMB/s Mega (= 106) Bytes (= 8 bits) per second, also MByte/sMFlop/sMega (= 106) Floating point operations (usually in 64-bit, i.e. DP) per second, also MF/sMooC Massively open online CourseMoU Memorandum of UnderstandingMPIMessage Passing InterfaceNDANon-Disclosure Agreement. Typically signed between vendors and customers working together on products prior to their general availability or announcement.PAPreparatory Access (to PRACE resources)PATCPRACE Advanced Training CentresPRACEPartnership for Advanced Computing in Europe; Project AcronymPRACE 2The upcoming next phase of the PRACE Research Infrastructure following the initial five year period.PRIDEProject Information and Dissemination EventRIResearch InfrastructureTBTechnical Board (group of Work Package leaders)TBTera (= 240 ~ 1012) Bytes (= 8 bits), also TByteTCOTotal Cost of Ownership. Includes recurring costs (e.g. personnel, power, cooling, maintenance) in addition to the purchase cost.TDPThermal Design PowerTFlop/sTera (= 1012) Floating-point operations (usually in 64-bit, i.e. DP) per second, also TF/sTier-0Denotes the apex of a conceptual pyramid of HPC systems. In this context the Supercomputing Research Infrastructure would host the Tier-0 systems; national or topical HPC centres would constitute Tier-1UNICOREUniform Interface to Computing Resources. Grid software for seamless access to distributed resources.List of Project Partner AcronymsBADW-LRZLeibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, Germany (3rd Party to GCS)BILKENTBilkent University, Turkey (3rd Party to UYBHM)BSCBarcelona Supercomputing Center - Centro Nacional de Supercomputacion, Spain CaSToRCComputation-based Science and Technology Research Center, CyprusCCSASComputing Centre of the Slovak Academy of Sciences, SlovakiaCEACommissariat à l’Energie Atomique et aux Energies Alternatives, France (3 rd Party to GENCI)CESGAFundacion Publica Gallega Centro Tecnológico de Supercomputación de Galicia, Spain, (3rd Party to BSC)CINECACINECA Consorzio Interuniversitario, ItalyCINESCentre Informatique National de l’Enseignement Supérieur, France (3 rd Party to GENCI)CNRSCentre National de la Recherche Scientifique, France (3 rd Party to GENCI)CSCCSC Scientific Computing Ltd., FinlandCSICSpanish Council for Scientific Research (3rd Party to BSC)CYFRONETAcademic Computing Centre CYFRONET AGH, Poland (3rd party to PNSC)EPCCEPCC at The University of Edinburgh, UK ETHZurich (CSCS)Eidgenössische Technische Hochschule Zürich – CSCS, SwitzerlandFISFACULTY OF INFORMATION STUDIES, Slovenia (3rd Party to ULFME)GCSGauss Centre for Supercomputing e.V.GENCIGrand Equipement National de Calcul Intensiv, FranceGRNETGreek Research and Technology Network, GreeceINRIAInstitut National de Recherche en Informatique et Automatique, France (3 rd Party to GENCI)ISTInstituto Superior Técnico, Portugal (3rd Party to UC-LCA)IUCCINTER UNIVERSITY COMPUTATION CENTRE, IsraelJKUInstitut fuer Graphische und Parallele Datenverarbeitung der Johannes Kepler Universitaet Linz, AustriaJUELICHForschungszentrum Juelich GmbH, GermanyKTHRoyal Institute of Technology, Sweden (3 rd Party to SNIC)LiULinkoping University, Sweden (3 rd Party to SNIC)NCSANATIONAL CENTRE FOR SUPERCOMPUTING APPLICATIONS, BulgariaNIIFNational Information Infrastructure Development Institute, HungaryNTNUThe Norwegian University of Science and Technology, Norway (3rd Party to SIGMA)NUI-GalwayNational University of Ireland Galway, IrelandPRACEPartnership for Advanced Computing in Europe aisbl, BelgiumPSNCPoznan Supercomputing and Networking Center, PolandRISCSWRISC Software GmbHRZGMax Planck Gesellschaft zur Förderung der Wissenschaften e.V., Germany (3 rd Party to GCS)SIGMA2UNINETT Sigma2 AS, NorwaySNICSwedish National Infrastructure for Computing (within the Swedish Science Council), SwedenSTFCScience and Technology Facilities Council, UK (3rd Party to EPSRC)SURFsaraDutch national high-performance computing and e-Science support center, part of the SURF cooperative, NetherlandsUC-LCAUniversidade de Coimbra, Labotatório de Computação Avançada, PortugalUCPHKøbenhavns Universitet, DenmarkUHEMIstanbul Technical University, Ayazaga Campus, TurkeyUiOUniversity of Oslo, Norway (3rd Party to SIGMA)ULFMEUNIVERZA V LJUBLJANI, SloveniaUmUUmea University, Sweden (3 rd Party to SNIC)UnivEvoraUniversidade de Évora, Portugal (3rd Party to UC-LCA)UPCUniversitat Politècnica de Catalunya, Spain (3rd Party to BSC)UPM/CeSViMaMadrid Supercomputing and Visualization Center, Spain (3rd Party to BSC)USTUTT-HLRSUniversitaet Stuttgart – HLRS, Germany (3rd Party to GCS)VSB-TUOVYSOKA SKOLA BANSKA - TECHNICKA UNIVERZITA OSTRAVA, Czech RepublicWCNSPolitechnika Wroclawska, Poland (3rd party to PNSC)Executive SummaryThis document describes an accelerator benchmark suite, a set of 11 codes that includes 1 synthetic benchmark and 10 commonly used applications. The key focus of this task has been exploiting accelerators or co-processors to improve the performance of real applications. It aims at providing a set of scalable, currently relevant and publically available codes and datasets.This work has been undertaken by Task7.2B "Accelerator Benchmarks" in the PRACE Fourth Implementation Phase (PRACE-4IP) project.Most of the selected application are a subset of the Unified European Applications Benchmark Suite (UEABS) . One application and a synthetic benchmark have been added.As a result, selected codes are: Alya, Code_Saturne, CP2K, GROMACS, GPAW, NAMD, PFARM, QCD, Quantum Espresso, SHOC and SPECFEM3D.For each code either two or more test case datasets have been selected. These are described in this document, along with a brief introduction to the application codes themselves. For each code, some sample results are presented, from first run on leading edge systems and prototypes.1IntroductionThe work produced within this task is an extension of the UEABS for accelerators. This document will cover each code, presenting the code as well as the test cases defined for the benchmarks and the first results that have been recorded on various accelerator systems.As the UEABS, this suite aims to present results for many scientific fields that can use HPC accelerated resources. Hence, it will help the European scientific communities to decide in terms of infrastructures they could buy in a near future. We focus on Intel Xeon Phi coprocessors and NVIDIA GPU cards for benchmarking as they are the two most wide-spread accelerated resources available now.Section will present both types of accelerator systems, Xeon Phi and GPU card along with architecture examples. Section gives a description of each of the selected applications, together with the test case datasets while section presents some sample results. Section outlines further work on, and using, the suite.2Targeted architecturesThis suite is targeting accelerator cards, more specifically the Intel Xeon Phi and NVIDIA GPU architecture. This section will quickly describe them and will present the 4 machines, the benchmarks ran on.2.1Co-processor descriptionScientific computing using co-processors has gained popularity in recent years. First the utility of GPU has been demonstrated and evaluated in several application domains . As a response to NVIDIA’s supremacy in this field, Intel designed Xeon Phi cards.Architectures and programming models of co-processors may differ from CPU and vary among different co-processor types. The main challenges are the high-level parallelism ability required from software and the fact that code may have to be offloaded to the accelerator card.The enlightens this fact:Intel Xeon PhiNVIDIA GPU 5110P (KNC)7250 (KNL)K40mP100public availability date Nov-12Jun-16Jun-13May-16theoretical peak perf1,011 GF/s3,046 GF/s1,430 GF/s5,300 GF/soffload requiredpossiblenot possiblerequiredrequiredmax number of thread/cuda cores24027228803584Table 1 Main co-processors specifications2.2Systems descriptionThe benchmark suite has been officially granted access to 4 different machines hosted by PRACE partners. Most results presented in this paper were obtained on these machines but some of the simulation has run on similar ones. This section will cover specifications of the sub mentioned 4 official systems while the few other ones will be presented along with concerned results.As it can be noticed on the previous section, leading edge architectures have been available quite recently and some code couldn't run on it yet. Results will be completed in a near future and will be delivered with an update of the benchmark suite. Still, presented performances are a good indicator about potential efficiency of codes on both Xeon Phi and NVIDIA GPU platforms.As for the future, the PRACE-3IP PCP is in its third and last phase and will be a good candidate to provide access to bigger machines. The following suppliers had been awarded with a contract: ATOS/Bull SAS (France), E4 Computer Engineering (Italy) and Maxeler Technologies (UK), providing pilots using Xeon Phi, OPENPower and FPGA technologies. During this final phase, which started in October 2016, the contractors will have to deploy pilot system with a compute capability of around 1 PFlop/s, to demonstrate technology readiness of the proposed solution and the progress in terms of energy efficiency, using high frequency monitoring designed for this purpose. These results will be evaluated on a subset of applications from UEABS (NEMO, SPECFEM3D, QuantumEspresso, BQCD). The access to these systems is foreseen to be open to PRACE partners, with a special interest for the 4IP-WP7 task on accelerated Benchmarks.2.2.1Cartesius K40The SURFsara institute in The Netherlands granted access to Cartesius which has a GPU island (installed May 2014) with following specifications :66 Bullx B515 GPU accelerated nodes2x 8-core 2.5 GHz Intel Xeon E5-2450 v2 (Ivy Bridge) CPU/node2x NVIDIA Tesla K40m GPU/node96 GB/node, DDR3-1600 RAMTotal theoretical peak performance (Ivy Bridge + K40m) 1,056 cores + 132 GPU: 210 TF/sThe interconnect has a fully non-blocking fat-tree topology. Every node has two ConnectX-3 InfiniBand FDR adapters: one per GPU.2.2.2MareNostrum KNCThe Barcelona Supercomputing Center (BSC) in Spain granted access to MareNostrum III which features KNC nodes (upgrade June 2013). Here's the description of this partition :42 hybrid nodes containing:1x Sandy-Bridge-EP (2 x 8 cores) host processors E5-2670 8x 8G DDR3–1600 DIMMs (4GB/core), total: 64GB/node2x Xeon Phi 5110P acceleratorsInterconnection networks:Infiniband Mellanox FDR10: High bandwidth network used by parallel applications communications (MPI)Gigabit Ethernet: 10GbitEthernet network used by the GPFS Filesystem.2.2.3Ouessant P100GENCI granted access to the Ouessant prototype at IDRIS in France (installed September 2016). It is composed of 12 IBM Minsky compute nodes with each containing :Compute nodesPOWER8+ sockets, 10 cores, 8 threads per core (or 160 threads per node)128 GB of DDR4 memory (bandwidth > 9 GB/s per core)4 NVIDIA’s new generation Pascal P100 GPU, 16 GB of HBM2 memoryInterconnect4 NVLink interconnects (40GB/s of bi-directional bandwidth per interconnect); each GPU card is connected to a CPU with 2 NVLink interconnects and another GPU with 2 interconnects remainingA Mellanox EDR InfiniBand CAPI interconnect network (1 interconnect per node)2.2.4Frioul KNLGENCI also granted access to the Frioul prototype at CINES in France (installed December 2016). It is composed of 48 Intel KNL compute nodes each containing:Compute nodes7250 KNL, 68 cores, 4 threads per cores192GB of DDR4 memory16GB of MCDRAMInterconnect:A Mellanox EDR 4x InfiniBand3Benchmark suite descriptionThis part will cover each code, presenting the interest for the scientific community as well as the test cases defined for the benchmarks.As an extension to the EUABS, most codes presented in this suite are included in the latter. Exceptions are PFARM which comes from PRACE-2IP and SHOC a synthetic benchmark suite.Table 2 Codes and corresponding APIs available (in green) lists the codes that will be presented in the next sections as well as their implementations available. It should be noted that OpenMP can be used with the Intel Xeon Phi architecture while CUDA is used for NVidia GPU cards. OpenCL has been considered as a third alternative that can be used on both architectures. It has been available on the first generation of Xeon Phi (KNC) but has not been ported to the second one (KNL). SHOC is the only code that is impacted, this problem is addressed in section .3.1AlyaAlya is a high performance computational mechanics code that can solve different coupled mechanics problems: incompressible/compressible flows, solid mechanics, chemistry, excitable media, heat transfer and Lagrangian particle transport. It is one single code. There are no particular parallel or individual platform versions. Modules, services and kernels can be compiled individually and used a la carte. The main discretisation technique employed in Alya is based on the variational multiscale finite element method to assemble the governing equations into Algebraic systems. These systems can be solved using solvers like GMRES, Deflated Conjugate Gradient, pipelined CG together with preconditioners like SSOR, Restricted Additive Schwarz, etc. The coupling between physics solved in different computational domains (like fluid-structure interactions) is carried out in a multi-code way, using different instances of the same executable. Asynchronous coupling can be achieved in the same way in order to transport Lagrangian particles.3.1.1Code descriptionThe code is parallelised with MPI and OpenMP. Two OpenMP strategies are available, without and with a colouring strategy to avoid ATOMICs during the assembly step. A CUDA version is also available for the different solvers. Alya has been also compiled for MIC (Intel Xeon Phi).Alya is written in Fortran 1995 and the incompressible fluid module, present in the benchmark suite, is freely available. This module solves the Navier-Stokes equations using an Orthomin(1) method for the pressure Schur complement. This method is an algebraic split strategy which converges to the monolithic solution. At each linearisation step, the momentum is solved twice and the continuity equation is solved once or twice depending whether the momentum preserving or the continuity preserving algorithm is selected.3.1.2Test cases descriptionCavity-hexaedra elements (10M elements)This test is the classical lid-driven cavity. The problem geometry is a cube of dimensions 1x1x1. The fluid properties are density=1.0 and viscosity=0.01. Dirichlet boundary conditions are applied on all sides, with three no-slip walls and one moving wall with velocity equal to 1.0, which corresponds to a Reynolds number of 100. The Reynolds number is low so the regime is laminar and turbulence modelling is not necessary. The domain is discretised into 9800344 hexaedra elements. The solvers are the GMRES method for the momentum equations and the Deflated Conjugate Gradient to solve the continuity equation. This test case can be run using pure MPI parallelisation or the hybrid MPI/OpenMP strategy.Cavity-hexaedra elements (30M elements)This is the same cavity test as before but with 30M of elements. Note that a mesh multiplication strategy enables one to multiply the number of elements by powers of 8, by simply activating the corresponding option in the ker.dat file.Cavity-hexaedra elements-GPU version (10M elements)This is the same test as Test case 1, but using the pure MPI parallelisation strategy with acceleration of the algebraic solvers using GPU.3.2Code_SaturneCode_Saturne is a CFD software package developed by EDF R&D since 1997 and open-source since 2007. The Navier-Stokes equations are discretised following a finite volume method approach. The code can handle any type of mesh built with any type of cell/grid structure. Incompressible and compressible flows can be simulated, with or without heat transfer, and a range of turbulence models is available. The code can also be coupled with itself or other software to model some multi-physics problems (fluid-structure, fluid-conjugate heat transfer, for instance).3.2.1Code descriptionParallelism is handled by distributing the domain over the processors (several partitioning tools are available, either internally, i.e. SFC Hilbert and Morton, or through external libraries, i.e. METIS Serial, ParMETIS, Scotch Serial, PT-SCOTCH. Communications between subdomains are handled by MPI. Hybrid parallelism using MPI/OpenMP has recently been optimised for improved multicore performance.For incompressible simulations, most of the time is spent during the computation of the pressure through Poisson equations. The matrices are very sparse. PETSc has recently been linked to the code to offer alternatives to the internal solvers to compute the pressure. The developer’s version of PETSc supports CUDA and is used in this benchmark suite.Code_Saturne is written in C, F95 and Python. It is freely available under the GPL license.3.2.2Test cases descriptionTwo test cases are dealt with, the former with a mesh made of hexahedral cells and the latter with a mesh made of tetrahedral cells. Both configurations are meant for incompressible laminar flows. The first test case is run on KNL in order to test the performance of the code always completely filling up a node using 64 MPI tasks and then either 1, 2, 4 OpenMP threads, or 1, 2, 4 extra MPI tasks to investigate the effect of hyper-threading. In this case, the pressure is computed using the code's native Algebraic Multigrid (AMG) algorithm as a solver. The second test case is run on KNL and GPU. In this configuration, the pressure equation is solved using the conjugate gradient (CG) algorithm from the PETSc library (the version of PETSc is the developer's version which supports GPU) and tests are run on KNL as well as on CPU+GPU. PETSc is built with the CUSP library and the CUSP format is used.Note that computing the pressure using a CG algorithm has always been slower than using the native AMG algorithm, when using Code_Saturne. The second test is then meant to compare the current results obtained on KNL and GPU using CG only, and not to compare CG and AMG time to solution.Flow in a 3-D lid-driven cavity (tetrahedral cells)The geometry is very simple, i.e. a cube, but the mesh is built using tetrahedral cells only. The Reynolds number is set to 100, and symmetry boundary conditions are applied in the spanwise direction. The case is modular and the mesh size can easily been varied. The largest mesh has about 13 million cells and is used to get some first comparisons using Code_Saturne linked to the developer's PETSc library, in order to get use of the GPU.3-D Taylor-Green vortex flow (hexahedral cells)The Taylor-Green vortex flow is traditionally used to assess the accuracy of CFD code numerical schemes. Periodicity is used in the 3 directions. The total kinetic energy (integral of the velocity) and enstrophy (integral of the vorticity) evolutions as a function of the time are looked at. Code_Saturne is set for 2nd order time and spatial schemes. The mesh size is 2563 cells.3.3CP2KCP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. It can perform molecular dynamics, metadynamics, Quantum Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method.CP2K provides a general framework for different modelling methods such as density functional theory (DFT) using the mixed Gaussian and plane waves approaches (GPW) and Gaussian and Augmented Plane (GAPW). Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, …), and classical force fields (AMBER, CHARMM, …).3.3.1Code descriptionParallelisation is achieved using a combination of OpenMP-based multi-threading and MPI.Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi).CP2K is written in Fortran 2003 and freely available under the GPL license.3.3.2Test cases descriptionLiH-HFXThis is a single-point energy calculation for a particular configuration of a 216 atom Lithium Hydride crystal with 432 electrons in a 12.3 Å3 (Angstroms cubed) cell. The calculation is performed using a DFT algorithm with GAPW under the hybrid Hartree-Fock exchange (HFX) approximation. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although the cost of the latter can be reduced by using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performanceThis test case is expected to scale efficiently to 1000+ nodes.H2O-DFT-LSThis is a single-point energy calculation for 2048 water molecules in a 39 Å3 box using linear-scaling DFT. A local-density approximation (LDA) functional is used to compute the Exchange-Correlation energy in combination with a DZVP MOLOPT basis set and a 300 Ry cutoff. For large systems, the linear-scaling approach for solving Self-Consistent-Field equations should be much cheaper computationally than using standard DFT, and allow scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix. The cubically-scaling orthogonalisation step of standard DFT is avoided and key operations are sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in CP2K's DBCSR library.This test case is expected to scale efficiently to 4000+ nodes.3.4GPAWGPAW is a DFT program for ab-initio electronic structure calculations using the projector augmented wave method. It uses a uniform real-space grid representation of the electronic wavefunctions, that allows for excellent computational scalability and systematic converge properties.3.4.1Code descriptionGPAW is written mostly in Python, but includes also computational kernels written in C as well as leveraging external libraries such as NumPy, BLAS and ScaLAPACK. Parallelisation is based on message-passing using MPI with no threading. Development branches for GPU and MICs include support for offloading to accelerators using either CUDA or pyMIC, respectively. GPAW is freely available under the GPL license.3.4.2Test cases descriptionCarbon NanotubeThis test case is a ground state calculation for a carbon nanotube in vacuum. By default, it uses a 6-6-10 nanotube with 240 atoms (freely adjustable) and serial LAPACK with an option to use ScaLAPACK.This benchmark is aimed at smaller systems, with an intended scaling range of up to 10 nodes.Copper FilamentThis test case is a ground state calculation for a copper filament in vacuum. By default, it uses a 2x2x3 FCC lattice with 71 atoms (freely adjustable) and ScaLAPACK for parallelisation.This benchmark is aimed at larger systems, with an intended scaling range of up to 100 nodes. A lower limit on the number of nodes may be imposed by the amount of memory required, which can be adjusted to some extent with the run parameters (e.g. lattice size or grid spacing).3.5GROMACSGROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles.It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation, and some additional features:GROMACS provides extremely high performance compared to all other programs. A lot of algorithmic optimisations have been introduced in the code; for instance, the calculation of the virial has been extracted from the innermost loops over pairwise interactions, and we use our own software routines to calculate the inverse square root. In GROMACS 4.6 and up, on almost all common computing platforms, the innermost loops are written in C using intrinsic functions that the compiler transforms to SIMD machine instructions, to utilise the available instruction-level parallelism. These kernels are available in both single and double precision, and support all different kinds of SIMD support found in x86-family (and other) processors.3.5.1Code descriptionParallelisation is achieved using combined OpenMP and MPI.Offloading for accelerators is implemented through CUDA for GPU and through OpenMP for MIC (Intel Xeon Phi).GROMACS is written in C/C++ and freely available under the GPL license.3.5.2Test cases descriptionGluCL Ion ChannelThe ion channel system is the membrane protein GluCl, which is a pentameric chloride channel embedded in a lipid bilayer. The GluCl ion channel was embedded in a DOPC membrane and solvated in TIP3P water. This system contains 142k atoms, and is a quite challenging parallelisation case due to the small size. However, it is likely one of the most wanted target sizes for biomolecular simulations due to the importance of these proteins for pharmaceutical applications. It is particularly challenging due to a highly inhomogeneous and anisotropic environment in the membrane, which poses hard challenges for load balancing with domain decomposition.This test case was used as the “Small” test case in previous 2IP and 3IP PRACE phases. It is included in the package's version 5.0 benchmark cases. It is reported to scale efficiently up to 1000+ cores on x86 based systems.LignocelluloseA model of cellulose and lignocellulosic biomass in an aqueous solution . This system of 3.3 million atoms is inhomogeneous. This system uses reaction-field electrostatics instead of PME and therefore scales well on x86. This test case was used as the “Large” test case in previous PRACE 2IP and 3IP projects. It is reported in previous PRACE projects to scale efficiently up to 10000+ x86 cores.3.6NAMDNAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. NAMD is developed by the “Theoretical and Computational Biophysics Group” at the University of Illinois at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a large number of processors. The application can read a wide variety of different file formats, for example force fields, protein structures, which are commonly used in bio-molecular science. A NAMD license can be applied for on the developer’s website free of charge. Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website. Deployment areas of NAMD include pharmaceutical research by academic and industrial users. NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other chemical substances is of interest. Typical examples are vaccine research and transport processes through cell membrane proteins.3.6.1Code descriptionNAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI, supporting both pure MPI and hybrid parallelisation .Offloading for accelerators is implemented for both GPU and MIC (Intel Xeon Phi).3.6.2Test cases descriptionThe datasets are based on the original "Satellite Tobacco Mosaic Virus (STMV)" dataset from the official NAMD site. The memory optimised build of the package and data sets are used in benchmarking. Data are converted to the appropriate binary format used by the memory optimised build.STMV.1MThis is the original STMV dataset from the official NAMD site. The system contains roughly 1 million atoms. This data set scales efficiently up to 1000+ x86 Ivy Bridge cores.STMV.8MThis is a 2x2x2 replication of the original STMV dataset from the official NAMD site. The system contains roughly 8 million atoms. This data set scales efficiently up to 6000 x86 Ivy Bridge cores.STMV.28MThis is a 3x3x3 replication of the original STMV dataset from the official NAMD site. The system contains roughly 28 million atoms. This data set also scales efficiently up to 6000 x86 Ivy Bridge cores.3.7PFARMPFARM is part of a suite of programs based on the ‘R-matrix’ ab-initio approach to the varitional solution of the many-electron Schrödinger equation for electron-atom and electron-ion scattering. The package has been used to calculate electron collision data for astrophysical applications (such as: the interstellar medium, planetary atmospheres) with, for example, various ions of Fe and Ni and neutral O, plus other applications such as data for plasma modelling and fusion reactor impurities. The code has recently been adapted to form a compatible interface with the UKRmol suite of codes for electron (positron) molecule collisions thus enabling large-scale parallel ‘outer-region’ calculations for molecular systems as well as atomic systems.3.7.1Code descriptionIn order to enable efficient computation, the external region calculation takes place in two distinct stages, named EXDIG and EXAS, with intermediate files linking the two. EXDIG is dominated by the assembly of sector Hamiltonian matrices and their subsequent eigensolutions. EXAS uses a combined functional/domain decomposition approach where good load-balancing is essential to maintain efficient parallel performance. Each of the main stages in the calculation is written in Fortran 2003 (or Fortran 2003-compliant Fortran 95), is parallelised using MPI and is designed to take advantage of highly optimised, numerical library routines. Hybrid MPI / OpenMP parallelisation has also been introduced into the code via shared memory enabled numerical library kernels.Accelerator-based implementations have been implemented for both EXDIG and EXAS. EXAS uses offloading via MAGMA (or MKL) for sector Hamiltonian diagonalisations on Intel Xeon Phi and GPU accelerators. EXDIG uses combined MPI and OpenMP to distribute the scattering energy calculations on CPU efficiently both across and within Intel Xeon Phi co-processors.3.7.2Test cases descriptionExternal region R-matrix propagations take place over the outer partition of configuration space, including the region where long-range potentials remain important. The radius of this region is determined from the user input and the program decides upon the best strategy for dividing this space into multiple sub-regions (or sectors). Generally, a choice of larger sector lengths requires the application of larger numbers of basis functions (and therefore larger Hamiltonian matrices) in order to maintain accuracy across the sector and vice-versa. Memory limits on the target hardware may determine the final preferred configuration for each test case.Iron, FeIIIThis is an electron-ion scattering case with 1181 channels. Hamiltonian assembly in the coarse region applies 10 Legendre functions leading to Hamiltonian matrix diagonalisations of order 11810. In the ‘fine energy region’ up to 30 Legendre functions may be applied leading to Hamiltonian matrices of up to order 35430. The number of sector calculations is likely to range from about 15 to over 30 depending on the user specifications. Several thousand scattering energies are used in the calculation. Methane, CH4The dataset is an electron-molecule calculation with 1361 channels. Hamiltonian dimensions are therefore estimated between 13610 and ~40000. A process in the code which splits the constituent channels according to spin can be used to approximately halve the Hamiltonian size (whilst doubling the overall number of Hamiltonian matrices). As eigensolvers generally require O(N3) operations, spin splitting leads to a saving in both memory requirements and operation count. The final radius of the external region required is relatively long, leading to more numerous sectors calculations (estimated to between 20 and 30). The calculation will require many thousands of scattering energies.In the current model, parallelism in EXDIG is limited to the number of sector calculations, i.e a maximum of around 30 accelerator nodes. Methane is a relatively new dataset which has not been calculated on novel technology platforms at the very large-scale to date, so this is somewhat a step into the unknown. We are also somewhat reliant on collaborative partners that are not associated with PRACE for continuing to develop and fine tune the accelerator-based EXAS program for this proposed work. Access to suitable hardware with throughput suited to development cycles is also a necessity if suitable progress is to be ensured.3.8QCDMatter consists of atoms, which in turn consist of nuclei and electrons. The nuclei consist of neutrons and protons, which comprise quarks bound together by gluons.The theory of how quarks and gluons interact to form nucleons and other elementary particles is called Quantum Chromo Dynamics (QCD). For most problems of interest, it is not possible to solve QCD analytically, and instead numerical simulations must be performed. Such “Lattice QCD” calculations are very computationally intensive, and occupy a significant percentage of all HPC resources worldwide.3.8.1Code descriptionThe QCD benchmark benefits of two different implementations described below.First implementationThe MILC code is a freely-available suite for performing Lattice QCD simulations, developed over many years by a collaboration of researchers .The benchmark used here is derived from the MILC code (v6), and consists of a full conjugate gradient solution using Wilson fermions. The benchmark is consistent with “QCD kernel E” in the full UAEBS, and has been adapted so that it can efficiently use accelerators as well as traditional CPU.The implementation for accelerators has been achieved using the “targetDP” programming model , a lightweight abstraction layer designed to allow the same application source code to be able to target multiple architectures, e.g. NVIDIA GPU and multicore/manycore CPU, in a performance portable manner. The targetDP syntax maps, at compile time, to either NVIDIA CUDA (for execution on GPU) or OpenMP+vectorisation (for implementation on multi/manycore CPU including Intel Xeon Phi). The base language of the benchmark is C and MPI is used for node-level parallelism.Second implementationThe QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA and the QPhix library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPU . The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL's . In both QUDA and QPhix, the benchmark kernel uses the conjugate gradient solvers implemented within the libraries.3.8.2Test cases descriptionLattice QCD involves discretisation of space-time into a lattice of points, where the extent of the lattice in each of the 3 spatial and 1 temporal dimensions can be chosen. This means that the benchmark is very flexible, where the size of the lattice can be varied with the size of the computing system in use (weak scaling) or can be fixed (strong scaling). For testing on a single node, then 64x64x32x8 is a reasonable size, since this fits on a single Intel Xeon Phi or a single GPU. For larger numbers of nodes, the lattice extents can be increased accordingly, keeping the geometric shape roughly similar. Test cases for the second implementation are given by a strong-scaling mode with a lattice size of 32x32x32x96 and 64x64x64x128 and a weak scaling mode with a local lattice size of 48x48x48x24.3.9Quantum EspressoQUANTUM ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modelling, based on density-functional theory, plane waves, and pseudopotentials (norm-conserving, ultrasoft, and projector-augmented wave). QUANTUM ESPRESSO stands for opEn Source Package for Research in Electronic Structure, Simulation, and Optimisation. It is freely available to researchers around the world under the terms of the GNU General Public License. QUANTUM ESPRESSO builds upon newly restructured electronic-structure codes that have been developed and tested by some of the original authors of novel electronic-structure algorithms and applied in the last twenty years by some of the leading materials modelling groups worldwide. Innovation and efficiency are still its main focus, with special attention paid to massively parallel architectures, and a great effort being devoted to user friendliness. QUANTUM ESPRESSO is evolving towards a distribution of independent and inter-operable codes in the spirit of an open-source project, where researchers active in the field of electronic-structure calculations are encouraged to participate in the project by contributing their own codes or by implementing their own ideas into existing codes.QUANTUM ESPRESSO is written mostly in Fortran90, and parallelised using MPI and OpenMP and is released under a GPL license.3.9.1Code descriptionDuring 2011 a GPU-enabled version of Quantum ESPRESSO was publicly released. The code is currently developed and maintained by Filippo Spiga at the High Performance Computing Service - University of Cambridge (United Kingdom) and Ivan Girotto at the International Centre for Theoretical Physics (Italy). The initial work has been supported by the EC-funded PRACE and a SFI (Science Foundation Ireland, grant 08/HEC/I1450). At the time of writing, the project is self-sustained thanks to the dedication of the people involved and thanks to NVIDIA support in providing hardware and expertise in GPU programming.The current public version of QE-GPU is 14.10.0 as it is the last version maintained as plug-in working on all QE 5.x versions. QE-GPU utilised phiGEMM (external) for CPU+GPU GEMM computation, MAGMA (external) to accelerate eigen-solvers and explicit CUDA kernel to accelerate compute-intensive routines. FFT capabilities on GPU are available only for serial computation due to the hard challenges posed in managing accelerators in the parallel distributed 3D-FFT portion of the code where communication is the dominant element that limits excellent scalability beyond hundreds of MPI ranks.A version for Intel Xeon Phi (MIC) accelerators is not currently available.3.9.2Test cases descriptionPW-IRMOF_M11Full SCF calculation of a Zn-based isoreticular metal–organic framework (total 130 atoms) over 1 K point. Benchmarks run in 2012 demonstrated speedups due to GPU (NVIDIA K20s, with respect to non-accelerated nodes) in the range 1.37 – 1.87, according to node count (maximum number of accelerators=8). Runs with current hardware technology and an updated version of the code are expected to exhibit higher speedups (probably 2-3x) and scale up to a couple hundred nodes.PW-SiGe432This is a SCF calculation of a Silicon-Germanium crystal with 430 atoms. Being a fairly large system, parallel scalability up to several hundred, perhaps a 1000 nodes is expected, with accelerated speed-ups likely to be of 2-3x.3.10Synthetic benchmarks – SHOCThe Accelerator Benchmark Suite will also include a series of synthetic benchmarks. For this purpose, we choose the Scalable HeterOgeneous Computing (SHOC) benchmark suite, augmented with a series of benchmark examples developed internally. SHOC is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing. Its initial focus is on systems containing GPU and multi-core processors, and on the OpenCL programming standard, but CUDA and OpenACC versions were added. Moreover, a subset of the benchmarks is optimised for the Intel Xeon Phi coprocessor. SHOC can be used on clusters as well as individual hosts.The SHOC benchmark suite currently contains benchmark programs categorised by complexity. Some measure low-level 'feeds and speeds' behaviour (Level 0), some measure the performance of a higher-level operation such as a Fast Fourier Transform (FFT) (Level 1), and the others measure real application kernels (Level 2).The SHOC benchmark suite has been selected to evaluate the performance of accelerators on synthetic benchmarks, mostly because SHOC provides CUDA/OpenCL/Offload/OpenACC variants of the benchmarks. This allowed us to evaluate NVIDIA GPU (with CUDA/OpenCL/OpenACC), Intel Xeon Phi KNC (with both Offload and OpenCL), but also Intel host CPU (with OpenCL/OpenACC). However, on the latest Xeon Phi processor (codenamed KNL) none of these 4 models is supported. Thus, benchmarks on the KNL architecture can not be run at this point, and there aren't any news of Intel supporting OpenCL on the KNL. However, there is work in progress on the PGI compiler to support the KNL as a target. This support will be added during 2017. This will allow us to compile and run the OpenACC benchmarks for the KNL. Alternatively, the OpenACC benchmarks will be ported to OpenMP and executed on the KNL.3.10.1Code descriptionAll benchmarks are MPI-enabled. Some will report aggregate metrics over all MPI ranks, others will only perform work for specific ranks.Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi). For selected benchmarks OpenACC implementations are provided for GPU. Multi-node parallelisation is achieved using MPI.SHOC is written in C++ and is open-source and freely available.3.10.2Test cases descriptionThe benchmarks contained in SHOC currently feature 4 different sizes for increasingly large systems. The size convention is as follows:CPU / debuggingMobile/integrated GPUDiscrete GPU (e.g. GeForce or Radeon series)HPC-focused or large memory GPU (e.g. Tesla or Firestream Series)In order to go even larger scale, we plan to add a 5th level for massive supercomputers.3.11SPECFEM3DThe software package SPECFEM3D simulates three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). All SPECFEM3D_GLOBE software is written in Fortran90 with full portability in mind, and conforms strictly to the Fortran95 standard. It uses no obsolete or obsolescent features of Fortran77. The package uses parallel programming based upon the Message Passing Interface (MPI).The SEM was originally developed in computational fluid dynamics and has been successfully adapted to address problems in seismic wave propagation. It is a continuous Galerkin technique, which can easily be made discontinuous; it is then close to a particular case of the discontinuous Galerkin technique, with optimised efficiency because of its tensorised basis functions. In particular, it can accurately handle very distorted mesh elements. It has very good accuracy and convergence properties. The spectral element approach admits spectral rates of convergence and allows exploiting hp-convergence schemes. It is also very well suited to parallel implementation on very large supercomputers as well as on clusters of GPU accelerating graphics cards. Tensor products inside each element can be optimised to reach very high efficiency, and mesh point and element numbering can be optimised to reduce processor cache misses and improve cache reuse. The SEM can also handle triangular (in 2D) or tetrahedral (3D) elements as well as mixed meshes, although with increased cost and reduced accuracy in these elements, as in the discontinuous Galerkin method.In many geological models in the context of seismic wave propagation studies (except for instance for fault dynamic rupture studies, in which very high frequencies of supershear rupture need to be modelled near the fault) a continuous formulation is sufficient because material property contrasts are not drastic and thus conforming mesh doubling bricks can efficiently handle mesh size variations. This is particularly true at the scale of the full earth. Effects due to lateral variations in compressional-wave speed, shear-wave speed, density, a 3D crustal model, ellipticity, topography and bathyletry, the oceans, rotation, and self-gravitation are included. The package can accommodate full 21-parameter anisotropy as well as lateral variations in attenuation. Adjoint capabilities and finite-frequency kernel simulations are also included.3.11.1Test cases definitionBoth test cases will use the same input data. A 3D shear-wave speed model (S362ANI) will be used to benchmark the code.Here is an explanation of the simulation parameters that will be used to size the test case:NCHUNKS, number of face of the cubed sphere included in the simulation (will be always 6)NPROC_XI, number of slice along one chunk of the cubed sphere (will represents also the number of processors used for 1 chunkNEX_XI, number of spectral elements along one side of a chunk.RECORD_LENGHT_IN_MINUTES, length of the simulated seismograms. The time of the simulation should vary linearly with this parameter.Small test case It runs with 24 MPI tasks and has the following mesh characteristics: NCHUCKS=6NPROC_XI=2NEX_XI =80 RECORD_LENGHT_IN_MINUTES =2.0Bigger test case It runs with 150 MPI tasks and has the following mesh characteristics: NCHUCKS=6NPROC_XI=5NEX_XI =80 RECORD_LENGHT_IN_MINUTES =2.04Applications performancesVictor CameoVC2017-03-19T19:18:00ZFaire un tableau récapThis section presents some sample results on targeted machines.4.1AlyaAlya has been compiled and run using test case A on three different types of compute nodes:BSC MinoTauro Westemere Partition (Intel E5649 12 core 2.53 GHz, 24 GB RAM, Infiniband)BSC MinoTauro Haswell + K80 Partition (Intel Xeon E5-2630 v3 16 core 2.4 GHz, 128 GB RAM, NVIDIA K80, Infiniband)KNL 7250 (68 core 1.40 GHz, 16 GB MCDRAM, 96BG DDR4 RAM, Ethernet)Alya supports parallelism via different options, mainly MPI for problem decomposition, OpenMP within the matrix construction phase and CUDA parallelism for selected solvers. In general, the best distribution and performance can be achieved by using MPI. Running on KNL it has been proven optimal to use 4 OpenMP threads and 16 MPI processes for a total of 64 processes, each on its own physical core. The Xeon Phi processor shows slightly better performance in Alya configured in Quadrant/Cache when compared to Quadrant/Flat, although the difference is negligible. The application is not optimized for the first generation Xeon Phi KNC and does not support offloading.Overall speedups have been compared to a one node CPU run on the Haswell partition of MinoTauro. As the application is heavily optimized for traditional computation the best and almost linear scaling is observed on the CPU only runs. Some calculations benefit from the accelerators, GPU yielding from 3.6x to 6.5x speedup for one to three nodes. The KNL runs are limited by the OpenMP scalability and too many MPI tasks on these processors lead to suboptimal scaling. Speedups in this case range from 0.9x to 1.6x and can be further optimized by introducing more threading parallelism. The communication overhead when running with many MPI tasks on KNL is noticeable and further limited by the ethernet connection on multinode runs. High-performance fabrics such as Omni-Path or Infiniband promise to provide significant enhancement for these cases. The results are compared in .It can be seen that the best performance is gained on the most recent standard Xeon CPU in conjunction with GPU. This is expected as Alya has been heavily optimized for traditional HPC scalability using mainly MPI and makes good use of available cores. The addition of GPU enabled solvers provides a noticeable boost to the overall performance. To fully exploit the KNL further optimizations are ongoing and additional OpenMP parallelism will need to be employed.
Figure 1 Shows the matrix construction part of Alya that is parallelised with OpenMP and benefits significantly from the many cores available on KNL.<anchor role="start" xml:id="_Toc478379015"><?latex \label{ref-0090}?></anchor>
Figure 2 Demonstrates the scalability of the code. As expected Haswell cores with K80 GPU are high-performing while the KNL port is currently being optimized further.<anchor role="start" xml:id="_Toc478379016"><?latex \label{ref-0091}?></anchor>
Figure 3 Best performance is achieved with GPU in combination with powerful CPU cores. Single thread performance has a big impact on the speedup, both threading and vectorization are employed for additional performance.<anchor role="start" xml:id="_Ref478141367"><?latex \label{ref-0092}?></anchor><anchor role="start" xml:id="_Toc478379017"><?latex \label{ref-0093}?></anchor>
4.2Code_SaturneDescription runtime architecture:KNL: ARCHER (model 7210) - The following environment is used, i.e. ENV_6.0.3. The INTEL compiler's version is 17.0.0.098.GPU: 2 POWER8 nodes, i.e. S822LC (2x P8 10-cores + 2x K80 (2 G210 per K80)) and S824L (2x P8 12-cores + 2x K40 (1 G180 per K40)) - The compiler is at/8.0, the MPI distribution openmpi/1.8.8 and the CUDA compiler's version is 7.5.3-D Taylor-Green vortex flow (hexahedral cells)The first test case has been run on ARCHER KNL and the performance has been investigated for several configurations, each of them using 64 MPI tasks per node and either 1, 2 or 4 hyper-threads (extra MPI tasks) or OpenMP threads have been added for testing. The results are compared to ARCHER CPU, in this case IvyBridge CPU. Up to 8 nodes are used for comparison.
Figure 4 Code_Saturne's performance on KNL. AMG is used as a solver in V4.2.2.<anchor role="start" xml:id="_Ref477440013"><?latex \label{ref-0095}?></anchor><anchor role="start" xml:id="_Toc478379018"><?latex \label{ref-0096}?></anchor>
shows the CPU time per time step as a function of the number threads/MPI tasks. For all the cases, the time to solution decreases when the number of threads increases. For the case using MPI only and no hyper-threading (green line) only, a simulation is run on half a node as well to investigate the speedup going from half a node to a node, which is about 2 as seen on the figure. The ellipses help comparing the time to solution per node, and finally, a comparison is carried out with simulations run on ARCHER without KNL, using Ivybridge processors. When using 8 nodes, the best configuration for Code_Saturne to run on KNL is for 64 MPI tasks and 2 OpenMP threads per task (blue line on the figure), which is about 15 to 20% faster than running on the Ivybridge nodes, using the same number of nodes.Flow in a 3-D lid-driven cavity (tetrahedral cells)The following options are used for PETSc: --CPU: -ksp_type = cg and -pc_type = jacobi-GPU: -ksp_type = cg and -vec_type = cusp and -mat_type = aijcusp and -pc_type = jacobiTable 3 Performance of Code_Saturne + PETSc on 1 node of the POWER8 clusters. Comparison between 2 different nodes, using different types of CPU and GPU. PETSc is built on LAPACK. The speedup is computed at the ratio between the time to solution on the CPU for a given number of MPI tasks and the time to solution on the CPU/GPU for the same number of MPI tasks.Table 4 Performance of Code_Saturne and PETSc on 1 node of KNL. PETSc is built on the MKL library and show the results obtained using POWER8 CPU and CPU/GPU, and KNL, respectively. Focusing on the results on the POWER8 nodes first, a speedup is observed on each node of the POWER8, when using the same number of MPI tasks and of GPU. However, when the nodes are fully populated (20 and 24 MPI tasks, respectively), it is cheaper to run on the CPU only than using CPU/GPU. This could be explained by the fact that the same overall amount of data is transferred but the system administration costs, latency costs, asynchronicity of transfer in 20 (S822LC) or 24 (S824L) slices might be prohibitive.4.3CP2KTimes shown in the ARCHER KNL (model 7210, 1.30GHz, 96GB memory DDR) vs Ivy Bridge (E5-2697 v2, 2.7 GHz, 64GB) plot are for those CP2K threading configurations that give the best performance in each case. The shorthand for naming threading configurations is:MPI: pure MPIX_TH: X OpenMP threads per MPI rankWhilst single-threaded pure MPI or 2 OpenMP threads is often fastest on conventional processors, on the KNL multithreading is more likely to be beneficial, especially in problems such as the LiH-HFX benchmark in which having fewer MPI ranks means more memory is available to each rank, allowing partial results to be stored in memory instead of expensively recomputed on the fly. Hyperthreads were left disabled (equivalent to the aprun option –j 1), as no significant performance benefit was observed using hyperthreading.
Figure 5 Test case 1 of CP2K on the ARCHER cluster<anchor role="start" xml:id="_Ref477996530"><?latex \label{ref-0102}?></anchor><anchor role="start" xml:id="_Toc478379019"><?latex \label{ref-0103}?></anchor>
The node based comparison shows () that the runtimes on KNL nodes are roughly 1.7 times slower than runtimes on 2-socket IvyBridge nodes.4.4GPAWThe performance of GPAW using both benchmarks was measured with a range of parallel job sizes on several architectures; with the architectures designated in the following tables, figures, and text as:CPU: x86 Haswell CPU (Intel Xeon E5-2690v3) in a dual-socket nodeKNC: Knights Corner MIC (Intel Xeon Phi 7120P) with a x86 Haswell host CPU (Intel Xeon E5-2680v3) in a dual-socket nodeKNL: Knights Landing MIC (Intel Xeon Phi 7210) in a single-socket nodeK40: K40 GPU (NVIDIA Tesla K40) with a x86 Ivy Bridge host CPU (Intel Xeon E5-2620-v2) in a dual-socket nodeK80: K80 GPU (NVIDIA Tesla K80) with a x86 Haswell host CPU (Intel Xeon E5-2680v3) in a quad-socket nodeOnly time spent in the main SCF-cycle was used as the runtime in the comparison ( and ) to exclude any differences in the initialisation overheads.Table 5 GPAW runtimes (in seconds) for the smaller benchmark (Carbon Nanotube) measured on several architectures when using n sockets (i.e. processors or accelerators).Table 6 GPAW runtimes (in seconds) for the larger benchmark (Copper Filament) measured on several architectures when using n sockets (i.e. processors or accelerators). *Due to memory limitations on the GPU the grid spacing was increased from 0.22 to 0.28 to have a sparser grid. To account for this in the comparison, the K40 and K80 runtimes have been scaled up using a corresponding CPU runtime as a yardstick (scaling factor q=2.1132).As can been seen from Table 2 and Table 3, in both benchmarks a single KNL or K40/K80 was faster than a single CPU. But when using multiple KNL, the performance does not seem to scale as well as for CPU. In the smaller benchmark (Carbon Nanotube), CPU outperform KNL when using more than 2 processors. In the larger benchmark (Copper Filament), KNL still outperform CPU with 8 processors but it seems likely that the CPU will overtake KNL when using an even larger number of processors.In contrast to KNL, the older KNC are slower than Haswell CPU across the board. Nevertheless, as can been seen from Figure 4, the scaling of KNC is to some extend comparable to CPU but with a lower scaling limit. It is therefore likely that, on systems with considerably slower host CPU than Haswells (e.g. Ivy Bridges), KNC may also give a performance boost over the host CPU.
Figure 6 Relative performance (to / t) of GPAW is shown for parallel jobs using an increasing number of CPU (blue) or Xeon Phi KNC (red). Single CPU SCF-cycle runtime (to) was used as the baseline for the normalisation. Ideal scaling is shown as a linear dashed line for comparison. Case 1 (Carbon Nanotube) is shown with square markers and Case 2 (Copper Filament) is shown with round markers. <anchor role="start" xml:id="_Toc478379020"><?latex \label{ref-0109}?></anchor>
4.5GROMACSGromacs was successfully compiled and ran on the following systems:GRNET ARIS: Thin nodes (E5-2680v2), GPU nodes (Dual E5-2660v3+ Dual K40m), all with FDR14 Infiniband, Single node KNL 7210. CINES Frioul KNL 7230IDRIS Ouessant: IBM Power 8 + Dual P100On KNL machines the runs were performed using Quadrant processor and both Cache / Flat memory configuration. On GRNET's single node KNL more configurations were tested. As it is expected the Quandrant/Cache mode gives the best performance for all cases. The performance dependence on the MPI Tasks/OpenMP threads combination was also explored. In most cases 66 tasks/per node using 2 or 4 threads/task gives the best performance on KNL 7230.In all accelerated runs a speed up of 2-2.6x with respect CPU only was achieved with GPU. Gromacs does not support offload on KNC.
Figure 7 Scalability for GROMACS test case GluCL Ion Channel<anchor role="start" xml:id="_Toc478379021"><?latex \label{ref-0111}?></anchor>
Figure 8 Scalability for GROMACS test case Lignocellulose<anchor role="start" xml:id="_Toc478379022"><?latex \label{ref-0112}?></anchor>
4.6NAMDNAMD was successfully compiled and ran on the following systems:GRNET ARIS : Thin nodes (E5-2680v2), GPU nodes (Dual E5-2660v3+ Dual K40m), KNC Nodes (Dual E5-2660v2+Dual KNC 7120P), all with FDR14 Infiniband, Single node KNL 7210.Cines Frioul : KNL 7230 Cines Ouessant : IBM Power 8 + Dual P100On KNL machines the runs were performed using Quadrant processor and both Cache / Flat memory configuration. On GRNET's single node KNL more configurations were tested. As it is expected the Quandrant/Cache mode gives the best performance for all cases. The performance dependence on the MPI Tasks/OpenMP threads combination was also explored. In most cases 66 tasks per node using 4 threads/task or 4 tasks per node/64 threads per task gives the best performance on KNL 7230.In all accelerated runs a speed up of 5-6x with respect CPU only runs was achieved with GPU.On KNC the speed up with respect CPU only is in the range 2-3.5 in all cases.
Figure 9 Scalability for NAMD test case STMV.8M<anchor role="start" xml:id="_Toc478379023"><?latex \label{ref-0114}?></anchor>
Figure 10 Scalability for NAMD test case STMV.28M<anchor role="start" xml:id="_Toc478379024"><?latex \label{ref-0115}?></anchor>
4.7PFARMThe code has been tested and timed on several architectures, designated in the following figures, tables and text as:CPU: node contains two 2.7 GHz, 12-core E5-2697 v2 (Ivy Bridge) series processors with 64GB memory.KNL: node is a 64-core KNL processor (model 7210) running at 1.30GHz with 96GB of memory.GPU: node contains a dual socket 16-core Haswell E5-2698 running at 2.3 GHz with 256GB memory and 4 K40, 4 K80 or 4 P100 GPU.Codes on all architectures are compiled with the Intel compiler (CPU v15, KNL & GPU v17).The divide-and-conquer eigensolver routine DSYEVD is used throughout the test runs. The routine is linked from the following numerical libraries:CPU: Intel MKL Version 11.2.2KNL: Intel MKL Version 2017 Initial ReleaseGPU: MAGMA Version 2.2
Figure 11 Eigensolver performance on KNL and GPU<anchor role="start" xml:id="_Ref477737037"><?latex \label{ref-0117}?></anchor><anchor role="start" xml:id="_Toc478379025"><?latex \label{ref-0118}?></anchor>
EXDIG calculations are dominated by the eigensolver operations required to diagonalize each sector Hamiltonian matrix. summarizes eigensolver performance, using DSYEVD, over a range of problem sizes for the Xeon (CPU), Intel Knight’s Landing (KNL) and a range of recent Nvidia GPU architectures. The results are normalised to the single node CPU performance using 24 OpenMP threads. The CPU runs use 24 OpenMP threads and the KNL runs use 64 OpenMP threads. Dense linear algebra calculations tend to be bound by memory bandwidth, so using hyperthreading on the KNL or CPU is not beneficial. MAGMA is able to parallelise the calculation automatically across multiple GPU on a compute node and these results are denoted by the x2, x4 labels. demonstrates that MAGMA performance relative to CPU performance increases as problem size increases, due to the relative overhead cost of data transfer O(N^2) reducing compared to computational load O(N^3).Test Case 1 – FeIIIDefining Computational Characteristics: 10 Fine Region Sector calculations involving Hamiltonian matrices of dimension 23620 and 10 Coarse Region Sector calculations involving Hamiltonian matrices of dimension 11810.Test Case 2 – CH4Defining Computational Characteristics: 10 ‘Spin 1’ Coarse Sector calculations involving Hamiltonian matrices of dimension 5720 and 10 ‘Spin 2’ Coarse Sector calculations involving Hamiltonian matrices of dimension 7890.CPU 24 threadsKNL 64 threadsK80K80x2K80x4P100P100x2P100x4Test Case 1 ; Atomic ; FeIII447526101215828631544427377Test Case 2 ; Molecular ; CH4466346180150134119107111Table 7 Overall EXDIG runtime performance on various accelerators (runtime, secs) records the overall run time on a range of architectures for both test cases described. For the complete runs (including I/O), both KNL-based and GPU-based computations significantly outperform the CPU-based calculations. For Test Case 1, utilising a node with single P100 GPU accelerator results in a runtime more than 8 times quicker than the CPU, correspondingly approximately 4 times quicker for Test Case 2. The smaller Hamiltonian matrices associated with Test Case 2 means that data transfer costs O(N2) are relatively high vs computation costs O(N3). Smaller matrices also result in poorer scaling as we increase the number of GPU per node for Test Case 2.Table 8 Overall EXDIG runtime parallel performance using MPI-GPU versionA relatively simple MPI harness can be used in EXDIG to farm out different sector Hamiltonian calculations to multiple CPU, KNL or GPU nodes. shows that parallel scaling across nodes is very good for each test platform. This strategy is inherently scalable, however the replicated data approach requires significant amounts of memory per node. Test Case 1 is used as the dataset here, although the problem characteristics are slightly different to the setup used for , with 5 Fine Region sectors with Hamiltonian dimension of 23620 and 20 Coarse Region sectors with Hamiltonian dimension of 11810. With these characteristics, runs using 2 MPI tasks experience inferior load-balancing in the Fine Region calculation compared to runs using 5 MPI tasks.4.8QCDAs stated in the description, QCD benchmark has two implementations.4.8.1First implementation
Figure 12 Small test case results for QCD, first implementation<anchor role="start" xml:id="_Ref477152535"><?latex \label{ref-0125}?></anchor><anchor role="start" xml:id="_Toc478379026"><?latex \label{ref-0126}?></anchor>
Figure 13 Large test case results for QCD, first implementation<anchor role="start" xml:id="_Ref477772687"><?latex \label{ref-0127}?></anchor><anchor role="start" xml:id="_Toc478379027"><?latex \label{ref-0128}?></anchor>
The strong scaling, on Titan and ARCHER, for small () and large () problem sizes. For ARCHER, both CPU are used per node. For Titan, we include results with and without GPU utilization.On each node, Titan has one 16-core Interlagos CPU and one K20X GPU, whereas ARCHER has two 12-core Ivy-bridge CPU. In this section, we evaluate on a node-by-node basis. For Titan, a single MPI task per node, operating on the CPU, is used to drive the GPU on that node. We also include, for Titan, results just using the CPU on each node without any involvement from the GPU, for comparison. This means that, on a single node, our Titan results will be the same as those K20X and Interlagos results presented in the previous section (for the same test case). On ARCHER, however, we fully utilize both the processors per node: to do this we use two MPI tasks per node, each with 12 OpenMP threads (via targetDP). So the single node results for ARCHER are twice as fast as those Ivy-bridge single-processor results presented in the previous section.
Figure 14 shows the time taken by the full MILC 64x64x64x8 test cases on traditional CPU, Intel Knights Landing Xeon Phi and NVIDIA P100 (Pascal) GPU architectures.<anchor role="start" xml:id="_Ref477152624"><?latex \label{ref-0129}?></anchor><anchor role="start" xml:id="_Toc478379028"><?latex \label{ref-0130}?></anchor>
In we present preliminary results for on the latest generation Intel Knights Landing (KNL) and NVIDIA Pascal architectures, which offer very high bandwidth stacked memory, together with the same traditional Intel-Ivy-bridge CPU used in previous sections. Note that these results are not directly comparable with those presented earlier, since they are for a different test case size (larger since we are no longer limited by the small memory size of the Knights Corner), and they are for a slightly updated verion of the benchmark. The KNL is the 64-core 7210 model, available from within a test and development platform provided as part of the ARCHER service. The Pascal is a NVIDIA P100 GPU provided as part of the “Ouessant” IBM service at IDRIS, where the host CPU is an IBM Power8+.It can be seen that the KNL is 7.5X faster than the Ivy-bridge; the Pascal is 13X faster than the Ivy-bridge; and the Pascal is 1.7X faster than the KNL. 4.8.2Second implementationGPU resultsThe GPU benchmark results of the second implementation are done on PizDaint located in Switzerland at CSCS and the GPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint is equipped with one P100 Pascal-GPU per node. Two different test-cases are depicted, the "strong-scaling" mode with a random lattice configuration of size 32x32x32x96 and 64x64x64x128. The GPU nodes of Cartesius have two Kepler-GPU K40m per node and the "strong-scaling" test is shown for one card per node and for two cards per node. The benchmark kernel is using the conjugated gradient solver which solve a linear equation system given by D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".
Figure 15 Result of second implementation of QCD on K40m GPU<anchor role="start" xml:id="_Ref478368452"><?latex \label{ref-0132}?></anchor><anchor role="start" xml:id="_Toc478379029"><?latex \label{ref-0133}?></anchor>
shows strong scaling of the conjugate gradient solver on K40m GPU on Cartesius. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays. The test is performed with a mixed precision CG in double-double mode (red) and half-double mode (blue). The run is done on one GPU per node (filled) and two GPU nodes per node (non-filled).
Figure 16 Result of second implementation of QCD on P100 GPU<anchor role="start" xml:id="_Ref478368421"><?latex \label{ref-0134}?></anchor><anchor role="start" xml:id="_Toc478379030"><?latex \label{ref-0135}?></anchor>
shows strong scaling of the conjugate gradient solver on P100 GPU on PizDaint. The lattice size is given by 32x32x32x96 similar to the strong scaling run on the K40m on Cartesius. The test is performed with mixed precision CG in double-double mode (red) and half-double mode (blue).
Figure 17 Result of second implementation of QCD on P100 GPU on larger test case<anchor role="start" xml:id="_Ref478368605"><?latex \label{ref-0136}?></anchor><anchor role="start" xml:id="_Toc478379031"><?latex \label{ref-0137}?></anchor>
shows strong scaling of the conjugate gradient solver on P100 GPU on PizDaint. The lattice size is increase to 64x64x64x128, which is a large lattice nowadays. By increasing the lattice the scaling test shows that the conjugate gradient solver has a very good strong scaling up to 64 GPU.Xeon Phi resultsThe benchmark results for the XeonPhi benchmark suite are performed on Frioul at CINES, and the hybrid partition on MareNostrum III at BSC. Frioul has one KNL-card per node while the hybrid partition of MareNostrum III is equipped with two KNC per node. The data on Frioul are generated by using the bash-scripts provided by the second implementation of QCD and are done for the two test cases "strong-scaling" with a lattice size of 32x32x32x96 and 64x64x64x128. In case of the data generated at MareNostrum, data for the "strong-scaling" mode on a 32x32x32x96 lattice are shown. The benchmark kernel uses a random gauge configuration and the conjugated gradient solver to solve a linear equation involving the clover Wilson Dirac operator.
Figure 18 Result of second implementation of QCD on KNC<anchor role="start" xml:id="_Ref478368691"><?latex \label{ref-0138}?></anchor><anchor role="start" xml:id="_Toc478379032"><?latex \label{ref-0139}?></anchor>
shows strong scaling of the conjugate gradient solver on KNC's on the hybrid partition on MareNostrum III. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays. The test is performed with a conjugate gradient solver in single precision by using the native mode and 60 OpenMP tasks per MPI process. The run is done on one KNC per node (filled) and two KNC node per node (non-filled).
Figure 19 Result of second implementation of QCD on KNL<anchor role="start" xml:id="_Ref478368762"><?latex \label{ref-0140}?></anchor><anchor role="start" xml:id="_Toc478379033"><?latex \label{ref-0141}?></anchor>
shows strong scaling results of the conjugate gradient solver on KNL's on Frioul. The lattice size is given by 32x32x32x96 which is similar to the strong scaling run on the KNC on MareNostrum III. The run is performed in quadrantic cache mode with 68 OpenMP processes per KNL. The test is performed with a conjugate gradient solver in single precision.4.9Quantum EspressoHere are sample results for Quantum Espresso. This code has run on Cartesius (see section ) and Marconi (1 node is 1 standalone KNL Xeon Phi 7250, 68 core 1.40 GHz, 16BG MCDRAM, 96BG DDR4 RAM, interconnect is Intel OmniPath).Runs on GPU
Figure 20 Scalability of Quantum Espresso on GPU for test case 1<anchor role="start" xml:id="_Ref477769024"><?latex \label{ref-0143}?></anchor><anchor role="start" xml:id="_Toc478379034"><?latex \label{ref-0144}?></anchor>
Figure 21 Scalability of Quantum Espresso on GPU for test case 2<anchor role="start" xml:id="_Ref477769025"><?latex \label{ref-0145}?></anchor><anchor role="start" xml:id="_Toc478379035"><?latex \label{ref-0146}?></anchor>
Test cases ( and ) show no appreciable speed-up with GPU. Inputs are probably too small, they should evolve in the future of this benchmark suite.Runs on KNL
Figure 22 Scalability of Quantum Espresso on KNL for test case 1<anchor role="start" xml:id="_Ref477769092"><?latex \label{ref-0147}?></anchor><anchor role="start" xml:id="_Toc478379036"><?latex \label{ref-0148}?></anchor>
shows the usual pw.x with the small test case A (AUSURF), comparing Marconi Broadwell (36 cores/node) with KNL (68 cores/node) - this test case is probably small for testing on KNL.
Figure 23 Quantum Espresso - KNL vs BDW vs BGQ (at scale)<anchor role="start" xml:id="_Ref477998355"><?latex \label{ref-0149}?></anchor><anchor role="start" xml:id="_Toc478379037"><?latex \label{ref-0150}?></anchor>
presents CNT10POR8 which is the large test case, even though it is using the cp.x executable (i.e. Car-parinello) rather than the usual pw.x (PW SCF calculation).4.10Synthetic benchmarks (SHOC)The SHOC benchmark has been run on Cartesius, Ouessant and MareNostrum. presents the results:NVIDIA GPUIntel Xeon PhiK40 CUDAK40 OpenCLPower 8 + P100 CUDAKNC OffloadKNC OpenCLHaswell OpenCLBusSpeedDownload10.5 GB/s10.56 GB/s32.23 GB/s6.6 GB/s6.8 GB/s12.4 GB/sBusSpeedReadback 10.5 GB/s10.56 GB/s34.00 GB/s6.7 GB/s6.8 GB/s12.5 GB/smaxspflops 3716 GFLOPS 3658 GFLOPS10424 GFLOPS215812314 GFLOPS1647 GFLOPS maxdpflops 1412 GFLOPS 1411 GFLOPS5315 GFLOPS160172318 GFLOPS884 GFLOPS gmem_readbw 177 GB/s179 GB/s575.16 GB/s170 GB/s49.7 GB/s20.2 GB/sgmem_readbw_strided 18 GB/s20 GB/s99.15 GB/sN/A35 GB/s156 GB/s gmem_writebw 175 GB/s188 GB/s436 GB/s72 GB/s41 GB/s13.6 GB/sgmem_writebw_strided 7 GB/s7 GB/s26.3 GB/sN/A25 GB/s163 GB/slmem_readbw 1168 GB/s1156 GB/s4239 GB/sN/A442 GB/s238 GB/slmem_writebw 1194 GB/s1162 GB/s5488 GB/sN/A477 GB/s295 GB/sBFS49,236,500 Edges/s 42,088,000 Edges/s 91,935,100 Edges/sN/A1,635,330 Edges/s 14,225,600 Edges/s FFT_sp523 GFLOPS 377 GFLOPS 1472 GFLOPS135 GFLOPS71 GFLOPS80 GFLOPS FFT_dp262 GFLOPS 61 GFLOPS 733 GFLOPS69.5 GFLOPS31 GFLOPS55 GFLOPS SGEMM2900-2990 GFLOPS 694/761 GFLOPS8604-8720 GFLOPS640/645 GFLOPS179/217 GFLOPS419-554 GFLOPS DGEMM1025-1083 GFLOPS 411/433 GFLOPS3635-3785 GFLOPS179/190 GFLOPS76/100 GFLOPS189-196 GFLOPS MD (SP)185 GFLOPS 91 GFLOPS483 GFLOPS28 GFLOPS33 GFLOPS114 GFLOPS MD5Hash3.38 GH/s3.36 GH/s15.77 GH/sN/A1.7 GH/s1.29 GH/sReduction137 GB/s150 GB/s271 GB/s99 GB/s10 GB/s91 GB/sScan47 GB/s39 GB/s99.2 GB/s11 GB/s4.5 GB/s15 GB/sSort3.08 GB/s0.54 GB/s12.54 GB/sN/A0.11 GB/s0.35 GB/sSpmv4-23 GFLOPS3-17 GFLOPS23-65 GFLOPS1-17944 GFLOPSN/A1-10 GFLOPSStencil2D123 GFLOPS135 GFLOPS465 GFLOPS89 GFLOPS8.95 GFLOPS34 GFLOPSStencil2D_dp57 GFLOPS67 GFLOPS258 GFLOPS16 GFLOPS7.92 GFLOPS30 GFLOPSTriad13.5 GB/s9.9 GB/s43 GB/s5.76 GB/s5.57 GB/s8 GB/sS3D (level2)94 GFLOPS91 GFLOPS294 GFLOPS109 GFLOPS18 GFLOPS27 GFLOPSTable 9 Synthetic benchmarks results on GPU and Xeon PhiMeasures marked red are not relevant and should not be considered:KNC MaxFlops (both SP and DP): In this case the compiler optimizes away some of the computation (although it shouldn't) .KNC SpMV: For these benchmarks it is a known bug currently being addressed .Haswell gmem_readbw_strided and gmem_writebw_strided: strided read/write benchmarks doesn't make too much sense in the case of the CPU, as the data will be cache in the large L3 caches. It is reason why we see high number only in the Haswell case.4.11SPECFEM3DTests have been carried out on Ouessant and Firoul.So far it has only been possible to run on one fixed core count for each test case, so scaling curves are not available. Test case A ran on 4 KNL and 4 P100. Test case B ran on 10 KNL and 4 P100.KNLP100Test case A66105Test case B21.468Table 10 SPECFEM 3D GLOBE results (run time in second)5Conclusion and future workThe work presented here stand as a first sight for application benchmarking on accelerators. Most codes have been selected among the main Unified European Application Benchmark Suite. This paper describes each of them as well as implementation, relevance to European science community and test cases. We have presented results on leading edge systemsThe suite will be publicly available on the PRACE web site where links to download sources and test cases will be published along with compilation and run instructions.Task 7.2B in PRACE 4IP started to design a benchmark suite for accelerators. This work has been done aiming at integrating it to the main UEABS one so that both can be maintained and evolve together. As PCP (PRACE-3IP) machines will soon be available, it will be very interesting to run the benchmark suite on them. First because these machines will be larger, but also because they will feature energy consumption probes.
\ No newline at end of file diff --git a/doc/iwoph17/aliascnt.sty b/doc/iwoph17/aliascnt.sty deleted file mode 100644 index 452aa0e5de83e18e2a7a412986d40391d1a3ea54..0000000000000000000000000000000000000000 --- a/doc/iwoph17/aliascnt.sty +++ /dev/null @@ -1,88 +0,0 @@ -%% -%% This is file `aliascnt.sty', -%% generated with the docstrip utility. -%% -%% The original source files were: -%% -%% aliascnt.dtx (with options: `package') -%% -%% This is a generated file. -%% -%% Project: aliascnt -%% Version: 2009/09/08 v1.3 -%% -%% Copyright (C) 2006, 2009 by -%% Heiko Oberdiek -%% -%% This work may be distributed and/or modified under the -%% conditions of the LaTeX Project Public License, either -%% version 1.3c of this license or (at your option) any later -%% version. This version of this license is in -%% http://www.latex-project.org/lppl/lppl-1-3c.txt -%% and the latest version of this license is in -%% http://www.latex-project.org/lppl.txt -%% and version 1.3 or later is part of all distributions of -%% LaTeX version 2005/12/01 or later. -%% -%% This work has the LPPL maintenance status "maintained". -%% -%% This Current Maintainer of this work is Heiko Oberdiek. -%% -%% This work consists of the main source file aliascnt.dtx -%% and the derived files -%% aliascnt.sty, aliascnt.pdf, aliascnt.ins, aliascnt.drv. -%% -\NeedsTeXFormat{LaTeX2e} -\ProvidesPackage{aliascnt}% - [2009/09/08 v1.3 Alias counter (HO)]% -\newcommand*{\newaliascnt}[2]{% - \begingroup - \def\AC@glet##1{% - \global\expandafter\let\csname##1#1\expandafter\endcsname - \csname##1#2\endcsname - }% - \@ifundefined{c@#2}{% - \@nocounterr{#2}% - }{% - \expandafter\@ifdefinable\csname c@#1\endcsname{% - \AC@glet{c@}% - \AC@glet{the}% - \AC@glet{theH}% - \AC@glet{p@}% - \expandafter\gdef\csname AC@cnt@#1\endcsname{#2}% - \expandafter\gdef\csname cl@#1\expandafter\endcsname - \expandafter{\csname cl@#2\endcsname}% - }% - }% - \endgroup -} -\newcommand*{\aliascntresetthe}[1]{% - \@ifundefined{AC@cnt@#1}{% - \PackageError{aliascnt}{% - `#1' is not an alias counter% - }\@ehc - }{% - \expandafter\let\csname the#1\expandafter\endcsname - \csname the\csname AC@cnt@#1\endcsname\endcsname - }% -} -\newcommand*{\AC@findrootcnt}[1]{% - \@ifundefined{AC@cnt@#1}{% - #1% - }{% - \expandafter\AC@findrootcnt\csname AC@cnt@#1\endcsname - }% -} -\def\AC@patch#1{% - \expandafter\let\csname AC@org@#1reset\expandafter\endcsname - \csname @#1reset\endcsname - \expandafter\def\csname @#1reset\endcsname##1##2{% - \csname AC@org@#1reset\endcsname{##1}{\AC@findrootcnt{##2}}% - }% -} -\RequirePackage{remreset} -\AC@patch{addto} -\AC@patch{removefrom} -\endinput -%% -%% End of file `aliascnt.sty'. diff --git a/doc/iwoph17/app.tex b/doc/iwoph17/app.tex deleted file mode 100644 index 7840f53ca9a72f8788a61c00b9b5abda107869a2..0000000000000000000000000000000000000000 --- a/doc/iwoph17/app.tex +++ /dev/null @@ -1,332 +0,0 @@ -\section{Benchmark Suite Description\label{sec:codes}} - -This part will cover each code, presenting the interest for the scientific community as well as the test cases defined for the benchmarks. - -As an extension to the EUABS, most codes presented in this suite are included in the latter. Exceptions are PFARM which comes from PRACE-2IP \cite{ref-0023} and SHOC \cite{ref-0026} a synthetic benchmark suite. - -\begin{table}[] -\centering -\caption{Codes and corresponding APIs available (in green)} -\label{table:avail-api} -\begin{tabular}{l|l|l|l|} -\cline{2-4} - & OpenMP & OpenCL & CUDA \\ \hline -\multicolumn{1}{|l|}{Alya} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{Code\_Saturne} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{CP2K} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{GPAW} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{GROMACS} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{NAMD} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{PFARM} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{QCD} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{QUANTUM ESPRESSO} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{SHOC} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} \\ \hline -\multicolumn{1}{|l|}{SPECFEM3D} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} \\ \hline -\end{tabular} -\end{table} - -Table \ref{table:avail-api} lists the codes that will be presented in the next sections as well as their implementations available. It should be noted that OpenMP can be used with the Intel Xeon Phi architecture while CUDA is used for NVidia GPU cards. OpenCL has been considered as a third alternative that can be used on both architectures. It has been available on the first generation of Xeon Phi (KNC) but has not been ported to the second one (KNL). SHOC is the only code that is impacted, this problem is addressed in Sect. \ref{sec:results:shoc}. - -\subsection{Alya} - -Alya is a high performance computational mechanics code that can solve different coupled mechanics problems: incompressible/compressible flows, solid mechanics, chemistry, excitable media, heat transfer and Lagrangian particle transport. It is one single code. There are no particular parallel or individual platform versions. Modules, services and kernels can be compiled individually and used a la carte. The main discretisation technique employed in Alya is based on the variational multiscale finite element method to assemble the governing equations into Algebraic systems. These systems can be solved using solvers like GMRES, Deflated Conjugate Gradient, pipelined CG together with preconditioners like SSOR, Restricted Additive Schwarz, etc. The coupling between physics solved in different computational domains (like fluid-structure interactions) is carried out in a multi-code way, using different instances of the same executable. Asynchronous coupling can be achieved in the same way in order to transport Lagrangian particles. - -\subsubsection{Code Description.} - -The code is parallelised with MPI and OpenMP. Two OpenMP strategies are available, without and with a colouring strategy to avoid ATOMICs during the assembly step. A CUDA version is also available for the different solvers. Alya has been also compiled for MIC (Intel Xeon Phi). - -Alya is written in Fortran 1995 and the incompressible fluid module, present in the benchmark suite, is freely available. This module solves the Navier-Stokes equations using an Orthomin(1) \cite{ref-0029} method for the pressure Schur complement. This method is an algebraic split strategy which converges to the monolithic solution. At each linearisation step, the momentum is solved twice and the continuity equation is solved once or twice depending whether the momentum preserving or the continuity preserving algorithm is selected. - -\subsubsection{Test Cases Description.} - -\paragraph{Cavity-hexaedra elements (10M Elements).} - -This test is the classical lid-driven cavity. The problem geometry is a cube of dimensions $1\times1\times1$. The fluid properties are $\mbox{density}=1.0$ and $\mbox{viscosity}=0.01$. Dirichlet boundary conditions are applied on all sides, with three no-slip walls and one moving wall with velocity equal to $1.0$, which corresponds to a Reynolds number of $100$. The Reynolds number is low so the regime is laminar and turbulence modelling is not necessary. The domain is discretised into $9800344$ hexaedra elements. The solvers are the GMRES method for the momentum equations and the Deflated Conjugate Gradient to solve the continuity equation. This test case can be run using pure MPI parallelisation or the hybrid MPI/OpenMP strategy. - -\paragraph{Cavity-hexaedra Elements (30M Elements).} - -This is the same cavity test as before but with 30M of elements. Note that a mesh multiplication strategy enables one to multiply the number of elements by powers of 8, by simply activating the corresponding option in the ker.dat file. - -\paragraph{Cavity-hexaedra Elements-GPU Version (10M Elements).} - -This is the same test as Test case 1, but using the pure MPI parallelisation strategy with acceleration of the algebraic solvers using GPU. - -\subsection{Code\_Saturne\label{ref-0058}} - -Code\_Saturne is a CFD software package developed by EDF R\&D since 1997 and open-source since 2007. The Navier-Stokes equations are discretised following a finite volume method approach. The code can handle any type of mesh built with any type of cell/grid structure. Incompressible and compressible flows can be simulated, with or without heat transfer, and a range of turbulence models is available. The code can also be coupled with itself or other software to model some multi-physics problems (fluid-structure, fluid-conjugate heat transfer, for instance). - -\subsubsection{Code Description.\label{ref-0059}} - -Parallelism is handled by distributing the domain over the processors (several partitioning tools are available, either internally, i.e. SFC Hilbert and Morton, or through external libraries, i.e. METIS Serial, ParMETIS, Scotch Serial, PT-SCOTCH. Communications between subdomains are handled by MPI. Hybrid parallelism using MPI/OpenMP has recently been optimised for improved multicore performance. - -For incompressible simulations, most of the time is spent during the computation of the pressure through Poisson equations. The matrices are very sparse. PETSc has recently been linked to the code to offer alternatives to the internal solvers to compute the pressure. The developer's version of PETSc supports CUDA and is used in this benchmark suite. - -Code\_Saturne is written in C, F95 and Python. It is freely available under the GPL license. - -\subsubsection{Test Cases Description.\label{ref-0060}} - -Two test cases are dealt with, the former with a mesh made of hexahedral cells and the latter with a mesh made of tetrahedral cells. Both configurations are meant for incompressible laminar flows. The first test case is run on KNL in order to test the performance of the code always completely filling up a node using 64 MPI tasks and then either 1, 2, 4 OpenMP threads, or 1, 2, 4 extra MPI tasks to investigate the effect of hyper-threading. In this case, the pressure is computed using the code's native Algebraic Multigrid (AMG) algorithm as a solver. The second test case is run on KNL and GPU. In this configuration, the pressure equation is solved using the conjugate gradient (CG) algorithm from the PETSc library (the version of PETSc is the developer's version which supports GPU) and tests are run on KNL as well as on CPU+GPU. PETSc is built with the CUSP library and the CUSP format is used. - -Note that computing the pressure using a CG algorithm has always been slower than using the native AMG algorithm, when using Code\_Saturne. The second test is then meant to compare the current results obtained on KNL and GPU using CG only, and not to compare CG and AMG time to solution. - -\paragraph{Flow in a 3-D Lid-driven Cavity (Tetrahedral Cells).} - -The geometry is very simple, i.e. a cube, but the mesh is built using tetrahedral cells only. The Reynolds number is set to 100, and symmetry boundary conditions are applied in the spanwise direction. The case is modular and the mesh size can easily been varied. The largest mesh has about 13 million cells and is used to get some first comparisons using Code\_Saturne linked to the developer's PETSc library, in order to get use of the GPU. - -\paragraph{3-D Taylor-Green Vortex Flow (Hexahedral Cells).} - -The Taylor-Green vortex flow is traditionally used to assess the accuracy of CFD code numerical schemes. Periodicity is used in the 3 directions. The total kinetic energy (integral of the velocity) and enstrophy (integral of the vorticity) evolutions as a function of the time are looked at. Code\_Saturne is set for 2nd order time and spatial schemes. The mesh size is 2563 cells. - -\subsection{CP2K\label{ref-0061}} - -CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. It can perform molecular dynamics, metadynamics, Quantum Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method. - -CP2K provides a general framework for different modelling methods such as density functional theory (DFT) using the mixed Gaussian and plane waves approaches (GPW) and Gaussian and Augmented Plane (GAPW). Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, {\ldots}), and classical force fields (AMBER, CHARMM, {\ldots}). - -\subsubsection{Code Description.\label{ref-0062}} - -Parallelisation is achieved using a combination of OpenMP-based multi-threading and MPI. - -Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi). - -CP2K is written in Fortran 2003 and freely available under the GPL license. - -\subsubsection{Test Cases Description.\label{ref-0063}} - -\paragraph{LiH-HFX.} - -This is a single-point energy calculation for a particular configuration of a 216 atom Lithium Hydride crystal with 432 electrons in a $12.3 \mbox{\AA{}}^3$ (Angstroms cubed) cell. The calculation is performed using a DFT algorithm with GAPW under the hybrid Hartree-Fock exchange (HFX) approximation. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although the cost of the latter can be reduced by using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performance - -This test case is expected to scale efficiently to 1000+ nodes. - -\paragraph{H2O-DFT-LS.} - -This is a single-point energy calculation for 2048 water molecules in a 39 \AA{}$^{\mathrm{3}}$ box using linear-scaling DFT. A local-density approximation (LDA) functional is used to compute the Exchange-Correlation energy in combination with a DZVP MOLOPT basis set and a 300 Ry cutoff. For large systems, the linear-scaling approach for solving Self-Consistent-Field equations should be much cheaper computationally than using standard DFT, and allow scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix. The cubically-scaling orthogonalisation step of standard DFT is avoided and key operations are sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in CP2K's DBCSR library. - -This test case is expected to scale efficiently to 4000+ nodes. - -\subsection{GPAW\label{ref-0064}} - -GPAW is a DFT program for ab-initio electronic structure calculations using the projector augmented wave method. It uses a uniform real-space grid representation of the electronic wavefunctions, that allows for excellent computational scalability and systematic converge properties. - -\subsubsection{Code Description.\label{ref-0065}} - -GPAW is written mostly in Python, but includes also computational kernels written in C as well as leveraging external libraries such as NumPy, BLAS and ScaLAPACK. Parallelisation is based on message-passing using MPI with no threading. Development branches for GPU and MICs include support for offloading to accelerators using either CUDA or pyMIC, respectively. GPAW is freely available under the GPL license. - -\subsubsection{Test Cases Description.\label{ref-0066}} - -\paragraph{Carbon Nanotube.} - -This test case is a ground state calculation for a carbon nanotube in vacuum. By default, it uses a $6-6-10$ nanotube with 240 atoms (freely adjustable) and serial LAPACK with an option to use ScaLAPACK. - -This benchmark is aimed at smaller systems, with an intended scaling range of up to 10 nodes. - -\paragraph{Copper Filament.} - -This test case is a ground state calculation for a copper filament in vacuum. By default, it uses a $2\times2\times3$ FCC lattice with 71 atoms (freely adjustable) and ScaLAPACK for parallelisation. - -This benchmark is aimed at larger systems, with an intended scaling range of up to 100 nodes. A lower limit on the number of nodes may be imposed by the amount of memory required, which can be adjusted to some extent with the run parameters (e.g. lattice size or grid spacing). - -\subsection{GROMACS\label{ref-0067}} - -GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. - -It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers. - -GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation, and some additional features: - -GROMACS provides extremely high performance compared to all other programs. A lot of algorithmic optimisations have been introduced in the code; for instance, the calculation of the virial has been extracted from the innermost loops over pairwise interactions, and we use our own software routines to calculate the inverse square root. In GROMACS 4.6 and up, on almost all common computing platforms, the innermost loops are written in C using intrinsic functions that the compiler transforms to SIMD machine instructions, to utilise the available instruction-level parallelism. These kernels are available in both single and double precision, and support all different kinds of SIMD support found in x86-family (and other) processors. - -\subsubsection{Code Description.\label{ref-0068}} - -Parallelisation is achieved using combined OpenMP and MPI. - -Offloading for accelerators is implemented through CUDA for GPU and through OpenMP for MIC (Intel Xeon Phi). - -GROMACS is written in C/C++ and freely available under the GPL license. - -\subsubsection{Test Cases Description.\label{ref-0069}} - -\paragraph{GluCL Ion Channel.} - -The ion channel system is the membrane protein GluCl, which is a pentameric chloride channel embedded in a lipid bilayer. The GluCl ion channel was embedded in a DOPC membrane and solvated in TIP3P water. This system contains $142\times10^3$ atoms, and is a quite challenging parallelisation case due to the small size. However, it is likely one of the most wanted target sizes for biomolecular simulations due to the importance of these proteins for pharmaceutical applications. It is particularly challenging due to a highly inhomogeneous and anisotropic environment in the membrane, which poses hard challenges for load balancing with domain decomposition. - -This test case was used as the ``Small'' test case in previous 2IP and 3IP PRACE phases. It is included in the package's version 5.0 benchmark cases. It is reported to scale efficiently up to 1000+ cores on x86 based systems. - -\paragraph{Lignocellulose.} - -A model of cellulose and lignocellulosic biomass in an aqueous solution {\cite{ref-0024}. This system of 3.3 million atoms is inhomogeneous. This system uses reaction-field electrostatics instead of PME and therefore scales well on x86. This test case was used as the ``Large'' test case in previous PRACE 2IP and 3IP projects. It is reported in previous PRACE projects to scale efficiently up to 10000+ x86 cores. - -\subsection{NAMD\label{ref-0070}} - -NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. NAMD is developed by the ``Theoretical and Computational Biophysics Group'' at the University of Illinois at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a large number of processors. The application can read a wide variety of different file formats, for example force fields, protein structures, which are commonly used in bio-molecular science. A NAMD license can be applied for on the developer's website free of charge. Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website. Deployment areas of NAMD include pharmaceutical research by academic and industrial users. NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other chemical substances is of interest. Typical examples are vaccine research and transport processes through cell membrane proteins. - -\subsubsection{Code Description.\label{ref-0071}} - -NAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI, supporting both pure MPI and hybrid parallelisation \cite{ref-0025}. - -Offloading for accelerators is implemented for both GPU and MIC (Intel Xeon Phi). - -\subsubsection{Test Cases Description.\label{ref-0072}} - -The datasets are based on the original "Satellite Tobacco Mosaic Virus (STMV)" dataset from the official NAMD site. The memory optimised build of the package and data sets are used in benchmarking. Data are converted to the appropriate binary format used by the memory optimised build. - -\paragraph{STMV.1M.} - -This is the original STMV dataset from the official NAMD site. The system contains roughly 1 million atoms. This data set scales efficiently up to 1000+ x86 Ivy Bridge cores. - -\paragraph{STMV.8M.} - -This is a $2\times2\times2$ replication of the original STMV dataset from the official NAMD site. The system contains roughly 8 million atoms. This data set scales efficiently up to 6000 x86 Ivy Bridge cores. - -\paragraph{STMV.28M.} - -This is a $3\times3\times3$ replication of the original STMV dataset from the official NAMD site. The system contains roughly 28 million atoms. This data set also scales efficiently up to 6000 x86 Ivy Bridge cores. - -\subsection{PFARM\label{ref-0073}} - -PFARM is part of a suite of programs based on the ``R-matrix'' ab-initio approach to the varitional solution of the many-electron Schr\"{o}dinger equation for electron-atom and electron-ion scattering. The package has been used to calculate electron collision data for astrophysical applications (such as: the interstellar medium, planetary atmospheres) with, for example, various ions of Fe and Ni and neutral O, plus other applications such as data for plasma modelling and fusion reactor impurities. The code has recently been adapted to form a compatible interface with the UKRmol suite of codes for electron (positron) molecule collisions thus enabling large-scale parallel `outer-region' calculations for molecular systems as well as atomic systems. - -\subsubsection{Code Description.\label{ref-0074}} - -In order to enable efficient computation, the external region calculation takes place in two distinct stages, named EXDIG and EXAS, with intermediate files linking the two. EXDIG is dominated by the assembly of sector Hamiltonian matrices and their subsequent eigensolutions. EXAS uses a combined functional/domain decomposition approach where good load-balancing is essential to maintain efficient parallel performance. Each of the main stages in the calculation is written in Fortran 2003 (or Fortran 2003-compliant Fortran 95), is parallelised using MPI and is designed to take advantage of highly optimised, numerical library routines. Hybrid MPI / OpenMP parallelisation has also been introduced into the code via shared memory enabled numerical library kernels. - -Accelerator-based implementations have been implemented for both EXDIG and EXAS. EXAS uses offloading via MAGMA (or MKL) for sector Hamiltonian diagonalisations on Intel Xeon Phi and GPU accelerators. EXDIG uses combined MPI and OpenMP to distribute the scattering energy calculations on CPU efficiently both across and within Intel Xeon Phi co-processors. - -\subsubsection{Test Cases Description.\label{ref-0075}} - -External region R-matrix propagations take place over the outer partition of configuration space, including the region where long-range potentials remain important. The radius of this region is determined from the user input and the program decides upon the best strategy for dividing this space into multiple sub-regions (or sectors). Generally, a choice of larger sector lengths requires the application of larger numbers of basis functions (and therefore larger Hamiltonian matrices) in order to maintain accuracy across the sector and vice-versa. Memory limits on the target hardware may determine the final preferred configuration for each test case. - -\paragraph{Iron, $\mathrm{FeIII}$.} - -This is an electron-ion scattering case with 1181 channels. Hamiltonian assembly in the coarse region applies 10 Legendre functions leading to Hamiltonian matrix diagonalisations of order 11810. In the `fine energy region' up to 30 Legendre functions may be applied leading to Hamiltonian matrices of up to order 35430. The number of sector calculations is likely to range from about 15 to over 30 depending on the user specifications. Several thousand scattering energies are used in the calculation. - -\paragraph{Methane, $\mathrm{CH}_4$.} - -The dataset is an electron-molecule calculation with 1361 channels. Hamiltonian dimensions are therefore estimated between 13610 and $\sim40000$. A process in the code which splits the constituent channels according to spin can be used to approximately halve the Hamiltonian size (whilst doubling the overall number of Hamiltonian matrices). As eigensolvers generally require $O(N^3)$ operations, spin splitting leads to a saving in both memory requirements and operation count. The final radius of the external region required is relatively long, leading to more numerous sectors calculations (estimated to between 20 and 30). The calculation will require many thousands of scattering energies. - -In the current model, parallelism in EXDIG is limited to the number of sector calculations, i.e a maximum of around 30 accelerator nodes. - -Methane is a relatively new dataset which has not been calculated on novel technology platforms at the very large-scale to date, so this is somewhat a step into the unknown. We are also somewhat reliant on collaborative partners that are not associated with PRACE for continuing to develop and fine tune the accelerator-based EXAS program for this proposed work. Access to suitable hardware with throughput suited to development cycles is also a necessity if suitable progress is to be ensured. - -\subsection{QCD\label{sec:codes-qcd}} - -Matter consists of atoms, which in turn consist of nuclei and electrons. The nuclei consist of neutrons and protons, which comprise quarks bound together by gluons. - -The theory of how quarks and gluons interact to form nucleons and other elementary particles is called Quantum Chromo Dynamics (QCD). For most problems of interest, it is not possible to solve QCD analytically, and instead numerical simulations must be performed. Such ``Lattice QCD'' calculations are very computationally intensive, and occupy a significant percentage of all HPC resources worldwide. - -\subsubsection{Code Description.\label{ref-0077}} - -The QCD benchmark benefits of two different implementations described below. - -\paragraph{First Implementation.} - -The MILC code is a freely-available suite for performing Lattice QCD simulations, developed over many years by a collaboration of researchers \cite{ref-0030}. - -The benchmark used here is derived from the MILC code (v6), and consists of a full conjugate gradient solution using Wilson fermions. The benchmark is consistent with ``QCD kernel E'' in the full UAEBS, and has been adapted so that it can efficiently use accelerators as well as traditional CPU. - -The implementation for accelerators has been achieved using the ``targetDP'' programming model \cite{ref-0031}, a lightweight abstraction layer designed to allow the same application source code to be able to target multiple architectures, e.g. NVIDIA GPU and multicore/manycore CPU, in a performance portable manner. The targetDP syntax maps, at compile time, to either NVIDIA CUDA (for execution on GPU) or OpenMP+vectorisation (for implementation on multi/manycore CPU including Intel Xeon Phi). The base language of the benchmark is C and MPI is used for node-level parallelism. - -\paragraph{Second Implementation.} - -The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA \cite{ref-0027} and the QPhix \cite{ref-0028} library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPU \cite{ref-0032}. The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL's \cite{ref-0033}. In both QUDA and QPhix, the benchmark kernel uses the conjugate gradient solvers implemented within the libraries. - -\subsubsection{Test Cases Description.\label{ref-0078}} - -Lattice QCD involves discretisation of space-time into a lattice of points, where the extent of the lattice in each of the 3 spatial and 1 temporal dimensions can be chosen. This means that the benchmark is very flexible, where the size of the lattice can be varied with the size of the computing system in use (weak scaling) or can be fixed (strong scaling). For testing on a single node, then $64\times64\times32\times8$ is a reasonable size, since this fits on a single Intel Xeon Phi or a single GPU. For larger numbers of nodes, the lattice extents can be increased accordingly, keeping the geometric shape roughly similar. Test cases for the second implementation are given by a strong-scaling mode with a lattice size of $32\times32\times32\times96$ and $64\times64\times64\times128$ and a weak scaling mode with a local lattice size of $48\times48\times48\times24$. - -\subsection{QUANTUM ESPRESSO\label{ref-0079}} - -QUANTUM ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modelling, based on density-functional theory, plane waves, and pseudopotentials (norm-conserving, ultrasoft, and projector-augmented wave). QUANTUM ESPRESSO stands for \textit{opEn Source Package for Research in Electronic Structure, Simulation, and Optimisation}. It is freely available to researchers around the world under the terms of the GNU General Public License. QUANTUM ESPRESSO builds upon newly restructured electronic-structure codes that have been developed and tested by some of the original authors of novel electronic-structure algorithms and applied in the last twenty years by some of the leading materials modelling groups worldwide. Innovation and efficiency are still its main focus, with special attention paid to massively parallel architectures, and a great effort being devoted to user friendliness. QUANTUM ESPRESSO is evolving towards a distribution of independent and inter-operable codes in the spirit of an open-source project, where researchers active in the field of electronic-structure calculations are encouraged to participate in the project by contributing their own codes or by implementing their own ideas into existing codes. - -QUANTUM ESPRESSO is written mostly in Fortran90, and parallelised using MPI and OpenMP and is released under a GPL license. - -\subsubsection{Code Description.\label{ref-0080}} - -During 2011 a GPU-enabled version of Quantum ESPRESSO was publicly released. The code is currently developed and maintained by Filippo Spiga at the High Performance Computing Service - University of Cambridge (United Kingdom) and Ivan Girotto at the International Centre for Theoretical Physics (Italy). The initial work has been supported by the EC-funded PRACE and a SFI (Science Foundation Ireland, grant 08/HEC/I1450). At the time of writing, the project is self-sustained thanks to the dedication of the people involved and thanks to NVIDIA support in providing hardware and expertise in GPU programming. - -The current public version of QE-GPU is 14.10.0 as it is the last version maintained as plug-in working on all QE 5.x versions. QE-GPU utilised phiGEMM (external) for CPU+GPU GEMM computation, MAGMA (external) to accelerate eigen-solvers and explicit CUDA kernel to accelerate compute-intensive routines. FFT capabilities on GPU are available only for serial computation due to the hard challenges posed in managing accelerators in the parallel distributed 3D-FFT portion of the code where communication is the dominant element that limits excellent scalability beyond hundreds of MPI ranks. - -A version for Intel Xeon Phi (MIC) accelerators is not currently available. Standart x86 version have been used on KNL. - -\subsubsection{Test Cases Description.\label{ref-0081}} - -\paragraph{PW-IRMOF\_M11.} - -Full SCF calculation of a Zn-based isoreticular metal--organic framework (total 130 atoms) over 1 K point. Benchmarks run in 2012 demonstrated speedups due to GPU (NVIDIA K20s, with respect to non-accelerated nodes) in the range 1.37 -- 1.87, according to node count (maximum number of accelerators=8). Runs with current hardware technology and an updated version of the code are expected to exhibit higher speedups (probably 2-3x) and scale up to a couple hundred nodes. - -\paragraph{PW-SiGe432.} - -This is a SCF calculation of a Silicon-Germanium crystal with 430 atoms. Being a fairly large system, parallel scalability up to several hundred, perhaps a 1000 nodes is expected, with accelerated speed-ups likely to be of 2-3x. - -\subsection{Synthetic benchmarks -- SHOC\label{ref-0082}} - -The Accelerator Benchmark Suite will also include a series of synthetic benchmarks. For this purpose, we choose the Scalable HeterOgeneous Computing (SHOC) benchmark suite, augmented with a series of benchmark examples developed internally. SHOC is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing. Its initial focus is on systems containing GPU and multi-core processors, and on the OpenCL programming standard, but CUDA and OpenACC versions were added. Moreover, a subset of the benchmarks is optimised for the Intel Xeon Phi coprocessor. SHOC can be used on clusters as well as individual hosts. - -The SHOC benchmark suite currently contains benchmark programs categorised by complexity. Some measure low-level 'feeds and speeds' behaviour (Level 0), some measure the performance of a higher-level operation such as a Fast Fourier Transform (FFT) (Level 1), and the others measure real application kernels (Level 2). - -The SHOC benchmark suite has been selected to evaluate the performance of accelerators on synthetic benchmarks, mostly because SHOC provides CUDA / OpenCL / Offload / OpenACC variants of the benchmarks. This allowed us to evaluate NVIDIA GPU (with CUDA / OpenCL / OpenACC), Intel Xeon Phi KNC (with both Offload and OpenCL), but also Intel host CPU (with OpenCL/OpenACC). However, on the latest Xeon Phi processor (codenamed KNL) none of these 4 models is supported. Thus, benchmarks on the KNL architecture can not be run at this point, and there aren't any news of Intel supporting OpenCL on the KNL. However, there is work in progress on the PGI compiler to support the KNL as a target. This support will be added during 2017. This will allow us to compile and run the OpenACC benchmarks for the KNL. Alternatively, the OpenACC benchmarks will be ported to OpenMP and executed on the KNL. - -\subsubsection{Code Description.\label{ref-0083}} - -All benchmarks are MPI-enabled. Some will report aggregate metrics over all MPI ranks, others will only perform work for specific ranks. - -Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi). For selected benchmarks OpenACC implementations are provided for GPU. Multi-node parallelisation is achieved using MPI. - -SHOC is written in C++ and is open-source and freely available. - -\subsubsection{Test Cases Description.\label{ref-0084}} - -The benchmarks contained in SHOC currently feature 4 different sizes for increasingly large systems. The size convention is as follows: - -\begin{itemize} -\item CPU / debugging -\item Mobile/integrated GPU -\item Discrete GPU (e.g. GeForce or Radeon series) -\item HPC-focused or large memory GPU (e.g. Tesla or Firestream Series) -\end{itemize} - -In order to go even larger scale, we plan to add a 5th level for massive supercomputers. - -\subsection{SPECFEM3D\_GLOBE\label{ref-0085}} - -The software package SPECFEM3D\_GLOBE simulates three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). All SPECFEM3D\_GLOBE software is written in Fortran90 with full portability in mind, and conforms strictly to the Fortran95 standard. It uses no obsolete or obsolescent features of Fortran77. The package uses parallel programming based upon the Message Passing Interface (MPI). - -The SEM was originally developed in computational fluid dynamics and has been successfully adapted to address problems in seismic wave propagation. It is a continuous Galerkin technique, which can easily be made discontinuous; it is then close to a particular case of the discontinuous Galerkin technique, with optimised efficiency because of its tensorised basis functions. In particular, it can accurately handle very distorted mesh elements. It has very good accuracy and convergence properties. The spectral element approach admits spectral rates of convergence and allows exploiting hp-convergence schemes. It is also very well suited to parallel implementation on very large supercomputers as well as on clusters of GPU accelerating graphics cards. Tensor products inside each element can be optimised to reach very high efficiency, and mesh point and element numbering can be optimised to reduce processor cache misses and improve cache reuse. The SEM can also handle triangular (in 2D) or tetrahedral (3D) elements as well as mixed meshes, although with increased cost and reduced accuracy in these elements, as in the discontinuous Galerkin method. - -In many geological models in the context of seismic wave propagation studies (except for instance for fault dynamic rupture studies, in which very high frequencies of supershear rupture need to be modelled near the fault) a continuous formulation is sufficient because material property contrasts are not drastic and thus conforming mesh doubling bricks can efficiently handle mesh size variations. This is particularly true at the scale of the full earth. Effects due to lateral variations in compressional-wave speed, shear-wave speed, density, a 3D crustal model, ellipticity, topography and bathyletry, the oceans, rotation, and self-gravitation are included. The package can accommodate full 21-parameter anisotropy as well as lateral variations in attenuation. Adjoint capabilities and finite-frequency kernel simulations are also included. - -\subsubsection{Test cases definition.\label{ref-0086}} - -Both test cases will use the same input data. A 3D shear-wave speed model (S362ANI) will be used to benchmark the code. - -Here is an explanation of the simulation parameters that will be used to size the test case: - -\begin{itemize} -\item \verb}NCHUNKS,} number of face of the cubed sphere included in the simulation (will be always 6) -\item \verb}NPROC_XI}, number of slice along one chunk of the cubed sphere (will represents also the number of processors used for 1 chunk -\item \verb}NEX_XI}, number of spectral elements along one side of a chunk. -\item \verb}RECORD_LENGHT_IN_MINUTES}, length of the simulated seismograms. The time of the simulation should vary linearly with this parameter. -\end{itemize} - -\paragraph{Small test case.} - -It runs with 24 MPI tasks and has the following mesh characteristics: - -\begin{itemize} -\item \verb+NCHUCKS = 6+ -\item \verb+NPROC_XI = 2+ -\item \verb+NEX_XI = 80+ -\item \verb+RECORD_LENGHT_IN_MINUTES = 2.0+ -\end{itemize} - -\paragraph{Bigger test case.} - -It runs with 150 MPI tasks and has the following mesh characteristics: - -\begin{itemize} -\item \verb+NCHUCKS = 6+ -\item \verb+NPROC_XI = 5+ -\item \verb+NEX_XI = 80+ -\item \verb+RECORD_LENGHT_IN_MINUTES = 2.0+ -\end{itemize} diff --git a/doc/iwoph17/content.tex b/doc/iwoph17/content.tex deleted file mode 100644 index dff334214093c2097ae0204e1718d9566bbd2e9e..0000000000000000000000000000000000000000 --- a/doc/iwoph17/content.tex +++ /dev/null @@ -1,350 +0,0 @@ -\part*{Executive Summary} -\addcontentsline{toc}{part}{Executive Summary} - -This document describes an accelerator benchmark suite, a set of 11 codes that includes 1 synthetic benchmarks and 10 comonly used applications. The key focus of this task has been exploiting accelerators or co-processors to improve the performance of real applications. It aims at providing a set of scalable, currently relevant and publically available codes and datasets. - -This work has been undertaken be Task7.2B "Accelerator Benchmarks" in the PRACE Forth Implementation Phase (PRACE-4IP) project. -Most of the selected application are a subset of the Unified European Applications Benchmark Suite (UEABS) \ref{}. One application and a synthetic benchmark have been added. - -As a result, selected codes are: ALYA, Code\_Saturne, CP2K, GROMACS, -GPAW, NAMD, PFARM, QCD, Quantum Espresso, SHOC and SPECFEM3D. - -For each code either two or more test case datasets have been selected. These are described in -this document, along with a brief introduction to the application codes themselves. For each -code some sample results are presented, from first run on leading egde systems and prototypes. - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -% Description of action/Grant Agreement -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -% Task 7.2 will look further ahead to future PRACE exascale systems. A key focus of this task will be exploiting accelerators or co-processors to improve the performance of real applications. There are currently a variety of different classes of accelerators and we will compare these to demonstrate their relative strengths and weaknesses. As part of this activity, we will develop an accelerator benchmark suite that includes both synthetic benchmarks and a subset of the applications from the Unified European Applications Benchmark Suite (UEABS); this will help researchers interested in knowing how to use accelerators effectively for their application - -\part{Introduction} -% Provide a brief description of the following (if appropriate): -% Objectives of the work related to the project as a whole; -% Purpose of the document; -The work produced within this task is an extension of the UEABS for accelerators. This document will cover each code, presenting the code in itself as well as the test cases defined for the benchmarks and the first results that have been recorded on various accelerator systems. - -% Intended audience. -As the UEABS, this suite aims to present results for many scientific fields that can use HPC accelerated resources. Hence, it will help the European scientific communities to make a decision in terms of infrastructures they could buy in a near future. We focus on Intel Xeon Phi coprocessors and NVidia GPU cards for benchmarking as they are the two most important accelerated resources available at the moment. - -% Structure of the document (what is in the different sections/chapters); -Section \ref{hardware} will present both type accelerator systems Xeon Phi and GPU card along with architcture examples. Section \ref{applications} gives a description of each of the selected applications, together with the test case datasets, and presents some sample results. Section \ref{conclusion} outlines further work on, and using, the suite. - -\part{Targeted architechture} - -%% https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4240026/ -% Scientific computing using co-processors (accelerators) has gained popularity in recent years. The utility of graphics processing units (GPUs) has been demonstrated and evaluated in several application domains [http://www.nvidia.com/object/quadro-design-and-manufacturing.html]. As a result, hybrid systems that combine multi-core CPUs with one or more co-processors of the same or different types are being more widely employed to speed up expensive computations. The architectures and programming models of co-processors may differ from CPUs and vary among different co-processor types. This heterogeneity leads to challenging problems in implementing application operations and obtaining the best performance. The performance of an application operation will depend on its data access and processing patterns and may vary widely from one co-processor to another. Understanding the performance characteristics of classes of operations can help in designing more efficient applications, choosing the appropriate co-processor for an application, and developing more effective task scheduling and mapping strategies. - - -%% https://goparallel.sourceforge.net/xeon-phi-vs-gpu-programming-better/ -% When you’re planning to build a high-performance software system, you need to decide what hardware to target. One option that many people might not be familiar with is the graphics card. There’s a technology called CUDA, which stands for Compute Unified Device Architecture, and it was created by NVIDIA, who makes, among other things, graphics cards. The processors on the graphics cards are called Graphics Processing Units, or GPUs. The GPUs made by NVIDIA have lots of cores, and you can actually compile code, offload it from your computer to the graphics card, and run the code on the GPU instead of your computer’s own CPU. By taking advantage of the high number of cores in the GPU, you can accomplish parallel programming much like we discuss here at Go Parallel. - -This suite is targeting accelerator cards, more specifically the Intel Xeon Phi and Nvidia GPU architecture. This section will quickly describe them and will present the four machine, the benchmarks ran on. - -\section{Co-processor description} -Scientific computing using co-processors has gained popularity in recent years. First the utility of GPUs has been demonstrated and evaluated in several application domains [http://www.nvidia.com/object/quadro-design-and-manufacturing.html]. As a response to NVIDA supprematy on this field, Intel designed Xeon Phi cards. - -Architectures and programming models of co-processors may differ from CPUs and vary among different co-processor types. The main challenges are the high level parallelism ability required from the softwares and the fact that code may have to be offloaded to the accelerator card. - -Here are raw specifications of some main co-processors: - -table - - Intel Xeon Phi NVIDIA gpus - 5110P(knc) 7250(knl) k40m p100 -public availability date Nov 2012 June 2016 June 2013 May 2016 -theoretical peak perf** 1,011GF/s* 3,046GF/s* 1,430GF/s 5,300GF/s -offload required yes no yes yes -max number of thread/cuda cores 240 272 2880 3584 - -* différence puissance annoncé/reelle knl (demander à gab) -** double precision - -\section{Systems description} - -The benchmark suite has been officially granted access to 4 different machines hosted by PRACE partners. A majority of the results presented in this paper were obtained on this machines but some of the simulation has run on similar ones. This section will cover specifications of the sub mentioned 4 official systems while the few exotic ones will be presented along with concerned results. - -As it can be noticed on the previous section, leading edge architectures have been available quite recently and some code couldn't run on it yet. Results will be completed in a near future and will be delivered with an update of the benchmark suite. Still, presented performances are a good indicator about potential efficiency of codes on both Xeon Phi and NVIDIA GPU platforms. - -\subsection{K40 cluster} -The SURFsara institute in Nederland granted access to Cartesius which has a GPU island (installed May 2014) with following specifications [https://userinfo.surfsara.nl/systems/cartesius/description]: - -66 bullx B515 GPGPU accelerated nodes - - 2 x 8-core 2.5 GHz Intel Xeon E5-2450 v2 (Ivy Bridge) CPUs/node - - 2 x NVIDIA Tesla K40m GPGPUs/node -96 GB/node -Total theoretical peak performance (Ivy Bridge + K40m) 1,056 cores + 132 GPUs: 210 Tflop/s -The interconnect has a fully non-blocking fat-tree topology. Every bullx node has a Mellanox ConnectX-3 or Connect-IB (Haswell thin nodes) InfiniBand adapter providing 4 x FDR (Fourteen Data Rate) resulting in 56 Gbit/s inter-node bandwidth, The GPGPU nodes have two ConnectX-3 InfiniBand adapters: one per GPGPU. - -\subsection{Xeon Phi 5110P cluster} -The Barcelona Supercomputing Center (BSC) in Spain granted access to MareNostrum III which features KNC nodes. Here's the description of this partition [https://www.bsc.es/support/MareNostrum3-ug.pdf, MareNostrum III User’s Guide Barcelona Supercomputing Center]: - -42 heterogenenous nodes contain: - - 8x 8G DDR3–1600 DIMMs (4GB/core) Total: 64GB/node - - 2x Xeon Phi 5110P accelerators -Interconnection Networks -– Infiniband Mellanox FDR10: High bandwidth network used by parallel applications communications -(MPI) -– Gigabit Ethernet: 10GbitEthernet network used by the GPFS Filesystem. - - -\subsection{P100 cluster} -GENCI granted access to the Ouessant prototype at IDRIS in France (installed September 2016). It is composed of 12 IBM Minsky compute nodes with each containing [http://www.idris.fr/eng/ouessant/]: - -2 POWER8+ sockets, 10 cores, 8 threads per core (or 160 threads par node) -128 GB of DDR4 memory (bandwidth > 9 GB/s per core) -4 Nvidia new generation Pascal P100 GPUs, 16 GB of HBM2 memory -4 NVLink interconnects (40GB/s of bi-directional bandwidth per interconnect); each GPU card is connected to a CPU with 2 NVLink interconnects and another GPU with 2 interconnects remaining (see figure below) -A Mellanox EDR IB CAPI interconnexion network (1 interconnect per node) - - -\subsection{Xeon Phi 7250 cluster} -GENCI granted also access to the Frioul prototype at CINES in France (installed December 2016). It is composed of 48 Intel KNL compute nodes : -Peak performance of 146 Tflop/s -Interconnect: Infiniband IB 4x FDR -File system: Lustre, more than 5 Po usable and a maximum bandwith of 105 Go/ - - -\part{Benchmark suite description} -%Description des codes/cas de tests/intérêt pour la communauté scientifique -% TODO: ou bien "as stated in WP212 brief code desc and test casess listing" -This part will cover each code, presenting the interest for the scientific comunity as well as the test cases defined for the benchmarks. -As an extention to the EUABS, \ref{} most of codes presented in this suite are included in the later. Exceptions are PFARM which come from PRACE-2IP \ref{} and SHOC a synthetic benchmark suite. - -\section{Alya} -Alya is a high performance computational mechanics code that can solve different coupled mechanics problems: incompressible/compressible flows, solid mechanics, chemistry, excitable media, heat transfer and Lagrangian particle transport. It is one single code. There are no particular parallel or individual platform versions. Modules, services and kernels can be compiled individually and used a la carte. The main discretisation technique employed in Alya is based on the variational multiscale finite element method to assemble the governing equations into Algebraic systems. These systems can be solved using solvers like GMRES, Deflated Conjugate Gradient, pipelined CG together with preconditioners like SSOR, Restricted Additive Schwarz, etc. The coupling between physics solved in different computational domains (like fluid-structure interactions) is carried out in a multi-code way, using different instances of the same executable. Asynchronous coupling can be achieved in the same way in order to transport Lagrangian particles. - -\subsection{Code desctiption} -The code is parallelised with MPI and OpenMP. Two OpenMP strategies are available, without and with a colouring strategy to avoid ATOMICs during the assembly step. A CUDA version is also available for the different solvers. Alya has been also compiled for MIC (Intel Xeon Phi). - -Alya is written in Fortran 1995 and the incompressible fluid module, present in the benchmark suite, is freely available. This module solves the Navier-Stokes equations using an Orthomin \ref{} method for the pressure Schur complement. This method is an algebraic split strategy which converges to the monolithic solution. At each linearisation step, the momentum is solved twice and the continuity equation is solved once or twice according to if the momentum preserving or the continuity preserving algorithm is selected. - -\subsection{Test cases desctiption} -\subsubsection{Cavity-hexaedra elements (10M elements)} -This test is the classical lid-driven cavity. The problem geometry is a cube of dimensions 1x1x1. The fluid properties are density=1.0 and viscosity=0.01. Dirichlet boundary conditions are applied on all sides, with three no-slip walls and one moving wall with velocity equal to 1.0, which corresponds to a Reynolds number of 100. The Reynolds number is low so the regime is laminar and turbulence modelling is not necessary. The domain is discretised into 9800344 hexaedra elements. The solvers are the GMRES method for the momentum equations and the Deflated Conjugate Gradient to solve the continuity equation. This test case can be run using pure MPI parallelisation or the hybrid MPI/OpenMP strategy. - -\subsubsection{Cavity-hexaedra elements (30M elements)} -This is the same cavity test as before but with 30M of elements. Note that a mesh multiplication strategy enables one to multiply the number of elements by powers of 8, by simply activating the corresponding option in the ker.dat file. - -\subsubsection{Cavity-hexaedra elements-GPU version (10M elements)} -This is the same test as Test case 1, but using the pure MPI parallelisation strategy with acceleration of the algebraic solvers using GPUs. - -\section{Code Saturne} -Code Saturne is an open-source CFD software package developed by EDF R\&D since 1997 and open-source since 2007. The Navier-Stokes equations are discretised following a finite volume method approach. The code can handle any type of mesh built with any type of cell/grid structure. Incompressible and compressible flows can be simulated, with or without heat transfer, and a range of turbulence models is available. The code can also be coupled with itself or other software to model some multiphysics problems (fluid-structure, fluid-conjugate heat transfer, for instance). - -\subsection{Code desctiption} -Parallelism is handled by distributing the domain over the processors (several partitioning tools are available, either internally, i.e. SFC Hilbert and Morton, or through external libraryies, i.e. METIS Serial, ParMETIS, Scotch Serial, PT-SCOTCH. Communications between subdomains are performed through MPI. Hybrid parallelism using OpenMP has recently been optimised for improved multicore performance. - -For incompressible simulations, most of the time is spent during the computation of the pressure through Poisson equations. PETSc and HYPRE have recently been linked to the code to offer alternatives to the internal solvers to compute the pressure. The developer’s version of PETSc supports CUDA and will be used in this benchmark suite. - -Code Saturne is written in C, F95 and Python. It is freely available under the GPL license. - -\subsection{Test cases desctiption} -Two test cases are dealt with, the former with a mesh made of tetrahedral cells and the latter with a mesh made of hexahedral cells. Both configurations are meant for incompressible laminar flows. Note that both configurations will also be used in the regular UEABS - -\subsubsection{Flow in a 3-D lid-driven cavity (tetrahedral cells)} -The geometry is very simple, i.e. a cube, but the mesh is built using tetrahedral cells. The Reynolds number is set to 400, and symmetry boundary conditions are applied in the spanwise direction. The case is modular and the mesh size can easily been varied. The largest mesh has about 13 million cells. - -This test case is expected to scale efficiently to 1000+ nodes. - -\subsubsection{3-D Taylor-Green vortex flow (hexahedral cells)} -The Taylor-Green vortex flow is traditionally used to assess the accuracy of CFD code numerical schemes. Periodicity is used in the 3 directions. The total kinetic energy (integral of the velocity) and enstrophy (integral of the vorticity) evolutions as a function of the time are looked at. Code Saturne is set for 2nd order time and spatial schemes, and three meshes are considered, containing 1283, 2563 and 5123 cells, respectively. - -This test case is expected to scale efficiently to 4000+ nodes for the largest mesh. - -\section{CP2K} -CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. It can perform molecular dynamics, metadynamics, Quantum Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method. - -CP2K provides a general framework for different modeling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, …), and classical force fields (AMBER, CHARMM, …). - -\subsection{Code desctiption} -Parallelisation is achieved using a combination of OpenMP-based multi-threading and MPI. - -Offloading for accelerators is implemented through CUDA and OpenCL for GPGPUs and through OpenMP for MIC (Intel Xeon Phi). - -CP2K is written in Fortran 2003 and freely available under the GPL license. - -\subsection{Test cases desctiption} -\subsubsection{LiH-HFX} -This is a single-point energy calculation for a particular configuration of a 216 atom Lithium Hydride crystal with 432 electrons in a 12.3 \ref{caractere pas bon ici}3 (Angstroms cubed) cell. The calculation is performed using a density functional theory (DFT) algorithm with Gaussian and Augmented Plane Waves (GAPW) under the hybrid Hartree-Fock exchange (HFX) approximation. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although the cost of the latter can be reduced by using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performance. - -This test case is expected to scale efficiently to 1000+ nodes. - -\subsubsection{H2O-DFT-LS} -This is a single-point energy calculation for 2048 water molecules in a 39 \ref{caractere pas bon ici}3 box using linear-scaling DFT. A local-density approximation (LDA) functional is used to compute the Exchange-Correlation energy in combination with a DZVP MOLOPT basis set and a 300 Ry cutoff. For large systems the linear-scaling approach for solving Self-Consistent-Field equations should be much cheaper computationally than using standard DFT, and allow scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix. The cubically-scaling orthogonalisation step of standard DFT is avoided and the key operation is sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in CP2K's DBCSR library. - -This test case is expected to scale efficiently to 4000+ nodes. - -\section{GPAW} -GPAW is a software package for ab initio electronic structure calculations using the projector augmented wave (PAW) method. Using a uniform real-space grid representation of the electronic wavefunctions, as implemented in GPAW, allows for excellent computational scalability and systematic converge properties in density functional theory calculations. - -\subsection{Code desctiption} -GPAW is written mostly in Python, but includes also computational kernels written in C as well as leveraging external libraries such as NumPy, BLAS and ScaLAPACK. Parallelisation is based on message-passing using MPI with no support for multithreading. Development branches for GPGPUs and MICs include support for offloading to accelerators using either CUDA or pyMIC/libxsteam, respectively. GPAW is freely available under the GPL license. - -\subsection{Test cases desctiption} -\subsubsection{Carbon Nanotube} -This is a single-point energy calculation for a carbon nanotube (6,6) with a freely adjustable length (240 atoms by default). The calculation is performed using the residual minimisation method with the RMM-DIIS eigensolver and a multigrid Jacobian method as a Poisson solver. - -This test case is expected to be suitable for smaller systems with up to 10 nodes. - -\subsubsection{Carbon Fullerenes on a Lead Sheet} -This is a single-point energy calculation for a system consisting of two C60 fullerenes next to a Pb112 bulk sheet. The system consists of 232 atoms in a 14.2 x 14.2 x 40.0 Å unit cell. The calculation is performed using the residual minimisation method with the RMM-DIIS eigensolver and the Perdew-Burke-Ernzerhof exchange-correlation functional. - -This test case is expected to be suitable for larger systems with up to 100 nodes. Smaller systems may be limited by the memory requirement, which can nevertheless be adjusted to some extent with the run parameters for Brillouin-zone sampling and grid spacing. - - -\section{GROMACS} -GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. -It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers. -GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation, and some additional features: - -GROMACS provides extremely high performance compared to all other programs. A lot of algorithmic optimisations have been introduced in the code; we have for instance extracted the calculation of the virial from the innermost loops over pairwise interactions, and we use our own software routines to calculate the inverse square root. In GROMACS 4.6, on almost all common computing platforms, the innermost loops are written in C using intrinsic functions that the compiler transforms to SIMD machine instructions, to utilise the available instruction-level parallelism. These kernels are available in either single and double precision, and support all different kinds of SIMD support found in x86-family processors available in January 2013. - -\subsection{Code desctiption} -Parallelisation is achieved using combined OpenMP and MPI. -Offloading for accelerators is implemented through CUDA for GPGPUs and through OpenMP for MIC (Intel Xeon Phi). - -GROMACS is written in C/C++ and freely available under the GPL license. - -\subsection{Test cases desctiption} - -\subsubsection{GluCL Ion Channel} -The ion channel system is the membrane protein GluCl, which is a pentameric chloride channel embedded in a lipid bilayer. The GluCl ion channel was embedded in a DOPC membrane and solvated in TIP3P water. This system contains 142k atoms, and is a quite challenging parallelisation case due to the small size. However, it is likely one of the most wanted target sizes for biomolecular simulations due to the importance of these proteins for pharmaceutical applications. It is particularly challenging due to a highly inhomogeneous and anisotropic environment in the membrane, which poses hard challenges for load balancing with domain decomposition. -This test case was used as the “Small” test case in previous 2IP and 3IP Prace phases. It is included in the package's version 5.0 benchmark cases. It is reported to scale efficiently up to 1000+ cores on x86 based systems?? -Test case 2 - -\subsubsection{Lignocellulose} -A model of cellulose and lignocellulosic biomass in an aqueous solution [http://pubs.acs.org/doi/abs/10.1021/bm400442n]. This system of 3.3 million atoms is inhomogeneous. This system uses reaction-field electrostatics instead of PME and therefore scales well on x86. This test case was used as the “Large” test case in previous PRACE-2IP and -3IP projects. It is reported in previous PRACE projects to scale efficiently up to 10000+ x86 cores. - -\section{NAMD} - -NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. NAMD is developed by the “Theoretical and Computational Biophysics Group” at the University of Illinois at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a large number of processors. The application can read a wide variety of different file formats, for example force fields, protein structure, which are commonly used in bio-molecular science. A NAMD license can be applied for on the developer’s website free of charge. Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website. Deployment areas of NAMD include pharmaceutical research by academic and industrial users. NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other chemical substances is of interest. Typical examples are vaccine research and transport processes through cell membrane proteins. - -\subsection{Code desctiption} -NAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI, supporting both pure MPI and hybrid parallelisation. -See the Web site: http://www.ks.uiuc.edu/Research/namd/ - -Offloading for accelerators is implemented for both GPGPUs and MIC (Intel Xeon Phi). - -\subsection{Test cases desctiption} -The datasets are based on the original "Satellite Tobacco Mosaic Virus (STMV)" dataset from the official NAMD site. The memory optimised build of the package and data sets are used in benchmarking. Data are converted to the appropriate binary format used by the memory optimised build. - -\subsubsection{STMV.1M} -This is the original STMV dataset from the official NAMD site. The system contains roughly 1 million atoms. This data set scales efficiently up to 1000+ x86 Ivy Bridge cores. -Test case 2 - -\subsubsection{STMV.8M} -This is a 2x2x2 replication of the original STMV dataset from official NAMD site. The system contains roughly 8 million atoms. This data set scales efficiently up to 6000 x86 Ivy Bridge cores. - -\section{PFARM} -PFARM is part of a suite of programs based on the ‘R-matrix’ ab-initio approach to the varitional solution of the many-electron Schrödinger equation for electron-atom and electron-ion scattering. The package has been used to calculate electron collision data for astrophysical applications (such as: the interstellar medium, planetary atmospheres) with, for example, various ions of Fe and Ni and neutral O, plus other applications such as data for plasma modelling and fusion reactor impurities. The code has recently been adapted to form a compatible interface with the UKRmol suite of codes for electron (positron) molecule collisions thus enabling large-scale parallel ‘outer-region’ calculations for molecular systems as well as atomic systems. - -\subsection{Code desctiption} -In order to enable efficient computation, the external region calculation takes place in two distinct stages, named EXDIG and EXAS, with intermediate files linking the two. EXDIG is dominated by the assembly of sector Hamiltonian matrices and their subsequent eigensolutions. EXAS uses a combined functional/domain decomposition approach where good load-balancing is essential to maintain efficient parallel performance. Each of the main stages in the calculation is written in Fortran 2003 (or Fortran 2003-compliant Fortran 95), is parallelised using MPI and is designed to take advantage of highly optimised, numerical library routines. Hybrid MPI / OpenMP parallelisation has also been introduced into the code via shared memory enabled numerical library kernels. - -Accelerator-based implementations have been implemented for both EXDIG and EXAS. EXAS uses offloading via MAGMA (or MKL) for sector Hamiltonian diagonalisations on Intel Xeon Phi and GPGPU accelerators. EXDIG uses combined MPI and OpenMP to distribute the scattering energy calculations on CPUs efficiently both across and within Intel Xeon Phi accelerators. - -\subsection{Test cases desctiption} -External region R-matrix propagations take place over the outer partition of configuration space, including the region where long-range potentials remain important. The radius of this region is determined from the user input and the program decides upon the best strategy for dividing this space into multiple sub-regions (or sectors). Generally, a choice of larger sector lengths requires the application of larger numbers of basis functions (and therefore larger Hamiltonian matrices) in order to maintain accuracy across the sector and vice-versa. Memory limits on the target hardware may determine the final preferred configuration for each test case. - -\subsubsection{FeIII} -This is an electron-ion scattering case with 1181 channels. Hamiltonian assembly in the coarse region applies 10 Legendre functions leading to Hamiltonian matrix diagonalisations of order 11810. In the fine region up to 30 Legendre functions may be applied leading to Hamiltonian matrices of order 35430. The number of sector calculations is likely to range from about 15 to over 30 depending on the user specifications. Several thousand scattering energies will be used in the calculation. - -In the current model, parallelism in EXDIG is limited to the number of sector calculations, i.e around 30 accelerator nodes. Parallelism in EXAS is limited by the number of scattering energies, so we would expect this to reach into the hundreds of nodes. - -\subsubsection{Methane} -The dataset is an electron-molecule calculation with 1361 channels. Hamiltonian dimensions are therefore estimated between 13610 and ~40000. The length of the external region required is relatively long, leading to more numerous sectors calculations (estimated to between 25 and 50). The calculation will require many thousands of scattering energies. - -EXDIG scaling expected up to 50 accelerator nodes. EXAS scaling expected on hundreds to low thousands of nodes. - -\section{QCD} - -Matter consists of atoms, which in turn consist of nuclei and electrons. The nuclei consist of neutrons and protons, which comprise quarks bound together by gluons. - -The theory of how quarks and gluons interact to form nucleons and other elementary particles is called Quantum Chromo Dynamics (QCD). For most problems of interest, it is not possible to solve QCD analytically, and instead numerical simulations must be performed. Such “Lattice QCD” calculations are very computationally intensive, and occupy a significant percentage of all HPC resources worldwide. - -The MILC code is a freely-available suite for performing Lattice QCD simulations, developed over many years by a collaboration of researchers (physics.indiana.edu/~sg/milc.html). - -The benchmark used here is derived from the MILC code (v6), and consists of a full conjugate gradient solution using Wilson fermions. The benchmark is consistent with “QCD kernel E” in the full UAEBS, and has been adapted so that it can efficiently use accelerators as well as traditional CPUs. - -\subsection{Code desctiption} -The implementation for accelerators has been achieved using the “targetDP” programming model [http://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README], a lightweight abstraction layer designed to allow the same application source code to be able to target multiple architectures, e.g. NVidia GPUs and multicore/manycore CPUs, in a performance portable manner. The targetDP syntax maps, at compile time, to either NVidia CUDA (for execution on GPUs) or OpenMP+vectorisation (for implementation on multi/manycore CPUs including Intel Xeon Phi). The base language of the benchmark is C and MPI is used for node-level parallelism. - -\subsection{Test cases desctiption} -Lattice QCD involves discretisation of space-time into a lattice of points, where the extent of the lattice in each of the 3 spatial and 1 temporal dimension can be chosen. This means that the benchmark is very flexible, where the size of the lattice can be varied with the size of the computing system in use (weak scaling) or can be fixed (strong scaling). For testing on a single node, then 64x64x32x8 is a reasonable size, since this fits on a single Intel Xeon Phi or a single GPU. For larger numbers of nodes, the lattice extents can be increased accordingly, keeping the geometric shape roughly similar. - - -\section{Quantum Espresso} - -QUANTUM ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modelling, based on density-functional theory, plane waves, and pseudopotentials (norm-conserving, ultrasoft, and projector-augmented wave). QUANTUM ESPRESSO stands for opEn Source Package for Research in Electronic Structure, Simulation, and Optimisation. It is freely available to researchers around the world under the terms of the GNU General Public License. QUANTUM ESPRESSO builds upon newly restructured electronic-structure codes that have been developed and tested by some of the original authors of novel electronic-structure algorithms and applied in the last twenty years by some of the leading materials modelling groups worldwide. Innovation and efficiency are still its main focus, with special attention paid to massively parallel architectures, and a great effort being devoted to user friendliness. QUANTUM ESPRESSO is evolving towards a distribution of independent and inter-operable codes in the spirit of an open-source project, where researchers active in the field of electronic-structure calculations are encouraged to participate in the project by contributing their own codes or by implementing their own ideas into existing codes. -QUANTUM ESPRESSO is written mostly in Fortran90, and parallelised using MPI and OpenMP and is released under a GPL license. -Accelerator version - -\subsection{Code desctiption} - -During 2011 a GPU-enabled version of Quantum ESPRESSO was publicly released. The code is currently developed and maintained by Filippo Spiga at the High Performance Computing Service - University of Cambridge (United Kingdom) and Ivan Girotto at the International Centre for Theoretical Physics (Italy). The initial work has been supported by the EC-funded PRACE and a SFI (Science Foundation Ireland, grant 08/HEC/I1450). At the time of writing, the project is self-sustained thanks to the dedication of the people involved and thanks to NVidia support in providing hardware and expertise in GPU programming. - -The current public version of QE-GPU is 14.10.0 as it is the last version maintained as plug-in working on all QE 5.x versions. QE-GPU utilised phiGEMM (external) for CPU+GPU GEMM computation, MAGMA (external) to accelerate eigen-solvers and explicit CUDA kernel to accelerate compute-intensive routines. FFT capabilities on GPU are available only for serial computation due to the hard challenges posed in managing accelerators in the parallel distributed 3D-FFT portion of the code where communication is the dominant element that limit excellent scalability beyond hundreds of MPI ranks. - -A version for Intel Xeon PHI (MIC) accelerators is not currently available. - -\subsection{Test cases desctiption} - - -\subsubsection{PW-IRMOF\_M11} -Full SCF calculation of a Zn-based isoreticular metal–organic framework (total 130 atoms) over 1 K point. Benchmarks run in 2012 demonstrated speedups due to GPUs (NVidia K20s) with respect to non-accelerated nodes) in the range 1.37 – 1.87, according to node count (maximum number of accelerators=8). Runs with current hardware technology and an updated version of the code are expected to exhibit higher speedups (probably 2-3x) and scale up to a couple hundred nodes. - - -\subsubsection{PW-SiGe432} -This is a SCF calculation of a Silicon-Germanium crystal with 430 atoms. Being a fairly large system parallel scalability up to several hundred, perhaps a 1000 nodes is expected, with accelerated speed-ups likely to be of 2-3X. - -\section{Synthetic benchmarks -- SHOC} -Code presentation -The Accelerator Benchmark Suite will also include a series of synthetic benchmarks. For this purpose, we choose the Scalable HeterOgeneous Computing (SHOC) benchmark suite, augmented with a series of benchmark examples developed internally. SHOC is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing. Its initial focus is on systems containing GPUs and multi-core processors, and on the OpenCL programming standard, but CUDA and OpenACC versions were added. Moreover, a subset of the benchmarks is optimised for the Intel Xeon Phi coprocessor. SHOC can be used on clusters as well as individual hosts. - -The SHOC benchmark suite currently contains benchmark programs categorised based on complexity. Some measure low-level 'feeds and speeds' behaviour (Level 0), some measure the performance of a higher-level operation such as a Fast Fourier Transform (FFT) (Level 1), and the others measure real application kernels (Level 2). The actual benchmarks for each level are listed below: - -\subsection{Code desctiption} - -All benchmarks are MPI-enabled. Some will report aggregate metrics over all MPI ranks, others will only perform work for specific rank. - -Offloading for accelerators is implemented through CUDA and OpenCl for GPGPUs and through OpenMP for MIC (Intel Xeon Phi). For selected benchmarks OpenACC implementations are provided for GPGPUs. Multi-node parallelisation is achieved using MPI. - -SHOC is written in C++ and is open-source and freely available: https://github.com/vetter/shoc . -Test cases definition -The benchmarks contained in SHOC currently feature 4 different sizes for increasingly large systems. The size convention is as follows: -1 - CPUs / debugging -2 - Mobile/integrated GPUs -3 - Discrete GPUs (e.g. GeForce or Radeon series) -4 - HPC-focused or large memory GPUs (e.g. Tesla or Firestream Series) - -In order to go even larger scale we plan to add a 5th level for massive supercomputers. - -\section{SPECFEM3D} -The software package SPECFEM3D simulates three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). All SPECFEM3D\_GLOBE software is written in Fortran90 with full portability in mind, and conforms strictly to the Fortran95 standard. It uses no obsolete or obsolescent features of Fortran77. The package uses parallel programming based upon the Message Passing Interface (MPI). - -The SEM was originally developed in computational fluid dynamics and has been successfully adapted to address problems in seismic wave propagation. It is a continuous Galerkin technique, which can easily be made discontinuous; it is then close to a particular case of the discontinuous Galerkin technique, with optimised efficiency because of its tensorised basis functions. In particular, it can accurately handle very distorted mesh elements. It has very good accuracy and convergence properties. The spectral element approach admits spectral rates of convergence and allows exploiting hp-convergence schemes. It is also very well suited to parallel implementation on very large supercomputers as well as on clusters of GPU accelerating graphics cards. Tensor products inside each element can be optimised to reach very high efficiency, and mesh point and element numbering can be optimised to reduce processor cache misses and improve cache reuse. The SEM can also handle triangular (in 2D) or tetrahedral (3D) elements as well as mixed meshes, although with increased cost and reduced accuracy in these elements, as in the discontinuous Galerkin method. - -In many geological models in the context of seismic wave propagation studies (except for instance for fault dynamic rupture studies, in which very high frequencies of supershear rupture need to be modelled near the fault, a continuous formulation is sufficient because material property contrasts are not drastic and thus conforming mesh doubling bricks can efficiently handle mesh size variations. This is particularly true at the scale of the full earth. Effects due to lateral variations in compressional-wave speed, shear-wave speed, density, a 3D crustal model, ellipticity, topography and bathyletry, the oceans, rotation, and self-gravitation are included. The package can accommodate full 21-parameter anisotropy as well as lateral variations in attenuation. Adjoint capabilities and finite-frequency kernel simulations are also included. - -\subsection{Test cases definition} -The small test case runs with 16 MPI tasks, the large one runs with 7776 MPI tasks. - - -\part{Applications performance} -Présentation des résultats - -\part{Conclusion and future work} -The work presented here stand as a first sight for application benchmarking on accelerators. Most codes have been selected among the main Unified European Accelerated Benchmark Suite. This paper describes each of them as well as implementation, relevance to european science comunity and test cases. We have presented results on leading edge systems -%that are meant to be updated when larger cluster will be available. - -The suite will be publicly available on the PRACE web site where links to download sources and test cases will be published along with compilation and run instructions. - -Task 7.2B in PRACE 4IP started to design a benchmark suite for accelerator. This work have been done aiming at integrating it to the main UEABS one so that both can be maintained and evolve together. As PCP (PRACE 3IP) machines will soon be available, it will be very interesting to run the benchmark suite on it. First because these machines will be larger, but also because it will feature energy consumption probes. - - -% topo comparatif des deux architechture -% open vers le merge des suite de bench prace -% petit mot sur l'energie, machine du prace du PCP also larger -% layus sur les différents modes de calculs à bord des nouvelles machines, difficulté d'administration ? diff --git a/doc/iwoph17/history.txt b/doc/iwoph17/history.txt deleted file mode 100644 index 3a5c7bad81f080f2aa901e6646de59b5909e833b..0000000000000000000000000000000000000000 --- a/doc/iwoph17/history.txt +++ /dev/null @@ -1,132 +0,0 @@ -Version history for the LLNCS LaTeX2e class - - date filename version action/reason/acknowledgements ----------------------------------------------------------------------------- - 29.5.96 letter.txt beta naming problems (subject index file) - thanks to Dr. Martin Held, Salzburg, AT - - subjindx.ind renamed to subjidx.ind as required - by llncs.dem - - history.txt introducing this file - - 30.5.96 llncs.cls incompatibility with new article.cls of - 1995/12/20 v1.3q Standard LaTeX document class, - \if@openbib is no longer defined, - reported by Ralf Heckmann and Graham Gough - solution by David Carlisle - - 10.6.96 llncs.cls problems with fragile commands in \author field - reported by Michael Gschwind, TU Wien - - 25.7.96 llncs.cls revision a corrects: - wrong size of text area, floats not \small, - some LaTeX generated texts - reported by Michael Sperber, Uni Tuebingen - - 16.4.97 all files 2.1 leaving beta state, - raising version counter to 2.1 - - 8.6.97 llncs.cls 2.1a revision a corrects: - unbreakable citation lists, reported by - Sergio Antoy of Portland State University - -11.12.97 llncs.cls 2.2 "general" headings centered; two new elements - for the article header: \email and \homedir; - complete revision of special environments: - \newtheorem replaced with \spnewtheorem, - introduced the theopargself environment; - two column parts made with multicol package; - add ons to work with the hyperref package - -07.01.98 llncs.cls 2.2 changed \email to simply switch to \tt - -25.03.98 llncs.cls 2.3 new class option "oribibl" to suppress - changes to the thebibliograpy environment - and retain pure LaTeX codes - useful - for most BibTeX applications - -16.04.98 llncs.cls 2.3 if option "oribibl" is given, extend the - thebibliograpy hook with "\small", suggested - by Clemens Ballarin, University of Cambridge - -20.11.98 llncs.cls 2.4 pagestyle "titlepage" - useful for - compilation of whole LNCS volumes - -12.01.99 llncs.cls 2.5 counters of orthogonal numbered special - environments are reset each new contribution - -27.04.99 llncs.cls 2.6 new command \thisbottomragged for the - actual page; indention of the footnote - made variable with \fnindent (default 1em); - new command \url that copys its argument - - 2.03.00 llncs.cls 2.7 \figurename and \tablename made compatible - to babel, suggested by Jo Hereth, TU Darmstadt; - definition of \url moved \AtBeginDocument - (allows for url package of Donald Arseneau), - suggested by Manfred Hauswirth, TU of Vienna; - \large for part entries in the TOC - -16.04.00 llncs.cls 2.8 new option "orivec" to preserve the original - vector definition, read "arrow" accent - -17.01.01 llncs.cls 2.9 hardwired texts made polyglot, - available languages: english (default), - french, german - all are "babel-proof" - -20.06.01 splncs.bst public release of a BibTeX style for LNCS, - nobly provided by Jason Noble - -14.08.01 llncs.cls 2.10 TOC: authors flushleft, - entries without hyphenation; suggested - by Wiro Niessen, Imaging Center - Utrecht - -23.01.02 llncs.cls 2.11 fixed footnote number confusion with - \thanks, numbered institutes, and normal - footnote entries; error reported by - Saverio Cittadini, Istituto Tecnico - Industriale "Tito Sarrocchi" - Siena - -28.01.02 llncs.cls 2.12 fixed footnote fix ; error reported by - Chris Mesterharm, CS Dept. Rutgers - NJ - -28.01.02 llncs.cls 2.13 fixed the fix (programmer needs vacation) - -17.08.04 llncs.cls 2.14 TOC: authors indented, smart \and handling - for the TOC suggested by Thomas Gabel - University of Osnabrueck - -07.03.06 splncs.bst fix for BibTeX entries without year; patch - provided by Jerry James, Utah State University - -14.06.06 splncs_srt.bst a sorting BibTeX style for LNCS, feature - provided by Tobias Heindel, FMI Uni-Stuttgart - -16.10.06 llncs.dem 2.3 removed affiliations from \tocauthor demo - -11.12.07 llncs.doc note on online visibility of given e-mail address - -15.06.09 splncs03.bst new BibTeX style compliant with the current - requirements, provided by Maurizio "Titto" - Patrignani of Universita' Roma Tre - -30.03.10 llncs.cls 2.15 fixed broken hyperref interoperability; - patch provided by Sven Koehler, - Hamburg University of Technology - -15.04.10 llncs.cls 2.16 fixed hyperref warning for informatory TOC entries; - introduced \keywords command - finally; - blank removed from \keywordname, flaw reported - by Armin B. Wagner, IGW TU Vienna - -15.04.10 llncs.cls 2.17 fixed missing switch "openright" used by \backmatter; - flaw reported by Tobias Pape, University of Potsdam - -27.09.13 llncs.cls 2.18 fixed "ngerman" incompatibility; solution provided - by Bastian Pfleging, University of Stuttgart - -31.03.14 llncs.cls 2.19 removed spurious blanks from "babel" texts, - problem reported by Piotr Stera, Silesian University - of Technology, Gliwice, Poland - diff --git a/doc/iwoph17/llncs.cls b/doc/iwoph17/llncs.cls deleted file mode 100644 index 9768d03f5afc22c40bfa76a9ad104992831acd48..0000000000000000000000000000000000000000 --- a/doc/iwoph17/llncs.cls +++ /dev/null @@ -1,1208 +0,0 @@ -% LLNCS DOCUMENT CLASS -- version 2.20 (24-JUN-2015) -% Springer Verlag LaTeX2e support for Lecture Notes in Computer Science -% -%% -%% \CharacterTable -%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z -%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z -%% Digits \0\1\2\3\4\5\6\7\8\9 -%% Exclamation \! Double quote \" Hash (number) \# -%% Dollar \$ Percent \% Ampersand \& -%% Acute accent \' Left paren \( Right paren \) -%% Asterisk \* Plus \+ Comma \, -%% Minus \- Point \. Solidus \/ -%% Colon \: Semicolon \; Less than \< -%% Equals \= Greater than \> Question mark \? -%% Commercial at \@ Left bracket \[ Backslash \\ -%% Right bracket \] Circumflex \^ Underscore \_ -%% Grave accent \` Left brace \{ Vertical bar \| -%% Right brace \} Tilde \~} -%% -\NeedsTeXFormat{LaTeX2e}[1995/12/01] -\ProvidesClass{llncs}[2015/06/24 v2.20 -^^J LaTeX document class for Lecture Notes in Computer Science] -% Options -\let\if@envcntreset\iffalse -\DeclareOption{envcountreset}{\let\if@envcntreset\iftrue} -\DeclareOption{citeauthoryear}{\let\citeauthoryear=Y} -\DeclareOption{oribibl}{\let\oribibl=Y} -\let\if@custvec\iftrue -\DeclareOption{orivec}{\let\if@custvec\iffalse} -\let\if@envcntsame\iffalse -\DeclareOption{envcountsame}{\let\if@envcntsame\iftrue} -\let\if@envcntsect\iffalse -\DeclareOption{envcountsect}{\let\if@envcntsect\iftrue} -\let\if@runhead\iffalse -\DeclareOption{runningheads}{\let\if@runhead\iftrue} - -\let\if@openright\iftrue -\let\if@openbib\iffalse -\DeclareOption{openbib}{\let\if@openbib\iftrue} - -% languages -\let\switcht@@therlang\relax -\def\ds@deutsch{\def\switcht@@therlang{\switcht@deutsch}} -\def\ds@francais{\def\switcht@@therlang{\switcht@francais}} - -\DeclareOption*{\PassOptionsToClass{\CurrentOption}{article}} - -\ProcessOptions - -\LoadClass[twoside]{article} -\RequirePackage{multicol} % needed for the list of participants, index -\RequirePackage{aliascnt} - -\setlength{\textwidth}{12.2cm} -\setlength{\textheight}{19.3cm} -\renewcommand\@pnumwidth{2em} -\renewcommand\@tocrmarg{3.5em} -% -\def\@dottedtocline#1#2#3#4#5{% - \ifnum #1>\c@tocdepth \else - \vskip \z@ \@plus.2\p@ - {\leftskip #2\relax \rightskip \@tocrmarg \advance\rightskip by 0pt plus 2cm - \parfillskip -\rightskip \pretolerance=10000 - \parindent #2\relax\@afterindenttrue - \interlinepenalty\@M - \leavevmode - \@tempdima #3\relax - \advance\leftskip \@tempdima \null\nobreak\hskip -\leftskip - {#4}\nobreak - \leaders\hbox{$\m@th - \mkern \@dotsep mu\hbox{.}\mkern \@dotsep - mu$}\hfill - \nobreak - \hb@xt@\@pnumwidth{\hfil\normalfont \normalcolor #5}% - \par}% - \fi} -% -\def\switcht@albion{% -\def\abstractname{Abstract.}% -\def\ackname{Acknowledgement.}% -\def\andname{and}% -\def\lastandname{\unskip, and}% -\def\appendixname{Appendix}% -\def\chaptername{Chapter}% -\def\claimname{Claim}% -\def\conjecturename{Conjecture}% -\def\contentsname{Table of Contents}% -\def\corollaryname{Corollary}% -\def\definitionname{Definition}% -\def\examplename{Example}% -\def\exercisename{Exercise}% -\def\figurename{Fig.}% -\def\keywordname{{\bf Keywords:}}% -\def\indexname{Index}% -\def\lemmaname{Lemma}% -\def\contriblistname{List of Contributors}% -\def\listfigurename{List of Figures}% -\def\listtablename{List of Tables}% -\def\mailname{{\it Correspondence to\/}:}% -\def\noteaddname{Note added in proof}% -\def\notename{Note}% -\def\partname{Part}% -\def\problemname{Problem}% -\def\proofname{Proof}% -\def\propertyname{Property}% -\def\propositionname{Proposition}% -\def\questionname{Question}% -\def\remarkname{Remark}% -\def\seename{see}% -\def\solutionname{Solution}% -\def\subclassname{{\it Subject Classifications\/}:}% -\def\tablename{Table}% -\def\theoremname{Theorem}} -\switcht@albion -% Names of theorem like environments are already defined -% but must be translated if another language is chosen -% -% French section -\def\switcht@francais{%\typeout{On parle francais.}% - \def\abstractname{R\'esum\'e.}% - \def\ackname{Remerciements.}% - \def\andname{et}% - \def\lastandname{ et}% - \def\appendixname{Appendice}% - \def\chaptername{Chapitre}% - \def\claimname{Pr\'etention}% - \def\conjecturename{Hypoth\`ese}% - \def\contentsname{Table des mati\`eres}% - \def\corollaryname{Corollaire}% - \def\definitionname{D\'efinition}% - \def\examplename{Exemple}% - \def\exercisename{Exercice}% - \def\figurename{Fig.}% - \def\keywordname{{\bf Mots-cl\'e:}}% - \def\indexname{Index}% - \def\lemmaname{Lemme}% - \def\contriblistname{Liste des contributeurs}% - \def\listfigurename{Liste des figures}% - \def\listtablename{Liste des tables}% - \def\mailname{{\it Correspondence to\/}:}% - \def\noteaddname{Note ajout\'ee \`a l'\'epreuve}% - \def\notename{Remarque}% - \def\partname{Partie}% - \def\problemname{Probl\`eme}% - \def\proofname{Preuve}% - \def\propertyname{Caract\'eristique}% -%\def\propositionname{Proposition}% - \def\questionname{Question}% - \def\remarkname{Remarque}% - \def\seename{voir}% - \def\solutionname{Solution}% - \def\subclassname{{\it Subject Classifications\/}:}% - \def\tablename{Tableau}% - \def\theoremname{Th\'eor\`eme}% -} -% -% German section -\def\switcht@deutsch{%\typeout{Man spricht deutsch.}% - \def\abstractname{Zusammenfassung.}% - \def\ackname{Danksagung.}% - \def\andname{und}% - \def\lastandname{ und}% - \def\appendixname{Anhang}% - \def\chaptername{Kapitel}% - \def\claimname{Behauptung}% - \def\conjecturename{Hypothese}% - \def\contentsname{Inhaltsverzeichnis}% - \def\corollaryname{Korollar}% -%\def\definitionname{Definition}% - \def\examplename{Beispiel}% - \def\exercisename{\"Ubung}% - \def\figurename{Abb.}% - \def\keywordname{{\bf Schl\"usselw\"orter:}}% - \def\indexname{Index}% -%\def\lemmaname{Lemma}% - \def\contriblistname{Mitarbeiter}% - \def\listfigurename{Abbildungsverzeichnis}% - \def\listtablename{Tabellenverzeichnis}% - \def\mailname{{\it Correspondence to\/}:}% - \def\noteaddname{Nachtrag}% - \def\notename{Anmerkung}% - \def\partname{Teil}% -%\def\problemname{Problem}% - \def\proofname{Beweis}% - \def\propertyname{Eigenschaft}% -%\def\propositionname{Proposition}% - \def\questionname{Frage}% - \def\remarkname{Anmerkung}% - \def\seename{siehe}% - \def\solutionname{L\"osung}% - \def\subclassname{{\it Subject Classifications\/}:}% - \def\tablename{Tabelle}% -%\def\theoremname{Theorem}% -} - -% Ragged bottom for the actual page -\def\thisbottomragged{\def\@textbottom{\vskip\z@ plus.0001fil -\global\let\@textbottom\relax}} - -\renewcommand\small{% - \@setfontsize\small\@ixpt{11}% - \abovedisplayskip 8.5\p@ \@plus3\p@ \@minus4\p@ - \abovedisplayshortskip \z@ \@plus2\p@ - \belowdisplayshortskip 4\p@ \@plus2\p@ \@minus2\p@ - \def\@listi{\leftmargin\leftmargini - \parsep 0\p@ \@plus1\p@ \@minus\p@ - \topsep 8\p@ \@plus2\p@ \@minus4\p@ - \itemsep0\p@}% - \belowdisplayskip \abovedisplayskip -} - -\frenchspacing -\widowpenalty=10000 -\clubpenalty=10000 - -\setlength\oddsidemargin {63\p@} -\setlength\evensidemargin {63\p@} -\setlength\marginparwidth {90\p@} - -\setlength\headsep {16\p@} - -\setlength\footnotesep{7.7\p@} -\setlength\textfloatsep{8mm\@plus 2\p@ \@minus 4\p@} -\setlength\intextsep {8mm\@plus 2\p@ \@minus 2\p@} - -\setcounter{secnumdepth}{2} - -\newcounter {chapter} -\renewcommand\thechapter {\@arabic\c@chapter} - -\newif\if@mainmatter \@mainmattertrue -\newcommand\frontmatter{\cleardoublepage - \@mainmatterfalse\pagenumbering{Roman}} -\newcommand\mainmatter{\cleardoublepage - \@mainmattertrue\pagenumbering{arabic}} -\newcommand\backmatter{\if@openright\cleardoublepage\else\clearpage\fi - \@mainmatterfalse} - -\renewcommand\part{\cleardoublepage - \thispagestyle{empty}% - \if@twocolumn - \onecolumn - \@tempswatrue - \else - \@tempswafalse - \fi - \null\vfil - \secdef\@part\@spart} - -\def\@part[#1]#2{% - \ifnum \c@secnumdepth >-2\relax - \refstepcounter{part}% - \addcontentsline{toc}{part}{\thepart\hspace{1em}#1}% - \else - \addcontentsline{toc}{part}{#1}% - \fi - \markboth{}{}% - {\centering - \interlinepenalty \@M - \normalfont - \ifnum \c@secnumdepth >-2\relax - \huge\bfseries \partname~\thepart - \par - \vskip 20\p@ - \fi - \Huge \bfseries #2\par}% - \@endpart} -\def\@spart#1{% - {\centering - \interlinepenalty \@M - \normalfont - \Huge \bfseries #1\par}% - \@endpart} -\def\@endpart{\vfil\newpage - \if@twoside - \null - \thispagestyle{empty}% - \newpage - \fi - \if@tempswa - \twocolumn - \fi} - -\newcommand\chapter{\clearpage - \thispagestyle{empty}% - \global\@topnum\z@ - \@afterindentfalse - \secdef\@chapter\@schapter} -\def\@chapter[#1]#2{\ifnum \c@secnumdepth >\m@ne - \if@mainmatter - \refstepcounter{chapter}% - \typeout{\@chapapp\space\thechapter.}% - \addcontentsline{toc}{chapter}% - {\protect\numberline{\thechapter}#1}% - \else - \addcontentsline{toc}{chapter}{#1}% - \fi - \else - \addcontentsline{toc}{chapter}{#1}% - \fi - \chaptermark{#1}% - \addtocontents{lof}{\protect\addvspace{10\p@}}% - \addtocontents{lot}{\protect\addvspace{10\p@}}% - \if@twocolumn - \@topnewpage[\@makechapterhead{#2}]% - \else - \@makechapterhead{#2}% - \@afterheading - \fi} -\def\@makechapterhead#1{% -% \vspace*{50\p@}% - {\centering - \ifnum \c@secnumdepth >\m@ne - \if@mainmatter - \large\bfseries \@chapapp{} \thechapter - \par\nobreak - \vskip 20\p@ - \fi - \fi - \interlinepenalty\@M - \Large \bfseries #1\par\nobreak - \vskip 40\p@ - }} -\def\@schapter#1{\if@twocolumn - \@topnewpage[\@makeschapterhead{#1}]% - \else - \@makeschapterhead{#1}% - \@afterheading - \fi} -\def\@makeschapterhead#1{% -% \vspace*{50\p@}% - {\centering - \normalfont - \interlinepenalty\@M - \Large \bfseries #1\par\nobreak - \vskip 40\p@ - }} - -\renewcommand\section{\@startsection{section}{1}{\z@}% - {-18\p@ \@plus -4\p@ \@minus -4\p@}% - {12\p@ \@plus 4\p@ \@minus 4\p@}% - {\normalfont\large\bfseries\boldmath - \rightskip=\z@ \@plus 8em\pretolerance=10000 }} -\renewcommand\subsection{\@startsection{subsection}{2}{\z@}% - {-18\p@ \@plus -4\p@ \@minus -4\p@}% - {8\p@ \@plus 4\p@ \@minus 4\p@}% - {\normalfont\normalsize\bfseries\boldmath - \rightskip=\z@ \@plus 8em\pretolerance=10000 }} -\renewcommand\subsubsection{\@startsection{subsubsection}{3}{\z@}% - {-18\p@ \@plus -4\p@ \@minus -4\p@}% - {-0.5em \@plus -0.22em \@minus -0.1em}% - {\normalfont\normalsize\bfseries\boldmath}} -\renewcommand\paragraph{\@startsection{paragraph}{4}{\z@}% - {-12\p@ \@plus -4\p@ \@minus -4\p@}% - {-0.5em \@plus -0.22em \@minus -0.1em}% - {\normalfont\normalsize\itshape}} -\renewcommand\subparagraph[1]{\typeout{LLNCS warning: You should not use - \string\subparagraph\space with this class}\vskip0.5cm -You should not use \verb|\subparagraph| with this class.\vskip0.5cm} - -\DeclareMathSymbol{\Gamma}{\mathalpha}{letters}{"00} -\DeclareMathSymbol{\Delta}{\mathalpha}{letters}{"01} -\DeclareMathSymbol{\Theta}{\mathalpha}{letters}{"02} -\DeclareMathSymbol{\Lambda}{\mathalpha}{letters}{"03} -\DeclareMathSymbol{\Xi}{\mathalpha}{letters}{"04} -\DeclareMathSymbol{\Pi}{\mathalpha}{letters}{"05} -\DeclareMathSymbol{\Sigma}{\mathalpha}{letters}{"06} -\DeclareMathSymbol{\Upsilon}{\mathalpha}{letters}{"07} -\DeclareMathSymbol{\Phi}{\mathalpha}{letters}{"08} -\DeclareMathSymbol{\Psi}{\mathalpha}{letters}{"09} -\DeclareMathSymbol{\Omega}{\mathalpha}{letters}{"0A} - -\let\footnotesize\small - -\if@custvec -\def\vec#1{\mathchoice{\mbox{\boldmath$\displaystyle#1$}} -{\mbox{\boldmath$\textstyle#1$}} -{\mbox{\boldmath$\scriptstyle#1$}} -{\mbox{\boldmath$\scriptscriptstyle#1$}}} -\fi - -\def\squareforqed{\hbox{\rlap{$\sqcap$}$\sqcup$}} -\def\qed{\ifmmode\squareforqed\else{\unskip\nobreak\hfil -\penalty50\hskip1em\null\nobreak\hfil\squareforqed -\parfillskip=0pt\finalhyphendemerits=0\endgraf}\fi} - -\def\getsto{\mathrel{\mathchoice {\vcenter{\offinterlineskip -\halign{\hfil -$\displaystyle##$\hfil\cr\gets\cr\to\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\textstyle##$\hfil\cr\gets -\cr\to\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\scriptstyle##$\hfil\cr\gets -\cr\to\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\scriptscriptstyle##$\hfil\cr -\gets\cr\to\cr}}}}} -\def\lid{\mathrel{\mathchoice {\vcenter{\offinterlineskip\halign{\hfil -$\displaystyle##$\hfil\cr<\cr\noalign{\vskip1.2pt}=\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\textstyle##$\hfil\cr<\cr -\noalign{\vskip1.2pt}=\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\scriptstyle##$\hfil\cr<\cr -\noalign{\vskip1pt}=\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\scriptscriptstyle##$\hfil\cr -<\cr -\noalign{\vskip0.9pt}=\cr}}}}} -\def\gid{\mathrel{\mathchoice {\vcenter{\offinterlineskip\halign{\hfil -$\displaystyle##$\hfil\cr>\cr\noalign{\vskip1.2pt}=\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\textstyle##$\hfil\cr>\cr -\noalign{\vskip1.2pt}=\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\scriptstyle##$\hfil\cr>\cr -\noalign{\vskip1pt}=\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\scriptscriptstyle##$\hfil\cr ->\cr -\noalign{\vskip0.9pt}=\cr}}}}} -\def\grole{\mathrel{\mathchoice {\vcenter{\offinterlineskip -\halign{\hfil -$\displaystyle##$\hfil\cr>\cr\noalign{\vskip-1pt}<\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\textstyle##$\hfil\cr ->\cr\noalign{\vskip-1pt}<\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\scriptstyle##$\hfil\cr ->\cr\noalign{\vskip-0.8pt}<\cr}}} -{\vcenter{\offinterlineskip\halign{\hfil$\scriptscriptstyle##$\hfil\cr ->\cr\noalign{\vskip-0.3pt}<\cr}}}}} -\def\bbbr{{\rm I\!R}} %reelle Zahlen -\def\bbbm{{\rm I\!M}} -\def\bbbn{{\rm I\!N}} %natuerliche Zahlen -\def\bbbf{{\rm I\!F}} -\def\bbbh{{\rm I\!H}} -\def\bbbk{{\rm I\!K}} -\def\bbbp{{\rm I\!P}} -\def\bbbone{{\mathchoice {\rm 1\mskip-4mu l} {\rm 1\mskip-4mu l} -{\rm 1\mskip-4.5mu l} {\rm 1\mskip-5mu l}}} -\def\bbbc{{\mathchoice {\setbox0=\hbox{$\displaystyle\rm C$}\hbox{\hbox -to0pt{\kern0.4\wd0\vrule height0.9\ht0\hss}\box0}} -{\setbox0=\hbox{$\textstyle\rm C$}\hbox{\hbox -to0pt{\kern0.4\wd0\vrule height0.9\ht0\hss}\box0}} -{\setbox0=\hbox{$\scriptstyle\rm C$}\hbox{\hbox -to0pt{\kern0.4\wd0\vrule height0.9\ht0\hss}\box0}} -{\setbox0=\hbox{$\scriptscriptstyle\rm C$}\hbox{\hbox -to0pt{\kern0.4\wd0\vrule height0.9\ht0\hss}\box0}}}} -\def\bbbq{{\mathchoice {\setbox0=\hbox{$\displaystyle\rm -Q$}\hbox{\raise -0.15\ht0\hbox to0pt{\kern0.4\wd0\vrule height0.8\ht0\hss}\box0}} -{\setbox0=\hbox{$\textstyle\rm Q$}\hbox{\raise -0.15\ht0\hbox to0pt{\kern0.4\wd0\vrule height0.8\ht0\hss}\box0}} -{\setbox0=\hbox{$\scriptstyle\rm Q$}\hbox{\raise -0.15\ht0\hbox to0pt{\kern0.4\wd0\vrule height0.7\ht0\hss}\box0}} -{\setbox0=\hbox{$\scriptscriptstyle\rm Q$}\hbox{\raise -0.15\ht0\hbox to0pt{\kern0.4\wd0\vrule height0.7\ht0\hss}\box0}}}} -\def\bbbt{{\mathchoice {\setbox0=\hbox{$\displaystyle\rm -T$}\hbox{\hbox to0pt{\kern0.3\wd0\vrule height0.9\ht0\hss}\box0}} -{\setbox0=\hbox{$\textstyle\rm T$}\hbox{\hbox -to0pt{\kern0.3\wd0\vrule height0.9\ht0\hss}\box0}} -{\setbox0=\hbox{$\scriptstyle\rm T$}\hbox{\hbox -to0pt{\kern0.3\wd0\vrule height0.9\ht0\hss}\box0}} -{\setbox0=\hbox{$\scriptscriptstyle\rm T$}\hbox{\hbox -to0pt{\kern0.3\wd0\vrule height0.9\ht0\hss}\box0}}}} -\def\bbbs{{\mathchoice -{\setbox0=\hbox{$\displaystyle \rm S$}\hbox{\raise0.5\ht0\hbox -to0pt{\kern0.35\wd0\vrule height0.45\ht0\hss}\hbox -to0pt{\kern0.55\wd0\vrule height0.5\ht0\hss}\box0}} -{\setbox0=\hbox{$\textstyle \rm S$}\hbox{\raise0.5\ht0\hbox -to0pt{\kern0.35\wd0\vrule height0.45\ht0\hss}\hbox -to0pt{\kern0.55\wd0\vrule height0.5\ht0\hss}\box0}} -{\setbox0=\hbox{$\scriptstyle \rm S$}\hbox{\raise0.5\ht0\hbox -to0pt{\kern0.35\wd0\vrule height0.45\ht0\hss}\raise0.05\ht0\hbox -to0pt{\kern0.5\wd0\vrule height0.45\ht0\hss}\box0}} -{\setbox0=\hbox{$\scriptscriptstyle\rm S$}\hbox{\raise0.5\ht0\hbox -to0pt{\kern0.4\wd0\vrule height0.45\ht0\hss}\raise0.05\ht0\hbox -to0pt{\kern0.55\wd0\vrule height0.45\ht0\hss}\box0}}}} -\def\bbbz{{\mathchoice {\hbox{$\mathsf\textstyle Z\kern-0.4em Z$}} -{\hbox{$\mathsf\textstyle Z\kern-0.4em Z$}} -{\hbox{$\mathsf\scriptstyle Z\kern-0.3em Z$}} -{\hbox{$\mathsf\scriptscriptstyle Z\kern-0.2em Z$}}}} - -\let\ts\, - -\setlength\leftmargini {17\p@} -\setlength\leftmargin {\leftmargini} -\setlength\leftmarginii {\leftmargini} -\setlength\leftmarginiii {\leftmargini} -\setlength\leftmarginiv {\leftmargini} -\setlength \labelsep {.5em} -\setlength \labelwidth{\leftmargini} -\addtolength\labelwidth{-\labelsep} - -\def\@listI{\leftmargin\leftmargini - \parsep 0\p@ \@plus1\p@ \@minus\p@ - \topsep 8\p@ \@plus2\p@ \@minus4\p@ - \itemsep0\p@} -\let\@listi\@listI -\@listi -\def\@listii {\leftmargin\leftmarginii - \labelwidth\leftmarginii - \advance\labelwidth-\labelsep - \topsep 0\p@ \@plus2\p@ \@minus\p@} -\def\@listiii{\leftmargin\leftmarginiii - \labelwidth\leftmarginiii - \advance\labelwidth-\labelsep - \topsep 0\p@ \@plus\p@\@minus\p@ - \parsep \z@ - \partopsep \p@ \@plus\z@ \@minus\p@} - -\renewcommand\labelitemi{\normalfont\bfseries --} -\renewcommand\labelitemii{$\m@th\bullet$} - -\setlength\arraycolsep{1.4\p@} -\setlength\tabcolsep{1.4\p@} - -\def\tableofcontents{\chapter*{\contentsname\@mkboth{{\contentsname}}% - {{\contentsname}}} - \def\authcount##1{\setcounter{auco}{##1}\setcounter{@auth}{1}} - \def\lastand{\ifnum\value{auco}=2\relax - \unskip{} \andname\ - \else - \unskip \lastandname\ - \fi}% - \def\and{\stepcounter{@auth}\relax - \ifnum\value{@auth}=\value{auco}% - \lastand - \else - \unskip, - \fi}% - \@starttoc{toc}\if@restonecol\twocolumn\fi} - -\def\l@part#1#2{\addpenalty{\@secpenalty}% - \addvspace{2em plus\p@}% % space above part line - \begingroup - \parindent \z@ - \rightskip \z@ plus 5em - \hrule\vskip5pt - \large % same size as for a contribution heading - \bfseries\boldmath % set line in boldface - \leavevmode % TeX command to enter horizontal mode. - #1\par - \vskip5pt - \hrule - \vskip1pt - \nobreak % Never break after part entry - \endgroup} - -\def\@dotsep{2} - -\let\phantomsection=\relax - -\def\hyperhrefextend{\ifx\hyper@anchor\@undefined\else -{}\fi} - -\def\addnumcontentsmark#1#2#3{% -\addtocontents{#1}{\protect\contentsline{#2}{\protect\numberline - {\thechapter}#3}{\thepage}\hyperhrefextend}}% -\def\addcontentsmark#1#2#3{% -\addtocontents{#1}{\protect\contentsline{#2}{#3}{\thepage}\hyperhrefextend}}% -\def\addcontentsmarkwop#1#2#3{% -\addtocontents{#1}{\protect\contentsline{#2}{#3}{0}\hyperhrefextend}}% - -\def\@adcmk[#1]{\ifcase #1 \or -\def\@gtempa{\addnumcontentsmark}% - \or \def\@gtempa{\addcontentsmark}% - \or \def\@gtempa{\addcontentsmarkwop}% - \fi\@gtempa{toc}{chapter}% -} -\def\addtocmark{% -\phantomsection -\@ifnextchar[{\@adcmk}{\@adcmk[3]}% -} - -\def\l@chapter#1#2{\addpenalty{-\@highpenalty} - \vskip 1.0em plus 1pt \@tempdima 1.5em \begingroup - \parindent \z@ \rightskip \@tocrmarg - \advance\rightskip by 0pt plus 2cm - \parfillskip -\rightskip \pretolerance=10000 - \leavevmode \advance\leftskip\@tempdima \hskip -\leftskip - {\large\bfseries\boldmath#1}\ifx0#2\hfil\null - \else - \nobreak - \leaders\hbox{$\m@th \mkern \@dotsep mu.\mkern - \@dotsep mu$}\hfill - \nobreak\hbox to\@pnumwidth{\hss #2}% - \fi\par - \penalty\@highpenalty \endgroup} - -\def\l@title#1#2{\addpenalty{-\@highpenalty} - \addvspace{8pt plus 1pt} - \@tempdima \z@ - \begingroup - \parindent \z@ \rightskip \@tocrmarg - \advance\rightskip by 0pt plus 2cm - \parfillskip -\rightskip \pretolerance=10000 - \leavevmode \advance\leftskip\@tempdima \hskip -\leftskip - #1\nobreak - \leaders\hbox{$\m@th \mkern \@dotsep mu.\mkern - \@dotsep mu$}\hfill - \nobreak\hbox to\@pnumwidth{\hss #2}\par - \penalty\@highpenalty \endgroup} - -\def\l@author#1#2{\addpenalty{\@highpenalty} - \@tempdima=15\p@ %\z@ - \begingroup - \parindent \z@ \rightskip \@tocrmarg - \advance\rightskip by 0pt plus 2cm - \pretolerance=10000 - \leavevmode \advance\leftskip\@tempdima %\hskip -\leftskip - \textit{#1}\par - \penalty\@highpenalty \endgroup} - -\setcounter{tocdepth}{0} -\newdimen\tocchpnum -\newdimen\tocsecnum -\newdimen\tocsectotal -\newdimen\tocsubsecnum -\newdimen\tocsubsectotal -\newdimen\tocsubsubsecnum -\newdimen\tocsubsubsectotal -\newdimen\tocparanum -\newdimen\tocparatotal -\newdimen\tocsubparanum -\tocchpnum=\z@ % no chapter numbers -\tocsecnum=15\p@ % section 88. plus 2.222pt -\tocsubsecnum=23\p@ % subsection 88.8 plus 2.222pt -\tocsubsubsecnum=27\p@ % subsubsection 88.8.8 plus 1.444pt -\tocparanum=35\p@ % paragraph 88.8.8.8 plus 1.666pt -\tocsubparanum=43\p@ % subparagraph 88.8.8.8.8 plus 1.888pt -\def\calctocindent{% -\tocsectotal=\tocchpnum -\advance\tocsectotal by\tocsecnum -\tocsubsectotal=\tocsectotal -\advance\tocsubsectotal by\tocsubsecnum -\tocsubsubsectotal=\tocsubsectotal -\advance\tocsubsubsectotal by\tocsubsubsecnum -\tocparatotal=\tocsubsubsectotal -\advance\tocparatotal by\tocparanum} -\calctocindent - -\def\l@section{\@dottedtocline{1}{\tocchpnum}{\tocsecnum}} -\def\l@subsection{\@dottedtocline{2}{\tocsectotal}{\tocsubsecnum}} -\def\l@subsubsection{\@dottedtocline{3}{\tocsubsectotal}{\tocsubsubsecnum}} -\def\l@paragraph{\@dottedtocline{4}{\tocsubsubsectotal}{\tocparanum}} -\def\l@subparagraph{\@dottedtocline{5}{\tocparatotal}{\tocsubparanum}} - -\def\listoffigures{\@restonecolfalse\if@twocolumn\@restonecoltrue\onecolumn - \fi\section*{\listfigurename\@mkboth{{\listfigurename}}{{\listfigurename}}} - \@starttoc{lof}\if@restonecol\twocolumn\fi} -\def\l@figure{\@dottedtocline{1}{0em}{1.5em}} - -\def\listoftables{\@restonecolfalse\if@twocolumn\@restonecoltrue\onecolumn - \fi\section*{\listtablename\@mkboth{{\listtablename}}{{\listtablename}}} - \@starttoc{lot}\if@restonecol\twocolumn\fi} -\let\l@table\l@figure - -\renewcommand\listoffigures{% - \section*{\listfigurename - \@mkboth{\listfigurename}{\listfigurename}}% - \@starttoc{lof}% - } - -\renewcommand\listoftables{% - \section*{\listtablename - \@mkboth{\listtablename}{\listtablename}}% - \@starttoc{lot}% - } - -\ifx\oribibl\undefined -\ifx\citeauthoryear\undefined -\renewenvironment{thebibliography}[1] - {\section*{\refname} - \def\@biblabel##1{##1.} - \small - \list{\@biblabel{\@arabic\c@enumiv}}% - {\settowidth\labelwidth{\@biblabel{#1}}% - \leftmargin\labelwidth - \advance\leftmargin\labelsep - \if@openbib - \advance\leftmargin\bibindent - \itemindent -\bibindent - \listparindent \itemindent - \parsep \z@ - \fi - \usecounter{enumiv}% - \let\p@enumiv\@empty - \renewcommand\theenumiv{\@arabic\c@enumiv}}% - \if@openbib - \renewcommand\newblock{\par}% - \else - \renewcommand\newblock{\hskip .11em \@plus.33em \@minus.07em}% - \fi - \sloppy\clubpenalty4000\widowpenalty4000% - \sfcode`\.=\@m} - {\def\@noitemerr - {\@latex@warning{Empty `thebibliography' environment}}% - \endlist} -\def\@lbibitem[#1]#2{\item[{[#1]}\hfill]\if@filesw - {\let\protect\noexpand\immediate - \write\@auxout{\string\bibcite{#2}{#1}}}\fi\ignorespaces} -\newcount\@tempcntc -\def\@citex[#1]#2{\if@filesw\immediate\write\@auxout{\string\citation{#2}}\fi - \@tempcnta\z@\@tempcntb\m@ne\def\@citea{}\@cite{\@for\@citeb:=#2\do - {\@ifundefined - {b@\@citeb}{\@citeo\@tempcntb\m@ne\@citea\def\@citea{,}{\bfseries - ?}\@warning - {Citation `\@citeb' on page \thepage \space undefined}}% - {\setbox\z@\hbox{\global\@tempcntc0\csname b@\@citeb\endcsname\relax}% - \ifnum\@tempcntc=\z@ \@citeo\@tempcntb\m@ne - \@citea\def\@citea{,}\hbox{\csname b@\@citeb\endcsname}% - \else - \advance\@tempcntb\@ne - \ifnum\@tempcntb=\@tempcntc - \else\advance\@tempcntb\m@ne\@citeo - \@tempcnta\@tempcntc\@tempcntb\@tempcntc\fi\fi}}\@citeo}{#1}} -\def\@citeo{\ifnum\@tempcnta>\@tempcntb\else - \@citea\def\@citea{,\,\hskip\z@skip}% - \ifnum\@tempcnta=\@tempcntb\the\@tempcnta\else - {\advance\@tempcnta\@ne\ifnum\@tempcnta=\@tempcntb \else - \def\@citea{--}\fi - \advance\@tempcnta\m@ne\the\@tempcnta\@citea\the\@tempcntb}\fi\fi} -\else -\renewenvironment{thebibliography}[1] - {\section*{\refname} - \small - \list{}% - {\settowidth\labelwidth{}% - \leftmargin\parindent - \itemindent=-\parindent - \labelsep=\z@ - \if@openbib - \advance\leftmargin\bibindent - \itemindent -\bibindent - \listparindent \itemindent - \parsep \z@ - \fi - \usecounter{enumiv}% - \let\p@enumiv\@empty - \renewcommand\theenumiv{}}% - \if@openbib - \renewcommand\newblock{\par}% - \else - \renewcommand\newblock{\hskip .11em \@plus.33em \@minus.07em}% - \fi - \sloppy\clubpenalty4000\widowpenalty4000% - \sfcode`\.=\@m} - {\def\@noitemerr - {\@latex@warning{Empty `thebibliography' environment}}% - \endlist} - \def\@cite#1{#1}% - \def\@lbibitem[#1]#2{\item[]\if@filesw - {\def\protect##1{\string ##1\space}\immediate - \write\@auxout{\string\bibcite{#2}{#1}}}\fi\ignorespaces} - \fi -\else -\@cons\@openbib@code{\noexpand\small} -\fi - -\def\idxquad{\hskip 10\p@}% space that divides entry from number - -\def\@idxitem{\par\hangindent 10\p@} - -\def\subitem{\par\setbox0=\hbox{--\enspace}% second order - \noindent\hangindent\wd0\box0}% index entry - -\def\subsubitem{\par\setbox0=\hbox{--\,--\enspace}% third - \noindent\hangindent\wd0\box0}% order index entry - -\def\indexspace{\par \vskip 10\p@ plus5\p@ minus3\p@\relax} - -\renewenvironment{theindex} - {\@mkboth{\indexname}{\indexname}% - \thispagestyle{empty}\parindent\z@ - \parskip\z@ \@plus .3\p@\relax - \let\item\par - \def\,{\relax\ifmmode\mskip\thinmuskip - \else\hskip0.2em\ignorespaces\fi}% - \normalfont\small - \begin{multicols}{2}[\@makeschapterhead{\indexname}]% - } - {\end{multicols}} - -\renewcommand\footnoterule{% - \kern-3\p@ - \hrule\@width 2truecm - \kern2.6\p@} - \newdimen\fnindent - \fnindent1em -\long\def\@makefntext#1{% - \parindent \fnindent% - \leftskip \fnindent% - \noindent - \llap{\hb@xt@1em{\hss\@makefnmark\ }}\ignorespaces#1} - -\long\def\@makecaption#1#2{% - \small - \vskip\abovecaptionskip - \sbox\@tempboxa{{\bfseries #1.} #2}% - \ifdim \wd\@tempboxa >\hsize - {\bfseries #1.} #2\par - \else - \global \@minipagefalse - \hb@xt@\hsize{\hfil\box\@tempboxa\hfil}% - \fi - \vskip\belowcaptionskip} - -\def\fps@figure{htbp} -\def\fnum@figure{\figurename\thinspace\thefigure} -\def \@floatboxreset {% - \reset@font - \small - \@setnobreak - \@setminipage -} -\def\fps@table{htbp} -\def\fnum@table{\tablename~\thetable} -\renewenvironment{table} - {\setlength\abovecaptionskip{0\p@}% - \setlength\belowcaptionskip{10\p@}% - \@float{table}} - {\end@float} -\renewenvironment{table*} - {\setlength\abovecaptionskip{0\p@}% - \setlength\belowcaptionskip{10\p@}% - \@dblfloat{table}} - {\end@dblfloat} - -\long\def\@caption#1[#2]#3{\par\addcontentsline{\csname - ext@#1\endcsname}{#1}{\protect\numberline{\csname - the#1\endcsname}{\ignorespaces #2}}\begingroup - \@parboxrestore - \@makecaption{\csname fnum@#1\endcsname}{\ignorespaces #3}\par - \endgroup} - -% LaTeX does not provide a command to enter the authors institute -% addresses. The \institute command is defined here. - -\newcounter{@inst} -\newcounter{@auth} -\newcounter{auco} -\newdimen\instindent -\newbox\authrun -\newtoks\authorrunning -\newtoks\tocauthor -\newbox\titrun -\newtoks\titlerunning -\newtoks\toctitle - -\def\clearheadinfo{\gdef\@author{No Author Given}% - \gdef\@title{No Title Given}% - \gdef\@subtitle{}% - \gdef\@institute{No Institute Given}% - \gdef\@thanks{}% - \global\titlerunning={}\global\authorrunning={}% - \global\toctitle={}\global\tocauthor={}} - -\def\institute#1{\gdef\@institute{#1}} - -\def\institutename{\par - \begingroup - \parskip=\z@ - \parindent=\z@ - \setcounter{@inst}{1}% - \def\and{\par\stepcounter{@inst}% - \noindent$^{\the@inst}$\enspace\ignorespaces}% - \setbox0=\vbox{\def\thanks##1{}\@institute}% - \ifnum\c@@inst=1\relax - \gdef\fnnstart{0}% - \else - \xdef\fnnstart{\c@@inst}% - \setcounter{@inst}{1}% - \noindent$^{\the@inst}$\enspace - \fi - \ignorespaces - \@institute\par - \endgroup} - -\def\@fnsymbol#1{\ensuremath{\ifcase#1\or\star\or{\star\star}\or - {\star\star\star}\or \dagger\or \ddagger\or - \mathchar "278\or \mathchar "27B\or \|\or **\or \dagger\dagger - \or \ddagger\ddagger \else\@ctrerr\fi}} - -\def\inst#1{\unskip$^{#1}$} -\def\fnmsep{\unskip$^,$} -\def\email#1{{\tt#1}} -\AtBeginDocument{\@ifundefined{url}{\def\url#1{#1}}{}% -\@ifpackageloaded{babel}{% -\@ifundefined{extrasenglish}{}{\addto\extrasenglish{\switcht@albion}}% -\@ifundefined{extrasfrenchb}{}{\addto\extrasfrenchb{\switcht@francais}}% -\@ifundefined{extrasgerman}{}{\addto\extrasgerman{\switcht@deutsch}}% -\@ifundefined{extrasngerman}{}{\addto\extrasngerman{\switcht@deutsch}}% -}{\switcht@@therlang}% -\providecommand{\keywords}[1]{\par\addvspace\baselineskip -\noindent\keywordname\enspace\ignorespaces#1}% -} -\def\homedir{\~{ }} - -\def\subtitle#1{\gdef\@subtitle{#1}} -\clearheadinfo -% -%%% to avoid hyperref warnings -\providecommand*{\toclevel@author}{999} -%%% to make title-entry parent of section-entries -\providecommand*{\toclevel@title}{0} -% -\renewcommand\maketitle{\newpage -\phantomsection - \refstepcounter{chapter}% - \stepcounter{section}% - \setcounter{section}{0}% - \setcounter{subsection}{0}% - \setcounter{figure}{0} - \setcounter{table}{0} - \setcounter{equation}{0} - \setcounter{footnote}{0}% - \begingroup - \parindent=\z@ - \renewcommand\thefootnote{\@fnsymbol\c@footnote}% - \if@twocolumn - \ifnum \col@number=\@ne - \@maketitle - \else - \twocolumn[\@maketitle]% - \fi - \else - \newpage - \global\@topnum\z@ % Prevents figures from going at top of page. - \@maketitle - \fi - \thispagestyle{empty}\@thanks -% - \def\\{\unskip\ \ignorespaces}\def\inst##1{\unskip{}}% - \def\thanks##1{\unskip{}}\def\fnmsep{\unskip}% - \instindent=\hsize - \advance\instindent by-\headlineindent - \if!\the\toctitle!\addcontentsline{toc}{title}{\@title}\else - \addcontentsline{toc}{title}{\the\toctitle}\fi - \if@runhead - \if!\the\titlerunning!\else - \edef\@title{\the\titlerunning}% - \fi - \global\setbox\titrun=\hbox{\small\rm\unboldmath\ignorespaces\@title}% - \ifdim\wd\titrun>\instindent - \typeout{Title too long for running head. Please supply}% - \typeout{a shorter form with \string\titlerunning\space prior to - \string\maketitle}% - \global\setbox\titrun=\hbox{\small\rm - Title Suppressed Due to Excessive Length}% - \fi - \xdef\@title{\copy\titrun}% - \fi -% - \if!\the\tocauthor!\relax - {\def\and{\noexpand\protect\noexpand\and}% - \protected@xdef\toc@uthor{\@author}}% - \else - \def\\{\noexpand\protect\noexpand\newline}% - \protected@xdef\scratch{\the\tocauthor}% - \protected@xdef\toc@uthor{\scratch}% - \fi - \addtocontents{toc}{\noexpand\protect\noexpand\authcount{\the\c@auco}}% - \addcontentsline{toc}{author}{\toc@uthor}% - \if@runhead - \if!\the\authorrunning! - \value{@inst}=\value{@auth}% - \setcounter{@auth}{1}% - \else - \edef\@author{\the\authorrunning}% - \fi - \global\setbox\authrun=\hbox{\small\unboldmath\@author\unskip}% - \ifdim\wd\authrun>\instindent - \typeout{Names of authors too long for running head. Please supply}% - \typeout{a shorter form with \string\authorrunning\space prior to - \string\maketitle}% - \global\setbox\authrun=\hbox{\small\rm - Authors Suppressed Due to Excessive Length}% - \fi - \xdef\@author{\copy\authrun}% - \markboth{\@author}{\@title}% - \fi - \endgroup - \setcounter{footnote}{\fnnstart}% - \clearheadinfo} -% -\def\@maketitle{\newpage - \markboth{}{}% - \def\lastand{\ifnum\value{@inst}=2\relax - \unskip{} \andname\ - \else - \unskip \lastandname\ - \fi}% - \def\and{\stepcounter{@auth}\relax - \ifnum\value{@auth}=\value{@inst}% - \lastand - \else - \unskip, - \fi}% - \begin{center}% - \let\newline\\ - {\Large \bfseries\boldmath - \pretolerance=10000 - \@title \par}\vskip .8cm -\if!\@subtitle!\else {\large \bfseries\boldmath - \vskip -.65cm - \pretolerance=10000 - \@subtitle \par}\vskip .8cm\fi - \setbox0=\vbox{\setcounter{@auth}{1}\def\and{\stepcounter{@auth}}% - \def\thanks##1{}\@author}% - \global\value{@inst}=\value{@auth}% - \global\value{auco}=\value{@auth}% - \setcounter{@auth}{1}% -{\lineskip .5em -\noindent\ignorespaces -\@author\vskip.35cm} - {\small\institutename} - \end{center}% - } - -% definition of the "\spnewtheorem" command. -% -% Usage: -% -% \spnewtheorem{env_nam}{caption}[within]{cap_font}{body_font} -% or \spnewtheorem{env_nam}[numbered_like]{caption}{cap_font}{body_font} -% or \spnewtheorem*{env_nam}{caption}{cap_font}{body_font} -% -% New is "cap_font" and "body_font". It stands for -% fontdefinition of the caption and the text itself. -% -% "\spnewtheorem*" gives a theorem without number. -% -% A defined spnewthoerem environment is used as described -% by Lamport. -% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\def\@thmcountersep{} -\def\@thmcounterend{.} - -\def\spnewtheorem{\@ifstar{\@sthm}{\@Sthm}} - -% definition of \spnewtheorem with number - -\def\@spnthm#1#2{% - \@ifnextchar[{\@spxnthm{#1}{#2}}{\@spynthm{#1}{#2}}} -\def\@Sthm#1{\@ifnextchar[{\@spothm{#1}}{\@spnthm{#1}}} - -\def\@spxnthm#1#2[#3]#4#5{\expandafter\@ifdefinable\csname #1\endcsname - {\@definecounter{#1}\@addtoreset{#1}{#3}% - \expandafter\xdef\csname the#1\endcsname{\expandafter\noexpand - \csname the#3\endcsname \noexpand\@thmcountersep \@thmcounter{#1}}% - \expandafter\xdef\csname #1name\endcsname{#2}% - \global\@namedef{#1}{\@spthm{#1}{\csname #1name\endcsname}{#4}{#5}}% - \global\@namedef{end#1}{\@endtheorem}}} - -\def\@spynthm#1#2#3#4{\expandafter\@ifdefinable\csname #1\endcsname - {\@definecounter{#1}% - \expandafter\xdef\csname the#1\endcsname{\@thmcounter{#1}}% - \expandafter\xdef\csname #1name\endcsname{#2}% - \global\@namedef{#1}{\@spthm{#1}{\csname #1name\endcsname}{#3}{#4}}% - \global\@namedef{end#1}{\@endtheorem}}} - -\def\@spothm#1[#2]#3#4#5{% - \@ifundefined{c@#2}{\@latexerr{No theorem environment `#2' defined}\@eha}% - {\expandafter\@ifdefinable\csname #1\endcsname - {\newaliascnt{#1}{#2}% - \expandafter\xdef\csname #1name\endcsname{#3}% - \global\@namedef{#1}{\@spthm{#1}{\csname #1name\endcsname}{#4}{#5}}% - \global\@namedef{end#1}{\@endtheorem}}}} - -\def\@spthm#1#2#3#4{\topsep 7\p@ \@plus2\p@ \@minus4\p@ -\refstepcounter{#1}\phantomsection -\@ifnextchar[{\@spythm{#1}{#2}{#3}{#4}}{\@spxthm{#1}{#2}{#3}{#4}}} - -\def\@spxthm#1#2#3#4{\@spbegintheorem{#2}{\csname the#1\endcsname}{#3}{#4}% - \ignorespaces} - -\def\@spythm#1#2#3#4[#5]{\@spopargbegintheorem{#2}{\csname - the#1\endcsname}{#5}{#3}{#4}\ignorespaces} - -\def\@spbegintheorem#1#2#3#4{\trivlist - \item[\hskip\labelsep{#3#1\ #2\@thmcounterend}]#4} - -\def\@spopargbegintheorem#1#2#3#4#5{\trivlist - \item[\hskip\labelsep{#4#1\ #2}]{#4(#3)\@thmcounterend\ }#5} - -% definition of \spnewtheorem* without number - -\def\@sthm#1#2{\@Ynthm{#1}{#2}} - -\def\@Ynthm#1#2#3#4{\expandafter\@ifdefinable\csname #1\endcsname - {\global\@namedef{#1}{\@Thm{\csname #1name\endcsname}{#3}{#4}}% - \expandafter\xdef\csname #1name\endcsname{#2}% - \global\@namedef{end#1}{\@endtheorem}}} - -\def\@Thm#1#2#3{\topsep 7\p@ \@plus2\p@ \@minus4\p@ -\@ifnextchar[{\@Ythm{#1}{#2}{#3}}{\@Xthm{#1}{#2}{#3}}} - -\def\@Xthm#1#2#3{\@Begintheorem{#1}{#2}{#3}\ignorespaces} - -\def\@Ythm#1#2#3[#4]{\@Opargbegintheorem{#1} - {#4}{#2}{#3}\ignorespaces} - -\def\@Begintheorem#1#2#3{#3\trivlist - \item[\hskip\labelsep{#2#1\@thmcounterend}]} - -\def\@Opargbegintheorem#1#2#3#4{#4\trivlist - \item[\hskip\labelsep{#3#1}]{#3(#2)\@thmcounterend\ }} - -\if@envcntsect - \def\@thmcountersep{.} - \spnewtheorem{theorem}{Theorem}[section]{\bfseries}{\itshape} -\else - \spnewtheorem{theorem}{Theorem}{\bfseries}{\itshape} - \if@envcntreset - \@addtoreset{theorem}{section} - \else - \@addtoreset{theorem}{chapter} - \fi -\fi - -%definition of divers theorem environments -\spnewtheorem*{claim}{Claim}{\itshape}{\rmfamily} -\spnewtheorem*{proof}{Proof}{\itshape}{\rmfamily} -\if@envcntsame % alle Umgebungen wie Theorem. - \def\spn@wtheorem#1#2#3#4{\@spothm{#1}[theorem]{#2}{#3}{#4}} -\else % alle Umgebungen mit eigenem Zaehler - \if@envcntsect % mit section numeriert - \def\spn@wtheorem#1#2#3#4{\@spxnthm{#1}{#2}[section]{#3}{#4}} - \else % nicht mit section numeriert - \if@envcntreset - \def\spn@wtheorem#1#2#3#4{\@spynthm{#1}{#2}{#3}{#4} - \@addtoreset{#1}{section}} - \else - \def\spn@wtheorem#1#2#3#4{\@spynthm{#1}{#2}{#3}{#4} - \@addtoreset{#1}{chapter}}% - \fi - \fi -\fi -\spn@wtheorem{case}{Case}{\itshape}{\rmfamily} -\spn@wtheorem{conjecture}{Conjecture}{\itshape}{\rmfamily} -\spn@wtheorem{corollary}{Corollary}{\bfseries}{\itshape} -\spn@wtheorem{definition}{Definition}{\bfseries}{\itshape} -\spn@wtheorem{example}{Example}{\itshape}{\rmfamily} -\spn@wtheorem{exercise}{Exercise}{\itshape}{\rmfamily} -\spn@wtheorem{lemma}{Lemma}{\bfseries}{\itshape} -\spn@wtheorem{note}{Note}{\itshape}{\rmfamily} -\spn@wtheorem{problem}{Problem}{\itshape}{\rmfamily} -\spn@wtheorem{property}{Property}{\itshape}{\rmfamily} -\spn@wtheorem{proposition}{Proposition}{\bfseries}{\itshape} -\spn@wtheorem{question}{Question}{\itshape}{\rmfamily} -\spn@wtheorem{solution}{Solution}{\itshape}{\rmfamily} -\spn@wtheorem{remark}{Remark}{\itshape}{\rmfamily} - -\def\@takefromreset#1#2{% - \def\@tempa{#1}% - \let\@tempd\@elt - \def\@elt##1{% - \def\@tempb{##1}% - \ifx\@tempa\@tempb\else - \@addtoreset{##1}{#2}% - \fi}% - \expandafter\expandafter\let\expandafter\@tempc\csname cl@#2\endcsname - \expandafter\def\csname cl@#2\endcsname{}% - \@tempc - \let\@elt\@tempd} - -\def\theopargself{\def\@spopargbegintheorem##1##2##3##4##5{\trivlist - \item[\hskip\labelsep{##4##1\ ##2}]{##4##3\@thmcounterend\ }##5} - \def\@Opargbegintheorem##1##2##3##4{##4\trivlist - \item[\hskip\labelsep{##3##1}]{##3##2\@thmcounterend\ }} - } - -\renewenvironment{abstract}{% - \list{}{\advance\topsep by0.35cm\relax\small - \leftmargin=1cm - \labelwidth=\z@ - \listparindent=\z@ - \itemindent\listparindent - \rightmargin\leftmargin}\item[\hskip\labelsep - \bfseries\abstractname]} - {\endlist} - -\newdimen\headlineindent % dimension for space between -\headlineindent=1.166cm % number and text of headings. - -\def\ps@headings{\let\@mkboth\@gobbletwo - \let\@oddfoot\@empty\let\@evenfoot\@empty - \def\@evenhead{\normalfont\small\rlap{\thepage}\hspace{\headlineindent}% - \leftmark\hfil} - \def\@oddhead{\normalfont\small\hfil\rightmark\hspace{\headlineindent}% - \llap{\thepage}} - \def\chaptermark##1{}% - \def\sectionmark##1{}% - \def\subsectionmark##1{}} - -\def\ps@titlepage{\let\@mkboth\@gobbletwo - \let\@oddfoot\@empty\let\@evenfoot\@empty - \def\@evenhead{\normalfont\small\rlap{\thepage}\hspace{\headlineindent}% - \hfil} - \def\@oddhead{\normalfont\small\hfil\hspace{\headlineindent}% - \llap{\thepage}} - \def\chaptermark##1{}% - \def\sectionmark##1{}% - \def\subsectionmark##1{}} - -\if@runhead\ps@headings\else -\ps@empty\fi - -\setlength\arraycolsep{1.4\p@} -\setlength\tabcolsep{1.4\p@} - -\endinput -%end of file llncs.cls diff --git a/doc/iwoph17/llncsdoc.pdf b/doc/iwoph17/llncsdoc.pdf deleted file mode 100644 index 5b68e0f18b015dba4224f7068acdf61200d19397..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/llncsdoc.pdf and /dev/null differ diff --git a/doc/iwoph17/llncsdoc.sty b/doc/iwoph17/llncsdoc.sty deleted file mode 100644 index 5843cba8e3f77e43f407f4af8df49905a11c3369..0000000000000000000000000000000000000000 --- a/doc/iwoph17/llncsdoc.sty +++ /dev/null @@ -1,42 +0,0 @@ -% This is LLNCSDOC.STY the modification of the -% LLNCS class file for the documentation of -% the class itself. -% -\def\AmS{{\protect\usefont{OMS}{cmsy}{m}{n}% - A\kern-.1667em\lower.5ex\hbox{M}\kern-.125emS}} -\def\AmSTeX{{\protect\AmS-\protect\TeX}} -% -\def\ps@myheadings{\let\@mkboth\@gobbletwo -\def\@oddhead{\hbox{}\hfil\small\rm\rightmark -\qquad\thepage}% -\def\@oddfoot{}\def\@evenhead{\small\rm\thepage\qquad -\leftmark\hfil}% -\def\@evenfoot{}\def\sectionmark##1{}\def\subsectionmark##1{}} -\ps@myheadings -% -\setcounter{tocdepth}{2} -% -\renewcommand{\labelitemi}{--} -\newenvironment{alpherate}% -{\renewcommand{\labelenumi}{\alph{enumi})}\begin{enumerate}}% -{\end{enumerate}\renewcommand{\labelenumi}{enumi}} -% -\def\bibauthoryear{\begingroup -\def\thebibliography##1{\section*{References}% - \small\list{}{\settowidth\labelwidth{}\leftmargin\parindent - \itemindent=-\parindent - \labelsep=\z@ - \usecounter{enumi}}% - \def\newblock{\hskip .11em plus .33em minus -.07em}% - \sloppy - \sfcode`\.=1000\relax}% - \def\@cite##1{##1}% - \def\@lbibitem[##1]##2{\item[]\if@filesw - {\def\protect####1{\string ####1\space}\immediate - \write\@auxout{\string\bibcite{##2}{##1}}}\fi\ignorespaces}% -\begin{thebibliography}{} -\bibitem[1982]{clar:eke3} Clarke, F., Ekeland, I.: Nonlinear -oscillations and boundary-value problems for Hamiltonian systems. -Arch. Rat. Mech. Anal. 78, 315--333 (1982) -\end{thebibliography} -\endgroup} diff --git a/doc/iwoph17/media/image1.jpeg b/doc/iwoph17/media/image1.jpeg deleted file mode 100644 index c8e782b947f4e6aafc95462514d555f584f2b3b6..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image1.jpeg and /dev/null differ diff --git a/doc/iwoph17/media/image10.pdf b/doc/iwoph17/media/image10.pdf deleted file mode 100644 index 3406c3322748c7b6b1b05fd92d3b57af552638d7..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image10.pdf and /dev/null differ diff --git a/doc/iwoph17/media/image11.pdf b/doc/iwoph17/media/image11.pdf deleted file mode 100644 index 6c684ed6e8015cd060c5c77c5bf87c802748dd0a..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image11.pdf and /dev/null differ diff --git a/doc/iwoph17/media/image12.png b/doc/iwoph17/media/image12.png deleted file mode 100644 index 1a92ccb6cfabff3b8cd1af068987834b672031cc..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image12.png and /dev/null differ diff --git a/doc/iwoph17/media/image13.png b/doc/iwoph17/media/image13.png deleted file mode 100644 index 9e552dfc42c89fa9fbea3e60cd461c2013d16db6..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image13.png and /dev/null differ diff --git a/doc/iwoph17/media/image14.png b/doc/iwoph17/media/image14.png deleted file mode 100644 index 96a21d72c868307643b467ca5affb4686c56a47a..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image14.png and /dev/null differ diff --git a/doc/iwoph17/media/image15.png b/doc/iwoph17/media/image15.png deleted file mode 100644 index ed77a66e71a6d68797d2f4b49f5ff88f4864f9c2..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image15.png and /dev/null differ diff --git a/doc/iwoph17/media/image16.png b/doc/iwoph17/media/image16.png deleted file mode 100644 index 960c6d9c3e5851ea29c3abd1111f58158eae111a..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image16.png and /dev/null differ diff --git a/doc/iwoph17/media/image17.png b/doc/iwoph17/media/image17.png deleted file mode 100644 index 769998f3d3bac919a1cbc43eb5656b7bd6caadc2..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image17.png and /dev/null differ diff --git a/doc/iwoph17/media/image18.pdf b/doc/iwoph17/media/image18.pdf deleted file mode 100644 index 6c8fa538b1a007ff7589eb32578c765496eb4288..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image18.pdf and /dev/null differ diff --git a/doc/iwoph17/media/image19.png b/doc/iwoph17/media/image19.png deleted file mode 100644 index bcfd535a22c4627df249f493f463e608bb81f565..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image19.png and /dev/null differ diff --git a/doc/iwoph17/media/image2.pdf b/doc/iwoph17/media/image2.pdf deleted file mode 100644 index 9cb86227c8ba69e600ec83c112e0f9ed1cf5106b..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image2.pdf and /dev/null differ diff --git a/doc/iwoph17/media/image20.png b/doc/iwoph17/media/image20.png deleted file mode 100644 index d1e91cb565d6cd9d3834350652953a312a999e05..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image20.png and /dev/null differ diff --git a/doc/iwoph17/media/image21.pdf b/doc/iwoph17/media/image21.pdf deleted file mode 100644 index 002d8acf8f09608cf3f484915d309662998838dd..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image21.pdf and /dev/null differ diff --git a/doc/iwoph17/media/image22.png b/doc/iwoph17/media/image22.png deleted file mode 100644 index 0b1d30768e595cf7abab22ea8fa08668eb27d6d9..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image22.png and /dev/null differ diff --git a/doc/iwoph17/media/image23.png b/doc/iwoph17/media/image23.png deleted file mode 100644 index a6cb4f8ca51306866ae6f4ff022de7773e1aa4f3..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image23.png and /dev/null differ diff --git a/doc/iwoph17/media/image24.png b/doc/iwoph17/media/image24.png deleted file mode 100644 index f225969609f41301f5241e5b563c39ad4f765c51..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image24.png and /dev/null differ diff --git a/doc/iwoph17/media/image25.png b/doc/iwoph17/media/image25.png deleted file mode 100644 index 39fab0dd1458649d1e852fdc5093b679613a1718..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image25.png and /dev/null differ diff --git a/doc/iwoph17/media/image26.png b/doc/iwoph17/media/image26.png deleted file mode 100644 index af2401c37074fd25664fe65a24c319ed5ad1b349..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image26.png and /dev/null differ diff --git a/doc/iwoph17/media/image27.png b/doc/iwoph17/media/image27.png deleted file mode 100644 index ede234a3ac0bc8e429cf7cb829eb8ecaea846d70..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image27.png and /dev/null differ diff --git a/doc/iwoph17/media/image28.png b/doc/iwoph17/media/image28.png deleted file mode 100644 index 61cc56813d6b51ed161cd02f68298ab6587aed28..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image28.png and /dev/null differ diff --git a/doc/iwoph17/media/image29.png b/doc/iwoph17/media/image29.png deleted file mode 100644 index 74e5eebfe672bc77a9467f1c8e325d449131f4fc..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image29.png and /dev/null differ diff --git a/doc/iwoph17/media/image3.png b/doc/iwoph17/media/image3.png deleted file mode 100644 index d40b273a8a657eb3ce515125d3d639785962c480..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image3.png and /dev/null differ diff --git a/doc/iwoph17/media/image30.png b/doc/iwoph17/media/image30.png deleted file mode 100644 index a76e3dd79e50ed9294b4f7721276e01bef097709..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image30.png and /dev/null differ diff --git a/doc/iwoph17/media/image4.png b/doc/iwoph17/media/image4.png deleted file mode 100644 index 1e516c6859830a07f690baef06dcd3ff30d6cb65..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image4.png and /dev/null differ diff --git a/doc/iwoph17/media/image5.png b/doc/iwoph17/media/image5.png deleted file mode 100644 index acd1a61c1e6b3a4a2839545a24969719e5cefe1a..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image5.png and /dev/null differ diff --git a/doc/iwoph17/media/image6.png b/doc/iwoph17/media/image6.png deleted file mode 100644 index 9e7ad1f1750669b780458597440a6f9bb783f8fe..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image6.png and /dev/null differ diff --git a/doc/iwoph17/media/image7.pdf b/doc/iwoph17/media/image7.pdf deleted file mode 100644 index 5d332462e1171db3eb09a4c909a3db663accb37c..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image7.pdf and /dev/null differ diff --git a/doc/iwoph17/media/image8.pdf b/doc/iwoph17/media/image8.pdf deleted file mode 100644 index c8bbcfdc642c7540cd072026367b56b1395df3e3..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image8.pdf and /dev/null differ diff --git a/doc/iwoph17/media/image9.png b/doc/iwoph17/media/image9.png deleted file mode 100644 index 819eb1a3577216fa02659d06b975b03c3d9fa101..0000000000000000000000000000000000000000 Binary files a/doc/iwoph17/media/image9.png and /dev/null differ diff --git a/doc/iwoph17/remreset.sty b/doc/iwoph17/remreset.sty deleted file mode 100644 index b53de583573b737887d2fa486b87a8cde34ca452..0000000000000000000000000000000000000000 --- a/doc/iwoph17/remreset.sty +++ /dev/null @@ -1,39 +0,0 @@ - -% remreset package -%%%%%%%%%%%%%%%%%% - -% Copyright 1997 David carlisle -% This file may be distributed under the terms of the LPPL. -% See 00readme.txt for details. - -% 1997/09/28 David Carlisle - -% LaTeX includes a command \@addtoreset that is used to declare that -% a counter should be reset every time a second counter is incremented. - -% For example the book class has a line -% \@addtoreset{footnote}{chapter} -% So that the footnote counter is reset each chapter. - -% If you wish to bas a new class on book, but without this counter -% being reset, then standard LaTeX gives no simple mechanism to do -% this. - -% This package defines |\@removefromreset| which just undoes the effect -% of \@addtorest. So for example a class file may be defined by - -% \LoadClass{book} -% \@removefromreset{footnote}{chapter} - - -\def\@removefromreset#1#2{{% - \expandafter\let\csname c@#1\endcsname\@removefromreset - \def\@elt##1{% - \expandafter\ifx\csname c@##1\endcsname\@removefromreset - \else - \noexpand\@elt{##1}% - \fi}% - \expandafter\xdef\csname cl@#2\endcsname{% - \csname cl@#2\endcsname}}} - - diff --git a/doc/iwoph17/results.tex b/doc/iwoph17/results.tex deleted file mode 100644 index 63c1d47beb1f22c10f0781f04538c45e43de3882..0000000000000000000000000000000000000000 --- a/doc/iwoph17/results.tex +++ /dev/null @@ -1,141 +0,0 @@ -\section{Applications Performances\label{sec:results}} - -This section presents sample results on OpenPOWER + GPU systems. It should be noted that some of the codes presented in Sect. \ref{sec:codes} -- namely Alya, CP2K, GPAW, PFARM, QUANTUM ESPRESSO -- has been run on x86 + GPU platforms so results are not shown here. Meanwhile instruction to compile and run those codes are available and can be used to target OpenPOWER based systems. - -\subsection{Code\_Saturne\label{ref-0094}} - -\subsubsection{Description Runtime Architecture.} - -Code\_Saturne ran on2 POWER8 nodes, i.e. S822LC (2x P8 10-cores + 2x K80 (2 G210 per K80)) and S824L (2x P8 12-cores + 2x K40 (1 G180 per K40)). The compiler is at/8.0, the MPI distribution openmpi/1.8.8 and the CUDA compiler's version is 7.5. - -\subsubsection{Flow in a 3-D Lid-driven Cavity (Tetrahedral Cells)} - -The following options are used for PETSc: \verb+-ksp_type = cg+, \verb+-vec_type = cusp+, \verb+-mat_type = aijcusp+ and \verb+-pc_type = jacobi+ - -\begin{table} -\caption{Performance of Code\_Saturne + PETSc on 1 node of the POWER8 clusters. Comparison between 2 different nodes, using different types of CPU and GPU. PETSc is built on LAPACK. The speedup is computed at the ratio between the time to solution on the CPU for a given number of MPI tasks and the time to solution on the CPU/GPU for the same number of MPI tasks.\label{table:cs-results}} -\includegraphics[width=1\textwidth]{media/image7.pdf} -\end{table} - -Talbe \ref{table:cs-results} shows the results obtained using POWER8 CPU and CPU/GPU. Focusing on the results on the POWER8 nodes first, a speedup is observed on each node of the POWER8, when using the same number of MPI tasks and of GPU. However, when the nodes are fully populated (20 and 24 MPI tasks, respectively), it is cheaper to run on the CPU only than using CPU/GPU. This could be explained by the fact that the same overall amount of data is transferred but the system administration costs, latency costs, asynchronicity of transfer in 20 (S822LC) or 24 (S824L) slices might be prohibitive. - -\subsection{GROMACS\label{ref-0110}} - -Gromacs was successfully compiled and ran on IDRIS Ouessant: IBM Power 8 + Dual P100 presented in Sect. \ref{sec:hardware}. - -In all accelerated runs a speed up of 2-2.6x with respect CPU only was achieved with GPU. -\begin{figure} -\caption{Scalability for GROMACS test case GluCL Ion Channel\label{ref-0111}} -\includegraphics[width=1\textwidth]{media/image13.png} -\label{fig:7} -\end{figure} -\begin{figure} -\caption{Scalability for GROMACS test case Lignocellulose\label{ref-0112}} -\includegraphics[width=1\textwidth]{media/image14.png} -\label{fig:8} -\end{figure} - -\subsection{NAMD\label{ref-0113}} - -NAMD was successfully compiled and ran on IDRIS Ouessant: IBM Power 8 + Dual P100 presented in Sect. \ref{sec:hardware}. - -In all accelerated runs a speed up of 5-6x with respect CPU only runs was achieved with GPU. - -\begin{figure} -\caption{Scalability for NAMD test case STMV.8M\label{ref-0114}} -\includegraphics[width=1\textwidth]{media/image15.png} -\label{fig:9} -\end{figure} - -\begin{figure} -\caption{ Scalability for NAMD test case STMV.28M\label{ref-0115}} -\includegraphics[width=1\textwidth]{media/image16.png} -\label{fig:10} -\end{figure} - - -\subsection{QCD\label{ref-0123}} - -As stated Sect. \ref{sec:codes-qcd}, QCD benchmark has two implementations. - -\subsubsection{First Implementation.\label{ref-0124}} - -\begin{figure} -\caption{shows the time taken by the full MILC $64\times64\times64\times8$ test cases on traditional CPU, Intel Knights Landing Xeon Phi and NVIDIA P100 (Pascal) GPU architectures.\label{ref-0129}\label{ref-0130}} -\includegraphics[width=1\textwidth]{media/image21.pdf} -\label{fig:14} -\end{figure} -In Fig. \ref{fig:14} we present preliminary results for on the latest generation Intel Knights Landing (KNL) and NVIDIA Pascal architectures, which offer very high bandwidth stacked memory, together with the same traditional Intel-Ivy-bridge CPU used in previous sections. Note that these results are not directly comparable with those presented earlier, since they are for a different test case size (larger since we are no longer limited by the small memory size of the Knights Corner), and they are for a slightly updated verion of the benchmark. The KNL is the 64-core 7210 model, available from within a test and development platform provided as part of the ARCHER service. The Pascal is a NVIDIA P100 GPU provided as part of the ``Ouessant'' IBM service at IDRIS, where the host CPU is an IBM Power8+. - -It can be seen that the KNL is 7.5X faster than the Ivy-bridge; the Pascal is 13X faster than the Ivy-bridge; and the OpenPOWER + Pascal is 1.7X faster than the KNL. - -\subsection{Synthetic Benchmarks (SHOC)\label{sec:results:shoc}} - -The SHOC benchmark has been run on Cartesius, Ouessant and MareNostrum. Table \ref{table:result:shoc} presents the results. Results on Power 8 are the one shown in the P100 CUDA column. - -\begin{table} -\caption{Synthetic benchmarks results on GPU and Xeon Phi\label{table:result:shoc}} -\begin{tabularx}{\textwidth}{ -p{\dimexpr 0.25\linewidth-2\tabcolsep} -p{\dimexpr 0.13\linewidth-2\tabcolsep} -p{\dimexpr 0.13\linewidth-2\tabcolsep} -p{\dimexpr 0.13\linewidth-2\tabcolsep} -p{\dimexpr 0.1\linewidth-2\tabcolsep} -p{\dimexpr 0.12\linewidth-2\tabcolsep} -p{\dimexpr 0.13\linewidth-2\tabcolsep}} - & \multicolumn{3}{l}{NVIDIA GPU} & \multicolumn{3}{l}{Intel Xeon Phi} \\ - & K40 CUDA & K40 OpenCL & P100 CUDA & KNC Offload & KNC OpenCL & Haswell OpenCL \\ -\hline -BusSpeedDownload & 10.5 GB/s & 10.56 GB/s & 32.23 GB/s & 6.6 GB/s & 6.8 GB/s & 12.4 GB/s \\ -BusSpeedReadback & 10.5 GB/s & 10.56 GB/s & 34.00 GB/s & 6.7 GB/s & 6.8 GB/s & 12.5 GB/s \\ -maxspflops & 3716 GFLOPS & 3658 GFLOPS & 10424 GFLOPS & \textcolor{color-4}{21581} & \textcolor{color-4}{2314 GFLOPS} & 1647 GFLOPS \\ -maxdpflops & 1412 GFLOPS & 1411 GFLOPS & 5315 GFLOPS & \textcolor{color-4}{16017} & \textcolor{color-4}{2318 GFLOPS} & 884 GFLOPS \\ -gmem\_readbw & 177 GB/s & 179 GB/s & 575.16 GB/s & 170 GB/s & 49.7 GB/s & 20.2 GB/s \\ -gmem\_readbw\_strided & 18 GB/s & 20 GB/s & 99.15 GB/s & N/A & 35 GB/s & \textcolor{color-4}{156 GB/s} \\ -gmem\_writebw & 175 GB/s & 188 GB/s & 436 GB/s & 72 GB/s & 41 GB/s & 13.6 GB/s \\ -gmem\_writebw\_strided & 7 GB/s & 7 GB/s & 26.3 GB/s & N/A & 25 GB/s & \textcolor{color-4}{163 GB/s} \\ -lmem\_readbw & 1168 GB/s & 1156 GB/s & 4239 GB/s & N/A & 442 GB/s & 238 GB/s \\ -lmem\_writebw & 1194 GB/s & 1162 GB/s & 5488 GB/s & N/A & 477 GB/s & 295 GB/s \\ -BFS & 49,236,500 Edges/s & 42,088,000 Edges/s & 91,935,100 Edges/s & N/A & 1,635,330 Edges/s & 14,225,600 Edges/s \\ -FFT\_sp & 523 GFLOPS & 377 GFLOPS & 1472 GFLOPS & 135 GFLOPS & 71 GFLOPS & 80 GFLOPS \\ -FFT\_dp & 262 GFLOPS & 61 GFLOPS & 733 GFLOPS & 69.5 GFLOPS & 31 GFLOPS & 55 GFLOPS \\ -SGEMM & 2900-2990 GFLOPS & 694/761 GFLOPS & 8604-8720 GFLOPS & 640/645 GFLOPS & 179/217 GFLOPS & 419-554 GFLOPS \\ -DGEMM & 1025-1083 GFLOPS & 411/433 GFLOPS & 3635-3785 GFLOPS & 179/190 GFLOPS & 76/100 GFLOPS & 189-196 GFLOPS \\ -MD (SP) & 185 GFLOPS & 91 GFLOPS & 483 GFLOPS & 28 GFLOPS & 33 GFLOPS & 114 GFLOPS \\ -MD5Hash & 3.38 GH/s & 3.36 GH/s & 15.77 GH/s & N/A & 1.7 GH/s & 1.29 GH/s \\ -Reduction & 137 GB/s & 150 GB/s & 271 GB/s & 99 GB/s & 10 GB/s & 91 GB/s \\ -Scan & 47 GB/s & 39 GB/s & 99.2 GB/s & 11 GB/s & 4.5 GB/s & 15 GB/s \\ -Sort & 3.08 GB/s & 0.54 GB/s & 12.54 GB/s & N/A & 0.11 GB/s & 0.35 GB/s \\ -Spmv & 4-23 GFLOPS & 3-17 GFLOPS & 23-65 GFLOPS & \textcolor{color-4}{1-17944 GFLOPS} & N/A & 1-10 GFLOPS \\ -Stencil2D & 123 GFLOPS & 135 GFLOPS & 465 GFLOPS & 89 GFLOPS & 8.95 GFLOPS & 34 GFLOPS \\ -Stencil2D\_dp & 57 GFLOPS & 67 GFLOPS & 258 GFLOPS & 16 GFLOPS & 7.92 GFLOPS & 30 GFLOPS \\ -Triad & 13.5 GB/s & 9.9 GB/s & 43 GB/s & 5.76 GB/s & 5.57 GB/s & 8 GB/s \\ -S3D (level2) & 94 GFLOPS & 91 GFLOPS & 294 GFLOPS & 109 GFLOPS & 18 GFLOPS & 27 GFLOPS \\ -\end{tabularx} -\end{table} - -In table \ref{table:result:shoc}, measures marked red are not relevant and should not be considered: - -\begin{itemize} -\item KNC MaxFlops (both SP and DP): In this case the compiler optimizes away some of the computation (although it shouldn't) \cite{ref-0034}. -\item KNC SpMV: For these benchmarks it is a known bug currently being addressed \cite{ref-0035}. -\item Haswell \verb+gmem_readbw_strided+ and \verb+gmem_writebw_strided+: strided read/write benchmarks doesn't make too much sense in the case of the CPU, as the data will be cache in the large L3 caches. It is reason why we see high number only in the Haswell case. -\end{itemize} - -\subsection{SPECFEM3D\_GLOBE\label{ref-0155}} - -Tests have been carried out on Ouessant (see Sect. \ref{sec:hardware}). - -So far it has only been possible to run on one fixed core count for each test case, so scaling curves are not available. Test case A ran on 4 KNL and 4 P100. Test case B ran on 10 KNL and 4 P100. Results are shown on table \ref{table:results:specfem}. - -\begin{table} -\centering -\caption{SPECFEM 3D GLOBE results (run time in second)} -\begin{tabular}{ccc} - & KNL & POWER + P100 \\ -\hline\noalign{\smallskip} -Test case A & 66 & 105 \\ -Test case B & 21.4 & 68 \\ -\end{tabular} -\label{table:results:specfem} -\end{table} diff --git a/doc/iwoph17/splncs03.bst b/doc/iwoph17/splncs03.bst deleted file mode 100644 index 3279169171fce20468c990cdbff52f0e582ccc77..0000000000000000000000000000000000000000 --- a/doc/iwoph17/splncs03.bst +++ /dev/null @@ -1,1519 +0,0 @@ -%% BibTeX bibliography style `splncs03' -%% -%% BibTeX bibliography style for use with numbered references in -%% Springer Verlag's "Lecture Notes in Computer Science" series. -%% (See Springer's documentation for llncs.cls for -%% more details of the suggested reference format.) Note that this -%% file will not work for author-year style citations. -%% -%% Use \documentclass{llncs} and \bibliographystyle{splncs03}, and cite -%% a reference with (e.g.) \cite{smith77} to get a "[1]" in the text. -%% -%% This file comes to you courtesy of Maurizio "Titto" Patrignani of -%% Dipartimento di Informatica e Automazione Universita' Roma Tre -%% -%% ================================================================================================ -%% This was file `titto-lncs-02.bst' produced on Wed Apr 1, 2009 -%% Edited by hand by titto based on `titto-lncs-01.bst' (see below) -%% -%% CHANGES (with respect to titto-lncs-01.bst): -%% - Removed the call to \urlprefix (thus no "URL" string is added to the output) -%% ================================================================================================ -%% This was file `titto-lncs-01.bst' produced on Fri Aug 22, 2008 -%% Edited by hand by titto based on `titto.bst' (see below) -%% -%% CHANGES (with respect to titto.bst): -%% - Removed the "capitalize" command for editors string "(eds.)" and "(ed.)" -%% - Introduced the functions titto.bbl.pages and titto.bbl.page for journal pages (without "pp.") -%% - Added a new.sentence command to separate with a dot booktitle and series in the inproceedings -%% - Commented all new.block commands before urls and notes (to separate them with a comma) -%% - Introduced the functions titto.bbl.volume for handling journal volumes (without "vol." label) -%% - Used for editors the same name conventions used for authors (see function format.in.ed.booktitle) -%% - Removed a \newblock to avoid long spaces between title and "In: ..." -%% - Added function titto.space.prefix to add a space instead of "~" after the (removed) "vol." label -%% ================================================================================================ -%% This was file `titto.bst', -%% generated with the docstrip utility. -%% -%% The original source files were: -%% -%% merlin.mbs (with options: `vonx,nm-rvvc,yr-par,jttl-rm,volp-com,jwdpg,jwdvol,numser,ser-vol,jnm-x,btit-rm,bt-rm,edparxc,bkedcap,au-col,in-col,fin-bare,pp,ed,abr,mth-bare,xedn,jabr,and-com,and-com-ed,xand,url,url-blk,em-x,nfss,') -%% ---------------------------------------- -%% *** Tentative .bst file for Springer LNCS *** -%% -%% Copyright 1994-2007 Patrick W Daly - % =============================================================== - % IMPORTANT NOTICE: - % This bibliographic style (bst) file has been generated from one or - % more master bibliographic style (mbs) files, listed above. - % - % This generated file can be redistributed and/or modified under the terms - % of the LaTeX Project Public License Distributed from CTAN - % archives in directory macros/latex/base/lppl.txt; either - % version 1 of the License, or any later version. - % =============================================================== - % Name and version information of the main mbs file: - % \ProvidesFile{merlin.mbs}[2007/04/24 4.20 (PWD, AO, DPC)] - % For use with BibTeX version 0.99a or later - %------------------------------------------------------------------- - % This bibliography style file is intended for texts in ENGLISH - % This is a numerical citation style, and as such is standard LaTeX. - % It requires no extra package to interface to the main text. - % The form of the \bibitem entries is - % \bibitem{key}... - % Usage of \cite is as follows: - % \cite{key} ==>> [#] - % \cite[chap. 2]{key} ==>> [#, chap. 2] - % where # is a number determined by the ordering in the reference list. - % The order in the reference list is alphabetical by authors. - %--------------------------------------------------------------------- - -ENTRY - { address - author - booktitle - chapter - edition - editor - eid - howpublished - institution - journal - key - month - note - number - organization - pages - publisher - school - series - title - type - url - volume - year - } - {} - { label } -INTEGERS { output.state before.all mid.sentence after.sentence after.block } -FUNCTION {init.state.consts} -{ #0 'before.all := - #1 'mid.sentence := - #2 'after.sentence := - #3 'after.block := -} -STRINGS { s t} -FUNCTION {output.nonnull} -{ 's := - output.state mid.sentence = - { ", " * write$ } - { output.state after.block = - { add.period$ write$ -% newline$ -% "\newblock " write$ % removed for titto-lncs-01 - " " write$ % to avoid long spaces between title and "In: ..." - } - { output.state before.all = - 'write$ - { add.period$ " " * write$ } - if$ - } - if$ - mid.sentence 'output.state := - } - if$ - s -} -FUNCTION {output} -{ duplicate$ empty$ - 'pop$ - 'output.nonnull - if$ -} -FUNCTION {output.check} -{ 't := - duplicate$ empty$ - { pop$ "empty " t * " in " * cite$ * warning$ } - 'output.nonnull - if$ -} -FUNCTION {fin.entry} -{ duplicate$ empty$ - 'pop$ - 'write$ - if$ - newline$ -} - -FUNCTION {new.block} -{ output.state before.all = - 'skip$ - { after.block 'output.state := } - if$ -} -FUNCTION {new.sentence} -{ output.state after.block = - 'skip$ - { output.state before.all = - 'skip$ - { after.sentence 'output.state := } - if$ - } - if$ -} -FUNCTION {add.blank} -{ " " * before.all 'output.state := -} - - -FUNCTION {add.colon} -{ duplicate$ empty$ - 'skip$ - { ":" * add.blank } - if$ -} - -FUNCTION {date.block} -{ - new.block -} - -FUNCTION {not} -{ { #0 } - { #1 } - if$ -} -FUNCTION {and} -{ 'skip$ - { pop$ #0 } - if$ -} -FUNCTION {or} -{ { pop$ #1 } - 'skip$ - if$ -} -STRINGS {z} -FUNCTION {remove.dots} -{ 'z := - "" - { z empty$ not } - { z #1 #1 substring$ - z #2 global.max$ substring$ 'z := - duplicate$ "." = 'pop$ - { * } - if$ - } - while$ -} -FUNCTION {new.block.checka} -{ empty$ - 'skip$ - 'new.block - if$ -} -FUNCTION {new.block.checkb} -{ empty$ - swap$ empty$ - and - 'skip$ - 'new.block - if$ -} -FUNCTION {new.sentence.checka} -{ empty$ - 'skip$ - 'new.sentence - if$ -} -FUNCTION {new.sentence.checkb} -{ empty$ - swap$ empty$ - and - 'skip$ - 'new.sentence - if$ -} -FUNCTION {field.or.null} -{ duplicate$ empty$ - { pop$ "" } - 'skip$ - if$ -} -FUNCTION {emphasize} -{ skip$ } -FUNCTION {tie.or.space.prefix} -{ duplicate$ text.length$ #3 < - { "~" } - { " " } - if$ - swap$ -} -FUNCTION {titto.space.prefix} % always introduce a space -{ duplicate$ text.length$ #3 < - { " " } - { " " } - if$ - swap$ -} - - -FUNCTION {capitalize} -{ "u" change.case$ "t" change.case$ } - -FUNCTION {space.word} -{ " " swap$ * " " * } - % Here are the language-specific definitions for explicit words. - % Each function has a name bbl.xxx where xxx is the English word. - % The language selected here is ENGLISH -FUNCTION {bbl.and} -{ "and"} - -FUNCTION {bbl.etal} -{ "et~al." } - -FUNCTION {bbl.editors} -{ "eds." } - -FUNCTION {bbl.editor} -{ "ed." } - -FUNCTION {bbl.edby} -{ "edited by" } - -FUNCTION {bbl.edition} -{ "edn." } - -FUNCTION {bbl.volume} -{ "vol." } - -FUNCTION {titto.bbl.volume} % for handling journals -{ "" } - -FUNCTION {bbl.of} -{ "of" } - -FUNCTION {bbl.number} -{ "no." } - -FUNCTION {bbl.nr} -{ "no." } - -FUNCTION {bbl.in} -{ "in" } - -FUNCTION {bbl.pages} -{ "pp." } - -FUNCTION {bbl.page} -{ "p." } - -FUNCTION {titto.bbl.pages} % for journals -{ "" } - -FUNCTION {titto.bbl.page} % for journals -{ "" } - -FUNCTION {bbl.chapter} -{ "chap." } - -FUNCTION {bbl.techrep} -{ "Tech. Rep." } - -FUNCTION {bbl.mthesis} -{ "Master's thesis" } - -FUNCTION {bbl.phdthesis} -{ "Ph.D. thesis" } - -MACRO {jan} {"Jan."} - -MACRO {feb} {"Feb."} - -MACRO {mar} {"Mar."} - -MACRO {apr} {"Apr."} - -MACRO {may} {"May"} - -MACRO {jun} {"Jun."} - -MACRO {jul} {"Jul."} - -MACRO {aug} {"Aug."} - -MACRO {sep} {"Sep."} - -MACRO {oct} {"Oct."} - -MACRO {nov} {"Nov."} - -MACRO {dec} {"Dec."} - -MACRO {acmcs} {"ACM Comput. Surv."} - -MACRO {acta} {"Acta Inf."} - -MACRO {cacm} {"Commun. ACM"} - -MACRO {ibmjrd} {"IBM J. Res. Dev."} - -MACRO {ibmsj} {"IBM Syst.~J."} - -MACRO {ieeese} {"IEEE Trans. Software Eng."} - -MACRO {ieeetc} {"IEEE Trans. Comput."} - -MACRO {ieeetcad} - {"IEEE Trans. Comput. Aid. Des."} - -MACRO {ipl} {"Inf. Process. Lett."} - -MACRO {jacm} {"J.~ACM"} - -MACRO {jcss} {"J.~Comput. Syst. Sci."} - -MACRO {scp} {"Sci. Comput. Program."} - -MACRO {sicomp} {"SIAM J. Comput."} - -MACRO {tocs} {"ACM Trans. Comput. Syst."} - -MACRO {tods} {"ACM Trans. Database Syst."} - -MACRO {tog} {"ACM Trans. Graphic."} - -MACRO {toms} {"ACM Trans. Math. Software"} - -MACRO {toois} {"ACM Trans. Office Inf. Syst."} - -MACRO {toplas} {"ACM Trans. Progr. Lang. Syst."} - -MACRO {tcs} {"Theor. Comput. Sci."} - -FUNCTION {bibinfo.check} -{ swap$ - duplicate$ missing$ - { - pop$ pop$ - "" - } - { duplicate$ empty$ - { - swap$ pop$ - } - { swap$ - pop$ - } - if$ - } - if$ -} -FUNCTION {bibinfo.warn} -{ swap$ - duplicate$ missing$ - { - swap$ "missing " swap$ * " in " * cite$ * warning$ pop$ - "" - } - { duplicate$ empty$ - { - swap$ "empty " swap$ * " in " * cite$ * warning$ - } - { swap$ - pop$ - } - if$ - } - if$ -} -FUNCTION {format.url} -{ url empty$ - { "" } -% { "\urlprefix\url{" url * "}" * } - { "\url{" url * "}" * } % changed in titto-lncs-02.bst - if$ -} - -INTEGERS { nameptr namesleft numnames } - - -STRINGS { bibinfo} - -FUNCTION {format.names} -{ 'bibinfo := - duplicate$ empty$ 'skip$ { - 's := - "" 't := - #1 'nameptr := - s num.names$ 'numnames := - numnames 'namesleft := - { namesleft #0 > } - { s nameptr - "{vv~}{ll}{, jj}{, f{.}.}" - format.name$ - bibinfo bibinfo.check - 't := - nameptr #1 > - { - namesleft #1 > - { ", " * t * } - { - s nameptr "{ll}" format.name$ duplicate$ "others" = - { 't := } - { pop$ } - if$ - "," * - t "others" = - { - " " * bbl.etal * - } - { " " * t * } - if$ - } - if$ - } - 't - if$ - nameptr #1 + 'nameptr := - namesleft #1 - 'namesleft := - } - while$ - } if$ -} -FUNCTION {format.names.ed} -{ - 'bibinfo := - duplicate$ empty$ 'skip$ { - 's := - "" 't := - #1 'nameptr := - s num.names$ 'numnames := - numnames 'namesleft := - { namesleft #0 > } - { s nameptr - "{f{.}.~}{vv~}{ll}{ jj}" - format.name$ - bibinfo bibinfo.check - 't := - nameptr #1 > - { - namesleft #1 > - { ", " * t * } - { - s nameptr "{ll}" format.name$ duplicate$ "others" = - { 't := } - { pop$ } - if$ - "," * - t "others" = - { - - " " * bbl.etal * - } - { " " * t * } - if$ - } - if$ - } - 't - if$ - nameptr #1 + 'nameptr := - namesleft #1 - 'namesleft := - } - while$ - } if$ -} -FUNCTION {format.authors} -{ author "author" format.names -} -FUNCTION {get.bbl.editor} -{ editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ } - -FUNCTION {format.editors} -{ editor "editor" format.names duplicate$ empty$ 'skip$ - { - " " * - get.bbl.editor -% capitalize - "(" swap$ * ")" * - * - } - if$ -} -FUNCTION {format.note} -{ - note empty$ - { "" } - { note #1 #1 substring$ - duplicate$ "{" = - 'skip$ - { output.state mid.sentence = - { "l" } - { "u" } - if$ - change.case$ - } - if$ - note #2 global.max$ substring$ * "note" bibinfo.check - } - if$ -} - -FUNCTION {format.title} -{ title - duplicate$ empty$ 'skip$ - { "t" change.case$ } - if$ - "title" bibinfo.check -} -FUNCTION {output.bibitem} -{ newline$ - "\bibitem{" write$ - cite$ write$ - "}" write$ - newline$ - "" - before.all 'output.state := -} - -FUNCTION {n.dashify} -{ - 't := - "" - { t empty$ not } - { t #1 #1 substring$ "-" = - { t #1 #2 substring$ "--" = not - { "--" * - t #2 global.max$ substring$ 't := - } - { { t #1 #1 substring$ "-" = } - { "-" * - t #2 global.max$ substring$ 't := - } - while$ - } - if$ - } - { t #1 #1 substring$ * - t #2 global.max$ substring$ 't := - } - if$ - } - while$ -} - -FUNCTION {word.in} -{ bbl.in capitalize - ":" * - " " * } - -FUNCTION {format.date} -{ - month "month" bibinfo.check - duplicate$ empty$ - year "year" bibinfo.check duplicate$ empty$ - { swap$ 'skip$ - { "there's a month but no year in " cite$ * warning$ } - if$ - * - } - { swap$ 'skip$ - { - swap$ - " " * swap$ - } - if$ - * - remove.dots - } - if$ - duplicate$ empty$ - 'skip$ - { - before.all 'output.state := - " (" swap$ * ")" * - } - if$ -} -FUNCTION {format.btitle} -{ title "title" bibinfo.check - duplicate$ empty$ 'skip$ - { - } - if$ -} -FUNCTION {either.or.check} -{ empty$ - 'pop$ - { "can't use both " swap$ * " fields in " * cite$ * warning$ } - if$ -} -FUNCTION {format.bvolume} -{ volume empty$ - { "" } - { bbl.volume volume tie.or.space.prefix - "volume" bibinfo.check * * - series "series" bibinfo.check - duplicate$ empty$ 'pop$ - { emphasize ", " * swap$ * } - if$ - "volume and number" number either.or.check - } - if$ -} -FUNCTION {format.number.series} -{ volume empty$ - { number empty$ - { series field.or.null } - { output.state mid.sentence = - { bbl.number } - { bbl.number capitalize } - if$ - number tie.or.space.prefix "number" bibinfo.check * * - series empty$ - { "there's a number but no series in " cite$ * warning$ } - { bbl.in space.word * - series "series" bibinfo.check * - } - if$ - } - if$ - } - { "" } - if$ -} - -FUNCTION {format.edition} -{ edition duplicate$ empty$ 'skip$ - { - output.state mid.sentence = - { "l" } - { "t" } - if$ change.case$ - "edition" bibinfo.check - " " * bbl.edition * - } - if$ -} -INTEGERS { multiresult } -FUNCTION {multi.page.check} -{ 't := - #0 'multiresult := - { multiresult not - t empty$ not - and - } - { t #1 #1 substring$ - duplicate$ "-" = - swap$ duplicate$ "," = - swap$ "+" = - or or - { #1 'multiresult := } - { t #2 global.max$ substring$ 't := } - if$ - } - while$ - multiresult -} -FUNCTION {format.pages} -{ pages duplicate$ empty$ 'skip$ - { duplicate$ multi.page.check - { - bbl.pages swap$ - n.dashify - } - { - bbl.page swap$ - } - if$ - tie.or.space.prefix - "pages" bibinfo.check - * * - } - if$ -} -FUNCTION {format.journal.pages} -{ pages duplicate$ empty$ 'pop$ - { swap$ duplicate$ empty$ - { pop$ pop$ format.pages } - { - ", " * - swap$ - n.dashify - pages multi.page.check - 'titto.bbl.pages - 'titto.bbl.page - if$ - swap$ tie.or.space.prefix - "pages" bibinfo.check - * * - * - } - if$ - } - if$ -} -FUNCTION {format.journal.eid} -{ eid "eid" bibinfo.check - duplicate$ empty$ 'pop$ - { swap$ duplicate$ empty$ 'skip$ - { - ", " * - } - if$ - swap$ * - } - if$ -} -FUNCTION {format.vol.num.pages} % this function is used only for journal entries -{ volume field.or.null - duplicate$ empty$ 'skip$ - { -% bbl.volume swap$ tie.or.space.prefix - titto.bbl.volume swap$ titto.space.prefix -% rationale for the change above: for journals you don't want "vol." label -% hence it does not make sense to attach the journal number to the label when -% it is short - "volume" bibinfo.check - * * - } - if$ - number "number" bibinfo.check duplicate$ empty$ 'skip$ - { - swap$ duplicate$ empty$ - { "there's a number but no volume in " cite$ * warning$ } - 'skip$ - if$ - swap$ - "(" swap$ * ")" * - } - if$ * - eid empty$ - { format.journal.pages } - { format.journal.eid } - if$ -} - -FUNCTION {format.chapter.pages} -{ chapter empty$ - 'format.pages - { type empty$ - { bbl.chapter } - { type "l" change.case$ - "type" bibinfo.check - } - if$ - chapter tie.or.space.prefix - "chapter" bibinfo.check - * * - pages empty$ - 'skip$ - { ", " * format.pages * } - if$ - } - if$ -} - -FUNCTION {format.booktitle} -{ - booktitle "booktitle" bibinfo.check -} -FUNCTION {format.in.ed.booktitle} -{ format.booktitle duplicate$ empty$ 'skip$ - { -% editor "editor" format.names.ed duplicate$ empty$ 'pop$ % changed by titto - editor "editor" format.names duplicate$ empty$ 'pop$ - { - " " * - get.bbl.editor -% capitalize - "(" swap$ * ") " * - * swap$ - * } - if$ - word.in swap$ * - } - if$ -} -FUNCTION {empty.misc.check} -{ author empty$ title empty$ howpublished empty$ - month empty$ year empty$ note empty$ - and and and and and - key empty$ not and - { "all relevant fields are empty in " cite$ * warning$ } - 'skip$ - if$ -} -FUNCTION {format.thesis.type} -{ type duplicate$ empty$ - 'pop$ - { swap$ pop$ - "t" change.case$ "type" bibinfo.check - } - if$ -} -FUNCTION {format.tr.number} -{ number "number" bibinfo.check - type duplicate$ empty$ - { pop$ bbl.techrep } - 'skip$ - if$ - "type" bibinfo.check - swap$ duplicate$ empty$ - { pop$ "t" change.case$ } - { tie.or.space.prefix * * } - if$ -} -FUNCTION {format.article.crossref} -{ - key duplicate$ empty$ - { pop$ - journal duplicate$ empty$ - { "need key or journal for " cite$ * " to crossref " * crossref * warning$ } - { "journal" bibinfo.check emphasize word.in swap$ * } - if$ - } - { word.in swap$ * " " *} - if$ - " \cite{" * crossref * "}" * -} -FUNCTION {format.crossref.editor} -{ editor #1 "{vv~}{ll}" format.name$ - "editor" bibinfo.check - editor num.names$ duplicate$ - #2 > - { pop$ - "editor" bibinfo.check - " " * bbl.etal - * - } - { #2 < - 'skip$ - { editor #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" = - { - "editor" bibinfo.check - " " * bbl.etal - * - } - { - bbl.and space.word - * editor #2 "{vv~}{ll}" format.name$ - "editor" bibinfo.check - * - } - if$ - } - if$ - } - if$ -} -FUNCTION {format.book.crossref} -{ volume duplicate$ empty$ - { "empty volume in " cite$ * "'s crossref of " * crossref * warning$ - pop$ word.in - } - { bbl.volume - capitalize - swap$ tie.or.space.prefix "volume" bibinfo.check * * bbl.of space.word * - } - if$ - editor empty$ - editor field.or.null author field.or.null = - or - { key empty$ - { series empty$ - { "need editor, key, or series for " cite$ * " to crossref " * - crossref * warning$ - "" * - } - { series emphasize * } - if$ - } - { key * } - if$ - } - { format.crossref.editor * } - if$ - " \cite{" * crossref * "}" * -} -FUNCTION {format.incoll.inproc.crossref} -{ - editor empty$ - editor field.or.null author field.or.null = - or - { key empty$ - { format.booktitle duplicate$ empty$ - { "need editor, key, or booktitle for " cite$ * " to crossref " * - crossref * warning$ - } - { word.in swap$ * } - if$ - } - { word.in key * " " *} - if$ - } - { word.in format.crossref.editor * " " *} - if$ - " \cite{" * crossref * "}" * -} -FUNCTION {format.org.or.pub} -{ 't := - "" - address empty$ t empty$ and - 'skip$ - { - t empty$ - { address "address" bibinfo.check * - } - { t * - address empty$ - 'skip$ - { ", " * address "address" bibinfo.check * } - if$ - } - if$ - } - if$ -} -FUNCTION {format.publisher.address} -{ publisher "publisher" bibinfo.warn format.org.or.pub -} - -FUNCTION {format.organization.address} -{ organization "organization" bibinfo.check format.org.or.pub -} - -FUNCTION {article} -{ output.bibitem - format.authors "author" output.check - add.colon - new.block - format.title "title" output.check - new.block - crossref missing$ - { - journal - "journal" bibinfo.check - "journal" output.check - add.blank - format.vol.num.pages output - format.date "year" output.check - } - { format.article.crossref output.nonnull - format.pages output - } - if$ -% new.block - format.url output -% new.block - format.note output - fin.entry -} -FUNCTION {book} -{ output.bibitem - author empty$ - { format.editors "author and editor" output.check - add.colon - } - { format.authors output.nonnull - add.colon - crossref missing$ - { "author and editor" editor either.or.check } - 'skip$ - if$ - } - if$ - new.block - format.btitle "title" output.check - crossref missing$ - { format.bvolume output - new.block - new.sentence - format.number.series output - format.publisher.address output - } - { - new.block - format.book.crossref output.nonnull - } - if$ - format.edition output - format.date "year" output.check -% new.block - format.url output -% new.block - format.note output - fin.entry -} -FUNCTION {booklet} -{ output.bibitem - format.authors output - add.colon - new.block - format.title "title" output.check - new.block - howpublished "howpublished" bibinfo.check output - address "address" bibinfo.check output - format.date output -% new.block - format.url output -% new.block - format.note output - fin.entry -} - -FUNCTION {inbook} -{ output.bibitem - author empty$ - { format.editors "author and editor" output.check - add.colon - } - { format.authors output.nonnull - add.colon - crossref missing$ - { "author and editor" editor either.or.check } - 'skip$ - if$ - } - if$ - new.block - format.btitle "title" output.check - crossref missing$ - { - format.bvolume output - format.chapter.pages "chapter and pages" output.check - new.block - new.sentence - format.number.series output - format.publisher.address output - } - { - format.chapter.pages "chapter and pages" output.check - new.block - format.book.crossref output.nonnull - } - if$ - format.edition output - format.date "year" output.check -% new.block - format.url output -% new.block - format.note output - fin.entry -} - -FUNCTION {incollection} -{ output.bibitem - format.authors "author" output.check - add.colon - new.block - format.title "title" output.check - new.block - crossref missing$ - { format.in.ed.booktitle "booktitle" output.check - format.bvolume output - format.chapter.pages output - new.sentence - format.number.series output - format.publisher.address output - format.edition output - format.date "year" output.check - } - { format.incoll.inproc.crossref output.nonnull - format.chapter.pages output - } - if$ -% new.block - format.url output -% new.block - format.note output - fin.entry -} -FUNCTION {inproceedings} -{ output.bibitem - format.authors "author" output.check - add.colon - new.block - format.title "title" output.check - new.block - crossref missing$ - { format.in.ed.booktitle "booktitle" output.check - new.sentence % added by titto - format.bvolume output - format.pages output - new.sentence - format.number.series output - publisher empty$ - { format.organization.address output } - { organization "organization" bibinfo.check output - format.publisher.address output - } - if$ - format.date "year" output.check - } - { format.incoll.inproc.crossref output.nonnull - format.pages output - } - if$ -% new.block - format.url output -% new.block - format.note output - fin.entry -} -FUNCTION {conference} { inproceedings } -FUNCTION {manual} -{ output.bibitem - author empty$ - { organization "organization" bibinfo.check - duplicate$ empty$ 'pop$ - { output - address "address" bibinfo.check output - } - if$ - } - { format.authors output.nonnull } - if$ - add.colon - new.block - format.btitle "title" output.check - author empty$ - { organization empty$ - { - address new.block.checka - address "address" bibinfo.check output - } - 'skip$ - if$ - } - { - organization address new.block.checkb - organization "organization" bibinfo.check output - address "address" bibinfo.check output - } - if$ - format.edition output - format.date output -% new.block - format.url output -% new.block - format.note output - fin.entry -} - -FUNCTION {mastersthesis} -{ output.bibitem - format.authors "author" output.check - add.colon - new.block - format.btitle - "title" output.check - new.block - bbl.mthesis format.thesis.type output.nonnull - school "school" bibinfo.warn output - address "address" bibinfo.check output - format.date "year" output.check -% new.block - format.url output -% new.block - format.note output - fin.entry -} - -FUNCTION {misc} -{ output.bibitem - format.authors output - add.colon - title howpublished new.block.checkb - format.title output - howpublished new.block.checka - howpublished "howpublished" bibinfo.check output - format.date output -% new.block - format.url output -% new.block - format.note output - fin.entry - empty.misc.check -} -FUNCTION {phdthesis} -{ output.bibitem - format.authors "author" output.check - add.colon - new.block - format.btitle - "title" output.check - new.block - bbl.phdthesis format.thesis.type output.nonnull - school "school" bibinfo.warn output - address "address" bibinfo.check output - format.date "year" output.check -% new.block - format.url output -% new.block - format.note output - fin.entry -} - -FUNCTION {proceedings} -{ output.bibitem - editor empty$ - { organization "organization" bibinfo.check output - } - { format.editors output.nonnull } - if$ - add.colon - new.block - format.btitle "title" output.check - format.bvolume output - editor empty$ - { publisher empty$ - { format.number.series output } - { - new.sentence - format.number.series output - format.publisher.address output - } - if$ - } - { publisher empty$ - { - new.sentence - format.number.series output - format.organization.address output } - { - new.sentence - format.number.series output - organization "organization" bibinfo.check output - format.publisher.address output - } - if$ - } - if$ - format.date "year" output.check -% new.block - format.url output -% new.block - format.note output - fin.entry -} - -FUNCTION {techreport} -{ output.bibitem - format.authors "author" output.check - add.colon - new.block - format.title - "title" output.check - new.block - format.tr.number output.nonnull - institution "institution" bibinfo.warn output - address "address" bibinfo.check output - format.date "year" output.check -% new.block - format.url output -% new.block - format.note output - fin.entry -} - -FUNCTION {unpublished} -{ output.bibitem - format.authors "author" output.check - add.colon - new.block - format.title "title" output.check - format.date output -% new.block - format.url output -% new.block - format.note "note" output.check - fin.entry -} - -FUNCTION {default.type} { misc } -READ -FUNCTION {sortify} -{ purify$ - "l" change.case$ -} -INTEGERS { len } -FUNCTION {chop.word} -{ 's := - 'len := - s #1 len substring$ = - { s len #1 + global.max$ substring$ } - 's - if$ -} -FUNCTION {sort.format.names} -{ 's := - #1 'nameptr := - "" - s num.names$ 'numnames := - numnames 'namesleft := - { namesleft #0 > } - { s nameptr - "{ll{ }}{ ff{ }}{ jj{ }}" - format.name$ 't := - nameptr #1 > - { - " " * - namesleft #1 = t "others" = and - { "zzzzz" * } - { t sortify * } - if$ - } - { t sortify * } - if$ - nameptr #1 + 'nameptr := - namesleft #1 - 'namesleft := - } - while$ -} - -FUNCTION {sort.format.title} -{ 't := - "A " #2 - "An " #3 - "The " #4 t chop.word - chop.word - chop.word - sortify - #1 global.max$ substring$ -} -FUNCTION {author.sort} -{ author empty$ - { key empty$ - { "to sort, need author or key in " cite$ * warning$ - "" - } - { key sortify } - if$ - } - { author sort.format.names } - if$ -} -FUNCTION {author.editor.sort} -{ author empty$ - { editor empty$ - { key empty$ - { "to sort, need author, editor, or key in " cite$ * warning$ - "" - } - { key sortify } - if$ - } - { editor sort.format.names } - if$ - } - { author sort.format.names } - if$ -} -FUNCTION {author.organization.sort} -{ author empty$ - { organization empty$ - { key empty$ - { "to sort, need author, organization, or key in " cite$ * warning$ - "" - } - { key sortify } - if$ - } - { "The " #4 organization chop.word sortify } - if$ - } - { author sort.format.names } - if$ -} -FUNCTION {editor.organization.sort} -{ editor empty$ - { organization empty$ - { key empty$ - { "to sort, need editor, organization, or key in " cite$ * warning$ - "" - } - { key sortify } - if$ - } - { "The " #4 organization chop.word sortify } - if$ - } - { editor sort.format.names } - if$ -} -FUNCTION {presort} -{ type$ "book" = - type$ "inbook" = - or - 'author.editor.sort - { type$ "proceedings" = - 'editor.organization.sort - { type$ "manual" = - 'author.organization.sort - 'author.sort - if$ - } - if$ - } - if$ - " " - * - year field.or.null sortify - * - " " - * - title field.or.null - sort.format.title - * - #1 entry.max$ substring$ - 'sort.key$ := -} -ITERATE {presort} -SORT -STRINGS { longest.label } -INTEGERS { number.label longest.label.width } -FUNCTION {initialize.longest.label} -{ "" 'longest.label := - #1 'number.label := - #0 'longest.label.width := -} -FUNCTION {longest.label.pass} -{ number.label int.to.str$ 'label := - number.label #1 + 'number.label := - label width$ longest.label.width > - { label 'longest.label := - label width$ 'longest.label.width := - } - 'skip$ - if$ -} -EXECUTE {initialize.longest.label} -ITERATE {longest.label.pass} -FUNCTION {begin.bib} -{ preamble$ empty$ - 'skip$ - { preamble$ write$ newline$ } - if$ - "\begin{thebibliography}{" longest.label * "}" * - write$ newline$ - "\providecommand{\url}[1]{\texttt{#1}}" - write$ newline$ - "\providecommand{\urlprefix}{URL }" - write$ newline$ -} -EXECUTE {begin.bib} -EXECUTE {init.state.consts} -ITERATE {call.type$} -FUNCTION {end.bib} -{ newline$ - "\end{thebibliography}" write$ newline$ -} -EXECUTE {end.bib} -%% End of customized bst file -%% -%% End of file `titto.bst'. - - diff --git a/doc/iwoph17/sprmindx.sty b/doc/iwoph17/sprmindx.sty deleted file mode 100644 index 8f17772e10b7a3cb7e32740d71c8ec711b0c1638..0000000000000000000000000000000000000000 --- a/doc/iwoph17/sprmindx.sty +++ /dev/null @@ -1,4 +0,0 @@ -delim_0 "\\idxquad " -delim_1 "\\idxquad " -delim_2 "\\idxquad " -delim_n ",\\," diff --git a/doc/iwoph17/t72b.tex b/doc/iwoph17/t72b.tex deleted file mode 100644 index e18852dd52171e8de668a5ec2067510e85fd6e70..0000000000000000000000000000000000000000 --- a/doc/iwoph17/t72b.tex +++ /dev/null @@ -1,140 +0,0 @@ -\documentclass{llncs} -\usepackage[table,xcdraw]{xcolor} -\usepackage{hyperref} -\usepackage{graphicx} -\usepackage{tabularx} -\definecolor{color-4}{rgb}{1,0,0} - -\begin{document} -\title{PRACE European Benchmark Suite: Application Performances on Accelerators} -\author{Victor Cameo Ponz\inst{1} -\and Adem Tekin\inst{2} -\and Alan Grey\inst{3} -\and Andrew Emerson\inst{4} -\and Andrew Sunderland\inst{5} -\and Arno Proeme\inst{3} -\and Charles Moulinec\inst{5} -\and Dimitris Dellis\inst{6} -% \and Fiona Reid\inst{3} -\and Jacob Finkenrath\inst{7} -% \and James Clark\inst{5} -\and Janko Strassburg\inst{8} -\and Jorge Rodriguez\inst{8} -\and Martti Louhivuori\inst{9} -\and Valeriu Codreanu\inst{10}} -\institute{Centre Informatique National de l'Enseignement Suppérieur (CINES), Montpellier, France% 1 -\and Istanbul Technical University (ITU), Istanbul, Turkey% 2 -\and Edinburgh Parallel Computing Centre (EPCC), Edinburgh, United Kingdom% 3 -\and Cineca, Bologna, Italy% 4 -\and Science and Technology Facilities Council (STFC) Daresbury Laboratory, Daresbury United, Kingdom% 5 -\and Greek Research and Technology Network (GRNET), Athens, Greece% 6 -\and Cyprus Institute (CyI), Nicosia, Chypre% 7 -\and Barcelona Supercomputing Center (BSC), Barcelona, Spain% 8 -\and CSC – IT Center for Science Ltd, Helsinki, Finland% 9 -\and SurfSARA, Amsterdam, Netherlands} % 10 -\maketitle - -\begin{abstract} -Increasing interest is being shown in the use of graphic and many-core processors in order to reach HPC exascale machines. -Radicaly different hardwares imply that codes --sometimes antique-- will have to evolve to remain efficient. -This portability effort can be huge and obviously varies from one software to the other. -People that uses HPC sytems will have to choose among stofwares adapted to hardware of their new cluster and may have to change customs completely. -Also people buying machines needs to have a good insight of effeciency of each platform regarding targetted comunity. - -This leads to the need of a performance overview of various softwares on different hardware stack. -It is common need in the HPC field and benchmark suite has existed since a long time. The objective of PRACE with this document is to target leading edge technologies such as OpenPOWER coupled with GPU. -It describes an accelerator benchmark suite, a set of 11 codes that includes 1 synthetic benchmark and 10 commonly used applications. It aims at providing a set of scalable, currently relevant and publically available codes and datasets. -For each code either two or more test case datasets have been selected. These are described in this document, along with a brief introduction to the application codes themselves. For each code, sample results on OpenPOWER+GPU are presented. -\end{abstract} - -\section{Introduction\label{sec:intro}} - -The work presented here have been carried out as an extension of the Unified European Application Benchmark Suite (UEABS) \cite{ref-0017,ref-0018} for accelerators. This document will cover each code, presenting the code as well as the test cases defined for the benchmarks and the results that have been recorded on targeted systems. - -As the UEABS, this suite aims to present results for many scientific fields that can use HPC accelerated resources. Hence, it will help the European scientific communities to decide in terms of infrastructures they could buy in a near future. We focus on Intel Xeon Phi coprocessors and NVIDIA GPU cards for benchmarking as they are the two most wide-spread accelerated resources available now. - -Section \ref{sec:hardware} will present architecture examples on which code ran. Section \ref{sec:codes} gives a description of each of the selected applications, together with the test case datasets while section \ref{sec:results} presents the results. Section \ref{sec:conclusion} outlines further work on, and using the suite. - -\section{Hardware Platform Available\label{sec:hardware}} - -This suite is targeting accelerator cards, more specifically the Intel Xeon Phi and NVIDIA GPU architecture. This section will quickly describe OpenPOWER system benchmarks ran on. - -Genci Grand Equipement National de Calcul Intensif (GENCI) granted access to the \emph{Ouessant} prototype at IDRIS in France (installed September 2016). It is composed of 12 IBM OpenPOWER Minsky compute nodes with each containing \cite{ref-0022}: - -\begin{itemize} -\item Compute nodes -\begin{itemize} -\item POWER8+ sockets, 10 cores, 8 threads per core (or 160 threads per node) -\item 128 GB of DDR4 memory (bandwidth 9 GB/s per core) -\item 4 NVIDIA's new generation Pascal P100 GPU, 16 GB of HBM2 memory -\end{itemize} -\item Interconnect -\begin{itemize} -\item 4 NVLink interconnects (40GB/s of bi-directional bandwidth per interconnect); each GPU card is connected to a CPU with 2 NVLink interconnects and another GPU with 2 interconnects remaining -\item A Mellanox EDR InfiniBand CAPI interconnect network (1 interconnect per node) -\end{itemize} -\end{itemize} - -\input{app.tex} -\input{results.tex} - -\section{Conclusion and future work\label{sec:conclusion}} - -The work presented here stand as a first sight for application benchmarking on accelerators. Most codes have been selected among the main Unified European Application Benchmark Suite. This paper describes each of them as well as implementation, relevance to European science community and test cases. We have presented results available on OpenPOWER systems. - -The suite will be publicly available on the PRACE CodeVault git repository \cite{ref-0036} where links to download sources and test cases will be published along with compilation and run instructions. - -This task in PRACE 4IP started to design a benchmark suite for accelerators. This work has been done aiming at integrating it to the main UEABS one so that both can be maintained and evolve together. As PCP (PRACE-3IP) machines will soon be available, it will be very interesting to run the benchmark suite on them. First because these machines will be larger, but also because they will feature energy consumption probes. - -\section{Acknowledgements} -This work was financially supported by the PRACE project funded in part by the EU's Horizon 2020 research -and innovation programme (2014-2020) under grant agreement 653838. - -\begin{thebibliography}{} % (do not forget {}) - -% \bibitem{ref-0016} PRACE web site -- \url{http://www.prace-ri.eu} - -\bibitem{ref-0017} The Unified European Application Benchmark Suite -- \url{http://www.prace-ri.eu/ueabs/} - -\bibitem{ref-0018} D7.4 Unified European Applications Benchmark Suite -- Mark Bull et al. (2013) - -% \bibitem{ref-0019} Accelerate productivity with the power of NVIDIA -- \url{http://www.nvidia.com/object/quadro-design-and-manufacturing.html} - -% \bibitem{ref-0020} Description of the Cartesius system -- \url{https://userinfo.surfsara.nl/systems/cartesius/description} - -% \bibitem{ref-0021} MareNostrum III User's Guide Barcelona Supercomputing Center -- \url{https://www.bsc.es/support/MareNostrum3-ug.pdf} - -\bibitem{ref-0022} Installation of an OpenPOWER prototype at IDRIS -- \url{http://www.idris.fr/eng/ouessant/} - -\bibitem{ref-0023} PFARM reference -- \url{https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/3/34/Pfarm_long_lug.pdf} - -\bibitem{ref-0024} Solvent-Driven Preferential Association of Lignin with Regions of Crystalline Cellulose in Molecular Dynamics Simulation -- Benjamin Lindner et al. -- Biomacromolecules, 2013 - -\bibitem{ref-0025} NAMD website -- \url{http://www.ks.uiuc.edu/Research/namd/} - -\bibitem{ref-0026} SHOC source repository -- \url{https://github.com/vetter/shoc} - -\bibitem{ref-0027} Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics -- R. Babbich, M. Clark and B. Joo -- Supercomputing 2010 (SC 10) - -\bibitem{ref-0028} Lattice QCD on Intel Xeon Phi -- B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, W. Watson III -- International Supercomputing Conference (ISC'13), 2013 - -\bibitem{ref-0029} Extension of fractional step techniques for incompressible flows: The preconditioned Orthomin(1) for the pressure Schur complement -- G. Houzeaux, R. Aubry, and M. V\'{a}zquez -- Computers \& Fluids, 44:297-313, 2011 - -\bibitem{ref-0030} MIMD Lattice Computation (MILC) Collaboration -- \url{http://physics.indiana.edu/~sg/milc.html} - -\bibitem{ref-0031} targetDP -- \url{https://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README} - -\bibitem{ref-0032} QUDA: A library for QCD on GPU -- \url{https://lattice.github.io/quda/} - -\bibitem{ref-0033} QPhiX, QCD for Intel Xeon Phi and Xeon processors -- \url{http://jeffersonlab.github.io/qphix/} - -\bibitem{ref-0034} KNC MaxFlops issue (both SP and DP) -- \url{https://github.com/vetter/shoc/issues/37} - -\bibitem{ref-0035} KNC SpMV issue -- \url{https://github.com/vetter/shoc/issues/24}, \url{https://github.com/vetter/shoc/issues/23}. - -\bibitem{ref-0036} PRACE Code Vault repository -- \url{https://gitlab.com/PRACE-4IP/CodeVault}. - - -\end{thebibliography} -\end{document} - diff --git a/doc/sphinx/4ip_extension.rst b/doc/sphinx/4ip_extension.rst deleted file mode 100644 index af0089acdf8b13db2daa845e5ce52efe42d87448..0000000000000000000000000000000000000000 --- a/doc/sphinx/4ip_extension.rst +++ /dev/null @@ -1,5 +0,0 @@ -HTTP 301: Page Moved Permanently -================================ - -This does no longer exists. Either you were looking for :ref:`ms33` or for the -general :ref:`4ip_toc` diff --git a/doc/sphinx/Makefile b/doc/sphinx/Makefile deleted file mode 100644 index bb3d6aec128ebd52763f67ee377f9dd5791beba8..0000000000000000000000000000000000000000 --- a/doc/sphinx/Makefile +++ /dev/null @@ -1,20 +0,0 @@ -# Minimal makefile for Sphinx documentation -# - -# You can set these variables from the command line. -SPHINXOPTS = -SPHINXBUILD = python -msphinx -SPHINXPROJ = test -SOURCEDIR = . -BUILDDIR = ../build/sphinx/ - -# Put it first so that "make" without argument is like "make help". -help: - @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) - -.PHONY: help Makefile - -# Catch-all target: route all unknown targets to Sphinx using the new -# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). -%: Makefile - @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/doc/sphinx/_static/.gitignore b/doc/sphinx/_static/.gitignore deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/doc/sphinx/conf.py b/doc/sphinx/conf.py deleted file mode 100644 index f4dd9c4fca1a87adbedaad6e59100d5cfb6a0e08..0000000000000000000000000000000000000000 --- a/doc/sphinx/conf.py +++ /dev/null @@ -1,161 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# -# UEABS for accelerators documentation build configuration file, created by -# sphinx-quickstart on Wed Jun 7 19:01:00 2017. -# -# This file is execfile()d with the current directory set to its -# containing dir. -# -# Note that not all possible configuration values are present in this -# autogenerated file. -# -# All configuration values have a default; values that are commented out -# serve to show the default. - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# -# import os -# import sys -# sys.path.insert(0, os.path.abspath('.')) - - -# -- General configuration ------------------------------------------------ - -# If your documentation needs a minimal Sphinx version, state it here. -# -# needs_sphinx = '1.0' - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [] - -# Add any paths that contain templates here, relative to this directory. -templates_path = ['_templates'] - -# The suffix(es) of source filenames. -# You can specify multiple suffix as a list of string: -# -# source_suffix = ['.rst', '.md'] -source_suffix = '.rst' - -# The master toctree document. -master_doc = 'index' - -# General information about the project. -project = 'UEABS for accelerators' -copyright = '2017, Victor Cameo Ponz' -author = 'Victor Cameo Ponz' - -# The version info for the project you're documenting, acts as replacement for -# |version| and |release|, also used in various other places throughout the -# built documents. -# -# The short X.Y version. -version = '' -# The full version, including alpha/beta/rc tags. -release = '' - -# The language for content autogenerated by Sphinx. Refer to documentation -# for a list of supported languages. -# -# This is also used if you do content translation via gettext catalogs. -# Usually you set "language" from the command line for these cases. -language = None - -# List of patterns, relative to source directory, that match files and -# directories to ignore when looking for source files. -# This patterns also effect to html_static_path and html_extra_path -exclude_patterns = [] - -# The name of the Pygments (syntax highlighting) style to use. -pygments_style = 'sphinx' - -# If true, `todo` and `todoList` produce output, else they produce nothing. -todo_include_todos = False - - -# -- Options for HTML output ---------------------------------------------- - -# The theme to use for HTML and HTML Help pages. See the documentation for -# a list of builtin themes. -# -html_theme = 'theme' # use the theme in subdir 'theme' -html_theme_path = ['.'] # make sphinx search for themes in current dir - -# Theme options are theme-specific and customize the look and feel of a theme -# further. For a list of options available for each theme, see the -# documentation. -# -# html_theme_options = {} - -# Add any paths that contain custom static files (such as style sheets) here, -# relative to this directory. They are copied after the builtin static files, -# so a file named "default.css" will overwrite the builtin "default.css". -html_static_path = ['_static'] - - -# -- Options for HTMLHelp output ------------------------------------------ - -# Output file base name for HTML help builder. -htmlhelp_basename = 'UEABSforacceleratorsdoc' - -html_sidebars = { '**': ['localtoc.html', 'globaltoc.html', 'relations.html', 'sourcelink.html', 'searchbox.html'], } - - - -# -- Options for LaTeX output --------------------------------------------- - -latex_elements = { - # The paper size ('letterpaper' or 'a4paper'). - # - # 'papersize': 'letterpaper', - - # The font size ('10pt', '11pt' or '12pt'). - # - # 'pointsize': '10pt', - - # Additional stuff for the LaTeX preamble. - # - # 'preamble': '', - - # Latex figure (float) alignment - # - # 'figure_align': 'htbp', -} - -# Grouping the document tree into LaTeX files. List of tuples -# (source start file, target name, title, -# author, documentclass [howto, manual, or own class]). -latex_documents = [ - (master_doc, 'UEABSforaccelerators.tex', 'UEABS for accelerators Documentation', - 'Victor Cameo Ponz', 'manual'), -] - - -# -- Options for manual page output --------------------------------------- - -# One entry per manual page. List of tuples -# (source start file, name, description, authors, manual section). -man_pages = [ - (master_doc, 'ueabsforaccelerators', 'UEABS for accelerators Documentation', - [author], 1) -] - - -# -- Options for Texinfo output ------------------------------------------- - -# Grouping the document tree into Texinfo files. List of tuples -# (source start file, target name, title, author, -# dir menu entry, description, category) -texinfo_documents = [ - (master_doc, 'UEABSforaccelerators', 'UEABS for accelerators Documentation', - author, 'UEABSforaccelerators', 'One line description of project.', - 'Miscellaneous'), -] - - - diff --git a/doc/sphinx/d77.rst b/doc/sphinx/d77.rst deleted file mode 100644 index 100fe00f63f675c5dbf325b74bea6f49c5ff1355..0000000000000000000000000000000000000000 --- a/doc/sphinx/d77.rst +++ /dev/null @@ -1,278 +0,0 @@ -.. _d77: - -Deliverable 7.7: Performance and energy metrics on PCP systems -============================================================== - -Executive Summary -***************** - -This document describes efforts deployed in order to exploit PRACE Pre-Commercial Procurement (PCP) machines. It aims at giving an overview of what can be done on in term of performances and energy analysis on these prototypes. The key focus has been given to a general study using the PRACE Unified European Application Benchmark Suite (UEABS) and a more detailed case study porting a solver stack using cutting edge tools. - -This work has been undertaken by the 4IP-extension task "Performance and energy metrics on PCP systems" which is a follow-up of the Task 7.2B "Accelerators benchmarks" in the PRACE Fourth Implementation Phase (4IP). - -It also heads in the direction of the Task 7.3 in 5IP meaning to merge PRACE accelerated and standard benchmark suites, as codes of the latter have been run on accelerators in this task. - -As a result, ALYA, Code_Saturne, CP2K, GPAW, GROMACS, NAMD, PFARM, QCD, Quantum Espresso, SHOC and Specfem3D_Globe (already ported to accelerator) and GADGET and NEMO (newly ported) have been selected to run on Intel KNL and NVDIA GPU to give an overview of performances and energy measurement. - -Also, the HORSE+MaPHyS+PaStiX solver stack have been selected to be ported on Intel KNL. Focus here has been given to performing an energetic profiling of theses codes and studying the influence of several parameters driving the accuracy and numerical efficiency of the underlying simulations. - -Introduction -************ - -The work produced within this task is driven by the delivery of PRACE PCP machines. It aims at giving manufacturer-independent performance and energy metrics for future hexa-scale systems. It is also an opportunity to explore and test cutting edge energy hardware stack and tool developed within the scope of PCP. - -As stated in the Milestone 33, this document will present metrics for selected code among the UEABS. It allows to show results concerning many fields used among European scientific communities. As well as it will go deeper in the porting and energetic profiling activities using the HORSE+MaPHyS+PaStiX solver stack as example. - -Section :ref:`d77_cluster_specs` will details hardware and software specifications where metrics have been carried out. On section :ref:`d77_ueabs_metrics` the metrics for UEABS will be bring together. The work on porting and energy profiling will be presented in section :ref:`d77_port_profile`. Section :ref:`d77_conclusion` will conclude and outline further work on PCP prototypes. - -.. _d77_cluster_specs: - -Clusters specifications and access -********************************** - -PRACE PCP project include tree different prototypes using respectively Xeon Phi, GPU and FPGA. First two machines become more and more common in HPC infrastructures, making the energy stack being the innovation. On the opposite the last architecture is brand new in this field making it harder get familiar with. - -As demonstrated in section :ref:`d77_machine_access` tight deadlines didn't let the time to produce relevant metrics on the FPGA cluster. This is why only GPU and KNL prototype are presented here. - -.. _d77_machine_access: - -Access to machines -^^^^^^^^^^^^^^^^^^ - -Working with prototypes can be painful in term of project management and meeting deadlines. This section is dedicated to give a feedback on accessing the hardware and software stack. - -The timeline_ outlines the initial tight deadlines for this project. Also showing that access to machines have been possible quite late during the phase for running codes. - -.. _timeline: - -.. figure:: /deliverable_d7.7/timeline.png - - 4IP-extention project timeline. On top of the figure are printed periods names and on the bottom key dates. Periods in grey stands for task preparation, periods in blue stands for documentation redaction and period in green stand for technical work. - -The table :ref:`table-pcp-systems-access` shows the precise timeline. To this delays some technical interruptions occurred right at the end of the running phase, not helping with the redaction of this document: - -**PCP-KNL:** - - - closed from 22th November to December the 4th - - login node hasf been down form the 5th to the 7th of December. - - energy metrics tools down from 5th to the 12th of December - -**DAVIDE-GPU** - - - slurm not working from 6th to the 11th of December - - energy metrics tools not *randomly* not working during beginning of December - -.. _table-pcp-systems-access: -.. table:: PCP Systems access dates - :widths: auto - - +-----------------------+------------------+-----------------+------------------+ - | | KNL | GPU | FPGA | - +=======================+==================+=================+==================+ - | Envisioned | June 2017 | July 2017 | August 2017 | - +-----------------------+------------------+-----------------+------------------+ - | Actual access | 1 September 2017 | 16 October 2017 | 2 November 2017 | - +-----------------------+------------------+-----------------+------------------+ - | Acces to energy stack | 6 October 2017 | 8 November 2017 | / | - +-----------------------+------------------+-----------------+------------------+ - - -.. include:: /pcp_systems/e4_gpu.rst - -.. include:: /pcp_systems/atos_knl.rst - - -.. _d77_ueabs_metrics: - -Performances and energy metrics of UEABS on PCP systems -******************************************************* - -This section will present results of UEABS on both GPU and KNL systems. This benchmark suite is made of two set of codes that covers each other's. The former is used to be run on standard CPU and de latest have been ported to accelerators. The accelerated suite is described in the PRACE 4IP Deliverable 7.5 and the standard suite is described on the PRACE UEABS official webpage. - -Metrics exhibited systematically will be time to solution and energy to solution. This choice allows to measure the exact same computation. Indeed, some code features specific performance metrics, e.g. not considering warm up and teardown phases. This metrics are thus not biased and small benchmark test cases can then give more information about an hypothetic production runs. Unfortunately, such a system is not available yet for energy, and this metrics will be shown as *side metrics*. - -In order to be comparable between machines, the :code:`Cumulative (all nodes) Total energy (J)` has been selected for the GPU machine. And the :code:`nodes.energy` has been selected for the KNL prototype. Both measure full nodes consumption in Joules. - -Each code will be presented along with the full set of metrics. The section ends with a recap chart with a line of metric picked up for its relevance. - - -ALYA -^^^^ - -Alya is a high performance computational mechanics code that can solve different coupled mechanics problems. - -The code is parallelised with MPI and OpenMP. Two OpenMP strategies are available, without and with a colouring strategy to avoid ATOMICs during the assembly step. A CUDA version is also available for the different solvers. - - -Code_Saturne -^^^^^^^^^^^^ -Code_Saturne is a CFD software package developed by EDF R&D since 1997 and open-source since 2007. - -Parallelism is handled by distributing the domain over the processors. Communications between subdomains are handled by MPI. Hybrid parallelism using MPI/OpenMP has recently been optimised for improved multicore performance. PETSc has recently been linked to the code to offer alternatives to the internal solvers to compute the pressure and supports CUDA. - -CP2K -^^^^ - -CP2K is a quantum chemistry and solid state physics software package. - -Parallelisation is achieved using a combination of OpenMP-based multi-threading and MPI. Offloading for accelerators is implemented through CUDA. - - -GADGET -^^^^^^ - -GPAW -^^^^ - -GPAW is a DFT program for ab-initio electronic structure calculations using the projector augmented wave method. - -GPAW is written mostly in Python, but includes also computational kernels written in C as well as leveraging external libraries such as NumPy, BLAS and ScaLAPACK. Support for offloading to accelerators using either CUDA or pyMIC, respectively. - - -GROMACS -^^^^^^^ - -GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. - -Parallelisation is achieved using combined OpenMP and MPI. Offloading for accelerators is implemented through CUDA for GPU and through OpenMP for MIC (Intel Xeon Phi). - -NAMD -^^^^ - -NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. - -It is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI. - -NEMO -^^^^ - -PFARM -^^^^^ - -PFARM is part of a suite of programs based on the ‘R-matrix’ ab-initio approach to the varitional solution of the many-electron Schrödinger equation for electron-atom and electron-ion scattering. - -It is parallelised using hybrid MPI / OpenMP and CUDA offloading to GPU. - -QCD -^^^ - -Quantum Espresso -^^^^^^^^^^^^^^^^ - -QUANTUM ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modelling, based on density-functional theory, plane waves, and pseudopotentials. - -It is implemented using MPI and CUDA offloading to GPU. - -SHOC -^^^^ - -The Accelerator Benchmark Suite will also include a series of synthetic benchmarks. - -SHOC is written in C++ is MPI-based. Offloading for accelerators is implemented through CUDA and OpenCL for GPU. - - -Specfem3D_Globe -^^^^^^^^^^^^^^^ - -The software package SPECFEM3D_Globe simulates three-dimensional global and regional seismic wave propagation based upon the spectral-element method. - -It is written in Fortran and uses MPI combined with OpenMP to achieve parallelisation. - - -Wrap-up table -^^^^^^^^^^^^^ - - -Here's the envisioned run table issued from the Milestone 33: - -.. _table-wrapup-result: -.. table:: Code definition - :widths: auto - - +-------------------+------+-----------------------------+------------------+-------------------------------+ - | | Test | Power8 + GPU | Xeon Phi | | - | Code name | case +-----+-----------+-----------+---------+--------+ 4IP-extension BCO + - | | # | N # | | | | | | - +===================+======+=====+===========+===========+=========+========+===============================+ - | | 1 | | ✓ | | ✓ | | Ricard Borrell (BSC) | - + ALYA +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✓ | | Ricard Borrell (BSC) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✓ | | ✓ | | Charles Moulinec (STFC) | - + Code_Saturne +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✓ | | Charles Moulinec (STFC) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✓ | | ✓ | | Arno Proeme (EPCC) | - + CP2K +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✓ | | Arno Proeme (EPCC) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✗ | | ✓ | | Volker Weinberg (LRZ) | - + GADGET +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✗ | | ✓ | | Volker Weinberg (LRZ) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✗ | | ✓ | | Martti Louhivuori (CSC) | - + GPAW +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✗ | | ✓ | | Martti Louhivuori (CSC) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✓ | | ✓ | | Dimitris Dellis (GRNET) | - + GROMACS +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✓ | | Dimitris Dellis (GRNET) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✓ | | ✓ | | Dimitris Dellis (GRNET) | - + NAMD +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✓ | | Dimitris Dellis (GRNET) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✗ | | ✓ | | Arno Proeme (EPCC) | - + NEMO +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✗ | | ✓ | | Arno Proeme (EPCC) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✓ | | ✓ | | Mariusz Uchronski (WCNS/PSNC) | - + PFARM +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✓ | | Mariusz Uchronski (WCNS/PSNC) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✓ | | ✓ | | Jacob Finkenrath (CyI) | - + QCD +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✓ | | Jacob Finkenrath (CyI) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✓ | | ✓ | | Andrew Emerson (CINECA) | - + Quantum Espresso +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✓ | | Andrew Emerson (CINECA) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✓ | | ✗ | | Valeriu Codreanu (SurfSARA) | - + SHOC +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✗ | | Valeriu Codreanu (SurfSARA) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 1 | | ✓ | | ✓ | | Victor Cameo Ponz (CINES) | - + Specfem3D_Globe +------+-----+-----------+-----------+---------+--------+-------------------------------+ - | | 2 | | ✓ | | ✓ | | Victor Cameo Ponz (CINES) | - +-------------------+------+-----+-----------+-----------+---------+--------+-------------------------------+ - - - -.. _d77_port_profile: - -Energetic Analysis of a Solver Stack for Frequency-Domain Electromagnetics -************************************************************************** - -Numerical approach -^^^^^^^^^^^^^^^^^^ - -Simulation software -^^^^^^^^^^^^^^^^^^^ - -MaPHyS algebraic solver -^^^^^^^^^^^^^^^^^^^^^^^ - -Numerical and performance results -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -MaPHyS used in standalone mode -"""""""""""""""""""""""""""""" - -Scattering of a plane wave by a PEC sphere -"""""""""""""""""""""""""""""""""""""""""" - -.. _d77_conclusion: - -Conclusion -********** diff --git a/doc/sphinx/deliverable_d7.7/D77_0.2.docx b/doc/sphinx/deliverable_d7.7/D77_0.2.docx deleted file mode 100644 index 0e4a4d7c5004969f049bc26b8f93c0d2db51b97f..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/D77_0.2.docx and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/D77_1.0.docx b/doc/sphinx/deliverable_d7.7/D77_1.0.docx deleted file mode 100644 index 37958267b56476d1f829263891dfbd6558ec47d5..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/D77_1.0.docx and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/Tables.numbers b/doc/sphinx/deliverable_d7.7/Tables.numbers deleted file mode 100644 index 26976bae6b60e3be3307e688fa3879d73afa66ed..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/Tables.numbers and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/WP7_PCP_Inria.docx b/doc/sphinx/deliverable_d7.7/WP7_PCP_Inria.docx deleted file mode 100644 index 9d4feae6c9dac206bb4e77b1263870b55439c920..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/WP7_PCP_Inria.docx and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/d77_0.1.docx b/doc/sphinx/deliverable_d7.7/d77_0.1.docx deleted file mode 100644 index 4a8afdf6baf12f96d627419136d8f3d0235fec82..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/d77_0.1.docx and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/deliverable_uabs_metrics.ods b/doc/sphinx/deliverable_d7.7/deliverable_uabs_metrics.ods deleted file mode 100644 index 4d3589b38e1570255649a6d19b7e5fe8be83e126..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/deliverable_uabs_metrics.ods and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/deliverable_uabs_metrics.xls b/doc/sphinx/deliverable_d7.7/deliverable_uabs_metrics.xls deleted file mode 100644 index e567fb895606e656b8ce12f6bd6deecb2b211a10..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/deliverable_uabs_metrics.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/mistake_in_template.txt b/doc/sphinx/deliverable_d7.7/mistake_in_template.txt deleted file mode 100644 index e5fc3bbc3d006f124a70b92f3b82ac72c9720656..0000000000000000000000000000000000000000 --- a/doc/sphinx/deliverable_d7.7/mistake_in_template.txt +++ /dev/null @@ -1,3 +0,0 @@ -mistake in template: -TFlop/s Tera (= 1012) -> 10^12 -TB Tera (= 240 ~ 1012) -> 2^40 & 10^12 diff --git a/doc/sphinx/deliverable_d7.7/results/.~lock.spefem_knl_gpu_testcase1_2.xls# b/doc/sphinx/deliverable_d7.7/results/.~lock.spefem_knl_gpu_testcase1_2.xls# deleted file mode 100644 index 928ebada56ae77b6d98923dbb1663c4cc2c39aaa..0000000000000000000000000000000000000000 --- a/doc/sphinx/deliverable_d7.7/results/.~lock.spefem_knl_gpu_testcase1_2.xls# +++ /dev/null @@ -1 +0,0 @@ -,cameo,pc193,13.12.2017 11:26,file:///home/cameo/.config/libreoffice/4; \ No newline at end of file diff --git a/doc/sphinx/deliverable_d7.7/results/alya_gpu_testcase1_2.xls b/doc/sphinx/deliverable_d7.7/results/alya_gpu_testcase1_2.xls deleted file mode 100644 index 8161b25f2b0f5d71ad870c32c40a521cde70d967..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/alya_gpu_testcase1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/alya_knl_gpu_testcase1_2.xls b/doc/sphinx/deliverable_d7.7/results/alya_knl_gpu_testcase1_2.xls deleted file mode 100644 index ee3b1b2712813084f02bbade4f545755980de449..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/alya_knl_gpu_testcase1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/code_saturn_knl_testcase2.xls b/doc/sphinx/deliverable_d7.7/results/code_saturn_knl_testcase2.xls deleted file mode 100644 index 8d30095efae6b43997056c6fa59428093e222246..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/code_saturn_knl_testcase2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/code_saturn_metrics_DAVIDE_PCP_testcase1.xls b/doc/sphinx/deliverable_d7.7/results/code_saturn_metrics_DAVIDE_PCP_testcase1.xls deleted file mode 100644 index d46289a0dfc4b9cac34d69170a71b2942806924a..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/code_saturn_metrics_DAVIDE_PCP_testcase1.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/cp2k_knl_testcase1_2.xls b/doc/sphinx/deliverable_d7.7/results/cp2k_knl_testcase1_2.xls deleted file mode 100644 index d51e07082099baa4ffafe19a173ac4f8c297f7e5..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/cp2k_knl_testcase1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/gadget_knl_testcase1.xls b/doc/sphinx/deliverable_d7.7/results/gadget_knl_testcase1.xls deleted file mode 100644 index 09fe5b3a9defcaf51b913fb8f231078beff74c7f..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/gadget_knl_testcase1.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/gpaw_knl_testcase1_2.xls b/doc/sphinx/deliverable_d7.7/results/gpaw_knl_testcase1_2.xls deleted file mode 100644 index 78a179a6e46543ab25773520f3c3268df50cba88..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/gpaw_knl_testcase1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/gromacs_gpu_knl_testcase1_2.xls b/doc/sphinx/deliverable_d7.7/results/gromacs_gpu_knl_testcase1_2.xls deleted file mode 100644 index 17c56262fe1b833bca7c0f36ad16b1ed936af06e..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/gromacs_gpu_knl_testcase1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/namd_gpu_knl_testcase1_2.xls b/doc/sphinx/deliverable_d7.7/results/namd_gpu_knl_testcase1_2.xls deleted file mode 100644 index b87357b939aba567ce7907b6a7fbda1f9616cd20..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/namd_gpu_knl_testcase1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/pfarm_knl_gpu_testcase1.xls b/doc/sphinx/deliverable_d7.7/results/pfarm_knl_gpu_testcase1.xls deleted file mode 100644 index 81ce4ba3a64c2948b87062f6ba98e79d5d375639..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/pfarm_knl_gpu_testcase1.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/qcd1_2_gpu_knl_testcase1_2.xls b/doc/sphinx/deliverable_d7.7/results/qcd1_2_gpu_knl_testcase1_2.xls deleted file mode 100644 index 00356840aa42b6f19827118cbe69e0ce34d785f6..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/qcd1_2_gpu_knl_testcase1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/qe_knl_gpu_testcase_1_2.xls b/doc/sphinx/deliverable_d7.7/results/qe_knl_gpu_testcase_1_2.xls deleted file mode 100644 index 3f761c4194e7fc00ac846a7cbd9dba1b939b7e46..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/qe_knl_gpu_testcase_1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/shoc_gpu_full_res.xlsx b/doc/sphinx/deliverable_d7.7/results/shoc_gpu_full_res.xlsx deleted file mode 100644 index dc97890b1ffe224e8726c44feb8d8bf8ff9e3366..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/shoc_gpu_full_res.xlsx and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/shoc_gpu_testcase_1_2.xls b/doc/sphinx/deliverable_d7.7/results/shoc_gpu_testcase_1_2.xls deleted file mode 100644 index bf414b262b0eb068d3f65f1502f7b8f602a4f158..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/shoc_gpu_testcase_1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/spefem_knl_gpu_testcase1_2.xls b/doc/sphinx/deliverable_d7.7/results/spefem_knl_gpu_testcase1_2.xls deleted file mode 100644 index b5177fdb947cca672b5c8b130a93af95286c7657..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/spefem_knl_gpu_testcase1_2.xls and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/results/wrapuptable.xlsx b/doc/sphinx/deliverable_d7.7/results/wrapuptable.xlsx deleted file mode 100644 index 85519bb0cfdcc44180272c6fa40d2fec26beb4d9..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/results/wrapuptable.xlsx and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/template.docx b/doc/sphinx/deliverable_d7.7/template.docx deleted file mode 100644 index 4f3a04c4ee453cedf55f0502dc7b51b1224775f5..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/template.docx and /dev/null differ diff --git a/doc/sphinx/deliverable_d7.7/timeline.png b/doc/sphinx/deliverable_d7.7/timeline.png deleted file mode 100644 index 2d4bb3fa00677f8ffb61e6b26dfaf00f441a9d8b..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/deliverable_d7.7/timeline.png and /dev/null differ diff --git a/doc/sphinx/f2f_cyprus/4ip-extention-overview.key b/doc/sphinx/f2f_cyprus/4ip-extention-overview.key deleted file mode 100644 index b3fd67fa9c0f16efff619259af3da6cd6484ac2f..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/f2f_cyprus/4ip-extention-overview.key and /dev/null differ diff --git a/doc/sphinx/f2f_cyprus/4ip-extention-session-wrapup.key b/doc/sphinx/f2f_cyprus/4ip-extention-session-wrapup.key deleted file mode 100644 index 8e24b11bb80e6d173a7a20e598c10e7f7a1a8f1d..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/f2f_cyprus/4ip-extention-session-wrapup.key and /dev/null differ diff --git a/doc/sphinx/f2f_cyprus/4ip-extention-session.key b/doc/sphinx/f2f_cyprus/4ip-extention-session.key deleted file mode 100644 index c0270bbf549994571848aa631bcd08677c406d47..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/f2f_cyprus/4ip-extention-session.key and /dev/null differ diff --git a/doc/sphinx/f2f_cyprus/4ip-extention-specfem3D_globe.key b/doc/sphinx/f2f_cyprus/4ip-extention-specfem3D_globe.key deleted file mode 100644 index c3ed8b4d201ae975bc3269b4142bd188f9f4b34e..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/f2f_cyprus/4ip-extention-specfem3D_globe.key and /dev/null differ diff --git a/doc/sphinx/f2f_cyprus/draft.rst b/doc/sphinx/f2f_cyprus/draft.rst deleted file mode 100644 index ad915cc6e1289767bf51bca055b78ce64f2c4cee..0000000000000000000000000000000000000000 --- a/doc/sphinx/f2f_cyprus/draft.rst +++ /dev/null @@ -1,129 +0,0 @@ -F2F in Cyprus draft presentations -================================= - -This is WIP presentation intended to be presented at the Cyprus F2F. - -EUABS all hands presentation -**************************** -4IP project, what's inside ?: - - Performances and energy metrics on PCP systems - - UEABS on PCP machines - - Porting to KNL + perf and energy analysis of the HORSE+MaPHyS+PaStiX stack - -A project with tight deadlines - - timeline - - Wrapup machine access dates planned vs actual - -UEABS on PCP machines: - - plan code/machines BCOs - MISSING check the fucking table !!! -PCP-GPU - -PCP-KNL - -PCP-FPGA - -Session at ??? - - 1. deliverable quick talk and brainstorming - - 2. current work overview with folling prez (cf attendants table) - - -EUABS session presentation -************************** - -deliverable: comments/questions - - 4IP-extension project intro - - Machine specs - - EUABS - - intro: ref to previous D7.5 - - each code - - what's new - - few comments on perf/energy/scalability - - wrap-up table - - HORSE+MaPHyS+PaStiX stack - - [...] - - Conclusion - -UEABS part: - - results tables: comments questions - - screen shot of the table - - each code presentation - - keep clear what's new from UEABS point of view, not because you're new to the code - -Code presentations: - - Specfem - - les autres - - -EUABS Specfem3D_Globe presentation -*********************************** - -Quick prez - - fortran - - X lines of code - - sismic code -Code version change on KNL - - versioning is quite difficult - - previously used intel modified version based on 7.0 version - - Intel didn't released publicly the code, so it's still not officialy available - - Decided to move back to the official version git rev = ?? -KNL results - - lost ~ x5 compared to the intel version which is pretty big - - perf & energy table - - WIP looking at atos results with UEABS test case -GPU - - WIP compilation - - no version change - - -UEABS Wrapup Session -******************** - -missing KNL runs: Specfem some further runs needed, Specfem3D_Globe Code Saturn, riccard (ALYA), jacob (QCD), martti (GPAW), arno (CP2K & NEMO), Mariuscz (PFARM) -possibility to run on frioul (almost same environment, home filesystem shared, account activation form PCP one shoud be really easy) - -more and updated informations/momtelcon on the gitlab website -next step is the deliverable -> worked on deadlines for contribution -Talked quiet a bit about that. Lots about figures whould be include and the way they are presented. - -BCOs presentation vampirised the UEABS session -good presentation, thank you to all speakers. Please post your presentation to BSCW space: -thanks to every one that participated - - - - - - -EUABS session Mom -***************** - -correct slide EUABS/UEABS - -Deliverable -time & energy to solution + one column on "performance" -update which metric should be taken from energy tool on website -Showing total energy only is sad because PCP system have been thinked more featured. Whatever benchmarking is showing simple figures. hard to analyse metrics from multiple machine/energy stack. -walltime vs node consumption not good !! GFLOPs/energy max energy envelope -walltime not good because not visible -> use speedup normalize to 1. -update machine specs - -Residency not inside the cost figure - -QE: - - not the 2nd UEABS testcase - -Speedup vs cost pretty interresting plots - - -PFARM -metrics showed are for same testcase right ? just the number of node changing ? you should present speedup too - -QCD - KNL numbind -> cache mode very impacting. - - -Best Practise Guide session -=========================== -Looking for volonteers -latest guides updates diff --git a/doc/sphinx/index.rst b/doc/sphinx/index.rst deleted file mode 100644 index 5117ae85905f53b99a0ab41b363967b67129d130..0000000000000000000000000000000000000000 --- a/doc/sphinx/index.rst +++ /dev/null @@ -1,38 +0,0 @@ -.. UEABS for accelerators documentation master file, created by - sphinx-quickstart on Wed Jun 7 19:01:00 2017. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -.. _4ip_toc: - -Welcome to 4IP-extension documentation! -======================================= - -Organisation of the 4IP-extension is publicly available to anyone interrested in the project and rely on the following tools: - * a slack channel for chat purpose: `PRACE PCP slack channel`_ , ask `Victor Cameo Ponz`_ for registration link. - * a mailing list: prace-4ip-wp7.extension@fz-juelich.de, subscribe here: `mailing list registration page`_. - * this documentation. - -The following timeline will be followed to lead this task: - #. white paper indicating what applications will be run on which prototypes (meeting the PRACE milestone **MS33 due by August 2017/M4**) - #. grant access to machines (cut off September 2017 ) - #. run codes (cut off October/November) - #. gather results and report *Applications Performance and Energy Usage* (this adresses the PRACE deliverable **D7.7 due by December 2017/M8**) - - -.. toctree:: - :maxdepth: 2 - :caption: General Table of Contents: - :glob: - - ms33 - pcp_systems - d77 - mom_telcon/index - - -.. _Victor Cameo Ponz: cameo+4ip-extension@cines.fr -.. _PRACE PCP slack channel: https://prace-pcp.slack.com -.. _mailing list registration page: https://lists.fz-juelich.de/mailman/listinfo/prace-4ip-wp7.extension - - diff --git a/doc/sphinx/mom_telcon/2017-09-12.rst b/doc/sphinx/mom_telcon/2017-09-12.rst deleted file mode 100644 index 4a6df60cfbc31ff9527a6dd4dd085c32492569c0..0000000000000000000000000000000000000000 --- a/doc/sphinx/mom_telcon/2017-09-12.rst +++ /dev/null @@ -1,115 +0,0 @@ -TelCon 12 of September -====================== - -Apologies: ----------- - - Jacob Finkenrath (CyI) - - Ricard Borrell (BSW) - - Volker Weinberg (LRZ) - - Dimitris Dellis (GRNET) - - Valeriu Codreanu (SurfSARA) - - Charles Moulinec (STFC) - -Present: --------- - - Andrew Emerson (CINECA) - - Arno Proeme (EPCC) - - Dimitris Dellis (GRNET) - - Luigi Iapichino (LRZ) - - Mariusz Uchronski (WCNS/PSNC) - - Martti Louhivuori (CSC) - - Sebastian Lührs (JUELICH) - - Victor Cameo Ponz (CINES) - - -Minutes of meeting: -------------------- - -Update on machine availability -****************************** - - Atos/KNL available, energy software stack not ready yet - - E4/GPU should be available by mid September. No news since beginning of September. - - Maxeler/FPGA, A mistake in the last update: the machine should be available by mid - October NOT mid-September. This compromise a lot running on this machine and confirm - that we should focus on the two others - -Login procedures -**************** - - Atos/KNL: WIP, almost all BCOs send their applications and account are oppened. (about 48h delay) - - E4/GPU: I'm waiting for the inputs of Carlo Cavazzoni. - - Maxeler/FPGA: BCOs will have to register on an online procedure and then send the - public part of an ssh-key to Dirk Pleiter - - -Basics for energy mesurement on BULL/Atos KNL machine -***************************************************** - -There is 2 kind of energy metrics that can be collected on the Atos machine. - -1. Bull Energy Optimiser, is an admin oriented tools that allow to get energy metrics at switch and node - level. This tool won't be operated by final users and metrics will be available through a - wrapper not defined yet (slurm accounting, or terminal command). Will give a an overview metric of a job. - -2. HDEEVIZ framework: allow metrics at a finer level ie DRAM, CPU and IO. to use it we'll just have a load - a module corresponding to the mpilibrairie used to compile and add a command at the begining of the mpirun - line. Then access to the metrics will be done through a Graphana web interface. - -Figures that will be included in the final deliverable (performances & energy related) -************************************************************************************** - - - Performances in regards of the peakperformance allocated to the run. ie theorical FLOPS for the allocated - numbers of GPU and or CPUs - - overall Power consumed by each testcases at high level ie at node and swith level. This is the best metric - we can imagine to have in common on all the clusters. - -Questions, concerns and report from BCOs -**************************************** -Everything about login opening, deadlines, expected work during the next period, others: - - +------------------------+-------------------------------+ - | ALYA | KNL form sended, waiting for | - | | effort confirmation | - +------------------------+-------------------------------+ - | Code_Saturne | Connection should be OK | - +------------------------+-------------------------------+ - | CP2K | KNL access OK, WIP building | - | | and running | - +------------------------+-------------------------------+ - | GADGET | waiting for knl access | - +------------------------+-------------------------------+ - | GPAW | KNL access OK | - +------------------------+-------------------------------+ - | GROMACS | KNL acces OK, compiled, | - | | begining runs | - +------------------------+-------------------------------+ - | NAMD | KNL acces OK, compiled, | - | | begining runs | - +------------------------+-------------------------------+ - | NEMO | Focusing on CP2K for now | - +------------------------+-------------------------------+ - | PFARM | connection KNL broken | - +------------------------+-------------------------------+ - | QCD | Made a first test which QCD | - | | part 1 and I could compiled | - | | and run it on KNL. | - +------------------------+-------------------------------+ - | Quantum Espresso | access GPU OK: compiled with | - | | Cuda Fortran. KNL access OK | - | | and started compilation | - +------------------------+-------------------------------+ - | SHOC | Waiting for GPU access. | - | | Planning FPGA/KNL port of 2/3 | - | | important kernels if time | - | | allows | - +------------------------+-------------------------------+ - | Specfem3D_Globe | Access to KNL OK. Focusing on | - | | lead | - +------------------------+-------------------------------+ - -**General questions** - - interconnect KNL (it is the same on PCP and Frioul): Infiniband - EDR 4x - - -AoB - Date of next meeting -************************** - - Date of the next tecon will be end of october diff --git a/doc/sphinx/mom_telcon/2017-09-28.rst b/doc/sphinx/mom_telcon/2017-09-28.rst deleted file mode 100644 index 3877f920236dfe8ee4b6a938068770a52fc63658..0000000000000000000000000000000000000000 --- a/doc/sphinx/mom_telcon/2017-09-28.rst +++ /dev/null @@ -1,105 +0,0 @@ -TelCon 28 of September -====================== - -Apologies: ----------- - - Andrew Emerson (CINECA) - - Arno Proeme(EPCC) - - Charles Moulinec (STFC) - - Luigi Iapichino (LRZ) - - Volker Weinberg (LRZ) - - Philippe Segers (GENCI) - -Present: --------- - - Martti Louhivuori (CSC) - - Mariusz Uchronski (WCNS/PSNC) - - Jacob Finkenrath (CyI) - - Valeriu Codreanu (SurfSARA) - - Dimitris Dellis (GRNET) - - Victor Cameo Ponz (CINES) - - - -Minutes of meeting: -------------------- - -Update on machine availability -****************************** - - - E4/GPU Rack are physicaly inside CINECA, openning is planned around 13 October - - No change for now for FPGA cluster - available mid October - -Login procedures -**************** - - - E4/GPU: Still no new from Carlo Cavazzoni but Victor Cameo Ponz should see - him in person next week - - -Basics for energy mesurement on BULL/Atos KNL machine -***************************************************** - -Update for BEO, that is now available. - - Generate an rsa key and send the public part to svp@cines.fr. - - You will be able to call BEO through SSH and get basic job energy reports. - - Please consult a more detailed documentation on the machine itself: - ``/opt/software/frioul/documentation/beo_usage.txt`` - - You will also find complete user guides for HDEEVIZ (vizualisation part not available yet) and - BEO in the same directory - - -Figures that will be included in the final deliverable (performances & energy related) -************************************************************************************** - - - it could be interesting, if possible and meaningfull, to run the same simulation on - different set of nodes to get some kind of scalability on power figures. - -Questions, concerns and report from BCOs -**************************************** -Everything about login opening, deadlines, expected work during the next period, others: - - +------------------------+-----------------------------------------------------------+ - | | | - | Code name + 4IP-extension BCO + - | | | - +========================+===========================================================+ - | Code_Saturne | Ran and exploit results on KNL/Frioul, that is not lost | - | | since files are shared. Should run directly on PCP since | - | | it should be no differences. Runs will begin on PCP by | - | | 10 days. | - +------------------------+-----------------------------------------------------------+ - | CP2K | KNL: built essential CP2K libraries, now building | - | | CP2K itself | - +------------------------+-----------------------------------------------------------+ - | GADGET | no updates | - +------------------------+-----------------------------------------------------------+ - | GPAW | Login ok, ready to start | - +------------------------+-----------------------------------------------------------+ - | GROMACS | First runs has been done, since 13 of September could | - +------------------------+ not submit, but this is now over and machine is + - | NAMD | available again | - +------------------------+-----------------------------------------------------------+ - | NEMO | KNL Building netcdf and xios before building NEMO itself | - +------------------------+-----------------------------------------------------------+ - | PFARM | WIP on code compilation | - +------------------------+-----------------------------------------------------------+ - | QCD | part 1 WIP | - +------------------------+-----------------------------------------------------------+ - | Quantum Espresso | nothing recent | - +------------------------+-----------------------------------------------------------+ - | SHOC | will detail soon kernels to be ported | - +------------------------+-----------------------------------------------------------+ - | Specfem3D_Globe | WIP with compilation, previously used sources modified by | - | | intel but since they didn't released it publicly it wont | - | | be used anymore (or just as point of comparison) | - +------------------------+-----------------------------------------------------------+ - -**General questions** - - - -AoB - Date of next meeting -**************************** - - 12 of October, 11:00 CEST, primary PRACE number - diff --git a/doc/sphinx/mom_telcon/2017-10-12.rst b/doc/sphinx/mom_telcon/2017-10-12.rst deleted file mode 100644 index b05b65d81335a8497e3847e8d754e29a790063aa..0000000000000000000000000000000000000000 --- a/doc/sphinx/mom_telcon/2017-10-12.rst +++ /dev/null @@ -1,119 +0,0 @@ -TelCon 12 of October -==================== - -Apologies: ----------- - - Charles Moulinec (STFC) - - Dimitris Dellis (GRNET) - - Mariusz Uchronski (WCNS/PSNC) - -Present: --------- - - Victor Cameo Ponz (CINES) - - Luigi Iapichino (LRZ) - - Remi Lacroix (IDRIS) - - Andrew Emerson (CINECA) - - Philippe Segers (GENCI) - - Jacob Finkenrath (CyI) - -Minutes of meeting: -------------------- - -Update on machine availability & login procedures -************************************************* - - - GPU prototype is up and running in Cineca. There is still some imprecision - concerning access for 4IP-extension: it might delay again for 10 days, but - we are currently negociating that point so it's faster. The login procedure - will not go through the userdb but A. Emerson will gather access requests - (name, email address and affiliation required). Action for VCP: send the BCOs - list to Andrew. Namely: - - - EMERSON Andrew - CINECA - - PROEME Arno - EPCC - - DE LA CRUZ Raul - BSC - - FINKENRATH Jacob - CyI - - UCHRONSKI Mariusz - WCNS/PSNC - - DELLIS Dimitris - - GRNET - - MOULINEC Charles - STFC - - CODREANU Valeriu - SURFsara - - CAMEO PONZ Victor - CINES - - Since GADGET and GPAW are not supposed to be run on GPU, BCOs have not been - included. - - No news from FPGA prototype staff. - - -Basics for energy mesurement on BULL/Atos KNL machine -***************************************************** - -BEO is fully functionnal, and now HDEEVIZ is starting to work if you want to try -it. There were a workshop at CINES last week where trainees used them successfully. -More documentation on ``/opt/software/frioul/documentation/``. Contact the support -team at CINES if you have any requests (svp@cines.fr). - - -Figures that will be included in the final deliverable (performances & energy related) -************************************************************************************** - -I'm working on a template that all of you will fill with performances and energy -metrics. Comparison will be base on full nodes so that we can compare results regarding -theoretical peak performances, memory characteristics and so on. At least on set of -metrics will be required by test case by machine, and if you have more (ie if you have -scalability) it's welcome obviously. -Also some comments on performance will be required. - - -General 4IP-extension information -********************************* - -I just updated the milestone so it mention the project of INRIA on the KNL machines. -This is a side project focusing on one PRACE code and it's energetic optimisation. - - -Questions, concerns and report from BCOs -**************************************** -Everything about login opening, deadlines, expected work during the next period, others: - - +------------------------+-----------------------------------------------------------+ - | | | - | Code name + 4IP-extension BCO + - | | | - +========================+===========================================================+ - | ALYA | No feedback | - +------------------------+-----------------------------------------------------------+ - | Code_Saturne | Lot of work lately, will begin 4IP-extension related work | - | | not before 1 week from now | - +------------------------+-----------------------------------------------------------+ - | CP2K | No feedback | - +------------------------+-----------------------------------------------------------+ - | GADGET | Concerns about UEABS version vs current one: contact | - | | Walter Lieon on this subjet. Currently compilation some | - | | problem, more updated version of the code might help | - +------------------------+-----------------------------------------------------------+ - | GPAW | No feedback | - +------------------------+-----------------------------------------------------------+ - | GROMACS | Runs almost finished. Some problem at the end of the jobs | - +------------------------+ might make it difficult to get the energy back. BEO usage + - | NAMD | WIP, send the SSH key but no feedback yet | - +------------------------+-----------------------------------------------------------+ - | NEMO | No feedback | - +------------------------+-----------------------------------------------------------+ - | PFARM | compiled and executed PFARM code on KNL machine. | - +------------------------+-----------------------------------------------------------+ - | QCD | No updates, waiting to get some time to work on QCD | - +------------------------+-----------------------------------------------------------+ - | Quantum Espresso | Compil OK knl. Few runs. Work fine. Currently running | - | | flat mode. Would certainly had some perfomance gain with | - | | cache mode | - +------------------------+-----------------------------------------------------------+ - | SHOC | No feedback | - +------------------------+-----------------------------------------------------------+ - | Specfem3D_Globe | Compiled and began to run. But facing I/O issues so | - | | no metrics for now. Action: contact France Boillod at | - | | Atos and Bertrand Cirou at CINES | - +------------------------+-----------------------------------------------------------+ - -AoB - Date of next meeting -**************************** - - 2 of November, 11:00 CEST, primary PRACE number diff --git a/doc/sphinx/mom_telcon/2017-11-02.rst b/doc/sphinx/mom_telcon/2017-11-02.rst deleted file mode 100644 index 115869ca2b027478755abdca213af6058e9d28ea..0000000000000000000000000000000000000000 --- a/doc/sphinx/mom_telcon/2017-11-02.rst +++ /dev/null @@ -1,115 +0,0 @@ -TelCon 2 of November -==================== - -Apologies: ----------- - - Arno Proeme (EPCC) - - Charles Moulinec (STFC) - - Jacob Finkenrath (CyI) - - Mariusz Uchronski (WCNS/PSNC) - -Present: --------- - - Andrew Emerson (CINECA) - - Hayk Shoukourian (LRZ) - - Luigi Iapichino (LRZ) - - Ricard Borrell (BSC) - - Victor Cameo Ponz (CINES) - - Volker Weinberg (LRZ) - - -Minutes of meeting: -------------------- - -.. note:: The PRACE telcon line, ther service is very poor lately and a backup solution should be setup. This could be online base solution (Skype/Google hangout/some other online services). Anyone might have problem in getting access to microphone on an online connected machine ? - -.. _deliverable_inputs: - -General consideration about the deliverable -******************************************* - -I'll need your inputs by the **30th of november** - - - **INRIA team**: you should include what have been discussed for the milestone - - **UEABS BCO**: Few information about the code and comments about the scalability are required. You should fill at **leat one line of energy and performance metrics by machines by test case in this table**: :download:`OpenOffice<../deliverable_d7.7/deliverable_uabs_metrics.ods>` or :download:`Excel<../deliverable_d7.7/deliverable_uabs_metrics.xls>` - -F2F Cyprus -********** - -Meeting full day 23 of november, do not forget to register. -You will present the current state of the runs and results for the code you're in charge of during the 4IP/UEABS session. If not present, please ask one of your collegue to do the presentation for you. Ultimately I can present your work send me the presentation before the meeting. - - -PCP-KNL news -************ - - PSU full liquid cooled is being installed on half the machine today. That's the purpose of the hardware maintenance for the machine. - - there will be another maintance beg/mid november that should stop 2 day the machine (probably around 13 of november). - - cluster is and has always been in flat mode and might (or not...) move to cache mode. You will be notified if this happens. You can check the current state whith: - -.. code-block:: shell - - sinfo -o "%30N %20b %f" - -(can be added to job script to keep some trace) - - -DAVIDE news -*********** - - everyone supposed to run on this machine should have access now - - Note the support contact: support@e4company.com - - Andrew E. should be able to get the energy software stack documentation soon. It'll be spread as soon as possible. - -PCP-FPGA news -************* -The few codes that requested access began the registration process. Still no clear opening date. - - -Questions, concerns and report from BCOs -**************************************** -Everything about login opening, deadlines, expected work during the next period, others: - - +------------------------+-----------------------------------------------------------+ - | | | - | Code name + 4IP-extension BCO + - | | | - +========================+===========================================================+ - | ALYA | not started testing yet. test in a similar machine | - +------------------------+-----------------------------------------------------------+ - | Code_Saturne | Out of office until the 7, not able to make progress on | - | | code run | - +------------------------+-----------------------------------------------------------+ - | CP2K | KNL: starting to get energy measurements, GPU built | - | | libraries and CP2K itself | - +------------------------+-----------------------------------------------------------+ - | GADGET | uaebs code compiled ok. Initial condition pb with slurm. | - | | Ticket on CINES svp | - +------------------------+-----------------------------------------------------------+ - | GPAW | No feedback | - +------------------------+-----------------------------------------------------------+ - | GROMACS | KNL runs finished, all metrics gathered. Just have to | - +------------------------+ check scalability consistancy. GPU runs still WIP but | - | NAMD | Pregress lately. Exanged woth E4 support about that | - +------------------------+-----------------------------------------------------------+ - | NEMO | knl: built and ready for runs, GPU built netcdf & xios, | - | | starting to build NEMO | - +------------------------+-----------------------------------------------------------+ - | PFARM | Trouble finding the RMX_MAGMA_GPU source code. Asked | - | | Andrew Sunderland | - +------------------------+-----------------------------------------------------------+ - | QCD | successfully compiled QCD-part 2 on Davide, planning to | - | | have results before the F2F meeting | - +------------------------+-----------------------------------------------------------+ - | Quantum Espresso | Issues with KNL machine, intelMPI bug on bcast with F95. | - | | not exactly the same results as on Marconi (cache vs flat | - | | ?). Starting running with GPU | - +------------------------+-----------------------------------------------------------+ - | SHOC | No feedback | - +------------------------+-----------------------------------------------------------+ - | Specfem3D_Globe | No update since last teclon since I was out of office. | - | | I had new from atos that made great improve in running | - | | the code. | - +------------------------+-----------------------------------------------------------+ - -AoB - Date of next meeting -**************************** - - 15 of November, 11:00 CEST, primary PRACE number diff --git a/doc/sphinx/mom_telcon/2017-11-15.rst b/doc/sphinx/mom_telcon/2017-11-15.rst deleted file mode 100644 index 95818921b4f8f6aae539ff0beb542ffe1e43f330..0000000000000000000000000000000000000000 --- a/doc/sphinx/mom_telcon/2017-11-15.rst +++ /dev/null @@ -1,146 +0,0 @@ -TelCon 15 of November -===================== - -Apologies: ----------- - - Arno Proeme (EPCC) - - Charles Moulinec (STFC) - - Ricard Borrell (BSC) - - Valeriu Codreanu (SurfSARA) - - Volker Weinberg (LRZ) - -Attending: ----------- - - Andrew Emerson (CINECA) - - Dimitris Dellis (GRNET) - - Jacob Finkenrath (CyI) - - Mariusz Uchronski (WCNS/PSNC) - - Martti Louhivuori (CSC) - - Stéphane Lanteri (INRIA) - - Victor Cameo Ponz (CINES) - - -Minutes of meeting: -------------------- - - -Deliverable update -****************** - -PMOs gave few extra days to submit the deliverable. It is now due by December the 13 instead of the 10. So partners will have few extra day to send me your contribution which is now due by **December the 3**. - -.. note:: You can find table to be filled by BCO on this sestion: :ref:`deliverable_inputs`. - -F2F Cyprus -********** - -Meeting full day 23 of november, do not forget to register. -No more than 5 minutes talk presenting -Wether someone will present your code at in Cyprus and his name/contact - - +------------------------+-------------------------------+ - | | | - | Code name | Attending the F2F + - | | | - +========================+===============================+ - | ALYA | Ricard Borrell (BSC) | - +------------------------+-------------------------------+ - | Code_Saturne | ✗ | - +------------------------+-------------------------------+ - | CP2K | Alan Simpson (EPCC) | - +------------------------+-------------------------------+ - | GADGET | Volker Weinberg (LRZ) | - +------------------------+-------------------------------+ - | GPAW | Martti Louhivuori (CSC) | - +------------------------+-------------------------------+ - | GROMACS | Dimitris Dellis (GRNET) | - +------------------------+-------------------------------+ - | NAMD | Dimitris Dellis (GRNET) | - +------------------------+-------------------------------+ - | NEMO | Alan Simpson (EPCC) | - +------------------------+-------------------------------+ - | PFARM | Mariusz Uchronski (WCNS/PSNC) | - +------------------------+-------------------------------+ - | QCD | Jacob Finkenrath (CyI) | - +------------------------+-------------------------------+ - | Quantum Espresso | Pietro Bonfa' (CINECA) | - +------------------------+-------------------------------+ - | SHOC | Damian Podareanu (SurfSARA) | - +------------------------+-------------------------------+ - | Specfem3D_Globe | Victor Cameo Ponz (CINES) | - +------------------------+-------------------------------+ - -PCP-KNL news -************ -Lots of you have asked for BEO access lately and sometimes it had been long to treat them. svp@cines.fr have been notified so they pay more attention to your demands. - -Some issues in thread pinning. This can impact performances a lot. -Feedback needed for cache/flat mode on PCP ? - -DAVIDE news -*********** -The command to get some energy informations on DAVIDE is :code:`get_job_energy `. It returns lots of informations and this is still uncertain of wich one exactly match our needs. However "IPMI Measures/Cumulative (all nodes)/Total energy" should the metric that we are looking for. Since don't have detailed documentation I wish you save the whole output of the command for the jobs so we can reprocess the result afterwards. - -A. Emerson asked for his contact in Bologne to have more informations. - -.. note:: Jobs ran one week ago should have energy metrics reccorded. However, before that the energy stack were not up so you should re-run your jobs. - - -PCP-FPGA news -************* -Machine has been opened. The 3 people that asked access, now have it. - - -Questions, concerns and report from BCOs -**************************************** -Everything about login opening, deadlines, expected work during the next period, others: - - +------------------------+-----------------------------------------------------------+ - | | | - | Code name + 4IP-extension BCO + - | | | - +========================+===========================================================+ - | ALYA | starting tests with Alya on PCP KNL, code compiled and | - | | tests ready to run | - +------------------------+-----------------------------------------------------------+ - | Code_Saturne | Installed on KNL using version 4.2.2. First test case | - | | (Taylor-Green vortex) has been successfully ran on 4 | - | | nodes using 68 cores per node. Waiting for the support to | - | | get BEO working. The second test (Code_Saturne with | - | | PETSc) will be carried out next on KNL and GPU. | - | | Priliminary results on both machines expectedby the F2F | - | | meeting | - +------------------------+-----------------------------------------------------------+ - | CP2K | Some issues running one of the CP2K test cases on KNL, | - | | WIP. WIP also on GPU | - +------------------------+-----------------------------------------------------------+ - | GADGET | No feedback | - +------------------------+-----------------------------------------------------------+ - | GPAW | compiled run knl + got energy results | - +------------------------+-----------------------------------------------------------+ - | GROMACS | KNL, performances and energy metrics OK | - +------------------------+ GPU performances OK, will have to rerun for energy | - | NAMD | | - +------------------------+-----------------------------------------------------------+ - | NEMO | WIP | - +------------------------+-----------------------------------------------------------+ - | PFARM | No more trouble with the GPU version | - +------------------------+-----------------------------------------------------------+ - | QCD | Stuck on KNL with perf 20% -> look at process pinning | - | | GPU WIP, should work soon | - +------------------------+-----------------------------------------------------------+ - | Quantum Espresso | KNL problems stoped with intel/intelmpi 18.0. MPI task | - | | pinning wird on PCP. Got results and energy. Will share | - | | pinning experience. GPU good results + energy. | - +------------------------+-----------------------------------------------------------+ - | SHOC | Access OK to KNL and GPU machines. Expect results for GPU | - | | by F2F meeting. FPGA experiments end of november begining | - | | of december | - +------------------------+-----------------------------------------------------------+ - | Specfem3D_Globe | I've had tests cases running on KNL. I'm begining to look | - | | at the GPU machine. Timing too short to run on FPGA | - +------------------------+-----------------------------------------------------------+ - -Date of next meeting -******************** - - 4 of december, 11:00 CET, primary PRACE number diff --git a/doc/sphinx/mom_telcon/2017-11-16.rst b/doc/sphinx/mom_telcon/2017-11-16.rst deleted file mode 100644 index 7ef46c6244542131bd57c4c3dae9a5515cade4d9..0000000000000000000000000000000000000000 --- a/doc/sphinx/mom_telcon/2017-11-16.rst +++ /dev/null @@ -1,35 +0,0 @@ -TelCon 16 of November -===================== - -Attending: ----------- - - Gilles Marait (INRIA) - - Luc Giraud (INRIA) - - Stéphane Lanteri (INRIA) - - Victor Cameo Ponz (CINES) - - -Minutes of meeting: -------------------- - -Reporting from INRIA -******************** -MaPHyS successfully ported on KNL with quite good scaling. Next step having the -full HORSE software ported. - -Energy metrics on MaPHyS have been collected. - - -Deliverable -*********** -INRIA deadline to submit contribution is December the 8 at noon. It'll be -integrated as a part of the deliverable D7.7. - -Should include technical report of the activity as well a performance and energy -analysis and discustion as mentioned in the Milestone 33: :ref:`inria_plan`. -Cite as much as possible, volume is not the goal (10 pages max). Write down what's necessary to -understand what have been done and the resulting analysis. - -Actions -******* -@INRIA: Share the google document used to redact the deliverable diff --git a/doc/sphinx/mom_telcon/2017-12-06.rst b/doc/sphinx/mom_telcon/2017-12-06.rst deleted file mode 100644 index 60f8cb677c45c2f76c2eecb3317222e57cdd2693..0000000000000000000000000000000000000000 --- a/doc/sphinx/mom_telcon/2017-12-06.rst +++ /dev/null @@ -1,103 +0,0 @@ -TelCon 15 of November -===================== - -Apologies: ----------- -Attending: ----------- - - - - - Arno Proeme (EPCC) - - Charles Moulinec (STFC) - - Ricard Borrell (BSC) - - Valeriu Codreanu (SurfSARA) - - Volker Weinberg (LRZ) - - Andrew Emerson (CINECA) - - Dimitris Dellis (GRNET) - - Jacob Finkenrath (CyI) - - Mariusz Uchronski (WCNS/PSNC) - - Martti Louhivuori (CSC) - - Stéphane Lanteri (INRIA) - - Victor Cameo Ponz (CINES) - - -Minutes of meeting: -------------------- - - -Deliverable update -****************** - -I'm currently working on it. I'll let you access to the draft version by the end of the week. -I've had updated metrics as requested by for PFARM by Mariusz only. - -PCP-KNL news -************ -You can submit batch again, so feel free ! - - -DAVIDE news -*********** -Some trouble to get energy metrics at the end the last week. -Also no SMT activation available. -No feedback from support yet on my side - -PCP-FPGA news -************* -No news -Any question ? - -Questions, concerns and report from BCOs -**************************************** -Everything about login opening, deadlines, expected work during the next period, others: - - +------------------------+-------+-------+-----------------------------------------------------------+ - | | Results | | - | Code name +-------+-------+ 4IP-extension BCO + - | | KNL | GPU | | - +========================+=======+=======+===========================================================+ - | ALYA | | | starting tests with Alya on PCP KNL, code compiled and | - | | | | tests ready to run | - +------------------------+-------+-------+-----------------------------------------------------------+ - | Code_Saturne | | | Installed on KNL using version 4.2.2. First test case | - | | | | (Taylor-Green vortex) has been successfully ran on 4 | - | | | | nodes using 68 cores per node. Waiting for the support to | - | | | | get BEO working. The second test (Code_Saturne with | - | | | | PETSc) will be carried out next on KNL and GPU. | - | | | | Priliminary results on both machines expectedby the F2F | - | | | | meeting | - +------------------------+-------+-------+-----------------------------------------------------------+ - | CP2K | | | Some issues running one of the CP2K test cases on KNL, | - | | | | WIP. WIP also on GPU | - +------------------------+-------+-------+-----------------------------------------------------------+ - | GADGET | | | No feedback | - +------------------------+-------+-------+-----------------------------------------------------------+ - | GPAW | | | compiled run knl + got energy results | - +------------------------+-------+-------+-----------------------------------------------------------+ - | GROMACS | | | KNL, performances and energy metrics OK | - +------------------------+-------+-------+ GPU performances OK, will have to rerun for energy | - | NAMD | | | | - +------------------------+-------+-------+-----------------------------------------------------------+ - | NEMO | | | WIP | - +------------------------+-------+-------+-----------------------------------------------------------+ - | PFARM | | | No more trouble with the GPU version | - +------------------------+-------+-------+-----------------------------------------------------------+ - | QCD | | | Stuck on KNL with perf 20% -> look at process pinning | - | | | | GPU WIP, should work soon | - +------------------------+-------+-------+-----------------------------------------------------------+ - | Quantum Espresso | | | KNL problems stoped with intel/intelmpi 18.0. MPI task | - | | | | pinning wird on PCP. Got results and energy. Will share | - | | | | pinning experience. GPU good results + energy. | - +------------------------+-------+-------+-----------------------------------------------------------+ - | SHOC | | | Access OK to KNL and GPU machines. Expect results for GPU | - | | | | by F2F meeting. FPGA experiments end of november begining | - | | | | of december | - +------------------------+-------+-------+-----------------------------------------------------------+ - | Specfem3D_Globe | | | I've had tests cases running on KNL. I'm begining to look | - | | | | at the GPU machine. Timing too short to run on FPGA | - +------------------------+-------+-------+-----------------------------------------------------------+ - -Date of next meeting -******************** - - 4 of december, 11:00 CET, primary PRACE number diff --git a/doc/sphinx/mom_telcon/index.rst b/doc/sphinx/mom_telcon/index.rst deleted file mode 100644 index efab39e25a46e16c925619f97b5c8f9fd8c92b07..0000000000000000000000000000000000000000 --- a/doc/sphinx/mom_telcon/index.rst +++ /dev/null @@ -1,13 +0,0 @@ -.. UEABS for accelerators documentation master file, created by - sphinx-quickstart on Wed Jun 7 19:01:00 2017. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Minutes of meeting for 4IP extension TelCon -=========================================== - -.. toctree:: - :maxdepth: 1 - :glob: - - * diff --git a/doc/sphinx/ms33.rst b/doc/sphinx/ms33.rst deleted file mode 100644 index e489b7ce483d9c0ec9abcd61974f4389b7e71853..0000000000000000000000000000000000000000 --- a/doc/sphinx/ms33.rst +++ /dev/null @@ -1,123 +0,0 @@ -.. _ms33: - -Milestone 33: workplan definition -================================= - -This document will describe the work done under the PRACE-4IP extension. This -task is dedicated to provide useful information on application *performance and -energy usage* on next generation systems on the path towards exacsale. It will -be caried out running the accelerated UEABS on PCP systems to obtain energy metrics -on *OpenPower+GPU, Xeon Phi and FPGA*. - -PCP systems availables ----------------------- - -This section describes the systems where codes owners have been granted access. -The table :ref:`table-pcp-systems` sums up systems and availability: - -.. _table-pcp-systems: -.. table:: PCP Systems - :widths: auto - - +--------------+--------------+----------------------------+---------------+-------------------------------+ - | Technology | Theoretical | Manufacturer | Host | Availability | - | | peak perf | | | | - +==============+==============+============================+===============+===============================+ - | Power8 + GPU | 877 TFlop/s | `E4 computer engineering`_ | CINECA_ (It) | June/July 2017 | - | | | | | **shifted to mid-October** | - +--------------+--------------+----------------------------+---------------+-------------------------------+ - | Xeon Phi | 512 TFlop/s | `Atos/Bull`_ | CINES_ (Fr) | June 2017 (now available) | - +--------------+--------------+----------------------------+---------------+-------------------------------+ - | FPGA | N/A | MAXELER_ | JSC_ (De) | August 2017 | - | | | | | **shifted to mid-October** | - +--------------+--------------+----------------------------+---------------+-------------------------------+ - -.. note:: More detailed information can be found for :ref:`e4_gpu`, :ref:`atos_knl` - and :ref:`maxeler_fpga` systems. It includes, hardware description, - registration procedures, and energy hardware and tool information. - - - -Code definition ---------------- - -Two sets of codes will be run. One will focus on giving metrics on multiple machines -for UEABS codes while the other will focus on porting specific kernels to the KNL -machine. - -UEABS -^^^^^ - -The table :ref:`table-code-definition` shows all codes available with UEABS -(regular and accelerated). It states for each codes, tageted architechures and BCOs. -Note that due to tight deadlines, efforts to port codes to new architechures will -have to be minimal. - -.. _table-code-definition: -.. table:: Code definition - :widths: auto - - +------------------------+--------------------------------+-------------------------------+ - | | Will run on | | - | Code name +--------------+----------+------+ 4IP-extension BCO + - | | Power8 + GPU | Xeon Phi | FPGA | | - +========================+==============+==========+======+===============================+ - | ALYA | ✓ | ✓ | ✗ | Ricard Borrell (BSC) | - +------------------------+--------------+----------+------+-------------------------------+ - | Code_Saturne | ✓ | ✓ | ✗ | Charles Moulinec (STFC) | - +------------------------+--------------+----------+------+-------------------------------+ - | CP2K | ✓ | ✓ | ✗ | Arno Proeme (EPCC) | - +------------------------+--------------+----------+------+-------------------------------+ - | GADGET | ✗ | ✓ | ✗ | Volker Weinberg (LRZ) | - +------------------------+--------------+----------+------+-------------------------------+ - | GENE | ✗ | ✗ | ✗ | ✗ | - +------------------------+--------------+----------+------+-------------------------------+ - | GPAW | ✗ | ✓ | ✗ | Martti Louhivuori (CSC) | - +------------------------+--------------+----------+------+-------------------------------+ - | GROMACS | ✓ | ✓ | ✗ | Dimitris Dellis (GRNET) | - +------------------------+--------------+----------+------+-------------------------------+ - | NAMD | ✓ | ✓ | ✗ | Dimitris Dellis (GRNET) | - +------------------------+--------------+----------+------+-------------------------------+ - | NEMO | ✗ | ✓ | ✗ | Arno Proeme (EPCC) | - +------------------------+--------------+----------+------+-------------------------------+ - | PFARM | ✓ | ✓ | ✗ | Mariusz Uchronski (WCNS/PSNC) | - +------------------------+--------------+----------+------+-------------------------------+ - | QCD | ✓ | ✓ | ✗ | Jacob Finkenrath (CyI) | - +------------------------+--------------+----------+------+-------------------------------+ - | Quantum Espresso | ✓ | ✓ | ✓ | Andrew Emerson (CINECA) | - +------------------------+--------------+----------+------+-------------------------------+ - | SHOC | ✓ | ✗ | ✓ | Valeriu Codreanu (SurfSARA) | - +------------------------+--------------+----------+------+-------------------------------+ - | Specfem3D_Globe | ✓ | ✓ | ✓ | Victor Cameo Ponz (CINES) | - +------------------------+--------------+----------+------+-------------------------------+ - -.. note:: Code descriptions are available on the `Description of the initial - accelerator benchmark suite` and on the `UEABS description web page`_. - -.. _inria_plan: - -Energy profiling of the HORSE+MaPHyS+PaStiX stack -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The work will aim at porting the HORSE+MaPHyS+PaStiX solver stack on the KNL-based system. -It will consists in performing an energetic profiling of theses codes and studying the influence of several parameters -driving the accuracy and numerical efficiency of the underlying simulations. -A parametric study for minimizing the energy consumption will be performed. -The deliverable will include details of this parametric study and a discussion of its main results. - -.. note:: More can be found on the `HORSE software page`_. - - -.. _Description of the initial accelerator benchmark suite: http://www.prace-ri.eu/IMG/pdf/WP212.pdf -.. _UEABS description web page: http://www.prace-ri.eu/ueabs/ -.. _HORSE software page: http://www-sop.inria.fr/nachos/index.php/Software/HORSE - - -.. _MAXELER: http://maxeler.com/ -.. _JSC: http://www.fz-juelich.de/ias/jsc/EN/Home/home_node.html -.. _E4 computer engineering: https://www.e4company.com -.. _CINECA: http://hpc.cineca.it/ -.. _Atos/Bull: https://bull.com/ -.. _CINES: https://www.cines.fr/ - -.. _Slurm: https://slurm.schedmd.com/ diff --git a/doc/sphinx/pcp_systems.rst b/doc/sphinx/pcp_systems.rst deleted file mode 100644 index 2697a74608baa0c1ef1c90033161bb3a3a5c7ebd..0000000000000000000000000000000000000000 --- a/doc/sphinx/pcp_systems.rst +++ /dev/null @@ -1,12 +0,0 @@ -PCP systems -*********** - -.. _e4_gpu: -.. include:: /pcp_systems/e4_gpu.rst - -.. _atos_knl: -.. include:: /pcp_systems/atos_knl.rst - -.. _maxeler_fpga: -.. include:: /pcp_systems/maxeler_fpga.rst - diff --git a/doc/sphinx/pcp_systems/atos_knl.rst b/doc/sphinx/pcp_systems/atos_knl.rst deleted file mode 100644 index 11bdaffa2d7083e84be2771f76986994cdede792..0000000000000000000000000000000000000000 --- a/doc/sphinx/pcp_systems/atos_knl.rst +++ /dev/null @@ -1,85 +0,0 @@ -Xeon Phi -^^^^^^^^ - -This machine has been designed by `Atos/Bull`_ and is hosted at CINES_ in Montpellier, France. It is made of 76 Bull Sequana X1210 blades, each including 3 Xeon Phi KNL nodes. It totals a theoretical peak performance of 465 Tflop/s with an estimated consumption of 42kW. - -.. note:: - - In order to access the machine BCOs should fill the `GENCI login opening form`_. - Use the following information to fill project related fields: - - - project outside DARI - - name of the personn in charge of the project: Victor Cameo Ponz - - phone number: +33 (0)4 67 14 14 03 - - project code: praceknl - - scientific machine demanded: PCP KNL cluster - - Then send it back to `Victor Cameo Ponz`_. - -Compute technology -"""""""""""""""""" -Hardware features the following nodes: - * 168 nodes with - - * 1x Intel Xeon Phi 7250 processor (KNL), 68 cores cadenced to 1.4 GHz with SMT 4. - * 96GB memory, 16GBx6 DDR4 DIMMs - * intranode communications integrated using InfiniBand EDR - * 100% Hot water cooled nodes - * Half of the configuration feature liquid cooled Power Supply Unit (PSU) make this part of the machine 100% liquid cooled. - * MooseFS I/O - -Each compute node has a theoritical peak performance of 2.765 TFlop/s (double precision) and a power consumption of less than 250W. - -Energy sampling technology -"""""""""""""""""""""""""" - -Power measurements at node level occurs at the sampling rate of 1 kHz at converters and 100 Hz at CPU/DRAM. It is provided through a HDEEM FPGA on each node - -`Atos/Bull`_ allow energy access through two frameworks, namely HDEEM VIZualization (HDEEVIZ) and Bull Energy Optimizer (BEO). - -.. note:: - - Specific setup documentations and instructions is available on the machine: :code:`ls /opt/software/frioul/documentation/`. - - -HDEEVIZ -------- - -Components - - SLURM synchronisation + initialisation - - HDEEM writing results to local storage - - Grafana: Graphical user interface - -Here's an example of usage in a submission script: - -.. code-block:: shell - -#SBATCH -N 2 -#SBATCH -time 00:30:00 -#SBATCH -J Specfem3D_Globe -#SBATCH -n 89 - -module load intel/17.2 intelmpi/2018.0.061 -module load hdeeviz/hdeeviz_intelmpi_2018.0.061 - -hdeeviz mpirun -n 89 $PWD/bin/xspecfem3D - -Access to generated data will be made through the Grafana web interface: - -.. image:: /pcp_systems/graphana.png - -BEO ---- - -BEO is a system administrator oriented tools that allow to get energy metrics at switch and node level. At user level the main interesting feature is the :code:`get_job_energy slurm>`. It produces the following output: - -.. literalinclude:: /pcp_systems/output_beo_report_energy - :emphasize-lines: 1 - - -.. _GENCI login opening form: https://www-dcc.extra.cea.fr/CCFR/ -.. _cines-login-form-odt: https://www.cines.fr/wp-content/uploads/2014/01/opening_renewal_login_2017.odt -.. _cines-login-form-rtf: https://www.cines.fr/wp-content/uploads/2014/01/opening_renewal_login_2017.rtf -.. _Atos/Bull: https://bull.com/ -.. _CINES: https://www.cines.fr/ -.. _Victor Cameo Ponz: cameo+4ip-extension@cines.fr diff --git a/doc/sphinx/pcp_systems/e4_gpu.rst b/doc/sphinx/pcp_systems/e4_gpu.rst deleted file mode 100644 index f7be270a94eabc743edea8b595e8e26863e503df..0000000000000000000000000000000000000000 --- a/doc/sphinx/pcp_systems/e4_gpu.rst +++ /dev/null @@ -1,39 +0,0 @@ -Power8 + GPU -^^^^^^^^^^^^ - -D.A.V.I.D.E has been designed by `E4 computer engineering`_ and is hosted at CINECA_ in Bologna, Italy. It totals a theoretical peak performance of 990 TFlop/s (double precision). A more detailed description can be found on the `E4 dedicated webpage`_. - -.. note:: In order to access the machine BCO should send an email to `Victor Cameo Ponz`_ so that. - -Compute technology -"""""""""""""""""" - -Hardware features fat-nodes with the following design: - * 45 nodes with - - * x2 IBM POWER8+ processors, ie 8x2 cores with Simultaneous Multi-Threading (SMT) 8 - * x4 NVIDIA P100 GPU with 16GB High Bandwidth Memory 2 (HBM2) - * intranode communications integrated using NVLink - * extranode communications integrated using Infiniband ERD interconnect in fat-tree with no oversubscription topology - * CPU and GPU direct hot water (~27°C) cooling, removing 75-80% of the total heat - * remaining 20-25% heat is air-cooled - -Each compute node has a theoretical peak performance of 22 Tflop/s (double precision) and a power consumption of less than 2kW. - -Energy sampling technology -"""""""""""""""""""""""""" - -Information is collected from processors, memory, GPUs and fans exploiting Analog-to-Digital Converter in the embedded SoC. It provides sampling up to 800 kHz lowered to 50kHz on power measuring sensor outputs. - - -The technology has been developed in collaboration with the University of Bologna which developed the :code:`get_job_energy ` program. Usage is straight forward and has the following verbose output: - -.. literalinclude:: /pcp_systems/output_get_job_energy - :emphasize-lines: 1 - - -.. _E4 computer engineering: https://www.e4company.com -.. _E4 dedicated webpage: https://www.e4company.com/en/?id=press§ion=1&page=&new=davide_supercomputer -.. _CINECA: http://hpc.cineca.it/ -.. _Victor Cameo Ponz: cameo+4ip-extension@cines.fr - diff --git a/doc/sphinx/pcp_systems/graphana.png b/doc/sphinx/pcp_systems/graphana.png deleted file mode 100644 index 81b236bedb70f2f4132f9297d99f247b4bae61ec..0000000000000000000000000000000000000000 Binary files a/doc/sphinx/pcp_systems/graphana.png and /dev/null differ diff --git a/doc/sphinx/pcp_systems/maxeler_fpga.rst b/doc/sphinx/pcp_systems/maxeler_fpga.rst deleted file mode 100644 index 19ed3a87f74268d9dd6650451f40fa445a306ded..0000000000000000000000000000000000000000 --- a/doc/sphinx/pcp_systems/maxeler_fpga.rst +++ /dev/null @@ -1,17 +0,0 @@ -FPGA -^^^^ - -This machine has been designed by MAXELER_ and is hosted at JSC_ in Julich, Germany. - -Compute technology -"""""""""""""""""" - -This small pilot system features: - - 4 MPC-H servers including 2x MAX5 DFE and 2x Intel Xeon processors - -Energy sampling technology -"""""""""""""""""""""""""" - - -.. _MAXELER: http://maxeler.com/ -.. _JSC: http://www.fz-juelich.de/ias/jsc/EN/Home/home_node.html diff --git a/doc/sphinx/pcp_systems/output_beo_report_energy b/doc/sphinx/pcp_systems/output_beo_report_energy deleted file mode 100644 index 7e7ba8e2d4a0ef7a9fa2403cd28fee68f6b4355c..0000000000000000000000000000000000000000 --- a/doc/sphinx/pcp_systems/output_beo_report_energy +++ /dev/null @@ -1,4 +0,0 @@ -$ beo report energy slurm8170 -| job | state | nodes.energy(slurm) | nodes.energy | switches.energy | disk_arrays.energy | job.energy | job.cost | -============================================================================================================================= -| slurm8170 | COMPLETED | | 618.4 kJ | 56.3 kJ | 0.0 J | 674.7 kJ | 0.0219 € | diff --git a/doc/sphinx/pcp_systems/output_get_job_energy b/doc/sphinx/pcp_systems/output_get_job_energy deleted file mode 100644 index 59fd852d1ae3aadd102ffd1f8b5ca336ebe56cfd..0000000000000000000000000000000000000000 --- a/doc/sphinx/pcp_systems/output_get_job_energy +++ /dev/null @@ -1,67 +0,0 @@ -$ get_job_energy 12389 -Job 12389 - - Duration (seconds): 421.0 - - Used Node(s): davide20 - - Requested CPUs: 16 - - Start time: 2017-12-05 17:33:47; End time: 2017-12-05 17:40:48 -(Negative values indicate problems in the job info collection - check back in half an hour) -<===============================================================> - Total nodes power consumption "at the plug". Integral of the - power consumed by each node sampled at 800KHz. BBB Measures - Cumulative (all nodes) - - Mean power (W): 536.402900943 - - Total energy (J): 225825.621297 -<---------------------------------------------------------------> - Node Average - - Mean node power (W): 536.402900943 - - Total node energy (J): 225825.621297 -<===============================================================> - AMESTER Power Measures of main components. Integral of the - power consumed by each component sampled at 4KHz : - Cumulative (all nodes) - - Mean power (W): 513.785714286 - - Total energy (J): 216303.785714 - - Mean FANs power (W): 27.0 - - Total FANs energy (J): 11367.0 - - Mean GPUs power (W): 107.047619048 - - Total GPUs energy (J): 45067.0476192 - - Mean CPU_0 processors power (W): 78.9761904762 - - Total CPU_0 processors energy (J): 33248.9761905 - - Mean CPU_1 processors power (W): 118.023809524 - - Total CPU_1 processors energy (J): 49688.0238096 - - Mean CPU_0 memories power (W): 137.0 - - Total CPU_0 memories energy (J): 57677.0 - - Mean CPU_1 memories power (W): 137.023809524 - - Total CPU_1 memories energy (J): 57687.0238096 - - Mean CPU_0 VCS0s VR power (W): 65.2380952381 - - Total CPU_0 VCS0s VR energy (J): 27465.2380952 - - Mean CPU_1 VCS0s VR power (W): 62.6666666667 - - Total CPU_1 VCS0s VR energy (J): 26382.6666667 - - Mean CPU_0 VDD0s VR power (W): 13.5952380952 - - Total CPU_0 VDD0s VR energy (J): 5723.59523808 - - Mean CPU_1 VDD0s VR power (W): 55.3333333333 - - Total CPU_1 VDD0s VR energy (J): 23295.3333333 -<---------------------------------------------------------------> - Node Average - - Mean node power (W): 513.785714286 - - Total node energy (J): 216303.785714 - - Mean FAN power (W): 27.0 - - Total FAN energy (J): 11367.0 - - Mean GPU power (W): 107.047619048 - - Total GPU energy (J): 45067.0476192 - - Mean CPU_0 processors power (W): 78.9761904762 - - Total CPU_0 processors energy (J): 33248.9761905 - - Mean CPU_1 processors power (W): 118.023809524 - - Total CPU_1 processors energy (J): 49688.0238096 - - Mean CPU_0 memories power (W): 137.0 - - Total CPU_0 memories energy (J): 57677.0 - - Mean CPU_1 memories power (W): 137.023809524 - - Total CPU_1 memories energy (J): 57687.0238096 - - Mean CPU_0 VCS0 VR power (W): 65.2380952381 - - Total CPU_0 VCS0 VR energy (J): 27465.2380952 - - Mean CPU_1 VCS0 VR power (W): 62.6666666667 - - Total CPU_1 VCS0 VR energy (J): 26382.6666667 - - Mean CPU_0 VDD0 VR power (W): 13.5952380952 - - Total CPU_0 VDD0 VR energy (J): 5723.59523808 - - Mean CPU_1 VDD0 VR power (W): 55.3333333333 - - Total CPU_1 VDD0 VR energy (J): 23295.3333333 diff --git a/doc/sphinx/periodic_report_2018-01.rst b/doc/sphinx/periodic_report_2018-01.rst deleted file mode 100644 index 96452de1cec7747b19657da683b1fec31e073fca..0000000000000000000000000000000000000000 --- a/doc/sphinx/periodic_report_2018-01.rst +++ /dev/null @@ -1,26 +0,0 @@ -During the PRACE-4IP extension, the main activity have been running the accelerated UEABS codes on the PRACE PCP prototypes to provide useful information on performance and energy usage for future Exascale systems. The project timeline have been divided in two main phases over six months: - - Code preparation (2 months): investigating the PCP prototypes documentation and deciding what applications will be run and what prototypes will be involved. The list of applications and prototypes have been be published in the PRACE 4IP milestone MS33 available on the web[1] at the end of this phase. - - Code run and analisys (4 months): measuring their performance, and monitoring their energy usage. These metrics and the corresponding analisys have been compiled in the PRACE 4IP delivereable D7.7. The final version of this document is currently under MB/TB approval phase. - -In practice deadline setup for this project have been very tight. The major concern have been the availability of PCP systems, not only the openning date but also energy tools stack access and practice knowledge. In order to illustrate this fact, Figure 1 enlighthen the general timescale, while Table 1 gives precise dates for each PCP machine. Due to late access to the FPGA machine, this branch of execution has been abandonned so that the risk of losing time on it without result has been avoided. The extra time of 15 days for submitting the deliverable allowed us to run almost all envisoned benchmarks (see Figure 2 for more details). - -Fig 1: timeline blabla - -Table 1: PRACE 4IP extention access to PCP machine detailed dates - -Here is the work done so far during the PRACE 4IP-extension: - - Split off the UEABS git repository from the CodeVault repository - - Compilation of MS33 - - Invesigation of the KNL and GPU prototype and their energy software stack - - Run UEABS codes on above prototype (see details on Figure 2) - - Portage and detailed performance and energy parametric study of the HORSE&MaPhys stack on KNL. - - Compilation of the results in the D7.7 - -Fig 2: Code ran on PCP vs envisoned - -Two major risks couldn't have been avoided during this project: - - Machine availability: the FPGA schedule shifted to much to allow signigicant run. Unfortunately, hard deadlines doens't comply well with shcedule shift. Identify this type of risk early and try to plan "shortcuts" in the road map is the best we can do. - - Running on prototypes is challenging by itself. Unexpected maintenances and tools stack availability didn't help with covering the road map. Having a clearer view on the PCP technical challenge would have help. However most of the roadmap have been covered so far - - -[1] PRACE 4IP MS33: https://misterfruits.gitlab.io/ueabs/ms33.html diff --git a/doc/sphinx/theme/static/style.css b/doc/sphinx/theme/static/style.css deleted file mode 100644 index c8980213bf77c58e6df0c9484902e5a22d40c1e0..0000000000000000000000000000000000000000 --- a/doc/sphinx/theme/static/style.css +++ /dev/null @@ -1,5 +0,0 @@ -@import url("alabaster.css"); /* make sure to sync this with the base theme's css filename */ - -.section { - text-align:justify; -} diff --git a/doc/sphinx/theme/theme.conf b/doc/sphinx/theme/theme.conf deleted file mode 100644 index e7c85b637c33b9e2cedb85de74fdace3d544df24..0000000000000000000000000000000000000000 --- a/doc/sphinx/theme/theme.conf +++ /dev/null @@ -1,5 +0,0 @@ -[theme] -inherit = alabaster -stylesheet = style.css -pygments_style = pygments.css - diff --git a/doc/timeline/.gitignore b/doc/timeline/.gitignore deleted file mode 100644 index 3eec47da1211f648b0c8623af52105b1f8d67bd3..0000000000000000000000000000000000000000 --- a/doc/timeline/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -*.aux -*.log -*.pdf diff --git a/doc/timeline/4ip_72b.tex b/doc/timeline/4ip_72b.tex deleted file mode 100644 index 8dacf737da5aad493664965ef3011e5437c7a128..0000000000000000000000000000000000000000 --- a/doc/timeline/4ip_72b.tex +++ /dev/null @@ -1,54 +0,0 @@ -\documentclass[border=10pt]{standalone} - -\usepackage{tikz} -\usetikzlibrary{timeline} - -\begin{document} - -\begin{tikzpicture}[timespan={}] -% timespan={Day} -> now we have days as reference -% timespan={} -> no label is displayed for the timespan -% default timespan is 'Week' - -\timeline[custom interval=true]{2016, 2017} -% \timeline[custom interval=true]{3,...,9} -> i.e., from Day 3 to Day 9 -% \timeline{8} -> i.e., from Week 1 to Week 8 - -% put here the phases -\begin{phases} -\initialphase{involvement degree=3cm,phase color=black} % code first idea: code call & UEABS 0 -\phase{between week=0 and 1 in 0.8,involvement degree=2.25cm, phase color=black} % bench suite definition 1 -\phase{between week=1 and 2 in 0.2,involvement degree=4cm, phase color=green} % first run 2 -\phase{between week=1 and 2 in 0.8,involvement degree=2.5cm, phase color=green} % Xcompile & Xexecution 3 -\phase{between week=1 and 2 in 1,involvement degree=1.75cm, phase color=green} % run on latest Xeon Phi and GPUs available 4 -\phase{between week=1 and 2 in 1.2,phase color=blue!80!cyan, involvement degree=3cm} % writte D7.5 & publish BM suite on prace.ri 5 -\phase{between week=1 and 2 in 1.4,phase color=blue!80!cyan, involvement degree=2cm} % review of D7.5 6 -\phase{between week=1 and 2 in 0.6,involvement degree=2cm, phase color=green} % Write Guide 7 - -\end{phases} - -% put here the milestones -\addmilestone{at=phase-1.270,direction=270:0.5cm,text={MS27/M11},text options={below}} % dec 2015 -\addmilestone{at=phase-6.330,direction=300:1.5cm,text={MS32/M26},text options={below}} % Apr 2017 -\addmilestone{at=phase-6.330,direction=315:1cm,text={D7.5/M26},text options={below}} % Apr 2017 - -% unofficial milstone -\addmilestone{at=phase-0.30,direction=90:1cm,text={Codes defined},text options={above}} % mid 2015 -\addmilestone{at=phase-1.30,direction=90:1cm,text={Test cases defined},text options={above}} % dec 2015 -\addmilestone{at=phase-2.30,direction=90:1.5cm,text={First run passed},text options={above}} % june 2016 -\addmilestone{at=phase-7.330,direction=225:1.5cm,text={Guides written},text options={below}} % june 2016 -\addmilestone{at=phase-3.30,direction=110:1.5cm,text={Cross compilation successful},text options={above}} % june 2016 -\addmilestone{at=phase-4.30,direction=70:1.3cm,text={Latest arch run passed},text options={above}} % feb 2017 -\addmilestone{at=phase-5.330,direction=255:2cm,text={Deliveries submitted to review},text options={below}} % May 2015 - - - -% informative dates -\addmilestone{at=phase-0.120,direction=105:1cm,text={Kickoff telcon},text options={above}} % May 2015 -%\addmilestone{at=phase-2.330,direction=270:1.5cm,text={F2F@Trondheim},text options={below}} % june 2016 -%\addmilestone{at=phase-3.300,direction=270:0.5cm,text={F2F@Sofia},text options={below}} % june 2016 -%\addmilestone{at=phase-4.330,direction=270:1.5cm,text={AH@Athens},text options={below}} % feb 2017 -%\addmilestone{at=phase-6.90,direction=70:1.9cm,text={PCP machines availables -- Q1 or Q2},text options={above}} % Q2 2017 -\end{tikzpicture} - -\end{document} diff --git a/doc/timeline/4ip_extension.tex b/doc/timeline/4ip_extension.tex deleted file mode 100644 index f2f29507be9dde601aab6ba2524a140982959470..0000000000000000000000000000000000000000 --- a/doc/timeline/4ip_extension.tex +++ /dev/null @@ -1,57 +0,0 @@ -\documentclass[border=10pt]{standalone} - -\usepackage{tikz} -\usetikzlibrary{timeline} - -\begin{document} - -\begin{tikzpicture}[timespan={}] -% timespan={Day} -> now we have days as reference -% timespan={} -> no label is displayed for the timespan -% default timespan is 'Week' - -\timeline[custom interval=true]{July, Aug, Sep, Oct, Nov, Dec} -% \timeline[custom interval=true]{3,...,9} -> i.e., from Day 3 to Day 9 -% \timeline{8} -> i.e., from Week 1 to Week 8 - -% phases -\begin{phases} -\initialphase{involvement degree=4cm,phase color=black} % 4ip extention setup 0 -\phase{between week=1 and 2 in 1.0,involvement degree=5.25cm, phase color=black} % BCO search 1 -\phase{between week=2 and 3 in 1.1,phase color=blue!80!cyan, involvement degree=2.5cm} % writte MS33 2 -\phase{between week=4 and 5 in 0.5,involvement degree=7cm, phase color=green} % Run codes 3 -\phase{between week=5 and 6 in 1.03,phase color=blue!80!cyan, involvement degree=1.8cm} % redaction D7.7 4 -\phase{between week=5 and 6 in 1.7,phase color=blue!80!cyan, involvement degree=3cm} % review D7.7 5 -\phase{between week=0 and 1 in 0.66,phase color=blue!80!cyan, involvement degree=0.5cm} % F2F Juelich 6 -\phase{between week=5 and 6 in 0.8,phase color=blue!80!cyan, involvement degree=0.5cm} % F2F Cyprus 7 -\end{phases} - -% milestones -\addmilestone{at=phase-2.260,direction=270:0.7cm,text={MS33 issued},text options={below}} % 1 sept 2017 -\addmilestone{at=phase-2.335,direction=270:0.5cm,text={MS33 updated},text options={below}} % 15 sept 2017 -\addmilestone{at=phase-4.330,direction=270:2.5cm,text={D7.7 submited},text options={below}} % 13 dec 2017 -\addmilestone{at=phase-5.0,direction=270:1.5cm,text={D7.7 delivered},text options={below}} % Jan 2018 - -% Machine opening -\addmilestone{at=phase-3.215,direction=270:0.5cm,text={KNL available},text options={below}} % Sept 2017 -\addmilestone{at=phase-3.300,direction=270:0.2cm,text={GPU available},text options={below}} % 7 Nov 2017 -\addmilestone{at=phase-3.315,direction=270:1.2cm,text={FPGA available},text options={below}} % 15 Nov 2017 - -% F2F Dates -\addmilestone{at=phase-6.270,direction=270:2cm,text={F2F in Juelich},text options={below}} % 20 June 2017 -\addmilestone{at=phase-7.270,direction=270:1cm,text={F2F in Cyprus},text options={below}} % 20 nov 2017 - - - - - -% Phase name -\addmilestone{at=phase-0.120,direction=105:1cm,text={4IP-Extension setup},text options={above}} % May 2015 -\addmilestone{at=phase-1.120,direction=105:1cm,text={Partners applying for BCO},text options={above}} % May 2015 -\addmilestone{at=phase-2.120,direction=105:1cm,text={MS33 compilation},text options={above}} % May 2015 -\addmilestone{at=phase-3.120,direction=105:0.5cm,text={Run on PCPs},text options={above}} % May 2015 -\addmilestone{at=phase-4.120,direction=105:2cm,text={D7.7 compilation},text options={above}} % May 2015 -\addmilestone{at=phase-5.120,direction=105:1cm,text={D7.7 review},text options={above}} % May 2015 -\end{tikzpicture} - -\end{document} diff --git a/doc/timeline/tikzlibrarytimeline.code.tex b/doc/timeline/tikzlibrarytimeline.code.tex deleted file mode 100644 index 7dcad20d7591a51e8e7fecbfbcfd9a04b8476519..0000000000000000000000000000000000000000 --- a/doc/timeline/tikzlibrarytimeline.code.tex +++ /dev/null @@ -1,138 +0,0 @@ -% * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * -% COPYRIGHT 2014 - Claudio Fiandrino -% Released under the LaTeX Project Public License v1.3c or later -% -% email: -% -% Timeline TikZ library version 0.3a - 19/07/2014 - -\usetikzlibrary{backgrounds,calc} - -\pgfkeys{/tikz/.cd, - timespan/.store in=\timespan, - timespan=Week, - timeline width/.store in=\timelinewidth, - timeline width=20, - timeline height/.store in=\timelineheight, - timeline height=0.5, - timeline offset/.store in=\timelineoffset, - timeline offset=0.15, - initial week/.store in=\initialweek, - initial week=1, - end week/.store in=\endweek, - end week=2, - time point/.store in=\timepoint, - time point=0.5, - between day/.style args={#1 and #2 in #3}{% auxiliary style for days - initial week=#1, - end week=#2, - time point=#3, - }, - between week/.style args={#1 and #2 in #3}{% style for weeks - initial week=#1, - end week=#2, - time point=#3, - }, - between month/.style args={#1 and #2 in #3}{% auxiliary style for months - initial week=#1, - end week=#2, - time point=#3, - }, - between year/.style args={#1 and #2 in #3}{% auxiliary style for years - initial week=#1, - end week=#2, - time point=#3, - }, - involvement degree/.store in=\involvdegree, - involvement degree=2cm, - phase color/.store in=\phasecol, - phase color=red!50!orange, - phase appearance/.style={ - circle, - opacity=0.3, - minimum size=\involvdegree, - fill=\phasecol - }, -} -% settings to customize aspect of timeline -\newif\ifcustominterval -\pgfkeys{/tikz/timeline/.cd, - custom interval/.is if=custominterval, - custom interval=false, -} - -% settings to deploy milestones -\pgfkeys{/tikz/milestone/.cd, - at/.store in=\msstartpoint, - at=phase-1.north, - circle radius/.store in=\milestonecircleradius, - circle radius=0.1cm, - direction/.store in=\msdirection, - direction=90:2cm, - text/.store in=\mstext, - text={}, - text options/.code={\tikzset{#1}}, -} - -\newcommand{\reftimespan}{\MakeLowerCase{\timespan}} - -\newcommand{\timeline}[2][]{ - \pgfkeys{/tikz/timeline/.cd,#1} - \draw[fill,opacity=0.8] (0,0) rectangle (\timelinewidth,\timelineheight); - \shade[top color=black, bottom color=white,middle color=black!20] - (0,0) rectangle (\timelinewidth,-\timelineoffset); - \shade[top color=white, bottom color=black,middle color=black!20] - (0,\timelineheight) rectangle (\timelinewidth,\timelineheight+\timelineoffset); - - \ifcustominterval% - \foreach \smitem [count=\xi] in {#2} {\global\let\maxsmitem\xi}% - \else% - \foreach \smitem [count=\xi] in {1,...,#2} {\global\let\maxsmitem\xi}% - \fi% - - \pgfmathsetmacro\position{\timelinewidth/(\maxsmitem+1)} - \node at (0,0.5*\timelineheight)(\timespan-0){\phantom{Week 0}}; - - \ifcustominterval% - \foreach \x[count=\xi] in {#2}{% - \node[text=white,text depth=0pt]at +(\xi*\position,0.5*\timelineheight) (\timespan-\xi) {\timespan\ \x};% - }% - \else% - \foreach \x[count=\xi] in {1,...,#2}{% - \node[text=white, text depth=0pt]at +(\xi*\position,0.5*\timelineheight) (\timespan-\xi) {\timespan\ \x};% - }% - \fi% -} - -\newcounter{involv} -\setcounter{involv}{0} - -\newcommand{\phase}[1]{ -\stepcounter{involv} -\node[phase appearance,#1] - (phase-\theinvolv) - at ($(\timespan-\initialweek)!\timepoint!(\timespan-\endweek)$){}; -} - -\newcommand{\initialphase}[1]{ -\node[phase appearance,#1,anchor=west,between week=0 and 1 in 0,] - (phase-\theinvolv) - at ($(\timespan-0)!0!(\timespan-1)$){}; -\setcounter{involv}{0} -} - -\newenvironment{phases}{\begin{pgfonlayer}{background}}{\end{pgfonlayer}} - -\newcommand{\addmilestone}[1]{ -\pgfkeys{/tikz/milestone/.cd,#1} -\draw[double,fill] (\msstartpoint) circle [radius=\milestonecircleradius]; -\draw(\msstartpoint)--++(\msdirection)node[/tikz/milestone/text options]{\mstext}; -} - -% HISTORY -% 0.1 -> initial release -% 0.2 -> customizable timespan label -% 0.3 -> \timeline command with custom intervals -% styles ``between x'' -% removed unnecessary call to xstring -% 0.3a -> text depth for timeline labels \ No newline at end of file diff --git a/gene/GENE_Run_README.txt b/gene/GENE_Run_README.txt deleted file mode 100644 index 06731f58b4019239315d8271916c53c9bf48d2fb..0000000000000000000000000000000000000000 --- a/gene/GENE_Run_README.txt +++ /dev/null @@ -1,99 +0,0 @@ -This is the README file for the GENE application benchmark, -distributed with the Unified European Application Benchmark Suite. - ------------ -GENE readme ------------ - -Contents --------- - -1. General description -2. Code structure -3. Parallelization -4. Building -5. Execution -6. Data - -1. General description -====================== - -The gyrokinetic plasma turbulence code GENE (this acronym stands for -Gyrokinetic Electromagnetic Numerical Experiment) is a software package -dedicated to solving the nonlinear gyrokinetic Integro-Differential system -of equations in either flux-tube domain or in a radially nonlocal domain. -GENE has been developed by a team of people (the Gene Development Team, -led by F. Jenko, Max-Planck-Institut for Plasma Physics) over the last -several years. - -For further documentation of the code see: http://www.ipp.mpg.de/~fsj/gene/ - -2. Code structure -================== - -Each particle species is described by a time-dependent distribution function -in a five-dimensional phase space. -This results in 6 dimensional arrays, which have the following coordinates: -x y z three space coordinates -v parallel velocity -w perpendicular velocity -spec species of particles - -GENE is written completely in FORTRAN90, with some language structures -from Fortran 2003 standard. It also contains preprocessing directives. - -3. Parallelization -================== - -Parallelization is done by domain decomposition of all 6 coordinates using MPI. -x, y, z 3 space coordinates -v parallel velocity -w perpendicular velocity -spec species of particles - - -4. Building -=========== - -The source code (fortran-90) resides in directory src. -The compilation of GENE will be done by JuBE. -Compilation will be done automatically if a new executable for the -benchmark runs is needed. - - -5. Running the code -==================== -A very brief description of the datasets: - -parameters_small - A small data set for test purposes. Needs only 8 cores to run. -parameters_tier1 - Global simulation of ion-scale turbulence in Asdex-Upgrade, - needs 200-500GB total memory, runs from 256 to 4096 cores -parameters_tier0 - Global simulation of ion-scale turbulence in JET, - needs 3.5-7TB total memory, runs from 4096 to 16384 cores - -For running the benchmark for GENE, please follow the instructions for -using JuBE. -JuBE generates for each benchmark run a run directory and generates from -a template input file the input file 'parameters' and stores it in the -run directory. -A job submit script is created as well and is submitted. - - -6. Data -======= - -The only input file is 'parameters'. It has the format of a f90 namelist. - -The following output files are stored in the run directory. - -nrg.dat The content of this file is used to verify the correctness - of the benchmark run. - -stdout is redirected by JuBE. - It contains logging information, - especially the result of the time measurement. - ---------------------------------------------------------------------------