# Quantum Espresso in the Accelerated Benchmark Suite ## Document Author: A. Emerson (a.emerson@cineca.it) , Cineca. ## Contents 1. Introduction 2. Requirements 3. Downloading the software 4. Compiling the application 5. Running the program 6. Example 7. References ## 1. Introduction The GPU-enabled version of Quantum Espresso (known as QE-GPU) provides GPU acceleration for the Plane-Wave Self-Consistent Field (PWscf) code and energy barriers and reaction pathways through the Nudged Elastic Band method (NEB) package. QE-GPU is developed as a sort of plugin to the main QE program branch and is based on code usually one or two versions behind the main program version. Note that in the accelerated benchmark suite, *version 5.4* has been used for QE-GPU whereas the latest release version of the main package is 6.0. QE-GPU is developed by Filippo Spiga and the download and build instructions for the package are given here [1] if the packages is not already available on your system. ## 2. Requirements Essential * Quantum ESPRESSO 5.4 * Kepler GPU: (minimum) CUDA SDK 6.5 * Pascal GPU: (minimum) CUDA SDK 8.0 Optional * A parallel linear algebra library such as Scalapack or Intel MKL. If none is available on your system then the installation can use a version supplied with the distribution. ## 3. Downloading the software ### QE distribution Many packages are available from the download page but since you need only the main base package for the benchmark suite, the `expresso-5.4.0.tar.gz` file will be sufficient. This can be downloaded as: [http://www.quantum-espresso.org/download] (http://www.quantum-espresso.org/download) ### GPU plugin The GPU source code can be conveniently downloaded from this link: [https://github.com/QEF/qe-gpu-plugin] (https://github.com/QEF/qe-gpu-plugin) ## 4. Compiling the application The QE-GPU gives more details but for the benchmark suite we followed this general procedure: 1. Uncompress the main QE distribution and copy the GPU source distribution inside: ``` shell tar zxvf espresso-5.4.0.tar.gz cp 5.4.0.tar.gz espresso-5.4.0` ``` 2. Uncompress the GPU source inside main distribution and create a symbolic link: ``` shell cd espresso-5.4.0 tar zxvf 5.4.0.tar.gz ln -s QE-GPU-5.4.0 GPU` ``` 3. Run QE-GPU configure and make: ``` shell cd GPU ./configure --enable-parallel --enable-openmp --with-scalapack=intel \ --enable-cuda --with-gpu-arch=Kepler \ --with-cuda-dir=/usr/local/cuda/7.0.1 \ --without-magma --with-phigemm cd .. make -f Makefile.gpu pw-gpu ``` In this example we are compiling with the Intel FORTRAN compiler so we can use the Intel MKL version of Scalapack. Note also that in the above it is assumed that the CUDA library has been installed in the directory `/usr/local/cuda/7.0.1`. The QE-GPU executable will appear in the directory `GPU/PW` and is called `pw-gpu.x`. ## 5. Running the program Of course you need some input before you can run calculations. The input files are of two types: 1. A control file usually called `pw.in` 2. One or more pseudopotential files with extension `.UPF` The pseudopotential files are placed in a directory specified in the control file with the tag pseudo\_dir. Thus if we have ```shell pseudo_dir=./ ``` then QE-GPU will look for the pseudopotential files in the current directory. If using the PRACE benchmark suite the data files can be downloaded from the QE website or the PRACE respository. For example, ```shell wget http://www.prace-ri.eu/UEABS/Quantum\_Espresso/QuantumEspresso_TestCaseA.tar.gz ``` Once uncompressed you can then run the program like this (e.g. using MPI over 16 cores): ```shell mpirun -n 16 pw-gpu.x -input pw.in ``` but check your system documentation since mpirun may be replaced by `mpiexec, runjob, aprun, srun,` etc. Note also that normally you are not allowed to run MPI programs interactively but must instead use the batch system. A couple of examples for PRACE systems are given in the next section. ## 6. Examples We now give a build and 2 run examples. ### Computer System: Cartesius GPU partition, SURFSARA. #### Build ``` shell wget http://www.qe-forge.org/gf/download/frsrelease/204/912/espresso-5.4.0.tar.gz tar zxvf espresso-5.4.0.tar.gz cd espresso-5.4.0 wget https://github.com/fspiga/QE-GPU/archive/5.4.0.tar.gz tar zxvf 5.4.0.tar.gz ln -s QE-GPU-5.4.0 GPU cd GPU module load mpi module load mkl module load cuda ./configure --enable-parallel --enable-openmp --with-scalapack=intel \ --enable-cuda --with-gpu-arch=sm_35 \ --with-cuda-dir=$CUDA_HOME \ --without-magma --with-phigemm cd .. make -f Makefile.gpu pw-gpu ``` #### Running Cartesius uses the SLURM scheduler. An example batch script is given below, ``` shell #!/bin/bash #SBATCH -N 6 --ntasks-per-node=16 #SBATCH -p gpu #SBATCH -t 01:00:00 module load fortran mkl mpi/impi cuda export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${SURFSARA_MKL_LIB} srun pw-gpu.x -input pw.in >job.out ``` You should create a file containing the above commands (e.g. myjob.sub) and then submit to the batch system, e.g. ``` shell sbatch myjob.sub ``` Please check the SURFSara documentation for more information on how to use the batch system. ### Computer System: Marconi KNL partition (A2), Cineca #### Running Quantum Espresso has already been installed for the KNL nodes of Marconi and can be accessed via a specific module: ``` shell module load profile/knl module load autoload qe/6.0_knl ``` On Marconi the default is to use the MCDRAM as cache, and have the cache mode set as quadrant. Other settings for the KNLs on Marconi haven't been substantailly tested for Quantum Espresso (e.g. flat mode) but significant differences in performance for most inputs are not expected. An example PBS batch script for the A2 partition is given below: ``` shell #!/bin/bash #PBS -l walltime=06:00:00 #PBS -l select=2:mpiprocs=34:ncpus=68:mem=93gb #PBS -A #PBS -N jobname module purge module load profile/knl module load autoload qe/6.0_knl cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=4 export MKL_NUM_THREADS=${OMP_NUM_THREADS} mpirun pw.x -npool 4 -input file.in > file.out ``` In the above with the PBS directives we have asked for 2 KNL nodes (each with 68 cores) in cache/quadrant mode and 93 Gb main memory each. We are running QE in hybrid mode using 36 MPI processes/node, each with 4 OpenMP threads/process and distributing the k-points in 4 pools; the Intel MKl library will also use 4 OpenMP threads/process. Note that this script needs to be submitted using the KNL scheduler as follows: ``` shell module load env-knl qsub myjob ``` Please check the Cineca documentation for information on using the Marconi KNL partition: ## 7. References 1. QE-GPU build and download instructions, https://github.com/QEF/qe-gpu-plugin. Last updated: 7-April-2017