# Quantum Espresso in the United European Applications Benchmark Suite (UEABS) ## Document Author: A. Emerson (a.emerson@cineca.it) , Cineca. ## Introduction Quantum Espresso is an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials. Full documentation is available from the project website [QuantumEspresso](https://www.quantum-espresso.org/). In this README we give information relevant for its use in the UEABS. ### Standard CPU version For the UEABS activity we have used mainly version v6.5 but later versions are now available. ### GPU version The GPU port of Quantum Espresso is a version of the program which has been completely re-written in CUDA FORTRAN. The version program used in these experiments is v6.5a1, even though later versions may be available. ## Installation and requirements ### Standard The Quantum Espresso source can be downloaded from the projects GitHub repository,[QE](https://github.com/QEF/q-e/tags). Requirements can be found from the website but you will need a good FORTRAN and C compiler with an MPI library and optionally (but highly recommended) an optimised linear algebra library. ### GPU version For complete build requirements and information see the following GitHub site: [QE-GPU](https://gitlab.com/QEF/q-e-gpu/-/releases) A short summary is given below: Essential * The PGI compiler version 17.4 or above. * You need NVIDIA TESLA GPUS such as Kepler (K20, K40, K80) or Pascal (P100) or Volta (V100). No other cards are supported. NVIDIA TESLA P100 and V100 are strongly recommend for their on-board memory capacity and double precision performance. Optional * A parallel linear algebra library such as Scalapack, Intel MKL or IBM ESSL. If none is available on your system then the installation can use a version supplied with the distribution. ## Downloading the software ### Standard From the website, for example: ```bash wget https://github.com/QEF/q-e/releases/download/qe-6.5/qe-6.5.tar.gz ``` ### GPU Available from the web site given above. You can use, for example, ```wget``` to download the software: ```bash wget https://gitlab.com/QEF/q-e-gpu/-/archive/qe-gpu-6.5a1/q-e-gpu-qe-gpu-6.5a1.tar.gz ``` ## Compiling and installing the application ### Standard installation Installation is achieved by the usual ```configure, make, make install ``` procedure. However, it is recommended that the user checks the __make.inc__ file created by this procedure before performing the make. For example, using the Intel compilers, ```bash module load intel intelmpi CC=icc FC=ifort MPIF90=mpiifort ./configure --enable-openmp --with-scalapack=intel ``` Assuming the __make.inc__ file is acceptable, the user can then do: ```bash make; make install ``` ### GPU The GPU version is configured similarly to the CPU version, the only exception being that the configure script will check for the presence of PGI and CUDA libraries. A typical configure might be ```bash ./configure --with-cuda=XX --with-cuda-runtime=YY --with-cuda-cc=ZZ --enable-openmp [ --with-scalapack=no ] ``` where `XX` is the location of the CUDA Toolkit (in HPC environments is generally `$CUDA_HOME`), `YY` is the version of the cuda toolkit and `ZZ` is the compute capability of the card. For example, ```bash ./configure --with-cuda=$CUDA_HOME --with-cuda-cc=60 --with-cuda-runtime=9.2 ``` The __dev-tools/get_device_props.py__ script is available if you dont know these values. Compilation is then performed as normal by ``` make pw ``` #### Example compilation of Quantum Espresso for GPU based machines ```bash module load pgi cuda ./configure --with-cuda=$CUDA_HOME --with-cuda-cc=70 --with-cuda-runtime=10.2 make -j8 pw ``` ## Running the program - general procedure Of course you need some input before you can run calculations. The input files are of two types: 1. A control file usually called `pw.in` 2. One or more pseudopotential files with extension `.UPF` The pseudopotential files are placed in a directory specified in the control file with the tag pseudo\_dir. Thus if we have ```bash pseudo_dir=./ ``` then QE-GPU will look for the pseudopotential files in the current directory. If using the PRACE benchmark suite the data files can be downloaded from the PRACE respository. For example, ```bash wget https://repository.prace-ri.eu/ueabs/Quantum_Espresso/QuantumEspresso_TestCaseA.tar.gz ``` Once uncompressed you can then run the program like this (e.g. using MPI over 16 cores): ```bash mpirun -n 16 pw-gpu.x -input pw.in ``` but check your system documentation since mpirun may be replaced by `mpiexec, runjob, aprun, srun,` etc. Note also that normally you are not allowed to run MPI programs interactively without using the batch system. ### Running on GPUs The procedure is identical to running on non accelerator-based hardware. If GPUs are being used then the following will appear in the program output: ``` GPU acceleration is ACTIVE. ``` GPU acceleration can be switched off by setting the following environment variable: ```bash $ export USEGPU=no ``` ### Parallelisation options Quantum Espresso uses various levels of parallelisation, the most important being MPI parallelisation over the *k points* available in the input system. This is achieved with the ```-npool``` program option. Thus for the AUSURF input which has 2 k points we can run: ```bash srun -n 64 pw.x -npool 2 -input pw.in ``` which would allocate 32 MPI tasks per k-point. The number of MPI tasks must be a multiple of the number of k-points. For the TA2O5 input, which has 26 k-points, we could try: ```bash srun -n 52 pw.x -npool 26 -input pw.in ``` but we may wish to use fewer pools but with more tasks per pool: ```bash srun -n 52 pw.x -npool 13 -input pw.in ``` It is also possible to control the number of MPI tasks used in the diagonalization of the subspace Hamiltonian. This is possible with the ```-ndiag``` parameter which must be a square number. For example with the AUSURF input with k-points we can assign 4 processes for the Hamiltonian diagonisation: ```bash srun -n 64 pw.x -npool 2 -ndiag 4 -input pw.in ``` ### Hints for running the GPU version The GPU port of Quantum Espresso runs almost entirely in the GPU memory. This means that jobs are restricted by the memory of the GPU device, normally 16-32 GB, regardless of the main node memory. Thus, unless many nodes are used the user is likely to see job failures due to lack of memory, even for small datasets. For example, on the CSCS Piz Daint supercomputer each node has only 1 NVIDIA Tesla P100 (16GB) which means that you will need at least 4 nodes to run even the smallest dataset (AUSURF in the UEABS). ## Execution In the UEABS repository you will find a directory for each computer system tested, together with installation instructions and job scripts. In the following we describe in detail the execution procedure for the Marconi computer system. ### Execution on the Cineca Galileo (x86) system Quantum Espresso has already been installed on the cluster and can be accessed via a specific module: ``` bash module load profile/phys module load autoload qe/6.5 ``` An example SLURM batch script is given below: ``` bash #!/bin/bash #SBATCH --time=06:00:00 # Walltime in hh:mm:ss #SBATCH --nodes=4 # Number of nodes #SBATCH --ntasks-per-node=18 # Number of MPI ranks per node #SBATCH --cpus-per-task=2 # Number of OpenMP threads for each MPI process/rank #SBATCH --mem=118000 # Per nodes memory request (MB) #SBATCH --account= #SBATCH --job-name=jobname #SBATCH --partition=gll_usr_prod module purge module load profile/phys module load autoload qe/6.5 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export MKL_NUM_THREADS=${OMP_NUM_THREADS} srun pw.x -npool 4 -input file.in > file.out ``` In the above with the SLURM directives we have asked for 4 nodes, 18 MPI tasks per node and 2 OpenMP threads per task. Note that this script needs to be submitted using SLURM scheduler as follows: ``` bash sbatch myjob ``` ## UEABS test cases | UEABS name | QE name | Description | k-points | Notes| |------------|---------------|-------------|----------|------| | Small test case | AUSURF | 112 atoms | 2 | < 4-8 nodes on most systems | | Large test case | GRIR443 | 432 | 4| Medium scaling, often 20 nodes | | Very Large test case | CNT | Carbon nanotube | | Large scaling runs only. Memory and time requirements very high| __Last updated: 22-October-2020__