README.md

# GROMACS


## Summary Version
1.0

## Purpose of Benchmark
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles.


## Characteristics of Benchmark
GROMACS is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, 
but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for 
research on non-biological systems, e.g. polymers.

GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation, 
(check the online reference or manual for details), but there are also quite a few features that make it stand out from the competition:

GROMACS provides extremely high performance compared to all other programs. A lot of algorithmic optimizations have been introduced in the code; 
it has for instance extracted the calculation of the virial from the innermost loops over pairwise interactions, and it uses its own software routines to calculate the inverse square root. In GROMACS version 4.6 and up, on almost all common computing platforms, the innermost loops are written in C using intrinsic functions that the compiler transforms to SIMD machine instructions, to utilize the available instruction-level parallelism. These kernels are available in either single and double precision, and in support all the different kinds of SIMD support found in x86-family (and other) processors.

It is parallelized using OpenMP and/or MPI and supports CUDA-based GPU acceleration on Nvidia GPUs.

## Mechanics of Building Benchmark
Complete Build instructions can be found at [https://manual.gromacs.org/documentation/current/install-guide/index.html](https://manual.gromacs.org/documentation/current/install-guide/index.html).
It requires a compiler with full C++17 support and a recent `CMake >= 3.13`. It depends on FFT library, typically FFTW.
If an optimized FFTW library is installed it may be used, if not one can specify a cmake option to download and install FFTW.

### Download the source code
Latest Releases can be downloaded from [https://manual.gromacs.org/documentation/#latest-releases](https://manual.gromacs.org/documentation/#latest-releases)
```
wget https://ftp.gromacs.org/gromacs/gromacs-VERSION.tar.gz
tar -zxf gromacs-VERSION.tar.gz
cd gromacs-VERSION
mkdir build
cd build
```

### Build the Executable
GROMACS supports different types of build :

* `Single Node, OpenMP` build 

* `Single/Multi-node`, pure MPI build

* `Hybrid OpenMP/MPI` build

* `Hybrid OpenMP/MPI/CUDA` build with support for NVIDIA GPUs.


No matter of build type, SIMD optimizations are taken into account, based on the autodetected SIMD support
detect at build time on compile machine. If one runs the executable on another machine, it may not run if the machine doesn't
support the best detected SIMD instructions, it may run but with no maximum speed if the run machine supports higher SIMD instructions.
In this case one could specify the SIMD instructions to activate in executable.
There are other options to fine tune build, for example GPU compute capability etc. that are described in Complete build instructions.

A typical build procedure the covers above build types looks like :

```
cmake \
        -DCMAKE_INSTALL_PREFIX=$HOME/Packages/gromacs/2020.5 \
        -DBUILD_SHARED_LIBS=off \
        -DBUILD_TESTING=off \
        -DCMAKE_VERBOSE_MAKEFILE=on  \
        -DREGRESSIONTEST_DOWNLOAD=OFF \
        -DCMAKE_C_COMPILER=`which mpicc` \
        -DCMAKE_CXX_COMPILER=`which mpicxx` \
        -DGMX_BUILD_OWN_FFTW=on \
        -DGMX_SIMD=AVX2_256 \
        -DGMX_DOUBLE=off \
        -DGMX_EXTERNAL_BLAS=off \
        -DGMX_EXTERNAL_LAPACK=off \
        -DGMX_FFT_LIBRARY=fftw3 \
        -DGMX_MPI=on \
        -DGMX_OPENMP=on \
        -DGMX_X11=off \
        -DGMX_GPU=on \
        ..

make 
make install
```
        
### Mechanics of Running Benchmark
The general way to run the benchmarks with the hybrid parallel executable, assuming SLURM Resource/Batch Manager is:

```
...
#SBATCH --cpus-per-task=X
#SBATCH --tasks-per-node=Y
#SBATCH --nnodes=Z
...
load necessary environment modules, like compilers, possible libraries etc. 

    export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK   
    parallel_launcher launcher_options path_to_/gmx_mpi mdrun \
    -s TESTCASE.tpr \
    -deffnm md.TESTCASE.Nodes.$SLURM_NNODES.TasksPerNode.$SLURM_TASKS_PER_NODE.ThreadsPerTask.$OMP_NUM_THREADS.JobID.$SLURM_JOBID 
    -cpt 1000 \
    -maxh 1.0 \
    -nsteps 50000 \
    -ntomp $OMP_NUM_THREADS 
```
Where:

* The environment variable for the number of threads `OMP_NUM_THREADS` must be set from `SLURM_CPUS_PER_TASK` in case of SLURM, before calling the executable.
* The `parallel_launcher` may be `srun`, `mpirun`, `mpiexec`, `mpiexec.hydra` or some variant such as `aprun` on Cray systems.
* `launcher_options` specifies parallel placement in terms of total numbers of nodes, MPI ranks/tasks, tasks per node, and OpenMP threads per task (which should be e
qual to the value given to OMP_NUM_THREADS). This is not necessary if parallel runtime options are picked up by the launcher from the job environment.
* You can try almost any combination of tasks per node and OpenMP threads per task to investigate absolute performance and scaling on the machine of interest
 as far as the product of `tasks_per_node x OMP_NUM_THREADS` is equal to the total threads available on each node.
It depends on : Test Case, Machine configuration (by means of memory slots, available cores, Hyperthreading enabled or not etc.) which configuration
gives the higher performance. Typically, OMP_NUM_THREADS should be low : 1-16, OMP_NUM_THREADS should guarantee that each task with its threads
shoud be allocated on the same socket.
* The inputfile has the extension `.tpr`, and should be specified unless it has the default name `topol.tpr`. 
  This file is the output of `gmx grompp` command that preprocess ASCII files creating the binary `.tpr` file. 
  In general, there is backwards compatibility with files version.
* The `-maxh` option instructs GROMACS to smoothly terminate after 0.99 times the specified time (in hours)
* The `-nsteps` option gives the number of time steps for integration of equations of motion. A value of `50000` is safe to have reproducible 
  timing. Depending on the simulated system size and details during the first 100s to 1000s iterations does dynamic load balancing,
  optimizes some internal parameters for performance. The contribution of these first steps in run time should be small enough.
* The value of `-cpt` applies a large value for the checkpoint period to avoid checkpointing during benchmark runs.


### UEABS Benchmarks

**A) `GluCl Ion Channel`**

The ion channel system is the membrane protein GluCl, which is a pentameric chloride channel embedded in a lipid bilayer. The GluCl ion channel was embedded in a DOPC membrane and solvated in TIP3P water. This system contains 142k atoms, and is a quite challenging parallelisation case due to the small size. However, it is likely one of the most wanted target sizes for biomolecular simulations due to the importance of these proteins for pharmaceutical applications. It is particularly challenging due to a highly inhomogeneous and anisotropic environment in the membrane, which poses hard challenges for load balancing with domain decomposition. This test case was used as the “Small” test case in previous PRACE-2IP-5IP projects. It is reported to scale efficiently up to 300 - 1000 cores on recent x86 based systems.

Download test Case A [https://repository.prace-ri.eu/ueabs/GROMACS/2.2/GROMACS_TestCaseA.tar.xz](https://repository.prace-ri.eu/ueabs/GROMACS/2.2/GROMACS_TestCaseA.tar.xz)


**B) `Lignocellulose`**

A model of cellulose and lignocellulosic biomass in an aqueous solution 
[http://pubs.acs.org/doi/abs/10.1021/bm400442n](http://pubs.acs.org/doi/abs/10.1021/bm400442n). 
This system of 3.3 million atoms is inhomogeneous. 
Reaction-field electrostatics are used instead of PME and therefore scales well. This test case was used as the “Large” test case in previous PRACE-2IP-5IP projects. It is reported in previous PRACE projects to scale efficiently on 10000+ recent x86 cores.

Download test Case B [https://repository.prace-ri.eu/ueabs/GROMACS/2.2/GROMACS_TestCaseB.tar.xz](https://repository.prace-ri.eu/ueabs/GROMACS/2.2/GROMACS_TestCaseB.tar.xz)

**C) `STMV 8M`**

This is a `2 x 2 x 2` replica of the STMV (Satellite Tobacco Mosaic Virus). It is a converted to GROMACS case of
the corresponding NAMD benchmark. It contains 8.5 million atoms and uses PME for electrostatics. It is reported to scale efficiently on more than 10000 recent x86 cores.

Download test Case C [https://repository.prace-ri.eu/ueabs/GROMACS/2.2/GROMACS_TestCaseC.tar.xz](https://repository.prace-ri.eu/ueabs/GROMACS/2.2/GROMACS_TestCaseC.tar.xz)
   
## Performance 
GROMACS reports in log file both time and performance. 

** `Performance` in `ns/day` units : `grep Performance logfile | awk -F ' ' '{print $2}'`.  **

** `Execution Time` in `seconds` : `grep Time: logfile | awk -F ' ' '{print $3}'`**