# NAMD


## Summary Version
1.0

## Purpose of Benchmark
NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms.

## Characteristics of Benchmark

NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of 
compute platforms. NAMD is developed by the “Theoretical and Computational Biophysics Group” at the University of Illinois 
at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a 
large number of processors. The application can read a wide variety of different file formats, 
for example force fields, protein structures, which are commonly used in bio-molecular science. 
A NAMD license can be applied for on the developer’s website free of charge. 
Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website. 
Deployment areas of NAMD include pharmaceutical research by academic and industrial users. 
NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other 
chemical substances is of interest. 
Typical examples are vaccine research and transport processes through cell membrane proteins. 
NAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI or Infiniband Verbs, 
supporting both pure MPI and hybrid parallelisation. 
Offloading for accelerators is implemented for both GPU and MIC (Intel Xeon Phi KNC).


## Mechanics of Building Benchmark

NAMD supports various build types.
In order to run current benchmarks the memopt build with SMP and Tcl support is mandatory.
 
NAMD should be compiled in memory-optimized mode that utilizes a compressed version of the molecular 
structure and Forcefield and  supports parallel I/O. 
In addition to reducing per-node memory requirements, the compressed datafiles reduce startup times 
compared to reading ascii PDB and PSF files. 

In order to build this version using MPI, the MPI implementation should support: `MPI_THREAD_FUNNELED`.

* Uncompress/extract the tar source archive.

* Typically source is in `NAMD_VERSION_Source`.

* `cd NAMD_VERSION_Source`

* Untar the `charm-VERSION.tar`

* `cd charm-VERSION`

* Configure and compile charm++. This step is system dependent.
  In most cases, the `mpi-linux-x86_64` architecture works.
  On systems with GPUs and Infiniband, the `verbs-linux-x86_64` arch should be used.
  One may explore the available `charm++` arch files in `charm-VERSION/src/arch` directory.

  `./build charm++ mpi-linux-x86_64  smp mpicxx --with-production -DCMK_OPTIMIZE`

   or for CUDA enabled NAMD

  `./build charm++ verbs-linux-[x86_64|ppc64le] smp [gcc|xlc64] 
                    --with-production -DCMK_OPTIMIZE`

For x86_64 architecture one may use Intel Compilers specifying icc instead of gcc in charm++ configuration.
Issue `./build --help` for help for additional options


   The build script will configure and compile charm++. Its files are placed in a directory 
   inside charm-VERSION tree with name the combination of ARCH, compiler etc.
   List the contents of the charm-VERSION directory and note the name of the extra directory.

*  `cd ..`
 
*  Configure NAMD

   There are arch files with settings for various types of systems. Check arch directory for  possibilities.
   The config tool in the NAMD_Source-VERSION is used to configure the build.
   Some options must be specified. 

```
    ./config  Linux-x86_64-g++ \
              --charm-base ./charm-VERSION \
              --charm-arch charm-ARCH \
              --with-fftw3 --with-fftw-prefix PATH_TO_FFTW_INSTALLATION \
              --with-tcl --tcl-prefix PATH_TO_TCL \
              --with-memopt \
              --cc-opts '-O3 -march=native -mtune=native ' \
              --cxx-opts '-O3 -march=native -mtune=native '  \
              NEXT TWO ARE OPTIONAL, ONLY for GPU enabled runs
              --with-cuda \
              --cuda-prefix PATH_TO_CUDA_INSTALLATION 
```

It should be noted that for GPU support one has to use `verbs-linux-x86_64` instead of `mpi-linux-x86_64`.
You can issue `./config --help` to see all available options.

What is absolutely necessary are the options : 

**  `--with-memopt`  **

**  `SMP build of charm++` **

** `--with-tcl` **
    
You need to specify the fftw3 installation directory. On systems that use environment modules you need 
to load the existing fftw3 module and probably use the provided environment variables. 
If fftw3 libraries are not installed on your system, download and install fftw-3.3.9.tar.gz from [http://www.fftw.org/](http://www.fftw.org/).

When config finish, it  prompts to change to a directory named ARCH-Compiler and run make. 
If everything is ok you'll find the executable with name `namd2` in this directory.
In this directory, there will be also a binary or shell script depending on architecture called `charmrun`. 
This is necessary for GPU runs on some systems.


### Download the source code

The official site to download namd is : [http://www.ks.uiuc.edu/Research/namd/](https://www.ks.uiuc.edu/Research/namd/)

You need to register for free here to get a namd copy from here : 
[https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD](https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD)


### Mechanics of Running Benchmark
The general way to run the benchmarks with hybrid parallel executable, assuming SLURM Resource/Batch manager is: 

```
...
#SBATCH --cpus-per-task=X
#SBATCH --tasks-per-node=Y
#SBATCH --nnodes=Z
...
load necessary environment modules, like compilers, possible libraries etc. 

PPN=`expr $SLURM_CPUS_PER_TASK - 1 `


parallel_launcher launcher_options path_to_/namd2 ++ppn $PPN \
TESTCASE.namd > \
TESTCASE.Nodes.$SLURM_NNODES.TasksPerNode.$SLURM_TASKS_PER_NODE.ThreadsPerTask.$SLURM_CPUS_PER_TASK.JobID.$SLURM_JOBID 
```
Where:

* The `parallel_launcher` may be `srun`, `mpirun`, `mpiexec`, `mpiexec.hydra` or some variant such as `aprun` on Cray systems.
* `launcher_options` specifies parallel placement in terms of total numbers of nodes, MPI ranks/tasks, tasks per node, and OpenMP.
* The variable `PPN` is the number of threads per task minus 1. This is necessary since namd uses one thread for communication.
* You can try almost any combination of tasks per node and threads per task to investigate absolute performance and scaling on the machine of interest
 as far as the product of `tasks_per_node x threads_per_task` is equal to the total threads available on each node.
Increasing the number of tasks per node, increases the number of communication
threads. Usually, the best performance is obtained using number of tasks per node equal to the number of sockets per node.
It depends on : Test Case, Machine configuration (by means of memory slots, available cores, Hyperthreading enabled or not etc.) which configuration
gives the higher performance.

On machines with  GPUs :

* The verbs-ARCH architecture should be used instead of MPI.

* The number of tasks per node should be equal to the number of GPUs per node.

* Typically the Batch system when one allocates GPUs, sets the environment
  cariable `CUDA_VISIBLE_DEVICES`. NAMD gets the list of devices to use from this variable. If for any reason no gpus are reported by namd, then one should add
the flags `+devices $CUDA_VISBLE_DEVICES` to namd flags.

* In some cases, due to the enabled features of the parallel_launcher, one 
  has to use the shipped in namd charmrun together with typically hydra parallel launcher that uses ssh to spawn processes. In this case passwordless ssh should be enabled between compute nodes.

In this case the corresponding script should look like :

```
PPN=`expr $SLURM_CPUS_PER_TASK - 1`
P="$(($PPN * $SLURM_NTASKS_PER_NODE * $SLURM_NNODES))"
PROCSPERNODE="$(($SLURM_CPUS_PER_TASK * $SLURM_NTASKS_PER_NODE))"

for n in `echo $SLURM_NODELIST | scontrol show hostnames`;  do \
           echo "host $n ++cpus $PROCSPERNODE" >> nodelist.$SLURM_JOB_ID 
done;

PATH_TO_charmrun ++mpiexec +p $P  PATH_TO_namd2 ++ppn $PPN \
                 ++nodelist ./nodelist.$SLURM_JOB_ID   +devices $CUDA_VISIBLE_DEVICES \
  TESTCASE.namd > TESTCASE.Nodes.$SLURM_NNODES.TasksPerNode.$PPN.ThreadsPerTask.$SLURM_CPUS_PER_TASK.log

```

### UEABS Benchmarks

The datasets are based on the original `Satellite Tobacco Mosaic Virus (STMV)` 
dataset from the official NAMD site. The memory optimised build of the 
package and data sets are used in benchmarking. Data are converted to 
the appropriate binary format used by the memory optimised build.

**A) Test Case A: STMV.8M **

This is a 2×2×2 replication of the original STMV dataset from the official NAMD site. The system contains roughly 8 million atoms. 

Download test Case A [https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseA.tar.gz](https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseA.tar.gz)

**B) Test Case B: STMV.28M **

This is a 3×3×3 replication of the original STMV dataset from the official NAMD site, created during PRACE-2IP project. The system contains roughly 28 million atoms and is expected to scale efficiently up to few tens of thousands x86 cores.

Download test Case B [https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseB.tar.gz](https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseB.tar.gz)

**C) Test Case C: STMV.210M **

This is a 5×6×7 replication of the original STMV dataset from the official NAMD site. The system contains roughly 210 million atoms and is expected to scale efficiently up to more than hundred thousand recent x86 cores.

Download test Case C [https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseC.tar.gz](https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseC.tar.gz)

## Performance 

NAMD Reports some timings in logfile. At the end of the log file,
WallClock, CPU Time and used memory is reported, the line looks like :

`WallClock: 629.896729  CPUTime: 629.896729  Memory: 2490.726562 MB`

One may obtain the execution time by :

`grep WallClock logfile | awk -F ' ' '{print $2}'`.

Since input data have size of order of few GB and NAMD writes the the end a similar amount of data, it is common depending on the filesystem load to have higher and usually not reproducible startup and close times.
One has to check if the startup and close times are not more more than 1-2 seconds by :

`grep "Info: Finished startup at" logfile | awk -F ' ' '{print $5}'`

`grep "file I/O" logfile | awk -F ' ' '{print $7}'`

If the reported startup and close times are significant, they should be subtracted from the reported WallClock in order to obtain the real run performance.