# NAMD ## Summary Version 1.0 ## Purpose of Benchmark NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. ## Characteristics of Benchmark NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. NAMD is developed by the “Theoretical and Computational Biophysics Group” at the University of Illinois at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a large number of processors. The application can read a wide variety of different file formats, for example force fields, protein structures, which are commonly used in bio-molecular science. A NAMD license can be applied for on the developer’s website free of charge. Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website. Deployment areas of NAMD include pharmaceutical research by academic and industrial users. NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other chemical substances is of interest. Typical examples are vaccine research and transport processes through cell membrane proteins. NAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI or Infiniband Verbs, supporting both pure MPI and hybrid parallelisation. Offloading for accelerators is implemented for both GPU and MIC (Intel Xeon Phi KNC). ## Mechanics of Building Benchmark NAMD supports various build types. In order to run current benchmarks the memopt build with SMP and Tcl support is mandatory. NAMD should be compiled in memory-optimized mode that utilizes a compressed version of the molecular structure and Forcefield and supports parallel I/O. In addition to reducing per-node memory requirements, the compressed datafiles reduce startup times compared to reading ascii PDB and PSF files. In order to build this version using MPI, the MPI implementation should support: `MPI_THREAD_FUNNELED`. * Uncompress/extract the tar source archive. * Typically source is in `NAMD_VERSION_Source`. * `cd NAMD_VERSION_Source` * Untar the `charm-VERSION.tar` * `cd charm-VERSION` * Configure and compile charm++. This step is system dependent. In most cases, the `mpi-linux-x86_64` architecture works. On systems with GPUs and Infiniband, the `verbs-linux-x86_64` arch should be used. One may explore the available `charm++` arch files in `charm-VERSION/src/arch` directory. `./build charm++ mpi-linux-x86_64 smp mpicxx --with-production -DCMK_OPTIMIZE` or for CUDA enabled NAMD `./build charm++ verbs-linux-[x86_64|ppc64le] smp [gcc|xlc64] --with-production -DCMK_OPTIMIZE` For x86_64 architecture one may use Intel Compilers specifying icc instead of gcc in charm++ configuration. Issue `./build --help` for help for additional options The build script will configure and compile charm++. Its files are placed in a directory inside charm-VERSION tree with name the combination of ARCH, compiler etc. List the contents of the charm-VERSION directory and note the name of the extra directory. * `cd ..` * Configure NAMD There are arch files with settings for various types of systems. Check arch directory for possibilities. The config tool in the NAMD_Source-VERSION is used to configure the build. Some options must be specified. ``` ./config Linux-x86_64-g++ \ --charm-base ./charm-VERSION \ --charm-arch charm-ARCH \ --with-fftw3 --with-fftw-prefix PATH_TO_FFTW_INSTALLATION \ --with-tcl --tcl-prefix PATH_TO_TCL \ --with-memopt \ --cc-opts '-O3 -march=native -mtune=native ' \ --cxx-opts '-O3 -march=native -mtune=native ' \ NEXT TWO ARE OPTIONAL, ONLY for GPU enabled runs --with-cuda \ --cuda-prefix PATH_TO_CUDA_INSTALLATION ``` It should be noted that for GPU support one has to use `verbs-linux-x86_64` instead of `mpi-linux-x86_64`. You can issue `./config --help` to see all available options. What is absolutely necessary are the options : ** `--with-memopt` ** ** `SMP build of charm++` ** ** `--with-tcl` ** You need to specify the fftw3 installation directory. On systems that use environment modules you need to load the existing fftw3 module and probably use the provided environment variables. If fftw3 libraries are not installed on your system, download and install fftw-3.3.9.tar.gz from [http://www.fftw.org/](http://www.fftw.org/). When config finish, it prompts to change to a directory named ARCH-Compiler and run make. If everything is ok you'll find the executable with name `namd2` in this directory. In this directory, there will be also a binary or shell script depending on architecture called `charmrun`. This is necessary for GPU runs on some systems. ### Download the source code The official site to download namd is : [http://www.ks.uiuc.edu/Research/namd/](https://www.ks.uiuc.edu/Research/namd/) You need to register for free here to get a namd copy from here : [https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD](https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD) ### Mechanics of Running Benchmark The general way to run the benchmarks with hybrid parallel executable, assuming SLURM Resource/Batch manager is: ``` ... #SBATCH --cpus-per-task=X #SBATCH --tasks-per-node=Y #SBATCH --nnodes=Z ... load necessary environment modules, like compilers, possible libraries etc. PPN=`expr $SLURM_CPUS_PER_TASK - 1 ` parallel_launcher launcher_options path_to_/namd2 ++ppn $PPN \ TESTCASE.namd > \ TESTCASE.Nodes.$SLURM_NNODES.TasksPerNode.$SLURM_TASKS_PER_NODE.ThreadsPerTask.$SLURM_CPUS_PER_TASK.JobID.$SLURM_JOBID ``` Where: * The `parallel_launcher` may be `srun`, `mpirun`, `mpiexec`, `mpiexec.hydra` or some variant such as `aprun` on Cray systems. * `launcher_options` specifies parallel placement in terms of total numbers of nodes, MPI ranks/tasks, tasks per node, and OpenMP. * The variable `PPN` is the number of threads per task minus 1. This is necessary since namd uses one thread for communication. * You can try almost any combination of tasks per node and threads per task to investigate absolute performance and scaling on the machine of interest as far as the product of `tasks_per_node x threads_per_task` is equal to the total threads available on each node. Increasing the number of tasks per node, increases the number of communication threads. Usually, the best performance is obtained using number of tasks per node equal to the number of sockets per node. It depends on : Test Case, Machine configuration (by means of memory slots, available cores, Hyperthreading enabled or not etc.) which configuration gives the higher performance. On machines with GPUs : * The verbs-ARCH architecture should be used instead of MPI. * The number of tasks per node should be equal to the number of GPUs per node. * Typically the Batch system when one allocates GPUs, sets the environment cariable `CUDA_VISIBLE_DEVICES`. NAMD gets the list of devices to use from this variable. If for any reason no gpus are reported by namd, then one should add the flags `+devices $CUDA_VISBLE_DEVICES` to namd flags. * In some cases, due to the enabled features of the parallel_launcher, one has to use the shipped in namd charmrun together with typically hydra parallel launcher that uses ssh to spawn processes. In this case passwordless ssh should be enabled between compute nodes. In this case the corresponding script should look like : ``` PPN=`expr $SLURM_CPUS_PER_TASK - 1` P="$(($PPN * $SLURM_NTASKS_PER_NODE * $SLURM_NNODES))" PROCSPERNODE="$(($SLURM_CPUS_PER_TASK * $SLURM_NTASKS_PER_NODE))" for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do \ echo "host $n ++cpus $PROCSPERNODE" >> nodelist.$SLURM_JOB_ID done; PATH_TO_charmrun ++mpiexec +p $P PATH_TO_namd2 ++ppn $PPN \ ++nodelist ./nodelist.$SLURM_JOB_ID +devices $CUDA_VISIBLE_DEVICES \ TESTCASE.namd > TESTCASE.Nodes.$SLURM_NNODES.TasksPerNode.$PPN.ThreadsPerTask.$SLURM_CPUS_PER_TASK.log ``` ### UEABS Benchmarks The datasets are based on the original `Satellite Tobacco Mosaic Virus (STMV)` dataset from the official NAMD site. The memory optimised build of the package and data sets are used in benchmarking. Data are converted to the appropriate binary format used by the memory optimised build. **A) Test Case A: STMV.8M ** This is a 2×2×2 replication of the original STMV dataset from the official NAMD site. The system contains roughly 8 million atoms. Download test Case A [https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseA.tar.gz](https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseA.tar.gz) **B) Test Case B: STMV.28M ** This is a 3×3×3 replication of the original STMV dataset from the official NAMD site, created during PRACE-2IP project. The system contains roughly 28 million atoms and is expected to scale efficiently up to few tens of thousands x86 cores. Download test Case B [https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseB.tar.gz](https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseB.tar.gz) **C) Test Case C: STMV.210M ** This is a 5×6×7 replication of the original STMV dataset from the official NAMD site. The system contains roughly 210 million atoms and is expected to scale efficiently up to more than hundred thousand recent x86 cores. Download test Case C [https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseC.tar.gz](https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseC.tar.gz) ## Performance NAMD Reports some timings in logfile. At the end of the log file, WallClock, CPU Time and used memory is reported, the line looks like : `WallClock: 629.896729 CPUTime: 629.896729 Memory: 2490.726562 MB` One may obtain the execution time by : `grep WallClock logfile | awk -F ' ' '{print $2}'`. Since input data have size of order of few GB and NAMD writes the the end a similar amount of data, it is common depending on the filesystem load to have higher and usually not reproducible startup and close times. One has to check if the startup and close times are not more more than 1-2 seconds by : `grep "Info: Finished startup at" logfile | awk -F ' ' '{print $5}'` `grep "file I/O" logfile | awk -F ' ' '{print $7}'` If the reported startup and close times are significant, they should be subtracted from the reported WallClock in order to obtain the real run performance.