Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# NAMD
## Summary Version
1.0
## Purpose of Benchmark
NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms.
## Characteristics of Benchmark
NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of
compute platforms. NAMD is developed by the “Theoretical and Computational Biophysics Group” at the University of Illinois
at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a
large number of processors. The application can read a wide variety of different file formats,
for example force fields, protein structures, which are commonly used in bio-molecular science.
A NAMD license can be applied for on the developer’s website free of charge.
Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website.
Deployment areas of NAMD include pharmaceutical research by academic and industrial users.
NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other
chemical substances is of interest.
Typical examples are vaccine research and transport processes through cell membrane proteins.
NAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI or Infiniband Verbs,
supporting both pure MPI and hybrid parallelisation.
Offloading for accelerators is implemented for both GPU and MIC (Intel Xeon Phi KNC).
## Mechanics of Building Benchmark
NAMD supports various build types.
In order to run current benchmarks the memopt build with SMP and Tcl support is mandatory.
NAMD should be compiled in memory-optimized mode that utilizes a compressed version of the molecular
structure and Forcefield and supports parallel I/O.
In addition to reducing per-node memory requirements, the compressed datafiles reduce startup times
compared to reading ascii PDB and PSF files.
In order to build this version using MPI, the MPI implementation should support: `MPI_THREAD_FUNNELED`.
* Uncompress/extract the tar source archive.
* Untar the `charm-VERSION.tar`
* `cd charm-VERSION`
* Configure and compile charm++. This step is system dependent.
In most cases, the `mpi-linux-x86_64` architecture works.
On systems with GPUs and Infiniband, the `verbs-linux-x86_64` arch should be used.
One may explore the available `charm++` arch files in `charm-VERSION/src/arch` directory.
`./build charm++ mpi-linux-x86_64 smp mpicxx --with-production -DCMK_OPTIMIZE`
or for CUDA enabled NAMD
`./build charm++ verbs-linux-[x86_64|ppc64le] smp [gcc|xlc64]
--with-production -DCMK_OPTIMIZE`
For x86_64 architecture one may use Intel Compilers specifying icc instead of gcc in charm++ configuration.
Issue `./build --help` for help for additional options
The build script will configure and compile charm++. Its files are placed in a directory
inside charm-VERSION tree with name the combination of ARCH, compiler etc.
List the contents of the charm-VERSION directory and note the name of the extra directory.
* `cd ..`
* Configure NAMD
There are arch files with settings for various types of systems. Check arch directory for possibilities.
The config tool in the NAMD_Source-VERSION is used to configure the build.
Some options must be specified.
```
--charm-base ./charm-VERSION \
--charm-arch charm-ARCH \
--with-fftw3 --with-fftw-prefix PATH_TO_FFTW_INSTALLATION \
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
--with-tcl --tcl-prefix PATH_TO_TCL \
--with-memopt \
--cc-opts '-O3 -march=native -mtune=native ' \
--cxx-opts '-O3 -march=native -mtune=native ' \
NEXT TWO ARE OPTIONAL, ONLY for GPU enabled runs
--with-cuda \
--cuda-prefix PATH_TO_CUDA_INSTALLATION
```
It should be noted that for GPU support one has to use `verbs-linux-x86_64` instead of `mpi-linux-x86_64`.
You can issue `./config --help` to see all available options.
What is absolutely necessary are the options :
** `--with-memopt` **
** `SMP build of charm++` **
** `--with-tcl` **
You need to specify the fftw3 installation directory. On systems that use environment modules you need
to load the existing fftw3 module and probably use the provided environment variables.
If fftw3 libraries are not installed on your system, download and install fftw-3.3.9.tar.gz from [http://www.fftw.org/](http://www.fftw.org/).
When config finish, it prompts to change to a directory named ARCH-Compiler and run make.
If everything is ok you'll find the executable with name `namd2` in this directory.
In this directory, there will be also a binary or shell script depending on architecture called `charmrun`.
This is necessary for GPU runs on some systems.
### Download the source code
The official site to download namd is : [http://www.ks.uiuc.edu/Research/namd/](https://www.ks.uiuc.edu/Research/namd/)
You need to register for free here to get a namd copy from here :
[https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD](https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD)
### Mechanics of Running Benchmark
The general way to run the benchmarks with hybrid parallel executable, assuming SLURM Resource/Batch manager is:
```
...
#SBATCH --cpus-per-task=X
#SBATCH --tasks-per-node=Y
#SBATCH --nnodes=Z
...
load necessary environment modules, like compilers, possible libraries etc.
PPN=`expr $SLURM_CPUS_PER_TASK - 1 `
parallel_launcher launcher_options path_to_/namd2 ++ppn $PPN \
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
TESTCASE.namd > \
TESTCASE.Nodes.$SLURM_NNODES.TasksPerNode.$SLURM_TASKS_PER_NODE.ThreadsPerTask.$SLURM_CPUS_PER_TASK.JobID.$SLURM_JOBID
```
Where:
* The `parallel_launcher` may be `srun`, `mpirun`, `mpiexec`, `mpiexec.hydra` or some variant such as `aprun` on Cray systems.
* `launcher_options` specifies parallel placement in terms of total numbers of nodes, MPI ranks/tasks, tasks per node, and OpenMP.
* The variable `PPN` is the number of threads per task minus 1. This is necessary since namd uses one thread for communication.
* You can try almost any combination of tasks per node and threads per task to investigate absolute performance and scaling on the machine of interest
as far as the product of `tasks_per_node x threads_per_task` is equal to the total threads available on each node.
Increasing the number of tasks per node, increases the number of communication
threads. Usually, the best performance is obtained using number of tasks per node equal to the number of sockets per node.
It depends on : Test Case, Machine configuration (by means of memory slots, available cores, Hyperthreading enabled or not etc.) which configuration
gives the higher performance.
On machines with GPUs :
* The verbs-ARCH architecture should be used instead of MPI.
* The number of tasks per node should be equal to the number of GPUs per node.
* Typically the Batch system when one allocates GPUs, sets the environment
cariable `CUDA_VISIBLE_DEVICES`. NAMD gets the list of devices to use from this variable. If for any reason no gpus are reported by namd, then one should add
the flags `+devices $CUDA_VISBLE_DEVICES` to namd flags.
* In some cases, due to the enabled features of the parallel_launcher, one
has to use the shipped in namd charmrun together with typically hydra parallel launcher that uses ssh to spawn processes. In this case passwordless ssh should be enabled between compute nodes.
In this case the corresponding script should look like :
```
PPN=`expr $SLURM_CPUS_PER_TASK - 1`
P="$(($PPN * $SLURM_NTASKS_PER_NODE * $SLURM_NNODES))"
PROCSPERNODE="$(($SLURM_CPUS_PER_TASK * $SLURM_NTASKS_PER_NODE))"
for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do \
echo "host $n ++cpus $PROCSPERNODE" >> nodelist.$SLURM_JOB_ID
done;
PATH_TO_charmrun ++mpiexec +p $P PATH_TO_namd2 ++ppn $PPN \
++nodelist ./nodelist.$SLURM_JOB_ID +devices $CUDA_VISIBLE_DEVICES \
TESTCASE.namd > TESTCASE.Nodes.$SLURM_NNODES.TasksPerNode.$PPN.ThreadsPerTask.$SLURM_CPUS_PER_TASK.log
```
### UEABS Benchmarks
The datasets are based on the original `Satellite Tobacco Mosaic Virus (STMV)`
dataset from the official NAMD site. The memory optimised build of the
package and data sets are used in benchmarking. Data are converted to
the appropriate binary format used by the memory optimised build.
**A) Test Case A: STMV.8M **
This is a 2×2×2 replication of the original STMV dataset from the official NAMD site. The system contains roughly 8 million atoms.
Download test Case A [https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseA.tar.gz](https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseA.tar.gz)
**B) Test Case B: STMV.28M **
This is a 3×3×3 replication of the original STMV dataset from the official NAMD site, created during PRACE-2IP project. The system contains roughly 28 million atoms and is expected to scale efficiently up to few tens of thousands x86 cores.
Download test Case B [https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseB.tar.gz](https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseB.tar.gz)
**C) Test Case C: STMV.210M **
This is a 5×6×7 replication of the original STMV dataset from the official NAMD site. The system contains roughly 210 million atoms and is expected to scale efficiently up to more than hundred thousand recent x86 cores.
Download test Case C [https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseC.tar.gz](https://repository.prace-ri.eu/ueabs/NAMD/2.2/NAMD_TestCaseC.tar.gz)
## Performance
NAMD Reports some timings in logfile. At the end of the log file,
WallClock, CPU Time and used memory is reported, the line looks like :
`WallClock: 629.896729 CPUTime: 629.896729 Memory: 2490.726562 MB`
One may obtain the execution time by :
`grep WallClock logfile | awk -F ' ' '{print $2}'`.
Since input data have size of order of few GB and NAMD writes the the end a similar amount of data, it is common depending on the filesystem load to have higher and usually not reproducible startup and close times.
One has to check if the startup and close times are not more more than 1-2 seconds by :
`grep "Info: Finished startup at" logfile | awk -F ' ' '{print $5}'`
`grep "file I/O" logfile | awk -F ' ' '{print $7}'`
If the reported startup and close times are significant, they should be subtracted from the reported WallClock in order to obtain the real run performance.