We performed scalability test on 512 cores and 1024 cores for test case A. We performed scalability test for 4096 cores, 8192 cores and 16384 cores for test case B.
Both these test cases can give us quite good understanding of node performance and interconnect behavior. We switch off the generation of mesh files by setting the flag nn_mesh = 0 in the namelist_ref file. Also using_server = false is defined in io_server file.
We report the performance in step time which is the total computational time averaged over the number of time steps for different test cases. This helps us to compare systems in a standard manner across all combinations of system architectures. The other main reason for reporting time per computational time step is to make sure that results are more reproducible and comparable.
Since NEMO supports both weak and strong scalability, test case A and test case B both can be scaled down to run on smaller number of processors while keeping the memory per processor constant achieving similar results for step time. To measure the step time, we inserted a patch which includes the `MPI_wtime()` functional call in `nemogcn.f90` file for each step which also cumulatively adds the step time until the second last step. We then divide the total cumulative time by the number of time steps to average out any overhead.
Both these test cases can give us quite good understanding of node performance and interconnect behavior.
We switch off the generation of mesh files by setting the `flag nn_mesh = 0` in the `namelist_ref` file. Also `using_server = false` is defined in `io_server` file.
We report the performance in step time which is the total computational time averaged over the number of time steps for different test cases.
This helps us to compare systems in a standard manner across all combinations of system architectures.
The other main reason for reporting time per computational time step is to make sure that results are more reproducible and comparable.
Since NEMO supports both weak and strong scalability,
test case A and test case B both can be scaled down to run on smaller number of processors while keeping the memory per processor constant achieving similar
results for step time. To measure the step time, we inserted a patch which includes the `MPI_wtime()` functional call in `nemogcn.f90` file
for each step which also cumulatively adds the step time until the second last step.
We then divide the total cumulative time by the number of time steps to average out any overhead.