9.65 KiB
Newer Older

## Summary Version
4.0 (2021)

## Purpose of Benchmark
Provide the Astrophysical community information on the performance and scalability (weak and strong scaling) of the Gadget-4 code associated to three test cases in PRACE Tier-0 supercomputers (JUWELS, MareNostrum4, IRENE-SKL, and IRENE-KNL) and the ARM64 Mont-Blanc 3 prototype (Dibona). 
GADGET-4 supports collisionless simulations and smoothed particle hydrodynamics on massively parallel computers. All communication between concurrent execution processes is done either explicitly by means of the message passing interface (MPI), or implicitly through shared-memory accesses on processes on multi-core nodes. The code is mostly written in ISO C++ (assuming the C++11 standard), and should run on all parallel platforms that support at least MPI-3.

## Characteristics of Benchmark

GADGET-4 was compiled with the optimisation level O3. In order to have a proper scalability analysis the tests we carried out with one MPI-task per core and 16 tasks per CPU. Hence, a total of 32 tasks (cores) dedicated to the calculations were used per compute node. An extra core to handle the MPI communications was used per compute node. The tests were carried out using two modes: Intel and GCC compiled MPI API and libraries (HPF5, GSL, FFTW3).

## Mechanics of Building Benchmark

Building the GADGET code requires a compiler with full C++11 support, MPI (e.g., MPICH, OpenMPI, IntelMPI), FFTW3, GSL, and HDF5. Hence, the corresponding environment modules must be loaded, e.g.,

module load OpenMPI/4.0.3 HDF5/1.10.6 FFTW/3.3.8 GSL/2.6

### Download the source code

Latest Release can be downloaded from [](
or get a cloned repository of the code by using 
git clone
Source code used in the benchmarks (version of June 22, 2021) [./gadget/4.0/gadget4-benchmarks.tar.gz](./gadget/4.0/gadget4-benchmarks.tar.gz)
### Build the Executable

##### General Building of the Executable
There are two files to obtain from the repository: gadget4.tar.gz and example_ics.tar.gz
gadget4.tar.gz includes the `src` code, `examples`, `buildsystem`, and `documentation` folders. It also includes **Makefile** and **Makefile.systype** (or a template)
example_ics.tar.gz includes initial conditions that are needed for each of the examples. When untarred you generate a folder named `ExampleICs`. You may download the examples initial conditions from [./gadget/4.0/example_ics.tar.gz](./gadget/4.0/example_ics.tar.gz)
1. After decompressing gadget4.tar.gz go to the master folder named `gadget4`. There are two files that need modification: **Makefile.systype** and **Makefile**.
1. In the **Makefile.systype** select one of the system types by uncommenting the corresponding line or add a line with your system, e.g.,
where XXX = system name and BBB = whatever you may want to include here, e.g., impi, openmpi, etc.

1. In case you uncommented a line corresponding to your system in the **Makefile.systype** then there is nothing to do in the **Makefile**.
1. In case you added a line, say #SYSTYPE="XXX-BBB", into the **Makefile.systype** then you must modify the **Makefile** by adding the following lines in 'define available Systems'
ifeq ($(SYSTYPE),"XXX-BBB")
include buildsystem/Makefile.comp.XXX-BBB
include buildsystem/Makefile.path.XXX-BBB
2. In the folder `buildsystem` make sure you have the **Makefile.comp.XXX** and **Makefile.path.XXX** (XXX = cluster name) set with the proper paths and compilation options, respectively. Either chose the existing files or create new ones that reflect your system paths and compiler.
3. The folder examples has several subfolders of test cases. From one of these subfolders, e.g., `CollidingGalaxiesSFR`, copy **** to the master folder.
4. In the master folder compile the code
make EXEC=gadget4-exe
where EXEC is the name of the executable.
5. Create a folder named `Run_CollidingGalaxies`. Copy **gadget4-exe**, and the files **param.txt** and **TREECOOL** existing in the subfolder `CollidingGalaxiesSFR` to `Run_CollidingGalaxies`.
6. In the folder `Run_CollidingGalaxies` modify **param.txt** to include the adequate path to the initial conditions file **ics_collision_g4.dat** located in the folder `ExampleICs` and modify the memory per core to that of the system you are using.
7. Run the code using mpirun or submit a SLURM script.
##### Building a Test Case Executable | Case A
1. Download and untar a test case tarball, e.g., `gadget4-case-A.tar.gz` (see below) and the source code used in the benchmarks named `gadget4-benchmarks.tar.gz`. The folder gadget4-case-A has the files, ics_collision_g4.dat, param.txt, and TREECOOL. The param.txt file has the path for the initial conditions and was adapted for a system with 2.0 GB RAM per core, in effect 1.8 GB.
2. Change to the folder named gadget4-benchmarks and adapt the file Makefile.systype and Makefile to your needs. Follow instructions 1a), 1b) or 1c) in Section "General Building of the Executable".
3. Compile the code using the file in gadget4-case-A
make CONFIG=../gadget4-case-A/ EXEC=../gadget4-case-A/gadget4-exe
4. Change to folder gadget4-case-A and make sure that the file param.txt has the correct memory size per core for the system you are using. 
5. Run the code directly with mpirun or submit a SLURM script.
### Mechanics of Running Benchmark
The general way to run the benchmarks, assuming SLURM Resource/Batch Manager is:

1. Set the environment modules (see Build the Executable section)
2. In the folder of the test cases, e.g., gadget4-case-A, adapt the SLURM script and submit it
where the has the form (for a run with 1024 cores):
#!/bin/bash -l 
#SBATCH --time=01:00:00
#SBATCH --job-name=collgal-01024
#SBATCH --output=g_collgal_%j.out
#SBATCH --error=g_collgal_%j.error
#SBATCH --nodes=32
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-socket=17
#SBATCH --ntasks-per-node=33
#SBATCH --exclusive
#SBATCH --partition=batch

echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running on $SLURM_NPROCS processors."
echo "Current working directory is `pwd`"

srun ./gadget4-exe param.txt
* gadget4-exe is the executable.
* param.txt is the input parameter file. 

##### NOTE 

Gadget-4 uses one core per compute node to handle communications. Hence, when allocating compute 
nodes we must take into account an extra core. So, if we want to run the code with 16 mpi tasks/socket we 
must allocate 33 cores per compute node. For a run with 1024 cores in 32 nodes we allocate 1056 cores. 

##### OUTPUT of a run with 1024 cores

Running on hosts: jwc03n[082-097,169-184]
Running on 32 nodes.
Running on 1056 processors.
Current working directory is Test-Case-A

Shared memory islands host a minimum of 33 and a maximum of 33 MPI ranks.
We shall use 32 MPI ranks in total for assisting one-sided communication (1 per shared memory node).

  ___    __    ____    ___  ____  ____       __
 / __)  /__\  (  _ \  / __)( ___)(_  _)___  /. |
( (_-. /(__)\  )(_) )( (_-. )__)   )( (___)(_  _)
 \___/(__)(__)(____/  \___/(____) (__)       (_)

This is Gadget, version 4.0.
Git commit 8ee7f358cf43a37955018f64404db191798a32a3, Tue Jun 15 15:10:36 2021 +0200

Code was compiled with the following compiler and flags:

Code was compiled with the following settings:

Running on 1024 MPI tasks.
### UEABS Benchmarks

**A) `Colliding galaxies with star formation`**

This simulation with setup in the folder CollidingGalaxiesSFR considers the collision of two compound galaxies made up of a dark matter halo, a stellar disk and bulge, and cold gas in the disk that undergoes star formation. Radiative cooling due to helium and hydrogen is included. Star formation and feedback is modelled with a simple subgrid treatment. 

[Download test Case A](./gadget/4.0/gadget4-caseA.tar.gz)

**B) `Cosmological DM-only simulation with IC creation`**

The setup in DM-L50-N128 simulates a small box of comoving side-length 50 Mpc/h using 128^3 dark matter particles. The initial conditions are created on the fly upon start-up of the code, using second order Lagrangian perturbation theory with a starting redshift of z=63. The LEAN option and 32-bit arithmetic are enabled to minimize memory consumption of the code.

Gravity is computed with the TreePM algorithm at expansion order p=3. Three output times are defined, for which FOF group finding is enabled, and power spectra are computed as well for the snapshots that are produced. Also, the code is asked to compute a power spectrum for each output.

[Download test Case B](./gadget/4.0/gadget4-caseB.tar.gz)
**C) `Adiabatic collapse of a gas sphere`**

This simulation in G2-gassphere considers the gravitational collapse of a self-gravitating sphere of gas which initially has a 1/r density profile and a very low temperature. The gas falls under its own weight to the centre, where it bounces back and a strong shock wave that moves outwards develops. The simulation uses Newtonian physics in a natural system of units (G=1).

[Download test Case C](./gadget/4.0/gadget4-caseC.tar.gz)

## Performance 
GADGET reports in log file both time and performance. 

** `Performance` in `ns/day` units : `grep Performance logfile | awk -F ' ' '{print $2}'`.  **

** `Execution Time` in `seconds` : `grep Time: logfile | awk -F ' ' '{print $3}'`**