# GADGET


## Summary Version
4.0 (2021)

## Purpose of Benchmark
Provide the Astrophysical community information on the performance and scalability (weak and strong scaling) of the Gadget-4 code associated to three test cases in PRACE Tier-0 supercomputers (JUWELS, MareNostrum4, and IRENE-SKL).

## Characteristics of Benchmark

GADGET-4 was compiled with C++ with the optimisation level O3, MPI (e.g., OpenMPI, Intel MPI) and the libraries HDF5, GSL, and FFTW3. The tests were carried out using two modes: Intel and GCC compiled MPI API and libraries.

In order to study the scalability of the software two approaches were considered:

1. A core-based performance analysis where 1 MPI task per core, 16 cores per socket, that is 16 MPI tasks per socket, and 1 extra core per compute node to handle communications when multiple compute nodes were used. For the runs on a single node (that is with the number of cores varying between 1 and 32) no extra core was considered.

2. A node-based performance analysis where 1 MPI task per core, and all cores in the socket, that is 24 MPI tasks per socket, including an extra core for MPI communications when multiple nodes are used. For runs on a single node there is no need to use an extra core for communications.

In both setups the compute nodes were used with exclusivity. These approaches allow us to identify which setup provides the better performance for the GADGET-4 code.

## Mechanics of Building Benchmark

Building the GADGET code requires a compiler with full C++11 support, MPI (e.g., MPICH, OpenMPI, Intel MPI), HDF5, GSL, and FFTW3. Hence, the corresponding environment modules must be loaded, e.g.,

```
module load OpenMPI/4.0.3 HDF5/1.10.6 FFTW/3.3.8 GSL/2.6
```

### Source Code and Initial Conditions

##### Source Code Release

Latest release of the code can be downloaded from [https://gitlab.mpcdf.mpg.de/vrs/gadget4](https://gitlab.mpcdf.mpg.de/vrs/gadget4)

or clone the repository by

```
git clone http://gitlab.mpcdf.mpg.de/vrs/gadget4
```

##### In this UEABS repository you can find:

- The cloned version used in the benchmarks (version of June 22, 2021): [gadget4-benchmarks.tar.gz](./gadget/gadget4-benchmarks.tar.gz)

This tarball includes the `src` code, `examples`, `buildsystem`, and `documentation` folders. It also includes **Makefile** and **Makefile.systype** (or a template) files.

- Examples initial conditions from [example_ics.tar.gz](./gadget/example_ics.tar.gz)

It includes initial conditions for each of the examples. When untarred you generate a folder named `ExampleICs`.

- Test cases A and B from [gadget4-case-A.tar.gz](./gadget/gadget4-case-A.tar.gz) and [gadget4-case-B.tar.gz](./gadget/gadget4-case-B.tar.gz)

### Build the Executable

#### General Building of the Executable

1. Two files are need from the repository: [gadget4.tar.gz](./gadget/gadget4.tar.gz) and [example_ics.tar.gz](./gadget/example_ics.tar.gz)

2. After decompressing gadget4.tar.gz go to the master folder named `gadget4`. There are two files that need modification: **Makefile.systype** and **Makefile**.

a) In the **Makefile.systype** select one of the system types by uncommenting the corresponding line or add a line with your system, e.g.,
```
#SYSTYPE="XXX-BBB"
```
where XXX = system name and BBB = whatever you may want to include here, e.g., impi, openmpi, etc.

b) In case you uncommented a line corresponding to your system in the **Makefile.systype** then there is nothing to do in the **Makefile**.

c) In case you added a line, say #SYSTYPE="XXX-BBB", into the **Makefile.systype** then you must modify the **Makefile** by adding the following lines in 'define available Systems'

```
ifeq ($(SYSTYPE),"XXX-BBB")
include buildsystem/Makefile.comp.XXX-BBB
include buildsystem/Makefile.path.XXX-BBB
endif
```

3. In the folder `buildsystem` make sure you have the **Makefile.comp.XXX** and **Makefile.path.XXX** (XXX = cluster name) set with the proper paths and compilation options, respectively. Either chose the existing files or create new ones that reflect your system paths and compiler.

4. The folder `examples` has several subfolders of test cases. From one of these subfolders, e.g., `CollidingGalaxiesSFR`, copy **Config.sh** to the master folder.

5. In the master folder compile the code
```
make CONFIG=Config.sh EXEC=gadget4-exe
```
where EXEC is the name of the executable.

6. Create a folder named `Run_CollidingGalaxies`. Copy **gadget4-exe**, and the files **param.txt** and **TREECOOL** existing in the subfolder `CollidingGalaxiesSFR` to `Run_CollidingGalaxies`.

7. In the folder `Run_CollidingGalaxies` modify **param.txt** to include the adequate path to the initial conditions file **ics_collision_g4.dat** located in the folder `ExampleICs` and modify the memory per core to that of the system you are using.

8. Run the code using mpirun or submit a SLURM script.


#### Building a Test Case Executable | Case A

1. Download and untar a test case tarball, e.g., [gadget4-case-A.tar.gz](./gadget/gadget4-case-A.tar.gz) (see below) and the source code used in the benchmarks named [gadget4-benchmarks.tar.gz](./gadget/gadget4-benchmarks.tar.gz). The folder `gadget4-case-A` has the **Config.sh**, **param.txt**, and **slurm_script.sh** files.  

The **param.txt** file has the path for the initial conditions and was adapted for a system with 2.0 GB RAM per core, in effect 1.8 GB.

2. Change to the folder named `gadget4-benchmarks` and adapt the file **Makefile.systype** and **Makefile** to your needs. Follow instructions 2a), 2b) or 2c) in Section "General Building of the Executable".

3. Compile the code using the **Config.sh** file in `gadget4-case-A`

```
make CONFIG=../gadget4-case-A/Config.sh EXEC=../gadget4-case-A/gadget4-exe
```

4. Change to folder `gadget4-case-A` and make sure that the file **param.txt** has the correct memory size per core for the system you are using. 

5. Run the code directly with mpirun or submit a SLURM script.


#### Building a Test Case Executable | Case B

1. Download and untar a test case tarball, e.g., [gadget4-case-A.tar.gz](./gadget/gadget4-case-B.tar.gz) (see below) and the source code used in the benchmarks named [gadget4-benchmarks.tar.gz](./gadget/gadget4-benchmarks.tar.gz). The folder `gadget4-case-B` has the **Config.sh**, **ics-blob-10m**, **param.txt**, and **slurm_script.sh** files. 

The **param_10m.txt** file has the path for the initial conditions and was adapted for a system with 2.0 GB RAM per core, in effect 1.8 GB.

2. Change to the folder named `gadget4-benchmarks` and adapt the file **Makefile.systype** and **Makefile** to your needs. Follow instructions 2a), 2b) or 2c) in Section "General Building of the Executable".

3. Compile the code using the **Config.sh** file in `gadget4-case-B`

```
make CONFIG=../gadget4-case-B/Config.sh EXEC=../gadget4-case-B/gadget4-exe
```

4. Change to folder `gadget4-case-B` and make sure that the file **param.txt** has the correct memory size per core for the system you are using. 

5. Run the code directly with mpirun or submit a SLURM script.


### Mechanics of Running Benchmark

The general way to run the benchmarks, assuming SLURM Resource/Batch Manager is:

1. Set the environment modules (see Build the Executable section)

2. In the folder of the test cases, e.g., `gadget4-case-A`, adapt the SLURM script and submit it

```
sbatch slurm_script.sh
```
where the slurm_script.sh has the form (for a run with 1024 cores):

```
#!/bin/bash 
#SBATCH --time=02:00:00
#SBATCH --account=XXXXX
#SBATCH --job-name=DM_L50-N512
#SBATCH --output=g_%j.out
#SBATCH --error=g_%j.error
#SBATCH --nodes=32
#SBATCH --ntasks=1056
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-socket=17
#SBATCH --tasks-per-node=33
#SBATCH --exclusive
#SBATCH --qos=prace
#SBATCH --workdir=.

echo
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running on $SLURM_NPROCS processors."
echo "Current working directory is `pwd`"
echo

srun ./gadget4_exe param.txt
```

Where:
* gadget4-exe is the executable.
* param.txt is the input parameter file. 

##### NOTE 

Gadget-4 uses one core per compute node to handle communications. Hence, when allocating compute 
nodes we must take into account an extra core. So, if we want to run the code with 16 mpi tasks/socket we 
must allocate 33 cores per compute node. For a run with 1024 cores in 32 nodes we allocate 1056 cores. 

##### OUTPUT of a run with 1024 cores

```
Running on 32 nodes.
Running on 1056 processors.
Current working directory is XXXXXXXXX

Shared memory islands host a minimum of 33 and a maximum of 33 MPI ranks.
We shall use 32 MPI ranks in total for assisting one-sided communication (1 per shared memory node).

  ___    __    ____    ___  ____  ____       __
 / __)  /__\  (  _ \  / __)( ___)(_  _)___  /. |
( (_-. /(__)\  )(_) )( (_-. )__)   )( (___)(_  _)
 \___/(__)(__)(____/  \___/(____) (__)       (_)

This is Gadget, version 4.0.
Git commit 8ee7f358cf43a37955018f64404db191798a32a3, Tue Jun 15 15:10:36 2021 +0200

...

Code was compiled with the following settings:
    ASMTH=2.0
    CREATE_GRID
    DOUBLEPRECISION=2
    FOF
    IDS_32BIT
    LEAN
    NGENIC=512
    NGENIC_2LPT
    NSOFTCLASSES=1
    NTYPES=2
    PERIODIC
    PMGRID=768
    POSITIONS_IN_32BIT
    POWERSPEC_ON_OUTPUT
    RANDOMIZE_DOMAINCENTER
    SELFGRAVITY
    TREEPM_NOTIMESPLIT


Running on 1024 MPI tasks.
```

### UEABS Benchmarks

**B) `Cosmological Dark Matter-only Simulation`**

This test case involves the three-dimensional simulation of the structure formation in the universe in a small box of linear length (in each direction) of 50 Mpc/h (pc denotes a parsec = 3.086×10^{16} m; Mpc = 10^6 pc; h denotes the Hubble constant) using 512^3 dark matter particles. The initial conditions are created on the fly after start-up of the simulation at redshift Z=63. The simulation evolves until redshift Z = 50. In order to minimise memory consumption 32-bit arithmetic is used.

Gravity is computed with the TreePM algorithm at expansion order p=3. Three output times are defined, for which FOF group finding is enabled, and power spectra are computed as well for the snapshots that are produced. Also, the code is asked to compute a power spectrum for each output.

[Download test Case A](./gadget/gadget4-case-A.tar.gz)

**C) `Blob Test`**

The blob test consists in the simulation of a spherical cloud (blob) that is placed in a wind tunnel in pressure equilibrium with the surrounding medium. The cloud has a temperature and a density 10 times lower and higher, respectively, than the surrounding medium. This test allows for the development of hydrodynamical instabilities at the cloud surface, e.g. Kelvin-Helmholtz and Rayleigh-Taylor, leading to the cloud breakup with time. The cloud is setup with 1 million smooth particle hydrodynamics (SPH) particles. A more sizeable test is done with 10 million particles.


[Download test Case B](./gadget/gadget4-case-B.tar.gz)

   
## Performance 
GADGET reports in log file both time and performance. 

** `Performance` in `ns/day` units : `grep Performance logfile | awk -F ' ' '{print $2}'`.  **

** `Execution Time` in `seconds` : `grep Time: logfile | awk -F ' ' '{print $3}'`**