README.md

# Specfem 3D globe -- Bench readme

## Summary Version

1.2

## General description
The software package SPECFEM3D simulates three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). All SPECFEM3D_GLOBE software is written in Fortran90 with full portability in mind, and conforms strictly to the Fortran95 standard. It uses no obsolete or obsolescent features of Fortran77. The package uses parallel programming based upon the Message Passing Interface (MPI).
The SEM was originally developed in computational fluid dynamics and has been successfully adapted to address problems in seismic wave propagation. It is a continuous Galerkin technique, which can easily be made discontinuous; it is then close to a particular case of the discontinuous Galerkin technique, with optimized efficiency because of its tensorized basis functions. In particular, it can accurately handle very distorted mesh elements. It has very good accuracy and convergence properties. The spectral element approach admits spectral rates of convergence and allows exploiting hp-convergence schemes. It is also very well suited to parallel implementation on very large supercomputers as well as on clusters of GPU accelerating graphics cards. Tensor products inside each element can be optimized to reach very high efficiency, and mesh point and element numbering can be optimized to reduce processor cache misses and improve cache reuse. The SEM can also handle triangular (in 2D) or tetrahedral (3D) elements as well as mixed meshes, although with increased cost and reduced accuracy in these elements, as in the discontinuous Galerkin method.
In many geological models in the context of seismic wave propagation studies (except for instance for fault dynamic rupture studies, in which very high frequencies of supershear rupture need to be modeled near the fault, a continuous formulation is sufficient because material property contrasts are not drastic and thus conforming mesh doubling bricks can efficiently handle mesh size variations. This is particularly true at the scale of the full Earth. Effects due to lateral variations in compressional-wave speed, shear-wave speed, density, a 3D crustal model, ellipticity, topography and bathyletry, the oceans, rotation, and self-gravitation are included. The package can accommodate full 21-parameter anisotropy as well as lateral variations in attenuation. Adjoint capabilities and finite-frequency kernel simulations are also included.

* [Web site](http://geodynamics.org/cig/software/specfem3d_globe/)
* User manual: 
    - [PDF manual](https://geodynamics.org/cig/software/specfem3d_globe/gitbranch/devel/doc/USER_MANUAL/manual_SPECFEM3D_GLOBE.pdf)
    - [HTML manual](https://specfem3d-globe.readthedocs.io/en/latest/)
* [Code github](https://github.com/geodynamics/specfem3d_globe.git)
* [Getting started](https://specfem3d-globe.readthedocs.io/en/latest/02_getting_started/)
* Test Cases
    * Small validation test case : [small_benchmark_run_to_test_more_complex_Earth](https://github.com/geodynamics/specfem3d_globe/tree/master/EXAMPLES/small_benchmark_run_to_test_more_complex_Earth)
    * [Test Case A](https://repository.prace-ri.eu/git/UEABS/ueabs/tree/master/specfem3d/test_cases/SPECFEM3D_TestCaseA): up to around 1,000 x86 cores, or equivalent
    * [Test Case B](https://repository.prace-ri.eu/git/UEABS/ueabs/tree/master/specfem3d/test_cases/SPECFEM3D_TestCaseB): up to around 10,000 x86 cores, or equivalent 

## Purpose of Benchmark
The software package SPECFEM3D_GLOBE simulates three-dimensional global and regional seismic wave propagation and performs full waveform imaging (FWI) or adjoint tomography based upon the spectral-element method (SEM).
The test cases simulate the earthquake of June 1994 in Northern Bolivia at a global scale with the global shear-wave speed model named s362ani.
Test Case A is designed to run on a system that has up to about 1,000 x86 cores, or equivalent, Test Case B is designed to run on systems up to about 10,000 x86 cores, or equivalent and finally a small validation test case called "small_benchmark_run_to_test_more_complex_Earth" which is a native specfem3D_globe benchmark to validate the behavior of code designed to run on a 24 MPI processes(1 node)".

## Mechanics of Building Benchmark

### Get the source

Clone the Specfem3D_Globe repository :
```shell
git clone https://github.com/geodynamics/specfem3d_globe.git
```
**Use a stable and fixed version of specfem3D_globe**, indeed, instabilities have been observed on the master and on the versions after the [October 31, 2017 commit](https://github.com/geodynamics/specfem3d_globe/commit/b1d6ba966496f269611eff8c2cf1f22bcdac2bd9).
```shell
cd specfem3d_globe
git checkout b1d6ba966496f269611eff8c2cf1f22bcdac2bd9
```
If this is not done, clone the ueabs repository. 
```shell
git clone https://repository.prace-ri.eu/git/UEABS/ueabs.git
```
In the specfem3D folder of this repo, you will find test cases in the test_cases folder, 
you will also find environment and submission scripts templates for several machines

### Define the environment

**a.** You will need a Fortran and a C compiler and a MPI library and it is recommended that you explicitly specify the appropriate command names for your Fortran compiler in your .bashrc or your .cshrc file (or directly in your submission file). To be exhaustive here are the relevant variables to compile the code: 

 - `LANG=C`
 - `FC`
 - `MPIFC`
 - `CC`
 - `MPICC`

**b.** To be able to run on GPUs, you must define the CUDA environment by setting the following two variables: 
 - `CUDA_LIB`
 - `CUDA_INC`

An exemple (compiling for GPUs) on the ouessant cluster at IDRIS - France:

```shell
LANG=C

module purge
module load pgi cuda ompi

export FC=`which pgfortran`
export MPIFC=`which mpif90`
export CC=`which pgcc`
export MPICC=`which mpicc`
export CUDA_LIB="$CUDAROOT/lib64"
export CUDA_INC="$CUDAROOT/include"
```
You will find in the specfem3D folder of this repo a folder named env,
with files named env_x which gives examples of the environment used on several supercomputers 
during the last benchmark campaign

**c.** To define the optimization specific to the target architecture, you will need the environment variables FCFLAGS and CFLAGS. 
Here is an example of optimization options for efficient compilation on an architecture 
with the 512-bit Advanced Vector Extensions SIMD instructions
```shell
FCFLAGS="-O3 -qopenmp -xhost -qopt-zmm-usage=high -ipo -fp-model fast=2 -mcmodel=large -DUSE_FP32 -DOPT_STREAMS "
CFLAGS="-O3 -qopenmp -xhost -qopt-zmm-usage=high -ipo "
```

### Configuration step
To configure specfem3D_Globe use the configure script, this script assumes that 
you will compile the code on the same kind of hardware as the machine on which 
you will run it. **As arrays are staticaly declared, you will need to compile specfem 
once for each test case with the right `Par_file`** which is the parameter file of specfem3D.

To use the **shared memory parallel programming** model of specfem3D we will 
specify `--enable-openmp` configure option.

**On GPU platform** you will need to add the following arguments to the configure 
`--build=ppc64 --with-cuda=cuda5` and you will need to set the `GPU_MODE = .true.`
in the parameter file `Par_file`

On some environement, depending on MPI configuration you will need to replace
`use mpi` statement with `include mpif.h`, use the script and prodedure commented below.

```shell
### replace `use mpi` if needed ###
# cd utils
# perl replace_use_mpi_with_include_mpif_dot_h.pl
# cd ..
####################################

./configure --prefix=$PWD
```

**On Xeon Phi**, since support is recent you should replace the following variables
values in the generated Makefile:

```Makefile
FCFLAGS = -g -O3 -qopenmp -xMIC-AVX512 -DUSE_FP32 -DOPT_STREAMS -align array64byte  -fp-model fast=2 -traceback -mcmodel=large
FCFLAGS_f90 = -mod ./obj -I./obj -I.  -I. -I${SETUP} -xMIC-AVX512
CPPFLAGS = -I${SETUP}  -DFORCE_VECTORIZATION  -xMIC-AVX512
```
Note: Be careful, in most machines login node does not have the same instruction set so, in order to compile with the right instruction set, you'll have to compile on a compute node or to make a cross-compilation.


### Compilation
Finally compile with make:
```shell
make clean
make all
```

**-> You will find in the specfem folder of ueabs repository the file "compile.sh" which is an compilation script template for several machines (different architectures : KNL, SKL, Haswell and GPU)**

## Mechanics of running
Input for the mesher (and the solver) is provided through the parameter file Par_file, which resides in the subdirectory DATA. Before running the mesher, a number of parameters need to be set in the Par_file. 
The solver calculates seismograms for 129 stations, and the simulations are run approximately (depending on the architecture and compilation options) in a few minutes for test case A, about ten minutes for test case B, and about an hour for the small validation test case.

The different test cases correspond to different meshes of the earth. The size of the mesh is determined by a combination of following variables: NCHUNKS, the number of chunks in the cubed sphere (6 for global simulations), NPROC_XI, the number of processors or slices along one chunk of the cubed sphere and NEX_XI, the number of spectral elements along one side of a chunk in the cubed sphere. These three variables give us the number of degrees of freedom of the mesh and determine the amount of memory needed per core. The Specfem3D **solver must be recompiled each time we change the mesh size because the solver uses a static loop size and the compilers know the size of all loops only at the time of compilation and can therefore optimize them efficiently.
 - "small_benchmark_run_to_test_more_complex_Earth" runs with `24 MPI` tasks using only MPI parallelization and has the following mesh characteristics:  NCHUNKS=6, NPROC_XI=2 and NEX_XI=80.
 - Test case A runs with `96 MPI` tasks using hybrid parallelization (MPI+OpenMP or MPI+OpenMP+Cuda depending on the system tested) and has the following mesh characteristics: NCHUNKS=6, NPROC_XI=4 and NEX_XI=384.
 - Test Case B runs with `1536 MPI` tasks using hybrid parallelization and has the following mesh characteristics: NCHUNKS=6, NPROC_XI=16 and NEX_XI=384. 

### small_benchmark_run_to_test_more_complex_Earth
For the small validation test case called "small_benchmark_run_to_test_more_complex_Earth", just go to the "EXAMPLES/small_benchmark_run_to_test_more_complex_Earth" folder of specfem3d_globe, edit the run_this_example file. sh file by adding the configure phase to line 38 (compilation is included in the script), then to launch this script file on a compute node (or to encapsulate it in a submission script, cf https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/specfem3d/job_script/job_occigen_small_benchmark_run_to_test_more_complex_Earth.slurm):
```shell
ssh $compute_node_name
cd $HOME/specfem3d_globe/EXAMPLES/small_benchmark_run_to_test_more_complex_Earth/
sed -i "38a ./configure " run_this_example.sh
time ./run_this_example.sh
```

### Test cases A and B
Once the parameter file is correctly defined, to run the test cases, copy the `Par_file`, 
`STATIONS` and `CMTSOLUTION` files defining one of the two test cases (A or B cf https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/specfem3d/test_cases) 
into the SPECFEM3D_GLOBE/DATA directory. Then use the xmeshfem3D binary (located 
in the bin directory) to mesh the domain and xspecfem3D to solve the problem using 
the appropriate command to run parallel jobs (srun, ccc_mprun, mpirun…).

```shell
srun   bin/xmeshfem3D
srun   bin/xspecfem3D
```

You can use or be inspired by the submission script template in the job_script folder using the appropriate job submission command :
- qsub for pbs job,
- sbatch for slurm job,
- ccc_msub for irene job (wrapper),
- llsubmit for LoadLeveler job.

## Validation

To validate the quality of the compilation and the functioning of the code on your machine, we will use the small validation test case (warning, **Pyhton 2.7 and numpy are required**). 
You must first run the small validation test case and have results in the "OUTPUT_FILES" folder (i.e. the 387 seismograms produced).
```shell
ls specfem3d_globe/EXAMPLES/small_benchmark_run_to_test_more_complex_Earth/OUTPUT_FILES/*.ascii | wc -l
387
```

Then uncompress the reference seismograms in directory OUTPUT_FILES_reference_OK using bunzip2
```shell
cd specfem3d_globe/EXAMPLES/small_benchmark_run_to_test_more_complex_Earth/OUTPUT_FILES_reference_OK/
bunzip2 *.bz2
```
Then we will use a spefem3d_globe python utility named "compare_seismogram_correlations.py" located in the "utils" folder; this python utility takes as arguments the path of the output folder ("OUTPUT_FILES") and the path of the reference output folder ("OUTPUT_FILES_reference_OK").

```shell
module load intel/17.0 python/2.7.13
# compares seismograms by plotting correlations
./utils/compare_seismogram_correlations.py EXAMPLES/small_benchmark_run_to_test_more_complex_Earth/OUTPUT_FILES/ EXAMPLES/small_benchmark_run_to_test_more_complex_Earth/OUTPUT_FILES_reference_OK/
```

If you get the mentions " no poor correlations found", " no poor matches found" and "no significant time shifts found" at the end of the output
you have successfully validated that the code works fine on your machine

```shell
$ grep -e "no poor"  -e "no significant time shifts" *.out
              no poor correlations found
              no poor matches found
              no significant time shifts found
```

## Gather the results
The relevant metric for this benchmark is time for the solver. 
It is recommended to use the "time" command in front of the solver's mpi launcher to harmonize the results. 
Using slurm, it is  easy to gather as each `mpirun` or `srun` is interpreted as a step which is already 
timed. So the command line `sacct -j <job_id>` allows you to catch the metric. 
The output of the mesher (“output_mesher.txt”) and of the solver (“output_solver.txt”) 
can be find in the OUTPUT_FILES directory. These files contains physical values 
and timing values that are more accurate than those collected by slurm.


## Reference results
Below are some benchmark results from the PRACE project implementation phase 5, which are displayed here to give an idea of performance on different systems.

|Application|Problem    |size: number|size:unit|erformance: metric|Performance: unit|Hazel Hen|Irene - KNL|Irene - SKL|JUWELS|Marconi - KNL|MareNostrum 4|Piz Daint - P100|DAVIDE|Frioul |SDV|Dibona |
|-----------|-----------|------------|---------|------------------|-----------------|---------|-----------|-----------|------|-------------|-------------|----------------|------|-------|---|-------|
|           |           |            |         |                  |                 |24 cores |68 cores   |48 cores   |48 cores |68  cores |48  cores  |68  cores   |240  cores  |68  cores    |64 cores|64 cores   |
|SPECFEM3D  |Test Case A|24          |nodes    |time              |s                |2389,00  |1639,00    |734,00     |658,00|1653,00      |744,00       |195,00          |      |1963,50|   |3921,22|
|SPECFEM3D  |Test Case B|384         |nodes    |time              |s                |         |330,00     |169,00     |193,00|1211,00      |156,00       |50,00           |      |       |   |       |