README.md

# README - Halo Exchange Example


## Description

Domain decomposition is a widely used work distribution strategy for many HPC applications (such as CFD codes, cellular automata, etc.). In many cases, the individual sub-domains logically overlap along their boundaries and mesh cells in these overlapping regions (called *halo cells* or *ghost cells*) need to be iteratively updated with data from neighboring processes; a communication pattern which we refer to as **halo-exchange**.

This code sample demonstrates how to implement halo-exchange for structured or unstructured grids using advanced MPI features, in particular:

 * How to use MPI-3's **distributed graph topology** with and **neighborhood collectives** (a.k.a. sparse collectives), i.e. `MPI_Dist_graph_create_adjacent` and `MPI_Neighbor_alltoallw`
 * How to use MPI-Datatypes to **send and receive non-contiguous data** directly, avoiding send and receive buffer packing and unpacking.

For the sake of simplicity, this code sample does not deal with loading and managing any actual mesh data structure. It rather attempts to mimic the typical communication characteristics (i.e. the neighborhood relationships and message size variations between neighbors) for halo-exchange on a 3D unstructured mesh. For this purpose, a simple cube with edge-length *E* is used as global "mesh" domain, consisting of *E*³ regular hexahedral cells. A randomized, iterative algorithm is used for decomposing this cube into irregularly aligned box-shaped sub-domains.

Moreover, no actual computation is performed. Only halo-exchange takes place for a configurable number of times (`-i [N]`) and the exchanged information is validated once at the end of the program.

The code sample is structured as follows:

 * `box.c`, `box.h`: Simple data structure for box-shaped "mesh" sub-domains and functions for decomposition, intersection, etc.
 * `configuration.c`, `configuration.h`: Command-line parsing and basic logging facilities.
 * `field.c`, `field.h`: Data structure for mesh-associated data; merely an array of integers in this sample.
 * `main.c`: The main program.
 * `mesh.c`, `mesh.h`: Stub implementation of a mesh data-structure; provides neighborhood topology information for communication.
 * `mpicomm.c`, `mpicomm.h`: **Probably the most interesting part**, implementing the core message-passing functionality:
   * `mpi_create_graph_communicator`: Creates an MPI Graph topology communicator with `MPI_Dist_graph_create_adjacent`.
   * `mpi_halo_exchange_int_sparse_collective`: Halo exchange with `MPI_Neighbor_alltoallw`.
   * `mpi_halo_exchange_int_collective`: Halo exchange with `MPI_Alltoallw`.
   * `mpi_halo_exchange_int_p2p_default`: Halo exchange with "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`.
   * `mpi_halo_exchange_int_p2p_synchronous`: Halo exchange with *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`.
   * `mpi_halo_exchange_int_p2p_ready`: Halo exchange with *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend`
 * `mpitypes.c`, `mpitypes.h`: Code for initialization of custom MPI-Datatypes.
   * `mpitype_indexed_int`: Creates MPI-Datatype for exchanging transfer of non-contiguous halo data.


## Release Date

2016-01-18


## Version History

 * 2016-01-18: Initial Release on PRACE CodeVault repository


## Contributors

 * Thomas Ponweiser - [thomas.ponweiser@risc-software.at](mailto:thomas.ponweiser@risc-software.at)


## Copyright

This code is available under Apache License, Version 2.0 - see also the license file in the CodeVault root directory.


## Languages

This sample is entirely written in C.


## Parallelisation

This sample uses MPI-3 for parallelisation.


## Level of the code sample complexity

Intermediate / Advanced


## Compiling

Follow the compilation instructions given in the main directory of the kernel samples directory (`/hpc_kernel_samples/README.md`).


## Running

To run the program, use something similar to

    mpirun -n [nprocs] ./8_unstructured_haloex

either on the command line or in your batch script.


### Command line arguments

 * `-v [0-3]`: Specify the output verbosity level - 0: OFF; 1: INFO (Default); 2: DEBUG; 3: TRACE;
 * `-g [rank]`: Debug MPI process with specified rank. Enables debug output for the specified rank (otherwise only output of rank 0 is written) and, if compiled with `-CFLAGS="-g -DDEBUG_ATTACH"`, enables a waiting loop for the specified rank which allows to attach a debugger.
 * `-n [ncells-per-proc]`: Approximate average number of mesh cells per processor; Default: 16k (= 16 * 1024).
 * `-N [ncells-total]`: Approximate total number of mesh cells (a nearby cubic number will be chosen)
 * `-e [edge-length]`: Edge length of cube (mesh domain).
 * `-w [halo-width]`: Halo width (in number of cells); Default: 1.
 * `-i [iterations]`: Number of iterations for halo-exchange; Default: 100.
 * Selecting halo-exchange mode:
   * `--graph` (Default): Use MPI Graph topology and neighborhood collectives. If supported, this allows MPI to reorder the processes in order to choose a good embedding of the virtual topology to the physical machine.
   * `--collective`: Use `MPI_Alltoallw`.
   * `--p2p`: Use "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`.
   * `--p2p-sync`: Use *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`.
   * `--p2p-ready`: Use *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend`.

For large numbers as arguments to the options `-i`, `-n` or `-N`, the suffixes 'k' or 'M' may be used. For example, `-n 16k` specifies approximately 16 * 1024 mesh cells per processor; `-N 1M` specifies approximately 1024 * 1024 (~1 million) mesh cells in total.


### Example

If you run

    mpirun -n 16 ./8_unstructured_haloex -v 2

the output should look similar to

    Configuration:
     * Verbosity level:         DEBUG (2)
     * Mesh domain (cube):      x: [    0,   64); y: [    0,   64); z: [    0,   64); cells: 262144
     * Halo transfer mode:      Sparse collective - MPI_Neighbor_alltoallw
     * Number of iterations:    100

    Cells per processor (min-max): 10912 - 22599

    Examining neighborhood topology for MPI Graph communicator creation...

    Found 4 neighbors for rank 0:
     1 4 5 6

    Creating MPI Graph communicator...

    INFO: MPI reordered ranks: NO

    Setting up index mappings and MPI types...

    Adjacency matrix:
        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
     0     X        X  X  X
     1  X     X  X  X  X  X  X        X  X  X        X
     2     X        X  X           X     X
     3     X                          X  X  X
     4  X  X  X        X
     5  X  X  X     X     X        X     X           X
     6  X  X           X     X        X        X  X  X
     7     X              X           X        X
     8                             X     X  X     X  X
     9        X        X        X        X           X
    10     X     X        X  X              X  X  X  X
    11     X  X  X     X        X  X        X        X
    12     X     X              X     X  X        X  X
    13                    X  X        X           X  X
    14                    X     X     X     X  X     X
    15     X           X  X     X  X  X  X  X  X  X

    Number of adjacent cells per neighbor (min-max): 31 - 1122

    Exchanging halo information (100 iterations)...

    Validating...

    Validation successful.

## Benchmarks
### Communication mode comparison Supermuc Phase 1 Sandybridge

Communication mode comparison test on Supermuc with 10M mesh cells per proc with different communication modes and different number of cores.

Hardware: Supermuc Phase 1 Thin Nodes. Intel Sandy Bridge-EP Xeon E5-2680 8C @2.7GHz, 16 Cores per node.

Commandline: 

	mpiexec -n {ranks} ./8_unstructured_haloex {flags} -v1 -n 1M -i1000
	
where `{ranks}` is the number of cores, `{flags}` is [`collective`, `p2p`, `p2p-sync`, `p2p-ready`].
As Supermuc Phase 1 Thin nodes are 8 core dual socket systems, 16 tasks per node are used.

![chart](hpc_kernel_samples/unstructured_grids/halo_exchange/benchmarks/SupermucChart.PNG)


## Known issues

### OpenMPI issue #1304

There is a [known issue for OpenMPI](https://github.com/open-mpi/ompi/issues/1304) when a MPI Datatype is marked for deallocation (with `MPI_Type_free`) while still in use by non-blocking collective operations. If you are using OpenMPI and get a segmentation fault in MPI_Bcast, try to re-compile with:

    make CFLAGS="-DOMPI_BUG_1304" clean all

This just disables two critical calls to `MPI_Type_free`.