Newer
Older
Thomas Ponweiser
committed
# README - Halo Exchange Example
## Description
Domain decomposition is a widely used work distribution strategy for many HPC applications (such as CFD codes, cellular automata, etc.). In many cases, the individual sub-domains logically overlap along their boundaries and mesh cells in these overlapping regions (called *halo cells* or *ghost cells*) need to be iteratively updated with data from neighboring processes; a communication pattern which we refer to as **halo-exchange**.
This code sample demonstrates how to implement halo-exchange for structured or unstructured grids using advanced MPI features, in particular:
* How to use MPI-3's **distributed graph topology** with and **neighborhood collectives** (a.k.a. sparse collectives), i.e. `MPI_Dist_graph_create_adjacent` and `MPI_Neighbor_alltoallw`
* How to use MPI-Datatypes to **send and receive non-contiguous data** directly, avoiding send and receive buffer packing and unpacking.
Thomas Ponweiser
committed
For the sake of simplicity, this code sample does not deal with loading and managing any actual mesh data structure. It rather attempts to mimic the typical communication characteristics (i.e. the neighborhood relationships and message size variations between neighbors) for halo-exchange on a 3D unstructured mesh. For this purpose, a simple cube with edge-length *E* is used as global "mesh" domain, consisting of *E*³ regular hexahedral cells. A randomized, iterative algorithm is used for decomposing this cube into irregularly aligned box-shaped sub-domains.
Thomas Ponweiser
committed
Thomas Ponweiser
committed
Moreover, no actual computation is performed. Only halo-exchange takes place for a configurable number of times (`-i [N]`) and the exchanged information is validated once at the end of the program.
Thomas Ponweiser
committed
The code sample is structured as follows:
Thomas Ponweiser
committed
* `box.c`, `box.h`: Simple data structure for box-shaped "mesh" sub-domains and functions for decomposition, intersection, etc.
Thomas Ponweiser
committed
* `configuration.c`, `configuration.h`: Command-line parsing and basic logging facilities.
* `field.c`, `field.h`: Data structure for mesh-associated data; merely an array of integers in this sample.
* `main.c`: The main program.
* `mesh.c`, `mesh.h`: Stub implementation of a mesh data-structure; provides neighborhood topology information for communication.
* `mpicomm.c`, `mpicomm.h`: **Probably the most interesting part**, implementing the core message-passing functionality:
* `mpi_create_graph_communicator`: Creates an MPI Graph topology communicator with `MPI_Dist_graph_create_adjacent`.
* `mpi_halo_exchange_int_sparse_collective`: Halo exchange with `MPI_Neighbor_alltoallw`.
* `mpi_halo_exchange_int_collective`: Halo exchange with `MPI_Alltoallw`.
* `mpi_halo_exchange_int_p2p_default`: Halo exchange with "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`.
* `mpi_halo_exchange_int_p2p_synchronous`: Halo exchange with *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`.
* `mpi_halo_exchange_int_p2p_ready`: Halo exchange with *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend`
* `mpitypes.c`, `mpitypes.h`: Code for initialization of custom MPI-Datatypes.
Thomas Ponweiser
committed
* `mpitype_indexed_int`: Creates MPI-Datatype for exchanging transfer of non-contiguous halo data.
Thomas Ponweiser
committed
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
## Release Date
2016-01-18
## Version History
* 2016-01-18: Initial Release on PRACE CodeVault repository
## Contributors
* Thomas Ponweiser - [thomas.ponweiser@risc-software.at](mailto:thomas.ponweiser@risc-software.at)
## Copyright
This code is available under Apache License, Version 2.0 - see also the license file in the CodeVault root directory.
## Languages
This sample is entirely written in C.
## Parallelisation
This sample uses MPI-3 for parallelisation.
## Level of the code sample complexity
Intermediate / Advanced
## Compiling
Thomas Ponweiser
committed
Follow the compilation instructions given in the main directory of the kernel samples directory (`/hpc_kernel_samples/README.md`).
Thomas Ponweiser
committed
## Running
To run the program, use something similar to
Thomas Ponweiser
committed
either on the command line or in your batch script.
### Command line arguments
* `-v [0-3]`: Specify the output verbosity level - 0: OFF; 1: INFO (Default); 2: DEBUG; 3: TRACE;
* `-g [rank]`: Debug MPI process with specified rank. Enables debug output for the specified rank (otherwise only output of rank 0 is written) and, if compiled with `-CFLAGS="-g -DDEBUG_ATTACH"`, enables a waiting loop for the specified rank which allows to attach a debugger.
* `-n [ncells-per-proc]`: Approximate average number of mesh cells per processor; Default: 16k (= 16 * 1024).
* `-N [ncells-total]`: Approximate total number of mesh cells (a nearby cubic number will be chosen)
* `-e [edge-length]`: Edge length of cube (mesh domain).
* `-w [halo-width]`: Halo width (in number of cells); Default: 1.
Thomas Ponweiser
committed
* `-i [iterations]`: Number of iterations for halo-exchange; Default: 100.
* Selecting halo-exchange mode:
* `--graph` (Default): Use MPI Graph topology and neighborhood collectives. If supported, this allows MPI to reorder the processes in order to choose a good embedding of the virtual topology to the physical machine.
* `--collective`: Use `MPI_Alltoallw`.
* `--p2p`: Use "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`.
* `--p2p-sync`: Use *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`.
* `--p2p-ready`: Use *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend`.
For large numbers as arguments to the options `-i`, `-n` or `-N`, the suffixes 'k' or 'M' may be used. For example, `-n 16k` specifies approximately 16 * 1024 mesh cells per processor; `-N 1M` specifies approximately 1024 * 1024 (~1 million) mesh cells in total.
### Example
If you run
Thomas Ponweiser
committed
the output should look similar to
Thomas Ponweiser
committed
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
Configuration:
* Verbosity level: DEBUG (2)
* Mesh domain (cube): x: [ 0, 64); y: [ 0, 64); z: [ 0, 64); cells: 262144
* Halo transfer mode: Sparse collective - MPI_Neighbor_alltoallw
* Number of iterations: 100
Cells per processor (min-max): 10912 - 22599
Examining neighborhood topology for MPI Graph communicator creation...
Found 4 neighbors for rank 0:
1 4 5 6
Creating MPI Graph communicator...
INFO: MPI reordered ranks: NO
Setting up index mappings and MPI types...
Adjacency matrix:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 X X X X
1 X X X X X X X X X X X
2 X X X X X
3 X X X X
4 X X X X
5 X X X X X X X X
6 X X X X X X X X
7 X X X X
8 X X X X X
9 X X X X X
10 X X X X X X X X
11 X X X X X X X X
12 X X X X X X X
13 X X X X X
14 X X X X X X
15 X X X X X X X X X X
Number of adjacent cells per neighbor (min-max): 31 - 1122
Exchanging halo information (100 iterations)...
Validating...
Validation successful.
Thomas Ponweiser
committed
## Benchmarks
### Communication mode comparison Supermuc Phase 1 Sandybridge
Communication mode comparison test on Supermuc with 10M mesh cells per proc with different communication modes and different number of cores.
Hardware: Supermuc Phase 1 Thin Nodes. Intel Sandy Bridge-EP Xeon E5-2680 8C @2.7GHz, 16 Cores per node.
Commandline:
mpiexec -n {ranks} ./8_unstructured_haloex {flags} -v1 -n 10M -i100
where `{ranks}` is the number of cores, `{flags}` is [`collective`, `p2p`, `p2p-sync`, `p2p-ready`].
As Supermuc Phase 1 Thin nodes are 8 core dual socket systems, 16 tasks per node are used.
![chart](hpc_kernel_samples/unstructured_grids/halo_exchange/benchmarks/SupermucChart.PNG)
Thomas Ponweiser
committed
## Known issues
Thomas Ponweiser
committed
### OpenMPI issue #1304
Thomas Ponweiser
committed
There is a [known issue for OpenMPI](https://github.com/open-mpi/ompi/issues/1304) when a MPI Datatype is marked for deallocation (with `MPI_Type_free`) while still in use by non-blocking collective operations. If you are using OpenMPI and get a segmentation fault in MPI_Bcast, try to re-compile with:
make CFLAGS="-DOMPI_BUG_1304" clean all
This just disables two critical calls to `MPI_Type_free`.