Newer
Older
Thomas Ponweiser
committed
# README - Halo Exchange Example
## Description
Domain decomposition is a widely used work distribution strategy for many HPC applications (such as CFD codes, cellular automata, etc.). In many cases, the individual sub-domains logically overlap along their boundaries and mesh cells in these overlapping regions (called *halo cells* or *ghost cells*) need to be iteratively updated with data from neighboring processes; a communication pattern which we refer to as **halo-exchange**.
This code sample demonstrates how to implement halo-exchange for structured or unstructured grids using advanced MPI features, in particular:
* How to use MPI-3's **distributed graph topology** with and **neighborhood collectives** (a.k.a. sparse collectives), i.e. `MPI_Dist_graph_create_adjacent` and `MPI_Neighbor_alltoallw`
* How to use MPI-Datatypes to **send and receive non-contiguous data** directly, avoiding send and receive buffer packing and unpacking.
Thomas Ponweiser
committed
For the sake of simplicity, this code sample does not deal with loading and managing any actual mesh data structure. It rather attempts to mimic the typical communication characteristics (i.e. the neighborhood relationships and message size variations between neighbors) for halo-exchange on a 3D unstructured mesh. For this purpose, a simple cube with edge-length *E* is used as global "mesh" domain, consisting of *E*³ regular hexahedral cells. A randomized, iterative algorithm is used for decomposing this cube into irregularly aligned box-shaped sub-domains.
Thomas Ponweiser
committed
Thomas Ponweiser
committed
Moreover, no actual computation is performed. Only halo-exchange takes place for a configurable number of times (`-i [N]`) and the exchanged information is validated once at the end of the program.
Thomas Ponweiser
committed
The code sample is structured as follows:
Thomas Ponweiser
committed
* `box.c`, `box.h`: Simple data structure for box-shaped "mesh" sub-domains and functions for decomposition, intersection, etc.
Thomas Ponweiser
committed
* `configuration.c`, `configuration.h`: Command-line parsing and basic logging facilities.
* `field.c`, `field.h`: Data structure for mesh-associated data; merely an array of integers in this sample.
* `main.c`: The main program.
* `mesh.c`, `mesh.h`: Stub implementation of a mesh data-structure; provides neighborhood topology information for communication.
* `mpicomm.c`, `mpicomm.h`: **Probably the most interesting part**, implementing the core message-passing functionality:
* `mpi_create_graph_communicator`: Creates an MPI Graph topology communicator with `MPI_Dist_graph_create_adjacent`.
* `mpi_halo_exchange_int_sparse_collective`: Halo exchange with `MPI_Neighbor_alltoallw`.
* `mpi_halo_exchange_int_collective`: Halo exchange with `MPI_Alltoallw`.
* `mpi_halo_exchange_int_p2p_default`: Halo exchange with "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`.
* `mpi_halo_exchange_int_p2p_synchronous`: Halo exchange with *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`.
* `mpi_halo_exchange_int_p2p_ready`: Halo exchange with *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend`
* `mpitypes.c`, `mpitypes.h`: Code for initialization of custom MPI-Datatypes.
Thomas Ponweiser
committed
* `mpitype_indexed_int`: Creates MPI-Datatype for exchanging transfer of non-contiguous halo data.
Thomas Ponweiser
committed
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
## Release Date
2016-01-18
## Version History
* 2016-01-18: Initial Release on PRACE CodeVault repository
## Contributors
* Thomas Ponweiser - [thomas.ponweiser@risc-software.at](mailto:thomas.ponweiser@risc-software.at)
## Copyright
This code is available under Apache License, Version 2.0 - see also the license file in the CodeVault root directory.
## Languages
This sample is entirely written in C.
## Parallelisation
This sample uses MPI-3 for parallelisation.
## Level of the code sample complexity
Intermediate / Advanced
## Compiling
In order to build the sample, only a working MPI implementation supporting MPI-3 must be available. To compile, simply run:
Thomas Ponweiser
committed
make
Thomas Ponweiser
committed
If you need to use a MPI wrapper compiler other than `mpicc`, e.g. `mpiicc`, type:
make MPICC=mpiicc
In order to specify further compilation flags, e.g. `-g`, type:
make CFLAGS="-g"
## Running
To run the program, use something similar to
mpirun -n [nprocs] ./haloex
either on the command line or in your batch script.
### Command line arguments
* `-v [0-3]`: Specify the output verbosity level - 0: OFF; 1: INFO (Default); 2: DEBUG; 3: TRACE;
* `-g [rank]`: Debug MPI process with specified rank. Enables debug output for the specified rank (otherwise only output of rank 0 is written) and, if compiled with `-CFLAGS="-g -DDEBUG_ATTACH"`, enables a waiting loop for the specified rank which allows to attach a debugger.
* `-n [ncells-per-proc]`: Approximate average number of mesh cells per processor; Default: 16k (= 16 * 1024).
* `-N [ncells-total]`: Approximate total number of mesh cells (a nearby cubic number will be chosen)
* `-e [edge-length]`: Edge length of cube (mesh domain).
* `-w [halo-width]`: Halo width (in number of cells); Default: 1.
Thomas Ponweiser
committed
* `-i [iterations]`: Number of iterations for halo-exchange; Default: 100.
* Selecting halo-exchange mode:
* `--graph` (Default): Use MPI Graph topology and neighborhood collectives. If supported, this allows MPI to reorder the processes in order to choose a good embedding of the virtual topology to the physical machine.
* `--collective`: Use `MPI_Alltoallw`.
* `--p2p`: Use "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`.
* `--p2p-sync`: Use *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`.
* `--p2p-ready`: Use *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend`.
For large numbers as arguments to the options `-i`, `-n` or `-N`, the suffixes 'k' or 'M' may be used. For example, `-n 16k` specifies approximately 16 * 1024 mesh cells per processor; `-N 1M` specifies approximately 1024 * 1024 (~1 million) mesh cells in total.
### Example
If you run
mpirun -n 16 ./haloex -v 2
Thomas Ponweiser
committed
Thomas Ponweiser
committed
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
the command line output should look similar to
Configuration:
* Verbosity level: DEBUG (2)
* Mesh domain (cube): x: [ 0, 64); y: [ 0, 64); z: [ 0, 64); cells: 262144
* Halo transfer mode: Sparse collective - MPI_Neighbor_alltoallw
* Number of iterations: 100
Cells per processor (min-max): 10912 - 22599
Examining neighborhood topology for MPI Graph communicator creation...
Found 4 neighbors for rank 0:
1 4 5 6
Creating MPI Graph communicator...
INFO: MPI reordered ranks: NO
Setting up index mappings and MPI types...
Adjacency matrix:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 X X X X
1 X X X X X X X X X X X
2 X X X X X
3 X X X X
4 X X X X
5 X X X X X X X X
6 X X X X X X X X
7 X X X X
8 X X X X X
9 X X X X X
10 X X X X X X X X
11 X X X X X X X X
12 X X X X X X X
13 X X X X X
14 X X X X X X
15 X X X X X X X X X X
Number of adjacent cells per neighbor (min-max): 31 - 1122
Exchanging halo information (100 iterations)...
Validating...
Validation successful.
Thomas Ponweiser
committed
# Known issues
## OpenMPI issue #1304
There is a [known issue for OpenMPI](https://github.com/open-mpi/ompi/issues/1304) when a MPI Datatype is marked for deallocation (with `MPI_Type_free`) while still in use by non-blocking collective operations. If you are using OpenMPI and get a segmentation fault in MPI_Bcast, try to re-compile with:
make CFLAGS="-DOMPI_BUG_1304" clean all
This just disables two critical calls to `MPI_Type_free`.