# README - Halo Exchange Example ## Description Domain decomposition is a widely used work distribution strategy for many HPC applications (such as CFD codes, cellular automata, etc.). In many cases, the individual sub-domains logically overlap along their boundaries and mesh cells in these overlapping regions (called *halo cells* or *ghost cells*) need to be iteratively updated with data from neighboring processes; a communication pattern which we refer to as **halo-exchange**. This code sample demonstrates how to implement halo-exchange for structured or unstructured grids using advanced MPI features, in particular: * How to use MPI-3's **distributed graph topology** with and **neighborhood collectives** (a.k.a. sparse collectives), i.e. `MPI_Dist_graph_create_adjacent` and `MPI_Neighbor_alltoallw` * How to use MPI-Datatypes to **send and receive non-contiguous data** directly, avoiding send and receive buffer packing and unpacking.
For the sake of simplicity, this code sample does not deal with loading and managing any actual mesh data structure. It rather attempts to mimic the typical communication characteristics (i.e. the neighborhood relationships and message size variations between neighbors) for halo-exchange on a 3D unstructured mesh. For this purpose, a simple cube with edge-length *E* is used as global "mesh" domain, consisting of *E*³ regular hexahedral cells. A randomized, iterative algorithm is used for decomposing this cube into irregularly aligned box-shaped sub-domains.
Moreover, no actual computation is performed. Only halo-exchange takes place for a configurable number of times (`-i [N]`) and the exchanged information is validated once at the end of the program.
The code sample is structured as follows:
* `box.c`, `box.h`: Simple data structure for box-shaped "mesh" sub-domains and functions for decomposition, intersection, etc.
* `configuration.c`, `configuration.h`: Command-line parsing and basic logging facilities. * `field.c`, `field.h`: Data structure for mesh-associated data; merely an array of integers in this sample. * `main.c`: The main program. * `mesh.c`, `mesh.h`: Stub implementation of a mesh data-structure; provides neighborhood topology information for communication. * `mpicomm.c`, `mpicomm.h`: **Probably the most interesting part**, implementing the core message-passing functionality: * `mpi_create_graph_communicator`: Creates an MPI Graph topology communicator with `MPI_Dist_graph_create_adjacent`. * `mpi_halo_exchange_int_sparse_collective`: Halo exchange with `MPI_Neighbor_alltoallw`. * `mpi_halo_exchange_int_collective`: Halo exchange with `MPI_Alltoallw`. * `mpi_halo_exchange_int_p2p_default`: Halo exchange with "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`. * `mpi_halo_exchange_int_p2p_synchronous`: Halo exchange with *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`. * `mpi_halo_exchange_int_p2p_ready`: Halo exchange with *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend` * `mpitypes.c`, `mpitypes.h`: Code for initialization of custom MPI-Datatypes.
* `mpitype_indexed_int`: Creates MPI-Datatype for exchanging transfer of non-contiguous halo data.
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
## Release Date 2016-01-18 ## Version History * 2016-01-18: Initial Release on PRACE CodeVault repository ## Contributors * Thomas Ponweiser - [email@example.com](mailto:firstname.lastname@example.org) ## Copyright This code is available under Apache License, Version 2.0 - see also the license file in the CodeVault root directory. ## Languages This sample is entirely written in C. ## Parallelisation This sample uses MPI-3 for parallelisation. ## Level of the code sample complexity Intermediate / Advanced ## Compiling In order to build the sample, only a working MPI implementation supporting MPI-3 must be available. To compile, simply run:
If you need to use a MPI wrapper compiler other than `mpicc`, e.g. `mpiicc`, type: make MPICC=mpiicc In order to specify further compilation flags, e.g. `-g`, type: make CFLAGS="-g" ## Running To run the program, use something similar to mpirun -n [nprocs] ./haloex either on the command line or in your batch script. ### Command line arguments * `-v [0-3]`: Specify the output verbosity level - 0: OFF; 1: INFO (Default); 2: DEBUG; 3: TRACE; * `-g [rank]`: Debug MPI process with specified rank. Enables debug output for the specified rank (otherwise only output of rank 0 is written) and, if compiled with `-CFLAGS="-g -DDEBUG_ATTACH"`, enables a waiting loop for the specified rank which allows to attach a debugger. * `-n [ncells-per-proc]`: Approximate average number of mesh cells per processor; Default: 16k (= 16 * 1024). * `-N [ncells-total]`: Approximate total number of mesh cells (a nearby cubic number will be chosen) * `-e [edge-length]`: Edge length of cube (mesh domain).
* `-w [halo-width]`: Halo width (in number of cells); Default: 1.
* `-i [iterations]`: Number of iterations for halo-exchange; Default: 100. * Selecting halo-exchange mode: * `--graph` (Default): Use MPI Graph topology and neighborhood collectives. If supported, this allows MPI to reorder the processes in order to choose a good embedding of the virtual topology to the physical machine. * `--collective`: Use `MPI_Alltoallw`. * `--p2p`: Use "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`. * `--p2p-sync`: Use *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`. * `--p2p-ready`: Use *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend`. For large numbers as arguments to the options `-i`, `-n` or `-N`, the suffixes 'k' or 'M' may be used. For example, `-n 16k` specifies approximately 16 * 1024 mesh cells per processor; `-N 1M` specifies approximately 1024 * 1024 (~1 million) mesh cells in total. ### Example If you run mpirun -n 16 ./haloex -v 2
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
the command line output should look similar to Configuration: * Verbosity level: DEBUG (2) * Mesh domain (cube): x: [ 0, 64); y: [ 0, 64); z: [ 0, 64); cells: 262144 * Halo transfer mode: Sparse collective - MPI_Neighbor_alltoallw * Number of iterations: 100 Cells per processor (min-max): 10912 - 22599 Examining neighborhood topology for MPI Graph communicator creation... Found 4 neighbors for rank 0: 1 4 5 6 Creating MPI Graph communicator... INFO: MPI reordered ranks: NO Setting up index mappings and MPI types... Adjacency matrix: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 X X X X 1 X X X X X X X X X X X 2 X X X X X 3 X X X X 4 X X X X 5 X X X X X X X X 6 X X X X X X X X 7 X X X X 8 X X X X X 9 X X X X X 10 X X X X X X X X 11 X X X X X X X X 12 X X X X X X X 13 X X X X X 14 X X X X X X 15 X X X X X X X X X X Number of adjacent cells per neighbor (min-max): 31 - 1122 Exchanging halo information (100 iterations)... Validating... Validation successful.
# Known issues ## OpenMPI issue #1304 There is a [known issue for OpenMPI](https://github.com/open-mpi/ompi/issues/1304) when a MPI Datatype is marked for deallocation (with `MPI_Type_free`) while still in use by non-blocking collective operations. If you are using OpenMPI and get a segmentation fault in MPI_Bcast, try to re-compile with: make CFLAGS="-DOMPI_BUG_1304" clean all This just disables two critical calls to `MPI_Type_free`.