Newer
Older
Thomas Ponweiser
committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# README - Halo Exchange Example
## Description
Domain decomposition is a widely used work distribution strategy for many HPC applications (such as CFD codes, cellular automata, etc.). In many cases, the individual sub-domains logically overlap along their boundaries and mesh cells in these overlapping regions (called *halo cells* or *ghost cells*) need to be iteratively updated with data from neighboring processes; a communication pattern which we refer to as **halo-exchange**.
This code sample demonstrates how to implement halo-exchange for structured or unstructured grids using advanced MPI features, in particular:
* How to use MPI-3's **distributed graph topology** with and **neighborhood collectives** (a.k.a. sparse collectives), i.e. `MPI_Dist_graph_create_adjacent` and `MPI_Neighbor_alltoallw`
* How to use MPI-Datatypes to **send and receive non-contiguous data** directly, avoiding send and receive buffer packing and unpacking.
For the sake of simplicity, this code sample does not deal with loading and managing any actual mesh data structure. It rather attempts to mimic the typical communication characteristics (i.e. the neighborhood relationships and message size variations between neighbors) for halo-exchange on a 3D unstructured mesh. For this purpose, a simple cube with edge-length *E* is used as global "mesh" domain, consisting of *E*³ regular hexahedral cells. A randomized, iterative algorithm is used for decomposing this cube into irregularly aligned box-shaped sub-domains.
Moreover, no actual computation is performed. Only halo-exchange takes place for a configurable number of times (`-i [N]`) and the exchanged information is validated once at the end of the program.
The code sample is structured as follows:
* `box.c`, `box.h`: Simple data structure for box-shaped "mesh" sub-domains and functions for decomposition, intersection, etc.
* `configuration.c`, `configuration.h`: Command-line parsing and basic logging facilities.
* `field.c`, `field.h`: Data structure for mesh-associated data; merely an array of integers in this sample.
* `main.c`: The main program.
* `mesh.c`, `mesh.h`: Stub implementation of a mesh data-structure; provides neighborhood topology information for communication.
* `mpicomm.c`, `mpicomm.h`: **Probably the most interesting part**, implementing the core message-passing functionality:
* `mpi_create_graph_communicator`: Creates an MPI Graph topology communicator with `MPI_Dist_graph_create_adjacent`.
* `mpi_halo_exchange_int_sparse_collective`: Halo exchange with `MPI_Neighbor_alltoallw`.
* `mpi_halo_exchange_int_collective`: Halo exchange with `MPI_Alltoallw`.
* `mpi_halo_exchange_int_p2p_default`: Halo exchange with "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`.
* `mpi_halo_exchange_int_p2p_synchronous`: Halo exchange with *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`.
* `mpi_halo_exchange_int_p2p_ready`: Halo exchange with *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend`
* `mpitypes.c`, `mpitypes.h`: Code for initialization of custom MPI-Datatypes.
* `mpitype_indexed_int`: Creates MPI-Datatype for exchanging transfer of non-contiguous halo data.
## Release Date
2016-01-18
## Version History
* 2016-01-18: Initial Release on PRACE CodeVault repository
## Contributors
* Thomas Ponweiser - [thomas.ponweiser@risc-software.at](mailto:thomas.ponweiser@risc-software.at)
## Copyright
This code is available under Apache License, Version 2.0 - see also the license file in the CodeVault root directory.
## Languages
This sample is entirely written in C.
## Parallelisation
This sample uses MPI-3 for parallelisation.
## Level of the code sample complexity
Intermediate / Advanced
## Compiling
In order to build the sample, only a working MPI implementation supporting MPI-3 must be available. To compile, simply run:
make
If you need to use a MPI wrapper compiler other than `mpicc`, e.g. `mpiicc`, type:
make MPICC=mpiicc
In order to specify further compilation flags, e.g. `-g`, type:
make CFLAGS="-g"
## Running
To run the program, use something similar to
mpirun -n [nprocs] ./haloex
either on the command line or in your batch script.
### Command line arguments
* `-v [0-3]`: Specify the output verbosity level - 0: OFF; 1: INFO (Default); 2: DEBUG; 3: TRACE;
* `-g [rank]`: Debug MPI process with specified rank. Enables debug output for the specified rank (otherwise only output of rank 0 is written) and, if compiled with `-CFLAGS="-g -DDEBUG_ATTACH"`, enables a waiting loop for the specified rank which allows to attach a debugger.
* `-n [ncells-per-proc]`: Approximate average number of mesh cells per processor; Default: 16k (= 16 * 1024).
* `-N [ncells-total]`: Approximate total number of mesh cells (a nearby cubic number will be chosen)
* `-e [edge-length]`: Edge length of cube (mesh domain).
* `-w [halo-width]`: Halo width (in number of cells); Default: 1.
Thomas Ponweiser
committed
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
* `-i [iterations]`: Number of iterations for halo-exchange; Default: 100.
* Selecting halo-exchange mode:
* `--graph` (Default): Use MPI Graph topology and neighborhood collectives. If supported, this allows MPI to reorder the processes in order to choose a good embedding of the virtual topology to the physical machine.
* `--collective`: Use `MPI_Alltoallw`.
* `--p2p`: Use "normal" send, i.e. `MPI_Irecv` / `MPI_Isend`.
* `--p2p-sync`: Use *synchronous send*, i.e. `MPI_Irecv` / `MPI_Issend`.
* `--p2p-ready`: Use *ready send*, i.e. `MPI_IRecv` / `MPI_Barrier` / `MPI_Irsend`.
For large numbers as arguments to the options `-i`, `-n` or `-N`, the suffixes 'k' or 'M' may be used. For example, `-n 16k` specifies approximately 16 * 1024 mesh cells per processor; `-N 1M` specifies approximately 1024 * 1024 (~1 million) mesh cells in total.
### Example
If you run
mpirun -n 16 ./haloex -v 2
the command line output should look similar to
Configuration:
* Verbosity level: DEBUG (2)
* Mesh domain (cube): x: [ 0, 64); y: [ 0, 64); z: [ 0, 64); cells: 262144
* Halo transfer mode: Sparse collective - MPI_Neighbor_alltoallw
* Number of iterations: 100
Cells per processor (min-max): 10912 - 22599
Examining neighborhood topology for MPI Graph communicator creation...
Found 4 neighbors for rank 0:
1 4 5 6
Creating MPI Graph communicator...
INFO: MPI reordered ranks: NO
Setting up index mappings and MPI types...
Adjacency matrix:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 X X X X
1 X X X X X X X X X X X
2 X X X X X
3 X X X X
4 X X X X
5 X X X X X X X X
6 X X X X X X X X
7 X X X X
8 X X X X X
9 X X X X X
10 X X X X X X X X
11 X X X X X X X X
12 X X X X X X X
13 X X X X X
14 X X X X X X
15 X X X X X X X X X X
Number of adjacent cells per neighbor (min-max): 31 - 1122
Exchanging halo information (100 iterations)...
Validating...
Validation successful.