@@ -6,38 +6,56 @@ In this example, we solve the heat equation. The idea is to apply a 5-point sten

Sequential

----------

`sequential.chpl <src/sequential.chpl>` is a sequential implementation of the heat equation written in Chapel. The stencil computation is the most time consuming part of the code and look like::

[sequential.chpl](src/sequential.chpl) is a sequential implementation of the heat equation written in Chapel. The stencil computation is the most time consuming part of the code and look like:

```

for (i,j) in Interior do//Iterate over all non-border cells

{

//Assign each cell in 'T' the mean of its neighboring cells in 'A'

Basically, each *interior* element in ``T`` gets the mean of the corresponding element in ``A`` as well as the neighboring elements. Since ``for`` is a sequential language construct in Chapel, a single CPU-core will execute this code.

Basically, each *interior* element in `T` gets the mean of the corresponding element in `A` as well as the neighboring elements. Since `for` is a sequential language construct in Chapel, a single CPU-core will execute this code.

In order to improve the performance, we can tell Chapel to use threads to execute the stencil operations in parallel (`single_machine.chpl <src/single_machine.chpl>`). We do that by replacing ``for`` with ``forall``, which tells Chapel to execute each iteration in ``Interior`` parallel.

It is our responsibility to make sure that each iteration in the ``forall`` loop is independent in order not to introduce race conditions.

In order to improve the performance, we can tell Chapel to use threads to execute the stencil operations in parallel ([single_machine.chpl](src/single_machine.chpl)). We do that by replacing `for` with `forall`, which tells Chapel to execute each iteration in `Interior` parallel.

It is our responsibility to make sure that each iteration in the `forall` loop is independent in order not to introduce race conditions.

Clearly in this case iteration is clearly independent since we do not read ``T``::

Clearly in this case iteration is clearly independent since we do not read `T`:

```

forall (i,j) in Interior do//Iterate over all non-border cells

{

//Assign each cell in 'T' the mean of its neighboring cells in 'A'

In order to improve the performance even further, we can tell Chapel to execute the stencil operation in parallel on multiple machines (`multiple_machines.chpl <src/multiple_machines.chpl>`).

We still use the ``forall`` loop construct, be we have to tell Chapel how to distributes ``A`` and ``T`` between the multiple machines. For that, we use the ``dmapped`` language construct when defining the ``Grid`` and ``Interior`` domain::

We still use the `forall` loop construct, be we have to tell Chapel how to distributes `A` and `T` between the multiple machines. For that, we use the `dmapped` language construct when defining the `Grid` and `Interior` domain:

var A, T : [Grid] real;//Zero initialized as default

```

We tell Chapel to use the same *block* distribution of the `Grid` and `Interior` domain such that each index in `Grid` has the same location as the corresponding index in `Interior`. Because they use the same distribution, no communication is needed when accessing the same index. For example, the operations `A[2,4] + T[2,4]` can be done locally on the machine that *owns* index `[2,4]`. However, it also means that a operations such as `A[2,4] + T[3,4]` will generally require communication.

Now, let's run it (note that `-nl 8` tells Chapel to use eight locations):

It is very importation that all arrays in the calculation uses similar `dmapped` layouts. For example, if we do not use `dmapped` when defines `Interior` we get horrible performance:

We tell Chapel to use the same *block* distribution of the ``Grid`` and ``Interior`` domain such that each index in ``Grid`` has the same location as the corresponding index in ``Interior``. Because they use the same distribution, no communication is needed when accessing the same index. For example, the operations ``A[2,4] + T[2,4]`` can be done locally on the machine that *owns* index ``[2,4]``. However, it also means that a operations such as ``A[2,4] + T[3,4]`` will generally require communication.