In order to compile and run these examples, the only requirement is a working Chapel compiler and make. You can download Chapel at http://chapel.cray.com/.

In this example, we solve the heat equation. The idea is to apply a 5-point stencil on a domain iteratively until equilibrium.

Sequential

----------

`sequential.chpl <src/sequential.chpl>` is a sequential implementation of the heat equation written in Chapel. The stencil computation is the most time consuming part of the code and look like::

for (i,j) in Interior do//Iterate over all non-border cells

{

//Assign each cell in 'T' the mean of its neighboring cells in 'A'

Basically, each *interior* element in ``T`` gets the mean of the corresponding element in ``A`` as well as the neighboring elements. Since ``for`` is a sequential language construct in Chapel, a single CPU-core will execute this code.

Multi-core

----------

In order to improve the performance, we can tell Chapel to use threads to execute the stencil operations in parallel (`single_machine.chpl <src/single_machine.chpl>`). We do that by replacing ``for`` with ``forall``, which tells Chapel to execute each iteration in ``Interior`` parallel.

It is our responsibility to make sure that each iteration in the ``forall`` loop is independent in order not to introduce race conditions.

Clearly in this case iteration is clearly independent since we do not read ``T``::

forall (i,j) in Interior do//Iterate over all non-border cells

{

//Assign each cell in 'T' the mean of its neighboring cells in 'A'

In order to improve the performance even further, we can tell Chapel to execute the stencil operation in parallel on multiple machines (`multiple_machines.chpl <src/multiple_machines.chpl>`).

We still use the ``forall`` loop construct, be we have to tell Chapel how to distributes ``A`` and ``T`` between the multiple machines. For that, we use the ``dmapped`` language construct when defining the ``Grid`` and ``Interior`` domain::

var A, T : [Grid] real;//Zero initialized as default

We tell Chapel to use the same *block* distribution of the ``Grid`` and ``Interior`` domain such that each index in ``Grid`` has the same location as the corresponding index in ``Interior``. Because they use the same distribution, no communication is needed when accessing the same index. For example, the operations ``A[2,4] + T[2,4]`` can be done locally on the machine that *owns* index ``[2,4]``. However, it also means that a operations such as ``A[2,4] + T[3,4]`` will generally require communication.

In relation to HPC, it is very importation use ``dmapped`` such that you minimize the communication requirements of your application.