README.md


# From Serial to Parallel: A simple training using the Martix-Vector multiplication algorithm


## Intro

This training's purpose is to select a simple algorithm, and starting from a basic serial implementation to explore multiple parallel options and implementations.
In this case, we used the simple Matrix-Vector multiplication algorithm, because of its simplicity and parallelization posibilities. All our experiments were implemented and tested on the GRNET's ARIS HPC. Each program subdirectory ( GPUs, MPI, OpenMP ) contains the corresponding programs source code, submit scripts and makefiles. 

## Directory breakdown
```
Training
├── External_Functions
├── GPUs
├── MPI
├── OpenMP
├── Outputs
└── Serial
```

## External_Functions

This directory contains basic helper functions used by most of our programs. These are included and compiled along with the programs in their own directories. Changing this directory's location requires updating the program makefiles.

## Serial
A basic Serial Implementation of the Matrix-Vector multiplication algorithm, mostly used for error-checking and speedup calculation.

## OpenMP
OpenMP is the simplest parallelization tool for shared memory architectures, and thus this is where we start from. In this directory we start with a simple OpenMP 'parallel for' implementation (OpenMP.c), which scales only for a small number of cores. Then, we update this simple program to utilize thread affinity/binding (which is done externally by the calling script), by initializing data to the correct sockets/caches with first touch policy (OpenMP_aff.c). 

## MPI
To further scale in multiple nodes, we use a non-shared memory model tool, MPI (Intel MPI in our case for compiling). We start with a bacic MPI implementation which scales (theoritically) to any number of nodes/cores (MPI.c). Then, in order to utilize shared memory better we implement a hybrid MPI-OpenMP version (MPI-OpenMP.c - MPI for multinode and OpenMP internally in every node for shared memory multicore utilization). In both cases, computation time scales smoothly, but inter-process communication time poses a big problem (because of the small computational intensity of the Matrix-Vector kernel).

## GPUs
Finally, we implement our base-algorithm with CUDA in a Nvidia GPU(cuda_SingleGPU.cu + dmv_gpu.cu). We invoke 3 different kernels, starting from a simple-naive one and improving him as we go (in the second kernel we transpose the matrix to achieve coalesced memory access, and in the third one we also use the block shared memory (shmem) to utilize bandwidth better). To test our implementations we also implement a cuBLAS (Nvidia parallel BLAS routine library) version (cuBLAS_SingleGPU.cu). Then, we create a final hybrid cuBlAS-MPI version (cuBLAS_MultiGPU.cu) in order to utilize a possible multi-gpu/node architecture (MPI inter-process communication is still a big problem for the Matrix-Vector kernel, but in a more computational intensive scenario a huge scale-up is possible). 

## Compilation/Running
All executables can be created by running the Makefiles in the corresponding directories. There is also a global-maker in the project root directory. Every program directory contains a slurm file for execution in the ARIS system (for other systems corresponding adjustments must be made). See compile.mk in External_Functions directory for all the available compiler options.