# A single GPU impementation of the Matrix-Vector algorithm with:

```
->cuBLAS(BLAS routines implemented on the GPU by NVIDIA)
07/09/2017: Completed
13/09/2017: Modified to use unified memory

->cuBLAS_MultiGPU(cuBLAS implementation in multiple GPUs/Nodes)
26/09/2017: Completed

->cuda_SingleGPU(3 cuda kernels showing the optimization steps in writing GPU code)
02/10/2017: Completed kernel 1
03/10/2017: Completed kernel 2 & 3

Tested environments:
- Haswell Intel Xeon E5-2660v3 CPU with Linux x86_64 + Nvidia Tesla K40 GPUs and cuda/8.0.61

```