Commit 7e4af317 authored by Thomas Steinreiter's avatar Thomas Steinreiter
Browse files

finished readme, fixed name in cmake

parent f7568f00
......@@ -16,7 +16,7 @@ if ("${DWARF_PREFIX}" STREQUAL "")
set(DWARF_PREFIX 1_dense)
endif()
set(NAME ${DWARF_PREFIX}_gemm_tbb_lapack)
set(NAME ${DWARF_PREFIX}_mklblas)
if (OPENMP_FOUND AND Boost_FOUND AND MKL_FOUND)
enable_language(CXX)
......
# README - MKL Blas Gemm Example
## Description
Matrix multiplications can be done in different ways which vary in performance. This example should point out how massive these performance differences really are.
This code sample demonstrates:
* Different implementations to do a matrix multiplication
* Serial
* IJK Algorithm
* IKJ Algorithm
* Parallel
* OpenMP parallelized IKJ Algorithm
* Intel MKL BLAS Library
* Benchmark implementations with different matrix dimensions.
Benchmark design:
The program runs the multiplication with matrix dimensions from 2-2048 with single precision floats. For timing accuracy multiple repetitions are used and the average time per multiplication is calculated. The program output is a CSV-Text with dimension size and time per multiplication in seconds.
## Release Date
2016-10-24
## Version History
* 2016-10-24: Initial Release on PRACE CodeVault repository
## Contributors
* Thomas Steinreiter - [thomas.steinreiter@risc-software.at](mailto:thomas.steinreiter@risc-software.at)
## Copyright
This code is available under Apache License, Version 2.0 - see also the license file in the CodeVault root directory.
## Languages
This sample is entirely written in C++14.
## Parallelisation
This sample uses MPI-3 for parallelization.
## Level of the code sample complexity
Intermediate
## Compiling
Follow the compilation instructions given in the main directory of the kernel samples directory (`/hpc_kernel_samples/README.md`).
## Running
In your current working directory, to run the program you may use something similar to
```
1_dense_mklblas blas
```
either on the command line or in your batch script.
### Command line arguments
* `<mode>`: Specity the implementation used for the multiplication (obligatory)
* `serialijk` Use the serial IJK Algorithm
* `serialikj` Use the serial IKJ Algorithm
* `parallel` Use the parallel (OpenMp) IKJ Algorithm
* `serialijk` Use the parallel Intel MKL Library
### Example
If you run
```
1_dense_mklblas blas
```
the output should look similar to
```
2;6.59846e-06
4;6.6475e-06
8;5.93214e-06
16;6.80383e-06
32;8.44547e-06
64;2.10678e-05
128;0.000102738
256;0.000789877
512;0.00560496
1024;0.0301494
2048;0.338072
```
### Benchmarks
Tested on Intel Core i7-6820HQ @ 2,70GHz (4 physical cores, 8 logical).
Compiler: Intel(R) 64, Version 17.0.0.109 Build 20160721 (MKL 2017.0) on Windows 10 1607.
![chart](hpc_kernel_samples/dense_linear_algebra/gemm/mklblas/benchmarks/Chart.PNG)
Be aware the chart has logarithmic scale.
The benchmarks show some interesting facts:
* The *serial IKJ* implementation is brutally faster than IJK (~30x @2048)
* *Serial IKJ* is the fastest smaller sizes (2-16). *MKL Blas* is the fastest for larger sizes (32-2048).
* The *Parallel OpenMP IKJ implementation* is never the fastest. **Manually parallelizing matrix multiplication is not recommended!**
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment