Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# README - MKL Blas Gemm Example
## Description
Matrix multiplications can be done in different ways which vary in performance. This example should point out how massive these performance differences really are.
This code sample demonstrates:
* Different implementations to do a matrix multiplication
* Serial
* IJK Algorithm
* IKJ Algorithm
* Parallel
* OpenMP parallelized IKJ Algorithm
* Intel MKL BLAS Library
* Benchmark implementations with different matrix dimensions.
Benchmark design:
The program runs the multiplication with matrix dimensions from 2-2048 with single precision floats. For timing accuracy multiple repetitions are used and the average time per multiplication is calculated. The program output is a CSV-Text with dimension size and time per multiplication in seconds.
## Release Date
2016-10-24
## Version History
* 2016-10-24: Initial Release on PRACE CodeVault repository
## Contributors
* Thomas Steinreiter - [thomas.steinreiter@risc-software.at](mailto:thomas.steinreiter@risc-software.at)
## Copyright
This code is available under Apache License, Version 2.0 - see also the license file in the CodeVault root directory.
## Languages
This sample is entirely written in C++14.
## Parallelisation
This sample uses MPI-3 for parallelization.
## Level of the code sample complexity
Intermediate
## Compiling
Follow the compilation instructions given in the main directory of the kernel samples directory (`/hpc_kernel_samples/README.md`).
## Running
In your current working directory, to run the program you may use something similar to
```
1_dense_mklblas blas
```
either on the command line or in your batch script.
### Command line arguments
* `<mode>`: Specity the implementation used for the multiplication (obligatory)
* `serialijk` Use the serial IJK Algorithm
* `serialikj` Use the serial IKJ Algorithm
* `parallel` Use the parallel (OpenMp) IKJ Algorithm
* `serialijk` Use the parallel Intel MKL Library
### Example
If you run
```
1_dense_mklblas blas
```
the output should look similar to
```
2;6.59846e-06
4;6.6475e-06
8;5.93214e-06
16;6.80383e-06
32;8.44547e-06
64;2.10678e-05
128;0.000102738
256;0.000789877
512;0.00560496
1024;0.0301494
2048;0.338072
```
### Benchmarks
Tested on Intel Core i7-6820HQ @ 2,70GHz (4 physical cores, 8 logical).
Compiler: Intel(R) 64, Version 17.0.0.109 Build 20160721 (MKL 2017.0) on Windows 10 1607.
![chart](hpc_kernel_samples/dense_linear_algebra/gemm/mklblas/benchmarks/Chart.PNG)
Be aware the chart has logarithmic scale.
* The *serial IKJ* implementation is brutally faster than IJK (~30x @2048)
* *Serial IKJ* is the fastest smaller sizes (2-16). *MKL Blas* is the fastest for larger sizes (32-2048).
* The *Parallel OpenMP IKJ implementation* is never the fastest. **Manually parallelizing matrix multiplication is not recommended!**