Single GPU CUDA Version(N=25000, M=25000): t= 1775.302799 ms Single GPU CUDA Coalesced Version(N=25000, M=25000): t= 31.533570 ms Single GPU CUDA shmem Version(N=25000, M=25000): t= 26.177359 ms Single GPU cuBLAS Version(N=25000, M=25000): t= 26.734490 ms