Issue
According to the following table (from this paper), numpy's np.dot
performance is comparable to a CUDA implementation of matrix multiplication, in experiments with 320x320
matrices. And I did replicate this Speedup in my machine for np.dot
with enough precision. Their code for CUDA with Numba ran much slower though, with a Speedup of about 1200 instead of the 49258 reported.
Why is numpy's implementation so fast?
Edit: here's the code taken from the paper. I just added the timeit
calls. I ran it in the following laptop.
CUDA
import numpy as np
from numba import cuda
@cuda.jit('void( float64 [ : , : ] , float64 [ : , : ] , float64 [ : , : ] , int32 )')
def cu_matmul(a , b, c , n) :
x, y = cuda.grid (2)
if (x >= n) or (y >= n) :
return
c[x, y] = 0
for i in range(n) :
c[x, y] += a[x, i ] * b[ i , y]
device = cuda.get_current_device()
tpb = device.WARP_SIZE
n = 320
bpg = (n+tpb-1)//tpb
grid_dim = (bpg, bpg)
block_dim = (tpb , tpb)
A = np.random.random((n, n ) ).astype (np. float64 )
B = np.random.random((n, n ) ).astype (np. float64 )
C = np.empty((n, n) , dtype=np.float64 )
dev_A = cuda.to_device(A)
dev_B = cuda.to_device(B)
dev_C = cuda.to_device(C, copy=False )
result_cuda = cu_matmul[grid_dim , block_dim](dev_A, dev_B, dev_C, n)
dev_C. copy_to_host(C)
assert (np. allclose (np. dot(A, B) , C))
Numpy
np.dot(A, B)
System specs
Solution
Aside from what @norok2 links to, there is the large overhead of transferring the data to the GPU. This becomes significant in several cases:
- it is comparably expensive to what you do on the GPU when compared to data transfer overhead, i.e. you only do one operation on less than a MB of data.
- The size of your problem doesn't scale extremely well. This is the case if your data size or your underlying problem don't allow the GPU to use its parallel processing sufficiently.
- There are too many branches in your parallel code. This usually means a large set of parallel processors needs to wait on each branch (branching hardware is usually grouped per X number of arithmetic processors on a GPU), slowing down the whole computation.
Both points apply here. 320x320 is not extremely large, and a multiplication is the only thing you're doing. CPUs aren't obsoleted by GPUs by far, and let this type of thing prove exactly that.
Answered By - rubenvb
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.