Issue
Given a numpy array of shape 4000x4000x3
, which is an image with 3 channels, subtract each channel by some values. I can do it as following:
Implementation 1
values = [0.43, 0.44, 0.45]
image -= values
Or,
Implementation 2
values = [0.43, 0.44, 0.45]
for i in range(3):
image[...,i] -= values[i]
Surprisingly, the later solution is 20x faster than the former when performing on that large image shape, but I don't clearly know why. Could you please explain what helps the second implementation faster? Thanks.
Simple script for confirmation:
import time
import numpy as np
image = np.random.rand(4000, 4000, 3).astype("float32")
values = [0.43, 0.44, 0.45]
st = time.time()
for i in range(3):
image[..., i] -= values[i]
et = time.time()
print("Implementation 2", et - st)
st = time.time()
image -= values
et = time.time()
print("Implementation 1", et - st)
Output
Implementation 2 0.030953645706176758
Implementation 1 0.8593623638153076
Solution
TL;DR: the performance issue is caused by multiple factors :
- Numpy iterators introduce a big overhead on small broadcasted arrays
values
is implicitly converted to anp.float64
array causing a slownp.float64
-based subtraction- the memory layout is sub-optimal in all implementations
Numpy internal iterators
One issue comes from the Numpy implementation. Numpy uses an inefficient code for handling small array that are broadcasted. This is due to an internal concept called Numpy iterators which introduce a significant overhead when arrays are small because of the number of repeated iteration on the same small array. This concept is necessary in Numpy to make the computing code more generic and to support many features, especially broadcasting.
On top of that, Numpy cannot benefit from using SIMD instructions in this case because the array is too small to even fit in a SIMD register of mainstream CPUs.
A simple way to prove this hypothesis is to flatten the array and analyse the performance of subtracting arrays of different size containing the same repeated values:
view = image.reshape(-1, 3)
%time view -= np.tile(values, 1)
view = image.reshape(-1, 6)
%time view -= np.tile(values, 2)
view = image.reshape(-1, 12)
%time view -= np.tile(values, 4)
view = image.reshape(-1, 24)
%time view -= np.tile(values, 8)
view = image.reshape(-1, 384)
%time view -= np.tile(values, 128)
view = image.reshape(-1, 3*4000)
%time view -= np.tile(values, 4000)
# Pathological case : huge generated array
view = image.reshape(-1, 3*4000*1000)
%time view -= np.tile(values, 4000*1000)
Here are results on my machine with an Intel Xeon W-2255 CPU using Numpy 1.20.3:
Implementation 2 0.06000018119812012
Implementation 1 0.1734762191772461
Wall time: 175 ms
Wall time: 137 ms
Wall time: 106 ms
Wall time: 95 ms
Wall time: 67 ms <-----
Wall time: 69 ms
Wall time: 108 ms
One can see that the bigger the second array the better in the above benchmark. However, when the subtracted array is too big, it takes too much time to generate the repeated array using np.tile
. Indeed, the array is so big it does not fit in the (fast) caches of the CPU but the (slow) DRAM. Since the operation is rather memory bound, it can be more than twice slower.
One can see that the fastest timing of the flatten operation is close to "Implementation 2". It is a bit slower (probably still certainly due to Numpy internal iterators).
Datatypes and implicit conversions
Another big issue is that values
is a list of float
objects. It is implicitly converted to a 1D array of np.float64
values while image
is of type np.float32
. Due to the type promotion rules in Numpy, the operations is done using the np.float64
type which is significantly slower and not needed here anyway.
We can fix that by creating the Numpy array explicitly with the right data-type : values = np.array(values, dtype=np.float32)
. Here are the results of the same (above) benchmark code but now using a np.float32
array:
Wall time: 75 ms
Wall time: 46 ms
Wall time: 39 ms
Wall time: 31 ms
Wall time: 23 ms
Wall time: 18 ms <-----
We can see that using only np.float32
-based operations is much faster.
More explanations
Could you please explain what helps the second implementation faster?
The second implementation operate on np.float32
values and do not suffer from the iterator overhead since there is no broadcasting done internally. This is because the right-hand-side is a unique float
scalar. Numpy directly converts the value to a np.float32
iternally in this case for sake of performance since the left-hand-side is a np.float32
array. You can check that by manually converting the value to np.float32
. Note that Numpy apparently convert the value to a np.float32
one even when it is a np.float64
one internally.
However, this implementation is not efficient because it walks through the whole array image
3 times (so the whole array needs to be read and written back from/to the DRAM 3 times).
Final optimized code
Put it shortly, you can write:
st = time.time()
image -= np.tile(np.array(values, dtype=np.float32), image.shape[1]).reshape(-1, 3)
et = time.time()
print("Implementation 3", et - st) # 0.018 seconds
Notes on memory layouts
While repeating values
image.shape[1]
times is guaranteed to work and it should be generally quite fast, the best solution is to avoid using the layout height x width x components
which tends to be inefficient (fundamentally not SIMD friendly), especially with Numpy. Instead, it is better to use the layout component x height x width
(or even possibly height x component x width
which tends to be a good tread-off of the two). See this post for example. Note that, for the same reason, it is much better to use a (2,N)
-shaped array than the common (N,2)
one.
Answered By - Jérôme Richard
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.