Tuesday, December 19, 2023

[FIXED] subtracting a numpy array by a list is slow

December 19, 2023 numpy, python No comments

Issue

Given a numpy array of shape 4000x4000x3, which is an image with 3 channels, subtract each channel by some values. I can do it as following:

Implementation 1

values = [0.43, 0.44, 0.45]
image -= values

Or,

Implementation 2

values = [0.43, 0.44, 0.45]
for i in range(3):
    image[...,i] -= values[i]

Surprisingly, the later solution is 20x faster than the former when performing on that large image shape, but I don't clearly know why. Could you please explain what helps the second implementation faster? Thanks.

Simple script for confirmation:

import time
import numpy as np

image = np.random.rand(4000, 4000, 3).astype("float32")
values = [0.43, 0.44, 0.45]

st = time.time()
for i in range(3):
    image[..., i] -= values[i]
et = time.time()
print("Implementation 2", et - st)

st = time.time()
image -= values
et = time.time()
print("Implementation 1", et - st)

Output

Implementation 2 0.030953645706176758 
Implementation 1 0.8593623638153076

Solution

TL;DR: the performance issue is caused by multiple factors :

Numpy iterators introduce a big overhead on small broadcasted arrays
values is implicitly converted to a np.float64 array causing a slow np.float64-based subtraction
the memory layout is sub-optimal in all implementations

Numpy internal iterators

One issue comes from the Numpy implementation. Numpy uses an inefficient code for handling small array that are broadcasted. This is due to an internal concept called Numpy iterators which introduce a significant overhead when arrays are small because of the number of repeated iteration on the same small array. This concept is necessary in Numpy to make the computing code more generic and to support many features, especially broadcasting.

On top of that, Numpy cannot benefit from using SIMD instructions in this case because the array is too small to even fit in a SIMD register of mainstream CPUs.

A simple way to prove this hypothesis is to flatten the array and analyse the performance of subtracting arrays of different size containing the same repeated values:

view = image.reshape(-1, 3)
%time view -= np.tile(values, 1)

view = image.reshape(-1, 6)
%time view -= np.tile(values, 2)

view = image.reshape(-1, 12)
%time view -= np.tile(values, 4)

view = image.reshape(-1, 24)
%time view -= np.tile(values, 8)

view = image.reshape(-1, 384)
%time view -= np.tile(values, 128)

view = image.reshape(-1, 3*4000)
%time view -= np.tile(values, 4000)

# Pathological case : huge generated array
view = image.reshape(-1, 3*4000*1000)
%time view -= np.tile(values, 4000*1000)

Here are results on my machine with an Intel Xeon W-2255 CPU using Numpy 1.20.3:

Implementation 2 0.06000018119812012
Implementation 1 0.1734762191772461

Wall time: 175 ms
Wall time: 137 ms
Wall time: 106 ms
Wall time: 95 ms
Wall time: 67 ms     <-----
Wall time: 69 ms
Wall time: 108 ms

One can see that the bigger the second array the better in the above benchmark. However, when the subtracted array is too big, it takes too much time to generate the repeated array using np.tile. Indeed, the array is so big it does not fit in the (fast) caches of the CPU but the (slow) DRAM. Since the operation is rather memory bound, it can be more than twice slower.

One can see that the fastest timing of the flatten operation is close to "Implementation 2". It is a bit slower (probably still certainly due to Numpy internal iterators).

Datatypes and implicit conversions

Another big issue is that values is a list of float objects. It is implicitly converted to a 1D array of np.float64 values while image is of type np.float32. Due to the type promotion rules in Numpy, the operations is done using the np.float64 type which is significantly slower and not needed here anyway.

We can fix that by creating the Numpy array explicitly with the right data-type : values = np.array(values, dtype=np.float32). Here are the results of the same (above) benchmark code but now using a np.float32 array:

Wall time: 75 ms
Wall time: 46 ms
Wall time: 39 ms
Wall time: 31 ms
Wall time: 23 ms
Wall time: 18 ms     <-----

We can see that using only np.float32-based operations is much faster.

More explanations

Could you please explain what helps the second implementation faster?

The second implementation operate on np.float32 values and do not suffer from the iterator overhead since there is no broadcasting done internally. This is because the right-hand-side is a unique float scalar. Numpy directly converts the value to a np.float32 iternally in this case for sake of performance since the left-hand-side is a np.float32 array. You can check that by manually converting the value to np.float32. Note that Numpy apparently convert the value to a np.float32 one even when it is a np.float64 one internally.

However, this implementation is not efficient because it walks through the whole array image 3 times (so the whole array needs to be read and written back from/to the DRAM 3 times).

Final optimized code

Put it shortly, you can write:

st = time.time()
image -= np.tile(np.array(values, dtype=np.float32), image.shape[1]).reshape(-1, 3)
et = time.time()
print("Implementation 3", et - st)  # 0.018 seconds

Notes on memory layouts

While repeating values image.shape[1] times is guaranteed to work and it should be generally quite fast, the best solution is to avoid using the layout height x width x components which tends to be inefficient (fundamentally not SIMD friendly), especially with Numpy. Instead, it is better to use the layout component x height x width (or even possibly height x component x width which tends to be a good tread-off of the two). See this post for example. Note that, for the same reason, it is much better to use a (2,N)-shaped array than the common (N,2) one.

Answered By - Jérôme Richard

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 19, 2023

[FIXED] subtracting a numpy array by a list is slow

Issue

Solution

Numpy internal iterators

Datatypes and implicit conversions

More explanations

Final optimized code

Notes on memory layouts

0 comments:

Post a Comment

Popular Posts

Labels