Issue
numpy/pandas are known famous for their underlying acceleration, i.e. vectorization.
condition evaluation are common expressions that occurs in codes everywhere.
However, when using pandas dataframe apply
function intuitively, the condition evaluation seems very slow.
An example of my apply
code looks like:
def condition_eval(df):
x=df['x']
a=df['a']
b=df['b']
if x <= a:
d = round((x-a)/0.01)-1
if d <- 10:
d = -10
elif x >= b:
d = round((x-b)/0.01)+1
if d > 10:
d = 10
else:
d = 0
return d
df['eval_result'] = df.apply(condition_eval, axis=1)
The properties of such kind of problems could be:
- the result can be computed with only using its own row data, and always using multiple columns.
- each row has the same computation algorithm.
- the algorithm may contain complex conditional branches.
What's the best practice in numpy/pandas to solve such kind of problems?
Some more thinkings.
In my opinion, one of the reason why vectorization acceleration can be effective is because the underlying cpu has some kind of vector instructions(e.g. SIMD, intel avx
), which rely on a truth that the computational instructions have a deterministic behavior, i.e. no matter how the input data is, the result could be acquired after a fixed number of cpu cycles. Thus, parallelizing such kind of operations is easy.
However, branch execution in cpu is much more complicated. First of all, different branches of the same condition evaluation have different execution paths thus they may result in different cpu cycles. Modern cpus even leverage a lot of tricks like branch prediction which create more uncertainties.
So I wonder if and how pandas try to accelerate such kind of vector condition evaluation operations, and is their a better practice to work on such kind of computational workloads.
Solution
This should be equivalent:
import pandas as pd
import numpy as np
def get_eval_result(df):
conditions = (
df.x.le(df.a),
df.x.gt(df.b),
)
choices = (
np.where((d := df.x.sub(df.a).div(0.01).round().sub(1)).lt(-10), -10, d),
np.where((d := df.x.sub(df.b).div(0.01).round().add(1)).gt(10), 10, d),
)
return np.select(conditions, choices, 0)
df = df.assign(eval_result=get_eval_result)
My answer basically calculates the results of every branch, and then uses numpy syntax to specify which of those results should be used. This could be optimized slightly, but since it's using purely vectorized function, it should be far faster than using .apply
.
Answered By - BeRT2me
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.