Issue
How can I have the same functions as shift() and cumsum() from pandas in pyspark?
import pandas as pd
temp = pd.DataFrame(data=[['a',0],['a',0],['a',0],['b',0],['b',1],['b',1],['c',1],['c',0],['c',0]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID').apply(lambda x: (x["X"].shift() != x["X"]).cumsum()).reset_index()['X']
print(temp)
My question is how to achieve in pyspark.
Solution
Pyspark have handle these type of queries with Windows utility functions. you can read its documentation here
Your pyspark code would be something like this :
from pyspark.sql import functions as F
from pyspark.sql Import Window as W
window = W.partitionBy('id').orderBy('time'?)
new_df = (
df
.withColumn('shifted', F.lag('X').over(window))
.withColumn('isEqualToPrev', (F.col('shifted') == F.col('X')).cast('int'))
.withColumn('cumsum', F.sum('isEqualToPrev').over(window))
)
Answered By - Dawyi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.