Issue
I am reading data from csv and converting that data into a python class object. But when i try to iterate over that rdd with user-defined class objects, I get errors like,
_pickle.PicklingError: Can't pickle <class '__main__.User'>: attribute lookup User on __main__ failed
I'm adding some part of the code here,
class User:
def __init__(self, line):
self.user_id = line[0]
self.location = line[1]
self.age = line[2]
def create_user(line):
user = User(line)
return user
def print_user(line):
user = line
print(user.user_id)
conf = (SparkConf().setMaster("local").setAppName("exercise_set_2").set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
users = sc.textFile("BX-Users.csv").map(lambda line: line.split(";"))
users_objs = users.map(lambda entry: create_user(entry))
users_objs.map(lambda entry: print_user(entry))
For the above code, I get results like,
PythonRDD[93] at RDD at PythonRDD.scala:43
CSV data source URL(Needs a zip extraction): HERE
UPDATE: changing the code to include collect will result in error again, I still have to try with Pickle. I never tried that one before, If you anyone have a sample, I can do it easily.
users_objs = users.map(lambda entry: create_user(entry)).collect()
Solution
Okay, found an explanation. Storing classes in separate files will make the classes picklable automatically. So I stored the User class inside user.py and added the following import into my code.
from user import User
contents of User.py
class User:
def __init__(self, line):
self.user_id = line[0]
self.location = line[1]
self.age = line[2]
As mentioned in earlier answer, I can user collect(an RDD method) on the created User objects. So the following code will print all user ids, as I wanted.
for user_obj in users.map(lambda entry: create_user(entry)).collect():
print_user(user_obj)
Answered By - Mitty
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.