Issue
Probably just a pipe dream, but is there a way to access pysparks executors and send jobs to them manually in a Jupyter or Zeppelin notebook?
It probably goes against all pyspark convention as well, but the idea is to access a running EMR clusters executors(workers) and send them python scripts to run. Sort of like pythons multiprocessing where the pool is instead the executors themselves, and you just feed them a map or a list of the python scripts path+arguments or a function.
pyspark_executors = sc.getExecutors()
def square(number):
return number ** 2
numbers = [1, 2, 3, 4, 5]
with pyspark_executors.Pool() as pool:
results = pool.map(square, numbers)
print(results)
Solution
You could if the spark runtime can be configured with Spark plugins (you can send messages from the driver to all executors etc.), but the runtime on the executors would also have to have all of the relevant python installed, not just the jvm.
The skill/effort it would take to get that working is high, as you are on Amazon perhaps this answer is more useful: https://stackoverflow.com/a/71431264/1028537
That said for the above example you could write that directly in pyspark/spark so you'd likely gain nothing by trying to leverage the underlying spark stack in that way.
Answered By - Chris
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.