Issue
Background: I'm just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle.
it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string
I read this Q&A on Pickle, Common use-cases for pickle in Python and wonder if the community here can share the differences between joblib and pickle? When should one use one over another?
Solution
- joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
- joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
- if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
- since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
- But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with
mmap_mode="r"
.
Answered By - ogrisel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.