Issue
So the Dask tutorials state that: "For too large datasets (larger than a single machine's memory), the scikit-learn estimators may not be able to cope"
Am I correct in saying: if I have 9.42 GB of "available physical memory" showing up on the System page, then a "too large dataset" for scikit would be anything in excess of 9.42 GB of data?
Solution
Am I correct in saying: if I have 9.42 GB of "available physical memory" showing up on the System page, then a "too large dataset" for scikit would be anything in excess of 9.42 GB of data?
Roughly yes, though the exact upper limit on the data size depends on various factors (such as which other objects are in the memory, how different algorithms handle copies or subsets of the data), etc. A rough rule of thumb for pandas dataframe is to use dataframes that are about 1/5 of the available RAM, and a similar ballpark estimate might be reasonable for scikit (with the caveat that much depends on the exact code/algo used).
Note that some scikit estimators can be fit on partial data, see this page.
Answered By - SultanOrazbayev
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.