Thursday, December 21, 2023

[FIXED] Unable to allocate 67.2 GiB for an array with shape (9013880778,) and data type float64

December 21, 2023 numpy, pandas, python, scikit-learn No comments

Issue

df = pd.read_csv('tech_clustring.csv')
df_tech = df.copy()
df_tech = df[df.Sector == ('Technology' or 'Communication Services')]
tech_scaled = StandardScaler().fit_transform(df_tech.iloc[:, 2:6])
cluster = AgglomerativeClustering(n_clusters=5, metric='euclidean', linkage='ward')

#The error occurs on this line
tech_cluster = cluster.fit(tech_scaled)

df_tech
       Symbol      Sector  Open   High    Low  Close
2470     AAOI  Technology  2.15  2.160  2.120   2.13
2471     AAOI  Technology  2.13  2.300  2.130   2.21
2472     AAOI  Technology  2.20  2.240  2.130   2.13
2473     AAOI  Technology  2.13  2.130  1.950   2.00
2474     AAOI  Technology  1.96  1.980  1.870   1.95
...       ...         ...   ...    ...    ...    ...
955519    ZUO  Technology  8.51  8.620  8.390   8.61
955520    ZUO  Technology  8.58  8.680  8.350   8.42
955521    ZUO  Technology  8.44  9.030  8.295   9.01
955522    ZUO  Technology  9.15  9.331  8.820   8.89
955523    ZUO  Technology  8.96  8.960  8.580   8.69

[134268 rows x 6 columns]

The shape of tech_scaled:

tech_scaled.shape
(134268, 4)

The size of tech_scaled:

tech_scaled.nbytes
4296576

The full output:

Traceback (most recent call last):
  File "/home/ahmed/PycharmProjects/pythonProject/iter_10.py", line 73, in <module>
    tech_cluster = cluster.fit(tech_scaled)
  File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 979, in fit
    return self._fit(X)
  File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 1071, in _fit
    out = memory.cache(tree_builder)(
  File "/home/ahmed/.local/lib/python3.10/site-packages/joblib/memory.py", line 353, in __call__
    return self.func(*args, **kwargs)
  File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 187, in wrapper
    return func(*args, **kwargs)
  File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 295, in ward_tree
    out = hierarchy.ward(X)
  File "/home/ahmed/.local/lib/python3.10/site-packages/scipy/cluster/hierarchy.py", line 833, in ward
    return linkage(y, method='ward', metric='euclidean')
  File "/home/ahmed/.local/lib/python3.10/site-packages/scipy/cluster/hierarchy.py", line 1059, in linkage
    y = distance.pdist(y, metric)
  File "/home/ahmed/.local/lib/python3.10/site-packages/scipy/spatial/distance.py", line 2220, in pdist
    return pdist_fn(X, out=out, **kwargs)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 67.2 GiB for an array with shape (9013880778,) and data type float64

Solution

This line in the error message is important for understanding why this is running out of memory:

  File "/home/ahmed/.local/lib/python3.10/site-packages/scipy/cluster/hierarchy.py", line 1059, in linkage
    y = distance.pdist(y, metric)

This is calling scipy.spatial.distance.pdist, which is calculating the distance between every vector in your dataset and every other vector. In order to return the results of this, it needs to allocate N²/2 array entries, where N is the number of vectors in your dataset.

This is 134268 * 134268 / 2 * 8 bytes in gibibytes = 67.16 GiB.

Since AgglomerativeClustering uses pdist, it cannot handle a dataset of this size with the amount of memory you have. Since this step is done before any linkage steps are done, this means that no matter what linkage is selected, AgglomerativeClustering will fail.

I suggest either:

Reducing dataset size, for example by selecting 1000 elements at random.
Using a clustering algorithm which can handle a dataset of this size. (KMeans, for example, can cluster this dataset.)

Answered By - Nick ODell

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 21, 2023

[FIXED] Unable to allocate 67.2 GiB for an array with shape (9013880778,) and data type float64

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels