Issue
df = pd.read_csv('tech_clustring.csv')
df_tech = df.copy()
df_tech = df[df.Sector == ('Technology' or 'Communication Services')]
tech_scaled = StandardScaler().fit_transform(df_tech.iloc[:, 2:6])
cluster = AgglomerativeClustering(n_clusters=5, metric='euclidean', linkage='ward')
#The error occurs on this line
tech_cluster = cluster.fit(tech_scaled)
df_tech
Symbol Sector Open High Low Close
2470 AAOI Technology 2.15 2.160 2.120 2.13
2471 AAOI Technology 2.13 2.300 2.130 2.21
2472 AAOI Technology 2.20 2.240 2.130 2.13
2473 AAOI Technology 2.13 2.130 1.950 2.00
2474 AAOI Technology 1.96 1.980 1.870 1.95
... ... ... ... ... ... ...
955519 ZUO Technology 8.51 8.620 8.390 8.61
955520 ZUO Technology 8.58 8.680 8.350 8.42
955521 ZUO Technology 8.44 9.030 8.295 9.01
955522 ZUO Technology 9.15 9.331 8.820 8.89
955523 ZUO Technology 8.96 8.960 8.580 8.69
[134268 rows x 6 columns]
The shape of tech_scaled
:
tech_scaled.shape
(134268, 4)
The size of tech_scaled
:
tech_scaled.nbytes
4296576
The full output:
Traceback (most recent call last):
File "/home/ahmed/PycharmProjects/pythonProject/iter_10.py", line 73, in <module>
tech_cluster = cluster.fit(tech_scaled)
File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
return fit_method(estimator, *args, **kwargs)
File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 979, in fit
return self._fit(X)
File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 1071, in _fit
out = memory.cache(tree_builder)(
File "/home/ahmed/.local/lib/python3.10/site-packages/joblib/memory.py", line 353, in __call__
return self.func(*args, **kwargs)
File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 187, in wrapper
return func(*args, **kwargs)
File "/home/ahmed/.local/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 295, in ward_tree
out = hierarchy.ward(X)
File "/home/ahmed/.local/lib/python3.10/site-packages/scipy/cluster/hierarchy.py", line 833, in ward
return linkage(y, method='ward', metric='euclidean')
File "/home/ahmed/.local/lib/python3.10/site-packages/scipy/cluster/hierarchy.py", line 1059, in linkage
y = distance.pdist(y, metric)
File "/home/ahmed/.local/lib/python3.10/site-packages/scipy/spatial/distance.py", line 2220, in pdist
return pdist_fn(X, out=out, **kwargs)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 67.2 GiB for an array with shape (9013880778,) and data type float64
Solution
This line in the error message is important for understanding why this is running out of memory:
File "/home/ahmed/.local/lib/python3.10/site-packages/scipy/cluster/hierarchy.py", line 1059, in linkage
y = distance.pdist(y, metric)
This is calling scipy.spatial.distance.pdist, which is calculating the distance between every vector in your dataset and every other vector. In order to return the results of this, it needs to allocate N2/2 array entries, where N is the number of vectors in your dataset.
This is 134268 * 134268 / 2 * 8 bytes in gibibytes = 67.16 GiB.
Since AgglomerativeClustering uses pdist, it cannot handle a dataset of this size with the amount of memory you have. Since this step is done before any linkage steps are done, this means that no matter what linkage is selected, AgglomerativeClustering will fail.
I suggest either:
- Reducing dataset size, for example by selecting 1000 elements at random.
- Using a clustering algorithm which can handle a dataset of this size. (KMeans, for example, can cluster this dataset.)
Answered By - Nick ODell
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.