Thursday, February 8, 2024

[FIXED] Asyncronous programming in machine learning

February 08, 2024 asynchronous, python, python-asyncio No comments

Issue

I would like to share my plan and idea for applying asynchronous programming with machine learning , so here is my idea : let us consider following code :

import pandas as pd
from sklearn.ensemble import  VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import time
from  sklearn.model_selection import  train_test_split
from sklearn.preprocessing import LabelEncoder
mylabel =LabelEncoder()
data =pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",usecols=lambda x: x not in ["PassengerId","Ticket"])
# print(data.columns)

st = time.time()


#Process the data
data['Name'] =data['Name'].map(lambda x:x.split(',')[1].split('.')[0])
data['Age'] =data['Age'].fillna(data['Age'].mean())
data['Sex'] =mylabel.fit_transform(data['Sex'])
data['Cabin'] =data['Cabin'].fillna(data['Cabin'].value_counts().index[0])
data['Cabin'] =data['Cabin'].map(lambda x:x[0])
data['Embarked'] =data['Embarked'].fillna(data['Embarked'].value_counts().index[0])
# print(data.isnull().any())
# print(data.dtypes)
#convert categorical columns
cat_columns =data.select_dtypes(include=['object']).columns
print(cat_columns)
data[cat_columns] = data[cat_columns].apply(LabelEncoder().fit_transform)



#seperate  and divide into train and test

y =data['Survived'].values
X =data.drop('Survived',axis=1).values
X_train,X_test,y_train,y_test =train_test_split(X,y,test_size=0.2,random_state=1)
eclf1 = VotingClassifier(estimators=[('svc', SVC(probability=True)), ('LR', LogisticRegression(max_iter=5000)),
                                     ('cart', DecisionTreeClassifier())], voting='soft')
eclf1.fit(X_train,y_train)
print(eclf1.score(X_test,y_test))

# get the end time
et = time.time()
# get the execution time
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')
# print(data.head())

This is standard machine learning code for applying VotingClassifier algorithm for titanic data (which is by the way very small in size), my algorithm took approximately :

Execution time: 0.10833930969238281 seconds

That is pretty small right? So the code runs very fast, but one main reason is the dataset is very small in size, what about if we have dataset with rows of millions or even billions or even size outside of human imagination? During this week, I have started learning of asyncio library and here is corresponding official python web link :asyncronous

now if we look that process the data line :

#Process the data
data['Name'] =data['Name'].map(lambda x:x.split(',')[1].split('.')[0])
data['Age'] =data['Age'].fillna(data['Age'].mean())
data['Sex'] =mylabel.fit_transform(data['Sex'])
data['Cabin'] =data['Cabin'].fillna(data['Cabin'].value_counts().index[0])
data['Cabin'] =data['Cabin'].map(lambda x:x[0])
data['Embarked'] =data['Embarked'].fillna(data['Embarked'].value_counts().index[0])

Definitely we can divide it into portions and run code in parallel right? we should not wait for any column and in parallel process any other columns ,my question is how can I do it?

I have another question :

eclf1 = VotingClassifier(estimators=[('svc', SVC(probability=True)), ('LR', LogisticRegression(max_iter=5000)),
                                     ('cart', DecisionTreeClassifier())], voting='soft')
eclf1.fit(X_train,y_train)

Can I apply asyncio to votingclassifier? or should I train those algorithms separately and then apply mode function to the predictions? Please if you can , give just a small examples in order to digest those details.

Solution

It goes something like this, but I haven't tested the code (sorry); If you really want it async, you will need to:

have a beefy computer (which I do not)
merge the params of multiple classifiers each trained async.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from concurrent.futures import ThreadPoolExecutor

def split_dataframe(df, chunk_size=10000):
    chunks = []
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
        chunks.append(df[i * chunk_size:(i + 1) * chunk_size])
    return chunks

def fit_classifier_async(classifier, train_data, train_labels):
    classifier.fit(train_data, train_labels)
    return classifier

# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='target')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Generate an example DataFrame (you can replace this with your actual data)
# For simplicity, I'll use the first 100 samples of the Iris dataset
example_df = pd.concat([X_train.iloc[:50], X_train.iloc[50:100], X_train.iloc[100:150]], ignore_index=True)
example_labels = pd.concat([y_train.iloc[:50], y_train.iloc[50:100], y_train.iloc[100:150]], ignore_index=True)

# Create individual classifiers
svc_classifier = SVC(probability=True)
lr_classifier = LogisticRegression(max_iter=5000)
cart_classifier = DecisionTreeClassifier()

# Create the VotingClassifier
eclf1 = VotingClassifier(estimators=[
    ('svc', svc_classifier),
    ('LR', lr_classifier),
    ('cart', cart_classifier)
], voting='soft')

# Split the dataset into chunks
chunks = split_dataframe(example_df)

# Asynchronously fit the classifier on each chunk
with ThreadPoolExecutor() as executor:     # set the number of threads here
    fitted_classifiers = list(executor.map(
        fit_classifier_async,
        [eclf1] * len(chunks),  # Use the same classifier for each chunk
        chunks,
        [example_labels] * len(chunks)  # Use the same labels for each chunk
    ))

# Combine the fitted classifiers into a single ensemble model
for fitted_classifier in fitted_classifiers:
    eclf1.estimators_ += fitted_classifier.estimators_

# Make predictions on the test set
predictions = eclf1.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy:.2f}')

Answered By - Sy Ker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, February 8, 2024

[FIXED] Asyncronous programming in machine learning

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels