Sunday, December 5, 2021

[FIXED] How to use Tensorflow embeddings in scikit learn models?

December 05, 2021 machine-learning, pandas, python-3.x, scikit-learn, tensorflow No comments

Issue

I am to trying to use text data for linear regression model as input and converting my text data to vectors using Universal sentence encoder from tensorflow hub as pretrained model for this but this gives me tf.tensors and now I am not able to split the data into training and testing for scikit learn linear regression model as my target feature is continuous.

This gives me embeddings (i.e vectors of shape (1,512) for each text in my pandas dataframe text column)

import tensorflow_hub as hub
model_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5'
model = hub.load(model_url)
embeddings = model(train['excerpt'])

This is how data look :

         id                  excerpt                                  target    
0   c12129c31   When the young people returned to the ballroom...   -0.340259   
1   85aa80a4c   All through dinner time, Mrs. Fayre was somewh...   -0.315372   
2   b69ac6792   As Roger had predicted, the snow departed as q...   -0.580118   
3   dd1000b26   And outside before the palace a great garden w...   -1.054013   
4   37c1b32fb   Once upon a time there were Three Bears who li...   0.247197

This is how embeddings look:

tf.Tensor: shape=(2834, 512), dtype=float32, numpy=
array([[-0.06747025,  0.02054032, -0.01223458, ...,  0.03468879,
        -0.04216784,  0.01212691],
       [-0.01053216,  0.01346854,  0.01992477, ...,  0.03078162,
        -0.0226634 ,  0.04429556],
       [-0.10778417,  0.01735378,  0.00803178, ...,  0.00345916,
         0.00552441, -0.02448413],
       ...,
       [ 0.0364146 ,  0.02996029, -0.06757646, ..., -0.00335971,
        -0.01381749, -0.08319554],
       [ 0.0042374 ,  0.02291174, -0.04473154, ..., -0.02009053,
        -0.00428826, -0.06476445],
       [-0.0141812 ,  0.03879716,  0.03304171, ...,  0.06709221,
        -0.05016331,  0.00868828]], dtype=float32)

Now I want to use this embeddings as input in Linear Regression model or any Regression model using scikit learn. But not able to split the data using train_test_split(), giving me error TypeError: Only integers, slices (:), ellipsis (...), tf.newaxis (None) and scalar tf.int32/tf.int64 tensors are valid indices, got array([1434, 2653, 2620, ..., 749, 2114, 2389])

This is how I am splitting the data:

X_train,X_test,y_train,y_test = train_test_split(embeddings,train['target'],test_size =0.2, shuffle =True)

Solution

In the train_test_split you are passing a tensor. Instead, you should pass the NumPy array like this-

X_train,X_test,y_train,y_test = train_test_split(embeddings.numpy(), train['target'],test_size =0.2, shuffle =True)

Answered By - Abhishek Prajapat

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 5, 2021

[FIXED] How to use Tensorflow embeddings in scikit learn models?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels