Issue
I am to trying to use text data
for linear regression model
as input
and converting my text data to vectors using Universal sentence encoder
from tensorflow hub
as pretrained model for this but this gives me tf.tensors
and now I am not able to split the data into training and testing for scikit learn linear regression model as my target feature is continuous.
This gives me embeddings (i.e vectors of shape (1,512) for each text in my pandas dataframe text column)
import tensorflow_hub as hub
model_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5'
model = hub.load(model_url)
embeddings = model(train['excerpt'])
This is how data look :
id excerpt target
0 c12129c31 When the young people returned to the ballroom... -0.340259
1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... -0.315372
2 b69ac6792 As Roger had predicted, the snow departed as q... -0.580118
3 dd1000b26 And outside before the palace a great garden w... -1.054013
4 37c1b32fb Once upon a time there were Three Bears who li... 0.247197
This is how embeddings look:
tf.Tensor: shape=(2834, 512), dtype=float32, numpy=
array([[-0.06747025, 0.02054032, -0.01223458, ..., 0.03468879,
-0.04216784, 0.01212691],
[-0.01053216, 0.01346854, 0.01992477, ..., 0.03078162,
-0.0226634 , 0.04429556],
[-0.10778417, 0.01735378, 0.00803178, ..., 0.00345916,
0.00552441, -0.02448413],
...,
[ 0.0364146 , 0.02996029, -0.06757646, ..., -0.00335971,
-0.01381749, -0.08319554],
[ 0.0042374 , 0.02291174, -0.04473154, ..., -0.02009053,
-0.00428826, -0.06476445],
[-0.0141812 , 0.03879716, 0.03304171, ..., 0.06709221,
-0.05016331, 0.00868828]], dtype=float32)
Now I want to use this embeddings as input in Linear Regression model or any Regression model using scikit learn. But not able to split the data using train_test_split()
, giving me error TypeError: Only integers, slices (
:), ellipsis (
...), tf.newaxis (
None) and scalar tf.int32/tf.int64 tensors are valid indices, got array([1434, 2653, 2620, ..., 749, 2114, 2389])
This is how I am splitting the data:
X_train,X_test,y_train,y_test = train_test_split(embeddings,train['target'],test_size =0.2, shuffle =True)
Solution
In the train_test_split
you are passing a tensor. Instead, you should pass the NumPy array like this-
X_train,X_test,y_train,y_test = train_test_split(embeddings.numpy(), train['target'],test_size =0.2, shuffle =True)
Answered By - Abhishek Prajapat
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.