Monday, December 25, 2023

[FIXED] Why do my tf-idf values not appear consistent?

December 25, 2023 python, scikit-learn No comments

Issue

I have a series of tweets that I've converted to tokens. Among them are the following:

geraldkutney happen realize happen conveniently rename catch yet emergency post fact come government
michaelemann burn happen chickenshittily change get make stupid argument good deed go unpunished
rickcaughell thomas_6278 coderedearth jrockstrom jordanbpeterson fact exxon predict good accuracy would happen temperature today back 1970s 80 prof model accurate

Note that the first two tweets have 13 total tokens and the third one has three.

Using the following code, I have created TF-IDF values:

vectoriser = sk_text.TfidfVectorizer()

vectoriser.fit(twit_api['text_clean'])

twit_vec = vectoriser.transform(twit_api['text_clean'])
twit_vec.columns = vectoriser.get_feature_names_out()

tokens_enc = twit_vec.toarray()

When I look at the tf-idf values for the word 'happen' in each of these, I get values 0.41124561276932653, 0.18906439908376366 and 0.1523571031416618.

This is with the code

print(tokens_enc[row_nos[0], vectoriser.vocabulary_['happen']])

These values don't appear consistent to me. I would expect the first value to be equal to double the second one as the tf is exactly double however this doesn't appear to be the case.

Have I misunderstood something?

Solution

There are many parameters that can be used with your tfidf function and the ones you don't specify have default values. In your case the three parameters that are affecting you are

# ---------------------------
norm{‘l1’, ‘l2’} or None, default=’l2’
Each output row will have unit norm, either:

‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.

‘l1’: Sum of absolute values of vector elements is 1. See normalize.

None: No normalization.

# ---------------------------
use_idfbool, default=True
Enable inverse-document-frequency reweighting. If False, idf(t) = 1.

# ---------------------------
smooth_idfbool, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

There is smoothing and normalization happening behind the scenes but it looks like you can turn all of those off. Try something like the following

vectoriser = sk_text.TfidfVectorizer(norm=None, use_idfbool=False, smooth_idfbool=False)

Without your dataset, I cannot test this but referring back to the function documentation for these parameters will be helpful for any further issues.

Answered By - Jesse Sealand

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 25, 2023

[FIXED] Why do my tf-idf values not appear consistent?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels