Issue
I have a series of tweets that I've converted to tokens. Among them are the following:
- geraldkutney happen realize happen conveniently rename catch yet emergency post fact come government
- michaelemann burn happen chickenshittily change get make stupid argument good deed go unpunished
- rickcaughell thomas_6278 coderedearth jrockstrom jordanbpeterson fact exxon predict good accuracy would happen temperature today back 1970s 80 prof model accurate
Note that the first two tweets have 13 total tokens and the third one has three.
Using the following code, I have created TF-IDF values:
vectoriser = sk_text.TfidfVectorizer()
vectoriser.fit(twit_api['text_clean'])
twit_vec = vectoriser.transform(twit_api['text_clean'])
twit_vec.columns = vectoriser.get_feature_names_out()
tokens_enc = twit_vec.toarray()
When I look at the tf-idf values for the word 'happen' in each of these, I get values
0.41124561276932653
, 0.18906439908376366
and 0.1523571031416618
.
This is with the code
print(tokens_enc[row_nos[0], vectoriser.vocabulary_['happen']])
These values don't appear consistent to me. I would expect the first value to be equal to double the second one as the tf is exactly double however this doesn't appear to be the case.
Have I misunderstood something?
Solution
There are many parameters that can be used with your tfidf function and the ones you don't specify have default values. In your case the three parameters that are affecting you are
# ---------------------------
norm{‘l1’, ‘l2’} or None, default=’l2’
Each output row will have unit norm, either:
‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.
‘l1’: Sum of absolute values of vector elements is 1. See normalize.
None: No normalization.
# ---------------------------
use_idfbool, default=True
Enable inverse-document-frequency reweighting. If False, idf(t) = 1.
# ---------------------------
smooth_idfbool, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
There is smoothing and normalization happening behind the scenes but it looks like you can turn all of those off. Try something like the following
vectoriser = sk_text.TfidfVectorizer(norm=None, use_idfbool=False, smooth_idfbool=False)
Without your dataset, I cannot test this but referring back to the function documentation for these parameters will be helpful for any further issues.
Answered By - Jesse Sealand
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.