Issue
The scikit-learn documentation says
If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.
However, why would df(d, t) = 0
? If a term doesn't occur in any text, the dictionary wouldn't have the term in the first place, would it?
Solution
This feature is useful in TfidfVectorizer
. According to documentation, this class can be provided with predefined vocabulary
. If a word from vocabulary was never seen in the train data, but occures in the test, smooth_idf allows it to be successfully processed.
train_texts = ['apple mango', 'mango banana']
test_texts = ['apple banana', 'mango orange']
vocab = ['apple', 'mango', 'banana', 'orange']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer1 = TfidfVectorizer(smooth_idf=True, vocabulary=vocab).fit(train_texts)
vectorizer2 = TfidfVectorizer(smooth_idf=False, vocabulary=vocab).fit(train_texts)
print(vectorizer1.transform(test_texts).todense()) # works okay
print(vectorizer2.transform(test_texts).todense()) # raises a ValueError
Output:
[[ 0.70710678 0. 0.70710678 0. ]
[ 0. 0.43016528 0. 0.90275015]]
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Answered By - David Dale
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.