What is TFIDF?

Unlike with traditional supervised Machine Learning (ML) problems the challenge with dealing with text data is that it is hard to figure out how to deal with it. Computers do not know how to deal with text data. We have come up with a standard of representing numbers in the form of ASCII, Latin, or UTF-8 encodings. Similarily our ML models will have some sort of encodings that we need it to know about. Dealing with text in the ML pipeline is reffered to as the preprocessing step.

Lets say we have a list of sentences

text_data  = ['A big Whale.', 
            'The quick brown fox.',
            'The Giraffe that cried.' ]

We can lowercase each word, remove any punctuation, and split on space. This will allow us to tokenize each word and count up how many times it appears in each sentence.

import pandas as pd 
def create_doc_term_df(corpus, vectorizer):
    doc_term_matrix = vectorizer.fit_transform(corpus)
    df = pd.DataFrame(doc_term_matrix.toarray(), columns=vectorizer.get_feature_names_out())
    df.index.name = 'Sentence'
    return df
Lets run the function and get the output

`python 
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
create_doc_term_df(account_names, vecotrizer)
'''Output: 
          big  brown  cried  fox  giraffe  quick  that  the  whale
Sentence                                                          
0           1      0      0    0        0      0     0    0      1
1           0      1      0    1        0      1     0    1      0
2           0      0      1    0        1      0     1    1      0
'''

Higher relevance is given to words that ppear more often across many sentences. The formula below calculates this frequency.

tfi,j=ni,jk=1Knk,j\text{tf}_{i,j} = \frac{n_{i,j}}{\sum_{k=1}^{K} n_{k,j}}

where ii referes to the index of the sentence and jj refers to the index of the word at the jj'th position.

IDF

Now that we have calculated the frequency of each word within a sentence, we must also calculate the amount of times each word appears across all sentences. We are comparing the relevance of each word against other words in the sentence across all sentences. he reason this is an important step in our pipeline is because it normalizes our data across all account names.

Think of a spam email classfier. The word "urgent! click here!" would have a high term frequency within the email; however, across all the emails you have received these words will have a low frequency. The purpose of inverse document frequency is to drown out the words in documents that do not appear in other documents.

The formula for calculating idf\text{idf} is

idf(w)=log(Ndft)\text{idf}(w) = \log(\frac{N}{\text{df}_t})

where dft\text{df}_t is the number of documents that contain the word ww. E.g. df(big)=1\text{df}('big') = 1. Also, NN is the number of words in our list. For our toy example, N=9N=9.

Putting it all Together, TFIDF

The last step is to get the tfidf\text{tfidf}. We simply multiply the tf\text{tf} and the idf\text{idf}.

tfidfi,j(word)=tfi,jidf(word)\text{tfidf}_{i,j}(\text{word}) = \text{tf}_{i,j} \cdot \text{idf}(\text{word})
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
print(create_doc_term_df(text_data, vectorizer).to_string())
'''output
               big     brown     cried       fox   giraffe     quick      that      the     whale
Sentence                                                                                         
0         0.707107  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.707107
1         0.000000  0.528635  0.000000  0.528635  0.000000  0.528635  0.000000  0.40204  0.000000
2         0.000000  0.000000  0.528635  0.000000  0.528635  0.000000  0.528635  0.40204  0.000000
'''