Unlike with traditional supervised Machine Learning (ML) problems the challenge with dealing with text data is that it is hard to figure out how to deal with it. Computers do not know how to deal with text data. We have come up with a standard of representing numbers in the form of ASCII, Latin, or UTF-8 encodings. Similarily our ML models will have some sort of encodings that we need it to know about. Dealing with text in the ML pipeline is reffered to as the preprocessing step.
Lets say we have a list of sentences
text_data = ['A big Whale.',
'The quick brown fox.',
'The Giraffe that cried.' ]
We can lowercase each word, remove any punctuation, and split on space. This will allow us to tokenize each word and count up how many times it appears in each sentence.
import pandas as pd
def create_doc_term_df(corpus, vectorizer):
doc_term_matrix = vectorizer.fit_transform(corpus)
df = pd.DataFrame(doc_term_matrix.toarray(), columns=vectorizer.get_feature_names_out())
df.index.name = 'Sentence'
return df
Lets run the function and get the output
`python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
create_doc_term_df(account_names, vecotrizer)
'''Output:
big brown cried fox giraffe quick that the whale
Sentence
0 1 0 0 0 0 0 0 0 1
1 0 1 0 1 0 1 0 1 0
2 0 0 1 0 1 0 1 1 0
'''
Higher relevance is given to words that ppear more often across many sentences. The formula below calculates this frequency.
where referes to the index of the sentence and refers to the index of the word at the 'th position.
Now that we have calculated the frequency of each word within a sentence, we must also calculate the amount of times each word appears across all sentences. We are comparing the relevance of each word against other words in the sentence across all sentences. he reason this is an important step in our pipeline is because it normalizes our data across all account names.
Think of a spam email classfier. The word "urgent! click here!" would have a high term frequency within the email; however, across all the emails you have received these words will have a low frequency. The purpose of inverse document frequency is to drown out the words in documents that do not appear in other documents.
The formula for calculating is
where is the number of documents that contain the word . E.g. . Also, is the number of words in our list. For our toy example, .
The last step is to get the . We simply multiply the and the .
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
print(create_doc_term_df(text_data, vectorizer).to_string())
'''output
big brown cried fox giraffe quick that the whale
Sentence
0 0.707107 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.707107
1 0.000000 0.528635 0.000000 0.528635 0.000000 0.528635 0.000000 0.40204 0.000000
2 0.000000 0.000000 0.528635 0.000000 0.528635 0.000000 0.528635 0.40204 0.000000
'''