Hi!, How are you?
Today lets, see how we can represent text data of a corpus in array format. As we know, computers only understand numbers, and when we are performing any machine learning algorithm, we have to encode each data into some sort of numerical format, so that the algorithms can find a pattern from that data and build a model. And if we are into Natural Language Processing and especially text-data analysis, we have to deal with the text as data. so, in order to feed to the algorithm, it is a must-performed step that we, change the textual raw data into numerical data. There are various ways to do it. Let's discuss those. The first is Bag of Words, it is just a way of counting the numbers of each text that appears in a corpus. (Here, Corpus means the entire dataset of text.) Let's take 3 sentences.
- "It is going to rain today"
- "I am going to drink coffee"
- "I am going to capital today"
If we perform Bag of words in the above example, first we make count the number of times individual items, repeats in a corpus.
Term | Frequency |
going | 3 |
to | 3 |
i | 2 |
am | 2 |
today | 2 |
it | 1 |
is | 1 |
rain | 1 |
drink | 1 |
coffee | 1 |
capital | 1 |
Now if we represent it in the tabular form, the bag of words representation looks like this.
Term/document No | going | it | to | i | am | is | rain | today | drink | coffee | capital |
1. | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
2. | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
3. | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
But we can already see the problem in this Bag of Words representation, All the words carry the same importance. In the given dataset, the word 'going' is present in each of the sentences. While, words like rain, coffee, capital are present only in each sentence, and carry the main essence of the sentence. But when we represent it in the BoW model, these all words got the value of 1. So, BoW model representation, will not represent the importance of some words which can be problematic during The problem we can see is it, no order is maintained, which means the semantic information is not preserved. We know, the text is sequential data, so the order of data is very important, but the BoW model doesn't care about the order of data. So, this can cause problems when we have to work on models where data need to be in proper order so that machines can learn from the data. If you want to perform Bag of Words in python sklearn, we can perform it as.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
vectorizer = CountVectorizer()
doc = ["It is going to rain today",
"I am going to drink coffee",
"I am going to capital today"]
X = vectorizer.fit_transform(doc)
column = vectorizer.get_feature_names()
df = pd.DataFrame(X.toarray(), columns=column)
df
In order to solve the problems with the Bag of Words Model, we use something called TF-IDF. So what is TF-IDF? Tf-IDF stands for Term Frequency - Inverse Document Frequency. Here, Term Frequency means the ratio of Number of Occuracnies of a word in a Document to the Number of Words in that Document. Term frequency, tf(t,d), is the frequency of term t, where f__t,d is the raw count of a term in a document, i.e., the number of times that term t occurs in document d. There are various other ways to define term frequency.
From the above example, the term-frequency of the word going is: Here, going appears 3 times in the document and there are total 18 words. so, tf(going) = 3/18 = 0.1666 similarly, the tf of word to is : tf(to) = 2/18 = 0.111
so, let calculate the term frequency for all the terms:
Term | TF value(doc1) | TF value(doc2) | Tf value(doc3) |
going | 0.1666 | 0.1666 | 0.1666 |
to | 0.1666 | 0.1666 | 0.1666 |
i | 0 | 0.1666 | 0.1666 |
am | 0 | 0.1666 | 0.1666 |
it | 0.1666 | 0 | 0 |
is | 0.1666 | 0 | 0 |
rain | 0.1666 | 0 | 0 |
today | 0.1666 | 0 | 0.1666 |
drink | 0 | 0.1666 | 0 |
coffee | 0 | 0.1666 | 0 |
capital | 0 | 0 | 0.1666 |
Since we have calculated the term-frequency, let's discuss Inverse Document Frequency (IDF). IDF is calculated as the log of the ratio of Numbers of the document to the Number of documents that contain the particular term. So, measure the amount of value the word provides i.e, is the measurement of how common or how rare is the word in the given corpus. with
- : total number of documents in the corpus N = | D |
- : number of documents where the term t appears (i.e., t f ( t , d ) ≠ 0 ). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 1 + | { d ∈ D: t ∈ d }
So, let's calculate the IDF value of some terms. The IDF of 'going' can be calculated as: Word 'going' is present in all three documents and there are since total 3 documents. so the idf value of going must be, idf(going) = log(3/)= log(1) = 0. What it tells that since going is present in all the 3 documents, it carries no importance at all. Also, if we calculate the idf value of to, it becomes: idf(to) = log(3/2) = 0.17609 Also, if we calculate the idf value of coffee, it becomes: idf(coffee) = log(3/1) = 0.47712 So, let's see what IDF value of each term becomes.
Term | IDF value |
going | 0 |
to | 0 |
i | 0.17609 |
am | 0.17609 |
today | 0.17609 |
it | 0.47712 |
is | 0.47712 |
rain | 0.47712 |
drink | 0.47712 |
coffee | 0.47712 |
capital | 0.47712 |
Now, it's time to do magic, calculate TF-IDF. It is simply the product of Term Frequency and Inverse Document Frequency. If we calculate the TF-IDF value of the word to in document 1, we get. TFIDF(to) = TF(to) IDF(to) = 0.16660.17609
Term/document No | going | it | to | i | am | is | rain | today | drink | coffee | capital |
1. | 0 | 0.07948 | 0 | 0 | 0 | 0.07948 | 0.07948 | 0.02933 | 0 | 0 | 0 |
2. | 0 | 0 | 0 | 0.02933 | 0.02933 | 0 | 0 | 0 | 0.07948 | 0.07948 | 0 |
3. | 0 | 0 | 0 | 0.02933 | 0.02933 | 0 | 0 | 0.02933 | 0 | 0 | 0.07948 |
This is the final TF-IDF text representation for the example corpus. You can try TF-IDF in sklearn as given below code.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(doc)
column = vectorizer.get_feature_names()
df = pd.DataFrame(X.toarray(), columns=column)
If you have tried TF-IDF in sklearn, then you can see that the results are quite different. It is because the sklearn TI-IDF vectorizer uses the log normalization method for the calculation and has tuned parameters in a different way. The above-mentioned method is the root idea about TFIDF, yet it needs to be tuned for large extensive use.
If you are still confused in TFIDF, let me know in the comments, until then, enjoy Learning. The code for this tutorial can also be found at this link.