Text Representation Techniques in NLP — Machine Learning

Bhargav Joshi
5 min readMar 21, 2022

Introduction

Natural Language Processing is a branch of artificial intelligence that deals with human language to make a system able to understand and respond to language. Data being the most important part of any data science project should always be represented in a way that helps easy understanding and modeling. It is said that when we provide excellent features to wrong models and bad features to well-optimized models then bad models will perform far better than an optimized model. So in this article, we will study how features from text data can be extracted, and used in our machine learning modeling process and why feature extraction from text is a bit difficult compared to other types of data.

Introduction to Text Representation

The first question arises is what is Feature Extraction from the text? Feature Extraction is a general term that is also known as a text representation of text vectorization which is a process of converting text into numbers. we call vectorization because when text is converted in numbers it is in vector form.

Now the second question would be Why do we need feature extraction? So we know that machines can only understand numbers and to make machines able to identify language we need to convert it into numeric form.

Limitations of information extraction methods and techniques for heterogeneous unstructured big data

During the recent era of big data, a huge volume of unstructured data is produced in various forms of audio, video, images, text, and animation. Effective use of these big unstructured data is a laborious and tedious task. Information extraction (IE) systems help to extract useful information from this extensive variety of unstructured data. Several techniques and methods have been presented for IE from unstructured data. However, numerous studies conducted on IE from various unstructured data are limited to single data types such as text, image, audio, or video. This article reviews the existing IE techniques along with their subtasks, limitations, and challenges for the variety of unstructured data highlighting the impact of unstructured big data on IE techniques.

Techniques for Feature Extraction

  1. One-Hot Encoding
    One hot encoding means converting words of your document in a V-dimension vector and by combining all this we get a single document so at the end we have a two-dimensional array. This technique is very intuitive means it is simple and you can code it yourself. This is only the advantage of One-Hot Encoding.
    Now to perform all the techniques using python let us get to Jupyter notebook and create a sample data frame of some sentences.
import numpy as np
import pandas as pd
sentences = [‘First word’, ‘Second word’, ‘Third word’]
df = pd.DataFrame({“text”:sentences, “output”:[1,1,0]})

Now we can perform one-hot encoding using sklearn pre-built class as well as you can implement it using python. After implementation, each sentence will have a different shape 2-D array.

2. Bag Of Words
It is one of the most used text vectorization techniques. It is mostly used in text classification tasks. Bag of words is a little bit similar to one-hot encoding where we enter each word as a binary value and in a Bag of words we keep a single row and entry the count of words in a document. So we create a vocabulary and for a single document, we enter one entry of which words occur how many times in a document. Let us get to IDE and implement Bag-of words model using the Count vectorized class of sciket-learn.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
bow = cv.fit_transform(df[‘text’])
#Now to see the vocabulary and the vector it has created you can use the below code as shown in the below results image.

3. N-Grams
The technique is similar to Bag of words. All the techniques till now we have read it is made up of a single word and we are not able to use them or utilize them for better understanding. So N-Gram technique solves this problem and constructs vocabulary with multiple words. When we built an N-gram technique we need to define like we want bigram, trigram, etc. So when you define N-gram and if it is not possible then it will throw an error. In our case, we cannot build after a 4 or 5-gram model. Let us try bigram and observe the outputs.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=[2,2])
bow = cv.fit_transform(df[‘text’])

4. TF-IDF (Term Frequency and Inverse Document Frequency)

The technique which we will study does not work in the same way as the above techniques. This technique gives different values(weightage) to each word in a document. The core idea of assigning weightage is the word that appears multiple time in a document but has a rare appearance in corpus then it is very important for that document so it gives more weightage to that word. This weightage is calculated by two terms known as TF and IDF. So for finding the weightage of any word we find TF and IDF and multiply both the terms.

Term Frequency(TF)
The number of occurrences of a word in a document divided by a total number of terms in a document is referred to as Term Frequency. For example, I have to find the Term frequency of people in the below sentence then it will be 1/5. It says how frequently a particular word occurs in a particular document.

Inverse Document Frequency
The total number of documents in the corpus is divided by the total number of documents with the term T in them and taking the log of a complete fraction is inverse document frequency. If we have a word that comes in all documents then the resultant output of the log is zero But in implementation sklearn uses a little bit different implementation because if it becomes zero then the contribution of the word is ignored so they add one in the resultant and because of which you can observe the values of TFIDF a bit high. If a word comes only a single time then IDF will be higher.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit_transform(df[‘text’]).toarray()
#So one term keeps track of how frequently the term occurs while the other keeps track of how rarely the term occurs.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Thanks for reading :)

--

--