Member-only story

NLP Mastery Part -2

Published in

DataDrivenInvestor

5 min readMay 2, 2021

SO, in the previous article, we learned about “How to represent a text as a vector” and how we can make Vocabulary, and how to extract features from a given text corpus.

If you haven’t checked the previous article → then check it out 🙌

👉 https://medium.datadriveninvestor.com/nlp-mastery-part-1-93cec31a457

So in this article, we’re going to learn about the major concepts of preprocessing.

Tokenization and StopWords

The first thing we’re going to learned it’s called tokenization and the second is called stop words, and specifically, we learn how to use theses technique to preprocess our text corpus.

Tokenization

Tokenization is simply splitting a phrase, sentence, paragraph, or entire text document into smaller units, such as individual words or terms. Each of these smaller units is called tokens.

So why tokenization is important?

We know that the Machine learning model / Deep learning models work on the Numerical input, not in the text input, so how are we going to change the whole sentence into the numerical data? If we try to convert the whole sentence into numerical data then there is no meaning of the particular word and our model performance is very bad.

NLP Mastery Part -2

Create an account to read the full story.

Published in DataDrivenInvestor

Written by Himanshu Tripathi

No responses yet