Member-only story
NLP Mastery Part -2

SO, in the previous article, we learned about “How to represent a text as a vector” and how we can make Vocabulary, and how to extract features from a given text corpus.
If you haven’t checked the previous article → then check it out 🙌
👉 https://medium.datadriveninvestor.com/nlp-mastery-part-1-93cec31a457
So in this article, we’re going to learn about the major concepts of preprocessing.
Tokenization and StopWords
The first thing we’re going to learned it’s called tokenization and the second is called stop words, and specifically, we learn how to use theses technique to preprocess our text corpus.
Tokenization
Tokenization is simply splitting a phrase, sentence, paragraph, or entire text document into smaller units, such as individual words or terms. Each of these smaller units is called tokens.
So why tokenization is important?
We know that the Machine learning model / Deep learning models work on the Numerical input, not in the text input, so how are we going to change the whole sentence into the numerical data? If we try to convert the whole sentence into numerical data then there is no meaning of the particular word and our model performance is very bad.