Member-only story

NLP Mastery Part -2

Himanshu Tripathi
DataDrivenInvestor
Published in
5 min readMay 2, 2021
Photo by James Harrison on Unsplash

SO, in the previous article, we learned about “How to represent a text as a vector” and how we can make Vocabulary, and how to extract features from a given text corpus.

If you haven’t checked the previous article → then check it out 🙌

👉 https://medium.datadriveninvestor.com/nlp-mastery-part-1-93cec31a457

So in this article, we’re going to learn about the major concepts of preprocessing.

Tokenization and StopWords

The first thing we’re going to learned it’s called tokenization and the second is called stop words, and specifically, we learn how to use theses technique to preprocess our text corpus.

Tokenization

Tokenization is simply splitting a phrase, sentence, paragraph, or entire text document into smaller units, such as individual words or terms. Each of these smaller units is called tokens.

So why tokenization is important?

We know that the Machine learning model / Deep learning models work on the Numerical input, not in the text input, so how are we going to change the whole sentence into the numerical data? If we try to convert the whole sentence into numerical data then there is no meaning of the particular word and our model performance is very bad.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Written by Himanshu Tripathi

NLP || Machine Learning || Deep Learning || Data Science || Web Developer || Android Developer (UI) ||

No responses yet

Write a response