The Art of Tokenization in Text Preprocessing: Day2

Exploring Basic to Advanced NLP Tokenization Techniques

Cognitive Creator
15 min readNov 18, 2023

Table of Contents: We will cover the following

· Introduction
· What is Tokenization?
· Types of Tokenization
· Basic Tokenization Types
· Subword Tokenization
· Specialized Tokenization
· Advanced Tokenization
· Conclusion
· What’s Coming Up Next?

Introduction

Tokenization is one of the fundamental steps in text preprocessing, setting the stage for all subsequent operations in the NLP pipeline. As we embark on this journey to explore the nuances and intricacies of tokenization, we’ll uncover the myriad ways in which text can be segmented, each with its unique advantages and implications. Think of tokenization as the act of dissecting language, breaking down the continuous stream of textual information into manageable and analyzable units.

Image by Author

Recap of Day 1

Before we delve into the heart of tokenization, let’s take a brief journey back to what we covered on the inaugural day. We commenced our…

--

--

Cognitive Creator

Python Developer | AI & ML Engineer | Deep Learning Enthusiast | Data Scientist | ML Engineer | Follow me on twitter: https://twitter.com/writercognitive