Basic Text Processing for NLP

Learn basic text processing techniques for Natural Language Processing with practical examples.

  1. Tokenize text:
import re
text = "This is a sample text. It contains multiple sentences!"
tokens = re.findall(r'\w+', text.lower())
  1. Remove stopwords:
stopwords = set(['a', 'the', 'is', 'and', 'of'])
filtered_tokens = [word for word in tokens if word not in stopwords]
  1. Stem words:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in filtered_tokens]
  1. Create n-grams:
def create_ngrams(tokens, n=2):
    return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
  1. Build vocabulary:
vocab = {word: idx for idx, word in enumerate(set(stems))}

Advanced techniques:

  • Lemmatization for better word normalization
  • POS tagging for grammatical analysis
  • Named Entity Recognition (NER)
  • Dependency parsing for sentence structure

Read more: NLTK Documentation