Basic Text Processing for NLP
Learn basic text processing techniques for Natural Language Processing with practical examples.
- Tokenize text:
import re
text = "This is a sample text. It contains multiple sentences!"
tokens = re.findall(r'\w+', text.lower())
- Remove stopwords:
stopwords = set(['a', 'the', 'is', 'and', 'of'])
filtered_tokens = [word for word in tokens if word not in stopwords]
- Stem words:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in filtered_tokens]
- Create n-grams:
def create_ngrams(tokens, n=2):
return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
- Build vocabulary:
vocab = {word: idx for idx, word in enumerate(set(stems))}
Advanced techniques:
- Lemmatization for better word normalization
- POS tagging for grammatical analysis
- Named Entity Recognition (NER)
- Dependency parsing for sentence structure
Read more: NLTK Documentation