A Quick Introduction to NLTK
Natural Language Tool Kit (NLTK)
Intro to a few NLTK concepts
- Tokenization
- Stemming & Lemmatization
- POS Tagging
- Stop Words
Tokenization
Tokenization helps in splitting the text by word or sentence.
import nltk
nltk.word_tokenize(word) # helps split the text by word
nltk.sent_tokenize(word) # helps split the text by sentence
The Word tokenizer uses the Default/TreeBank Tokenizer.
Other Tokenizers include Whitespace-based, Punctuation-based, Tweet and MWE.
Freq of Words
The FreqDist function from nltk allows us to look at the Frequency of words in the tokenized version of a text.
import nltk
from nltk.probability import FreqDist
tokenized_words = nltk.word_tokenize(word) # split text by word
freq = FreqDist(tokenized_words)
freq.most_common(5) # check the 5 most common words
Stemming & Lemmatization
Stemming - Suffix Stripping
Advantages: Word generalisation, reduced vocabulary size.
Disadvantages: Too simplistic, loss of information, may become obsolete with better language models.
A few well known stemmers are:
- PorterStemmer
- LancasterStemmer
- SnowballStemmer
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer
snowball = SnowballStemmer('english') # using english as the lang
word = 'demonstration'
print(porter.stem(word))
# demonstr
print(lancaster.stem(word))
# demonst
print(snowball.stem(word))
# demonstr
Lemmatization - Picking up root of the word
Lemmatization uses POS tags of the word to lemmatize
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("better", pos="a")
# 'good'
POS Tagging
Catgorising words into verbs, noun, adjectives etc.
To find POS tags on specific words
import nltk
text = "......"
tokenized_sents = nltk.tokenize.sent_tokenize(text)
tokenized_words = nltk.word_tokenize(tokenized_sents[0])
nltk.pos_tag(tokenized_words)
Training our own POS Taggers:
Uni-Grams - one word training
Bi-Grams - 2 words or previous words in context
Tri-Grams
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
from nltk.corpus import brown # data from the nltk corpus
brown_data = brown.tagged_sents(categories='mystery')
# data is generally split into test and train
unigram_tagger = UnigramTagger(brown_data)
bigram_tagger = BigramTagger(brown_data)
trigram_tagger = TrigramTagger(brown_data)
Stop Words
Filler words like me, my, we, our, etc
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('English')
N-Grams
Combination of grams from bigrams/trigrams or n-grams
The point is to consider our text given its context
from nltk import bigrams, trigrams, everygrams
list(bigrams(tokenized_words))
list(trigrams(tokenized_words))
list(everygrams(tokenized_words, 2, 5)) ### 2 to 5 grams
A very rapid Introduction to NLTK, not perfect but will update this as I understand it even better:)