在自然语言处理领域,无涯教程遇到了两个或两个以上单词具有共同词根的情况,涉及任何这些单词的搜索应将它们视为相同的单词,即词根。因此,将所有单词到其根词变得至关重要, NLTK库具有执行并提供显示根词的输出的方法。
下面的程序使用Porter Stemming Algorithm进行词干分析。
import nltk from nltk.stem.porter import PorterStemmer porter_stemmer = PorterStemmer() word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms" # 第一个词标记化 nltk_tokens = nltk.word_tokenize(word_data) #接下来找到单词的词根 for w in nltk_tokens: print "Actual: %s Stem: %s" % (w,porter_stemmer.stem(w))
当执行上面的代码时,它将产生以下输出。
Actual: It Stem: It Actual: originated Stem: origin Actual: from Stem: from Actual: the Stem: the Actual: idea Stem: idea Actual: that Stem: that Actual: there Stem: there Actual: are Stem: are Actual: readers Stem: reader Actual: who Stem: who Actual: prefer Stem: prefer Actual: learning Stem: learn Actual: new Stem: new Actual: skills Stem: skill Actual: from Stem: from Actual: the Stem: the Actual: comforts Stem: comfort Actual: of Stem: of Actual: their Stem: their Actual: drawing Stem: draw Actual: rooms Stem: room
在下面的程序中,使用WordNet词汇数据库进行词素化。
import nltk from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms" nltk_tokens = nltk.word_tokenize(word_data) for w in nltk_tokens: print "Actual: %s Lemma: %s" % (w,wordnet_lemmatizer.lemmatize(w))
当无涯教程执行上面的代码时,它将产生以下输出。
来源:LearnFk无涯教程网
Actual: It Lemma: It Actual: originated Lemma: originated Actual: from Lemma: from Actual: the Lemma: the Actual: idea Lemma: idea Actual: that Lemma: that Actual: there Lemma: there Actual: are Lemma: are Actual: readers Lemma: reader Actual: who Lemma: who Actual: prefer Lemma: prefer Actual: learning Lemma: learning Actual: new Lemma: new Actual: skills Lemma: skill Actual: from Lemma: from Actual: the Lemma: the Actual: comforts Lemma: comfort Actual: of Lemma: of Actual: their Lemma: their Actual: drawing Lemma: drawing Actual: rooms Lemma: room
祝学习愉快!(内容编辑有误?请选中要编辑内容 -> 右键 -> 修改 -> 提交!)