nltk.pos_tag函数
nltk.pos_tag()函数是一种用来进行词性标注的工具。
def pos_tag(tokens, tagset=None, lang='eng'):
"""
Use NLTK's currently recommended part of speech tagger to
tag the given list of tokens.
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.
:param tokens: Sequence of tokens to be tagged
:type tokens: list(str)
:param tagset: the tagset to be used, e.g. universal, wsj, brown
:type tagset: str
:param lang: the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
:type lang: str
:return: The tagged tokens
:rtype: list(tuple(str, str))
"""
tagger = _get_tagger(lang)
return _pos_tag(tokens, tagset, tagger)
从其源码中我们可以发现它默认支持英文
另外,我对语法标注的过程函数 _get_tagger的源码也进行了了解
def _get_tagger(lang=None):
if lang == 'rus':
tagger = PerceptronTagger(False)
ap_russian_model_loc = 'file:' + str(find(RUS_PICKLE))
tagger.load(ap_russian_model_loc)
else:
tagger = PerceptronTagger()
return tagger
可以发现其核心实现在于 PerceptronTagger()class PerceptronTagger(TaggerI):
Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal.
See more implementation details here:
http://spacy.io/blog/part-of-speech-POS-tagger-in-python/
>>> from nltk.tag.perceptron import PerceptronTagger
Train the model
>>> tagger = PerceptronTagger(load=False)
>>> tagger.train([[('today','NN'),('is','VBZ'),('good','JJ'),('day','NN')],
... [('yes','NNS'),('it','PRP'),('beautiful','JJ')]])
>>> tagger.tag(['today','is','a','beautiful','day'])
[('today', 'NN'), ('is', 'PRP'), ('a', 'PRP'), ('beautiful', 'JJ'), ('day', 'NN')]
Use the pretrain model (the default constructor)
>>> pretrain = PerceptronTagger()
>>> pretrain.tag('The quick brown fox jumps over the lazy dog'.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
>>> pretrain.tag("The red cat".split())
[('The', 'DT'), ('red', 'JJ'), ('cat', 'NN')]
从上述代码中, 可以得出在PerceptronTagger()函数没有参数的时候,使用的是已经训练好的词性标注工具。
初次学习,后续还会慢慢摸索