Bootstrap

python自然语言处理学习笔记第一章

1 找出文件中以ing结尾的文字

In [5]: for line in open("regular_express.txt"):
    for word in line.split():
        if word.endswith('ing'):
   ...:             print word
   ...:            
   ...:            
drafting

2 如果需要对所有的数进行正常除法需要导入下面模块:

>>> 1/3
0
>>> from __future__ import division
>>> 1/3
0.33333333333333331
>>> 1.0/3.0
0.33333333333333331

 

3我们可以把词用链表连接起来组成单个字符串,或者把字符串分割成一个链表。

>>> ''.join(['money','python'])
'moneypython'
>>> 'money python'.split
<built-in method split of str object at 0x0211B340>
>>> 'money python'.split()
['money', 'python']

 

 

 

4从NLTK 的book 模块加载所有的东西”。这个book 模块包含你阅读本章所需的所有数据。

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

 

5下面我们输入text1 后面跟一个点,再输入函数名concordance,然后将monstrous 放在括号里,来查一下《白鲸记》中的词monstrous。

>>> text1.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

 

6 查找一个词的上下文。

>>> text1.similar("monstrous")
Building word-context index...
abundant candid careful christian contemptible curious delightfully
determined doleful domineering exasperate fearless few gamesome
horrible impalpable imperial lamentable lazy loving

 

 

7函数common_contexts允许我们研究两个或两个以上的词共同的上下文,

>>> text2.common_contexts(["monstrous","very"])
Building word-context index...
a_lucky a_pretty am_glad be_glad is_pretty

 

8:从文本开头算起在它前面有多少词。这个位置信息可以用离散图表示。每一个竖线代表一个单词,每一行代表整个文本。

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

 

 

 9,让我们尝试以我们刚才看到的不同风格产生一些随机文本。 虽然文本是随机的,但它重用了源文本中常见的词和短语,从而使我们能感觉到它的风格和
内容。

 >>> text3.generate()
Building ngram index...
In the day that the one people shall be his wife ; and she said ,
because he had made an he and the tree of life , I stood upon the
earth , which is in thine hand upon the earth with you ; for thou
knowest my service which I command thee . And we said unto his
brethren , and said , I give thee ? separate thyself , I pray thee ,
and worshipped the LORD was with me ? Whereas thou hast not suffered
me to speak unto the God of your fathers . Moreover
>>> text3.generate()
In the mount of Gilead . And Bilhah Rachel ' s son ; and the fowls of
the knowledge of good and evil , bless the lads ; and Jacob held his
peace until they have brought guiltiness upon us . For the LORD
scatter them in ward . And Joseph took an oath betwixt us and our lan
Wherefore shall we die in thy sight . And when they spake unto his
father , and kissed him : but he found them in Israel in lying with
Jacob ' s venison , that the firstborn said unto Cain ,

 

 

10让我们以文本中出现的词和标点符号为单位算出文本从头到尾的长度

 >>> len(text3)
44764

 

 set(text3)获得text3 的词汇表

>>> sorted(set(text3))
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', 'Adam', 'Adbeel', 'Admah', 'Adullamite', 'After', 'Aholibamah', 'Ahuzzath', 'Ajah', 'Akan', 'All', 'Allonbachuth', 'Almighty', 'Almodad', 'Also', 'Alvah', 'Alvan', 'Am', 'Amal', 'Amalek', 'Amalekites', 'Ammon', 'Amorite', 'Amorites', 'Amraphel', 'An', 'Anah', 'Anamim', 'And', 'Aner', 'Angel', 'Appoint', 'Aram', 'Aran', 'Ararat', 'Arbah', 'Ard', 'Are', 'Areli', 'Arioch', 'Arise', 'Arkite', 'Arodi', 'Arphaxad', 'Art', 'Arvadite', 'As', 'Asenath', 'Ashbel', 'Asher', 'Ashkenaz', 'Ashteroth', 'Ask', 'Asshur', 'Asshurim', 'Assyr', 'Assyria', 'At', 'Atad', 'Avith', 'Baalhanan', 'Babel', 'Bashemat

 

们展示了每个字平均被使用了16 次

>>> from __future__ import division
>>> len(text3) / len(set(text3))
16.050197203298673

计数一个词在文本中出现的次数

>>> text3.count("smote")
5

 

11我们可以挑选出一篇打印出来的文本中的第1 个、第173 个或第14278个词。类似的,我们也可以通过它在链表中出现的次序找出一个Python 链表的元素。表示
这个位置的数字叫做这个元素的索引。

 >>> text4[173]
'awaken'
>>> text4.index('awaken')
173

 

 12

>>> saying =['ni','hao', 'huang','chengdu','ni', 'chifanlema', 'wo','huang','ele']
>>> tokens = set(saying)
>>> tokens = sorted(tokens)
>>> tokens[-2:]
['ni', 'wo']

 

 

 13我们使用FreqDist 寻找《白鲸记》中最常见的50 个词。一种方法是为每个词项设置一个计数器

>>> fdist1 = FreqDist(text1)
>>> fdist1
<FreqDist with 19317 samples and 260819 outcomes>
>>> vocabulary1[:50]

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    vocabulary1[:50]
NameError: name 'vocabulary1' is not defined
>>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']

>>> fdist1['whale']
906                                //whale出现了906次。

 

 

14《白鲸记》中50 个最常用词的累积频率图,这些词占了所有标识符的将近一半.

fdist1.plot(50,cumulative=True)

 fdist1.hapaxes()查看只出现一次的词

 

15要找出文本词汇表长度中超过15 个字符的词

 >>> V=set(text1)
>>> long_words = [w for w in V if len(w) > 15]
>>> sorted(long_words)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']

 

16以下是聊天语料库中所有长度超过7 个字符出现次数超过7 次的词。len(w) > 7 保证词长都超过七个字母,fdist5[w]> 7保证这些词出现超过7 次

>>> fdist5=FreqDist(text5)
>>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching']

 

17一个搭配是异乎寻常的经常在一起出现的词序列。red wine 是一个搭配而the wine 不是。
一个搭配的特点是其中的词不能被类似的词置换。例如:maroon wine(粟色酒)听起来就
很奇怪。要获取搭配,我们先从提取文本词汇中的词对也就是双连词开始。使用函数bigrams()
很容易实现。我们希望找到比我们基于单个词的频率预期得到的更频繁出现的双连词。collocations()函数为我们做这些。

文本中出现的搭配很能体现文本的风格。

 >>> text4.collocations()
Building collocations list
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

 

18 先找到text1中单词长度的分布情况,然后列出每种分布的个数,然后找出最大分布的元素,然后找出最大分布咋总的元素中占的比例。

>>> fdist = FreqDist([len(w) for w in text1])
>>> fdist
<FreqDist with 19 samples and 260819 outcomes>
>>> fidst.keys()

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    fidst.keys()
NameError: name 'fidst' is not defined
>>> fdist.keys()
[3, 1, 4, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]
>>> fdist.items()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
>>> fdist.max()
3
>>> fdist[3]
50223
>>> fdist.freq(3)
0.19255882431878046


 

19,关于词长的进一步分析可能帮助我们了解作者、文体或语言之间的差异。下表总结了NLTK 频率分布类中定义的函数。

 

 

 

 

 20,我们也可以使用表1-4 中列出的函数测试词汇的各种属性。

 

 

 

>>> [w for w in text1 if  w.isdigit()]
['1851', '890', '1671', '1652', '500', '1668', '1729', '1772', '1778', '40', '1690', '1821', '10', '440', '1839', '1840', '13', '1846', '1828', '1828', '1', '2', '3', '4', '5', '6', '7', '1836', '1839', '1833', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '1750', '1788', '000', '000', '4', '000', '000', '20', '000', '000', '7', '000', '000', '25', '26', '27', '28', '29', '30', '31', '32', '1820', '1839', '1776', '1850', '1', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '1851', '45', '1820', '45', '1807', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '1671', '1793', '1807', '1825', '1836', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '500', '76', '77', '78', '79', '80', '81', '2000', '82', '83', '84', '85', '1851', '86', '87', '88', '89', '1695', '000', '000', '1492', '90', '3', '3', '91', '92', '1791', '93', '94', '95', '96', '97', '98', '99', '100', '101', '1775', '1775', '1726', '1778', '1819', '180', '400', '000', '60', '000', '150', '000', '550', '000', '72', '000', '2', '800', '20', '000', '144', '000', '550', '10', '800', '10', '800', '30', '180', '5', '400', '550', '102', '103', '104', '1779', '1842', '25', '000', '105', '3', '1825', '13', '000', '4', '000', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135']

 

>>> sorted([w for w in set(text1) if w.endswith('ableness')])
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness']
>>> sorted(w for w in set(text4) if 'gnt' in term)

Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    sorted(w for w in set(text4) if 'gnt' in term)
  File "<pyshell#14>", line 1, in <genexpr>
    sorted(w for w in set(text4) if 'gnt' in term)
NameError: global name 'term' is not defined
>>> sorted([w for w in set(text4) if 'gnt' in term])

Traceback (most recent call last):
  File "<pyshell#15>", line 1, in <module>
    sorted([w for w in set(text4) if 'gnt' in term])
NameError: name 'term' is not defined
>>> sorted([w for w in set(text4) if 'gnt' in w])
['Sovereignty', 'sovereignties', 'sovereignty']
>>> 


 

 

 

;