1 找出文件中以ing结尾的文字
In [5]: for line in open("regular_express.txt"):
for word in line.split():
if word.endswith('ing'):
...: print word
2 如果需要对所有的数进行正常除法需要导入下面模块:
>>> 1/3
>>> from __future__ import division
>>> 1/3
>>> 1.0/3.0
>>> ''.join(['money','python'])
>>> 'money python'.split
<built-in method split of str object at 0x0211B340>
>>> 'money python'.split()
['money', 'python']
4从NLTK 的book 模块加载所有的东西”。这个book 模块包含你阅读本章所需的所有数据。
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
5下面我们输入text1 后面跟一个点,再输入函数名concordance,然后将monstrous 放在括号里,来查一下《白鲸记》中的词monstrous。
>>> text1.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
6 查找一个词的上下文。
>>> text1.similar("monstrous")
Building word-context index...
abundant candid careful christian contemptible curious delightfully
determined doleful domineering exasperate fearless few gamesome
horrible impalpable imperial lamentable lazy loving
>>> text2.common_contexts(["monstrous","very"])
Building word-context index...
a_lucky a_pretty am_glad be_glad is_pretty
>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
9,让我们尝试以我们刚才看到的不同风格产生一些随机文本。 虽然文本是随机的,但它重用了源文本中常见的词和短语,从而使我们能感觉到它的风格和
>>> text3.generate()
Building ngram index...
In the day that the one people shall be his wife ; and she said ,
because he had made an he and the tree of life , I stood upon the
earth , which is in thine hand upon the earth with you ; for thou
knowest my service which I command thee . And we said unto his
brethren , and said , I give thee ? separate thyself , I pray thee ,
and worshipped the LORD was with me ? Whereas thou hast not suffered
me to speak unto the God of your fathers . Moreover
>>> text3.generate()
In the mount of Gilead . And Bilhah Rachel ' s son ; and the fowls of
the knowledge of good and evil , bless the lads ; and Jacob held his
peace until they have brought guiltiness upon us . For the LORD
scatter them in ward . And Joseph took an oath betwixt us and our lan
Wherefore shall we die in thy sight . And when they spake unto his
father , and kissed him : but he found them in Israel in lying with
Jacob ' s venison , that the firstborn said unto Cain ,
>>> len(text3)
set(text3)获得text3 的词汇表
>>> sorted(set(text3))
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', 'Adam', 'Adbeel', 'Admah', 'Adullamite', 'After', 'Aholibamah', 'Ahuzzath', 'Ajah', 'Akan', 'All', 'Allonbachuth', 'Almighty', 'Almodad', 'Also', 'Alvah', 'Alvan', 'Am', 'Amal', 'Amalek', 'Amalekites', 'Ammon', 'Amorite', 'Amorites', 'Amraphel', 'An', 'Anah', 'Anamim', 'And', 'Aner', 'Angel', 'Appoint', 'Aram', 'Aran', 'Ararat', 'Arbah', 'Ard', 'Are', 'Areli', 'Arioch', 'Arise', 'Arkite', 'Arodi', 'Arphaxad', 'Art', 'Arvadite', 'As', 'Asenath', 'Ashbel', 'Asher', 'Ashkenaz', 'Ashteroth', 'Ask', 'Asshur', 'Asshurim', 'Assyr', 'Assyria', 'At', 'Atad', 'Avith', 'Baalhanan', 'Babel', 'Bashemat
们展示了每个字平均被使用了16 次
>>> from __future__ import division
>>> len(text3) / len(set(text3))
>>> text3.count("smote")
11我们可以挑选出一篇打印出来的文本中的第1 个、第173 个或第14278个词。类似的,我们也可以通过它在链表中出现的次序找出一个Python 链表的元素。表示
>>> text4[173]
>>> text4.index('awaken')
>>> saying =['ni','hao', 'huang','chengdu','ni', 'chifanlema', 'wo','huang','ele']
>>> tokens = set(saying)
>>> tokens = sorted(tokens)
>>> tokens[-2:]
['ni', 'wo']
13我们使用FreqDist 寻找《白鲸记》中最常见的50 个词。一种方法是为每个词项设置一个计数器
>>> fdist1 = FreqDist(text1)
>>> fdist1
<FreqDist with 19317 samples and 260819 outcomes>
>>> vocabulary1[:50]
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
NameError: name 'vocabulary1' is not defined
>>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']
>>> fdist1['whale']
906 //whale出现了906次。
14《白鲸记》中50 个最常用词的累积频率图,这些词占了所有标识符的将近一半.
15要找出文本词汇表长度中超过15 个字符的词
>>> V=set(text1)
>>> long_words = [w for w in V if len(w) > 15]
>>> sorted(long_words)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
16以下是聊天语料库中所有长度超过7 个字符出现次数超过7 次的词。len(w) > 7 保证词长都超过七个字母,fdist5[w]> 7保证这些词出现超过7 次
>>> fdist5=FreqDist(text5)
>>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching']
17一个搭配是异乎寻常的经常在一起出现的词序列。red wine 是一个搭配而the wine 不是。
一个搭配的特点是其中的词不能被类似的词置换。例如:maroon wine(粟色酒)听起来就
>>> text4.collocations()
Building collocations list
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
18 先找到text1中单词长度的分布情况,然后列出每种分布的个数,然后找出最大分布的元素,然后找出最大分布咋总的元素中占的比例。
>>> fdist = FreqDist([len(w) for w in text1])
>>> fdist
<FreqDist with 19 samples and 260819 outcomes>
>>> fidst.keys()
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
NameError: name 'fidst' is not defined
>>> fdist.keys()
[3, 1, 4, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]
>>> fdist.items()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
>>> fdist.max()
>>> fdist[3]
>>> fdist.freq(3)
19,关于词长的进一步分析可能帮助我们了解作者、文体或语言之间的差异。下表总结了NLTK 频率分布类中定义的函数。
20,我们也可以使用表1-4 中列出的函数测试词汇的各种属性。
>>> [w for w in text1 if w.isdigit()]
['1851', '890', '1671', '1652', '500', '1668', '1729', '1772', '1778', '40', '1690', '1821', '10', '440', '1839', '1840', '13', '1846', '1828', '1828', '1', '2', '3', '4', '5', '6', '7', '1836', '1839', '1833', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '1750', '1788', '000', '000', '4', '000', '000', '20', '000', '000', '7', '000', '000', '25', '26', '27', '28', '29', '30', '31', '32', '1820', '1839', '1776', '1850', '1', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '1851', '45', '1820', '45', '1807', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '1671', '1793', '1807', '1825', '1836', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '500', '76', '77', '78', '79', '80', '81', '2000', '82', '83', '84', '85', '1851', '86', '87', '88', '89', '1695', '000', '000', '1492', '90', '3', '3', '91', '92', '1791', '93', '94', '95', '96', '97', '98', '99', '100', '101', '1775', '1775', '1726', '1778', '1819', '180', '400', '000', '60', '000', '150', '000', '550', '000', '72', '000', '2', '800', '20', '000', '144', '000', '550', '10', '800', '10', '800', '30', '180', '5', '400', '550', '102', '103', '104', '1779', '1842', '25', '000', '105', '3', '1825', '13', '000', '4', '000', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135']
>>> sorted([w for w in set(text1) if w.endswith('ableness')])
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness']
>>> sorted(w for w in set(text4) if 'gnt' in term)
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
sorted(w for w in set(text4) if 'gnt' in term)
File "<pyshell#14>", line 1, in <genexpr>
sorted(w for w in set(text4) if 'gnt' in term)
NameError: global name 'term' is not defined
>>> sorted([w for w in set(text4) if 'gnt' in term])
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
sorted([w for w in set(text4) if 'gnt' in term])
NameError: name 'term' is not defined
>>> sorted([w for w in set(text4) if 'gnt' in w])
['Sovereignty', 'sovereignties', 'sovereignty']