Python 入门 —— 字符串

基本操作

字符串是 Python 中最常用的数据类型，其未不可变类型，无法对一个字符串变量进行修改，只能创建一个新的字符串，字符串使用单引号（''）、双引号（""）或三引号（""" """、''' '''）来进行标识

创建字符串

a = 'abc'
b = "123"
print(a, b)
# abc 123
a + b
# 'abc123'
a * 10
# 'abcabcabcabcabcabcabcabcabcabc'

其中 + 用于连接两个字符串，* 用于创建重复的字符串。而三引号可以创建一个跨越多行的字符串，基本上按原样输出，所见即所得

c = """
123
abc
@#$%^&*()
\t!\n"""
print(c)
# 
# 123
# abc
# @#$%^&*()
# 	!

跨行显示也可以使用反斜杠（\）

a + "; " \
    "\t+"
# 'abc; \t+'
a + "; " + \  # 表达式跨行
    b
# 'abc; 123'

与 R 中一样，字符串中可以包含转义字符，大部分表示的含义都是一样的，比如 \n 都表示一个换行符，但是在进制数表示有所区别。Python 中，八进制数为 \nnn，其中 n 在 0-7 之间，十六进制数为 \xnn，其中 n 在 0-9、A-F 之间。

print("\110\145\154\154\157\40\127\157\162\154\144\41")
# Hello World!
print("\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21")
# Hello World!

如果需要将字符串按字面意思，而不是将特殊字符进行转义，可以在字符串前面加上字母 r 或 R

print(r"\110\145\154\154\157\40\127\157\162\154\144\41")
# \110\145\154\154\157\40\127\157\162\154\144\41
print(R"\n+\t")
# \n+\t

字符串访问

字符串本质上也是一种序列数据结构，因此对于序列所支持的操作，基本上都可以应用于字符串上，当然，对字符串的修改操作将会被禁止，强行修改只会引发异常

a = "Beautiful is better than ugly."
a[0]
# 'B'
a[-2]
# 'y'
a[::2]
# 'Batfli etrta gy'
a[::-1]
# '.ylgu naht retteb si lufituaeB'
a[0] = 2
# TypeError: 'str' object does not support item assignment
'a' in a
# True
len(a)
# 30

内置函数

字符串类型拥有许多非常好用的内置方法，例如

方法	功能	方法	功能
`capitalize`	字符串首字母大写	`casefold`	返回无视大小写可比较的版本
`count`	返回指定范围内子串出现频率	`center`	居中对齐，空白处默认用空格填充
`rjust`	右对齐	`ljust`	左对齐
`startswith`	判断字符串开头	`endswith`	判断字符串的结尾字符
`expandtabs`	将所有 `\t` 转换为空格	`find`	查询子串首次出现的索引，找不到返回 `-1`
`format`	格式化字符串	`format_map`	使用字典映射格式化字符串
`index`	同 `find`，找不到抛出异常	`join`	使用该字符来连接可迭代对象
`lower`	转换为小写	`upper`	转换为大写
`maketrans`	构造字符串映射表	`translate`	将每个字符根据映射表进行转换
`replace`	替换字符串内容	`split`	将字符串按指定分隔符拆分为列表
`strip`	删除前后空白符	`swapcase`	字母大小写转换
`title`	字符串中每个单词首字母大写	`partition`	用分割符将字符串分为长度为 `3` 的元组
`encode`	对字符串进行编码	`zfill`	在数字前面用 `0` 填充以达到指定宽度

字符串对齐与填充

print(a.ljust(50, '.'))
print(a.rjust(50, '='))
print(a.center(50, '-'))
# Beautiful is better than ugly.....................
# ====================Beautiful is better than ugly.
# ----------Beautiful is better than ugly.----------
'123'.zfill(10)
# '0000000123'

字符串搜索与统计

a.count('u')
# 3
a.find('is')
# 10
a.find('you')
# -1
a.rfind('u')
# 25
a.index('a')
# 2
a.index('a', 3)
# 22
a.rindex('a')
# 22
a.index('you')
# ValueError: substring not found

字符串判断

a.endswith('.txt')
# False
a.isalnum()         # 是否为数字字母
# False
'123'.isnumeric()   # 只包含 Unicode 数字字符
# True
u"一二三四".isnumeric()
# True
'123'.isdigit()     # 全为十进制数字 (0-9)
# True
'Ac'.isupper()
# False
'Ac'.istitle()
# True

字符串拆分与合并

a = "Beautiful is better than ugly."
a.split()            # 默认使用空白符
# ['Beautiful', 'is', 'better', 'than', 'ugly.']
a.split('is')
# ['Beautiful ', ' better than ugly.']
a.split(maxsplit=1)  # 拆分次数
# ['Beautiful', 'is better than ugly.']
a.rsplit(maxsplit=1)
# ['Beautiful is better than', 'ugly.']
'\nacb\tjkl\t\n'.strip()
# 'acb\tjkl'
a.rstrip('.')
# 'Beautiful is better than ugly'
a.partition('is')    # 拆分为长度为 3 的元组
# ('Beautiful ', 'is', ' better than ugly.')
a.rpartition(' ')
# ('Beautiful is better than', ' ', 'ugly.')
';'.join(a.split())  # 使用分号合并字符串
# 'Beautiful;is;better;than;ugly.'

字符串替换

a.replace('u', 'U')                    # 只能定义一种替换关系
# 'BeaUtifUl is better than Ugly.'
a.replace('u', 'U', 1)
# 'BeaUtiful is better than ugly.'
trans = str.maketrans('ATCG', 'TAGC')  # 定义字符映射表，只能定义单字符映射关系
'ATTTGCGCGCGCTAAA'.translate(trans)
# 'TAAACGCGCGCGATTT'
trans = str.maketrans({'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'})
'ATTTGCGCGCGCTAAA'.translate(trans)
# 'TAAACGCGCGCGATTT'

字符串排序

a = 'aaTGcGcGCgCtAAA'
sorted(a)
# ['A', 'A', 'A', 'C', 'C', 'G', 'G', 'G', 'T', 'a', 'a', 'c', 'c', 'g', 't']
sorted(a, key=lambda x: x.lower())  # key 用于定义比较前在每个元素上调用的函数
# ['a', 'a', 'A', 'A', 'A', 'c', 'c', 'C', 'C', 'G', 'G', 'G', 'g', 'T', 't']
sorted(a, key=lambda x: x.translate(trans))
# ['T', 'G', 'G', 'G', 'C', 'C', 'A', 'A', 'A', 'a', 'a', 'c', 'c', 'g', 't']

字符串格式化

百分号格式化

Python 支持 C 语言风格的格式化字符串方式，使用特殊的%（取模）运算符，也被称为字符串的格式化或插值运算符。其使用形式为：format % values，其中 format 中的 % 转换标记将会按照顺序或映射关系被 values 中的值替换

转换标记的形式如下：%[(somename)][flags][width][.precision][len]typecode

字段标记	含义
`%`	必须，转换符起始标记
`(somename)`	可选，映射键，转换为 `values` 中对应键的值
`flags`	可选，特殊转换形式
`width`	可选，最小字段宽度。如果为 `*`，会从 `values` 对应位置读取值，而格式化对象的值会在精度之后
`precision`	可选，精度。在 `.` 之后的数字表示显示的精确度，如果为 `*`，会从 `values` 对应位置处读取值，而格式化对象的值会在精度之后
`len`	可选，长度修饰符为：`h`、`l` 或 `L`，会被忽略
`typecode`	必须，转换类型

flags 可选的值为

`flag`	含义
`'0'`	如果数值前面有空隙，在其前面填充 `0`
`'-'`	左对齐，默认右对齐
`' '`	如果是正数会在数值前添加一个空格，与负数对齐
`'+'`	正数符号字符

其中转换类型有

转换符	含义	转换符	含义
`'d'`	十进制整数	`'o'`	八进制数
`'x'`	小写十六进制	`'X'`	大写十六进制
`'e'`	小写科学计数法	`'E'`	大写科学计数法
`'f'`	浮点数	`'g'`	浮点数，指数小于 `-4` 用十进制
`'c'`	单个 `ASCII` 字符	`'r'`	字符串，使用 `repr` 转换的对象
`'s'`	字符串，使用 `str` 转换的对象	`'a'`	字符串，使用 `ascii` 转换的对象
`'%'`	输出百分号

格式化数值

print('% .3f' % (-10.123000))
# -10.123
print('% .3f' % (10.123000))
#  10.123
print("%+10x" % 10)
#         +a
print("%03d" % 7)
# 007
print("%7.3f" % 2.3)
#   2.300
print("%6.3e" % 0.00314159265358)
# 3.142e-03
print("%s\t%s" % ('hello', 'world'))
# hello	world
print("%*.3f" % (7, 2.34567))     # 长度为 7
#   2.346
print("%*.*f" % (7, 4, 2.34567))  # 长度为 7，精度为 4
#  2.3457
print("%*.*f%%" % (7, 2, 2.3456))
#    2.35%

格式化字符

print("%s\t%s" % ('hello', 'world'))
# hello	world
print("%10s$%-12s$" % ('hello', 'world'))
#      hello$world       $
print("%c < %c" % (65, 97))
# A < a
print("List: %r" % ([65, 97]))
# List: [65, 97]
class Test:
    def __init__(self, data):
        self.data = data
        
    def __repr__(self):
        return '__repr__:' + repr(self.data)
    
    def __str__(self):
        return '__str__:' + str(self.data)
t = Test([1, 2, 3])
print("%s\n%r" % (t, t))
# __str__:[1, 2, 3]
# __repr__:[1, 2, 3]

使用映射关系格式化字符串

print('%(name)s is %(age)3d years old' % {'name': 'Tom', 'age': 19})
# Tom is  19 years old

`format` 函数

使用 str.format() 函数也可以格式化字符串，其用法大部分都与 % 用法类似，其在字符串中使用一对花括号 {} 来标记要格式化的内容，如果要显示花括号，要使用双花括号 {{}}。

其语法如下："{[field_name] [! conversion] [: format_spec] }"

字段标记	含义
`field_name`	用与表示格式化对象，可以是数字（位置参数）或关键字（命名参数）的形式，不加数字便是按顺序取值
`conversion`	以标记 `!` 起始，类型转换：`r`
`format_spec`	以标记 `:` 起始，后面为具体的格式化形式

可以理解为将 format 函数的参数值传递给对应的花括号标记

"First {0}".format(1, 2)
# 'First 1'
"1: {} 2: {}".format('a', 'b')
# '1: a 2: b'
"My quest is {name}".format(1, name='Tom')
# 'My quest is Tom'
"players: {players[0]}".format(players=[1, 2, 3])
# 'players: 1'
"players: {{ {players[1]} }}".format(players=[1, 2, 3])
# 'players: { 2 }'
class Person:
    def __init__(self, first_name, last_name):
        self.first_name = first_name
        self.last_name = last_name
        
t = Person('Tom', 'Cruise')
"first name : {0.first_name}, last name: {person.last_name}".format(t, person=t)
# 'first name : Tom, last name: Cruise'

conversion 类似于 % 格式化方法，会调用对应的字符串换行函数

"str() method: {0!s}".format(t)
# 'str() method: __str__:[1, 2, 3]'
"repr() method: {test!r}".format(test=t)
# 'repr() method: __repr__:[1, 2, 3]'
"ascii() method: {!a}".format("G ë ê k s f ? r G ? e k s")
# "ascii() method: 'G \\xeb \\xea k s f ? r G ? e k s'"

format_spec 为格式说明符，其形式如下：

[[fill]align][sign][#][0][width][grouping_option][.precision][type]

字段标记	含义	取值
`fill`	指定了 `align` 才可以使用，指定空白填充	任何字符，默认为空格
`align`	对齐方式，左、右、符号与数字之间填充及居中	`<`、`>`、`=`、`^`
`sign`	正负数标记	`+`、`-`、空格
`#`	针对整数、浮点数和复数类型自动推断
`0`	相当于 `fill` 为 `0` 且 `align` 为 `=`
`width`	指定该格式化字符串的宽度
`grouping_option`	数值的千位分隔符	`_`、`,`
`precision`	小数点精确度
`type`	转换类型	多了一个二进制类型 `b`

文本对齐

'{:=<30}'.format('left aligned')
# 'left aligned=================='
'{:+>30}'.format('right aligned')
# '+++++++++++++++++right aligned'
'{:-^30}'.format('centered')
# '-----------centered-----------'
'{:0=+10}'.format(1234567)
# '+001234567'
'{:0=-10}'.format(-1234567)
# '-001234567'

千分位分隔符

'{:,}'.format(1234567890)
# '1,234,567,890'
'{:_}'.format(1234567890)
# '1_234_567_890'

进制转换

"int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(27)
# 'int: 27;  hex: 1b;  oct: 33;  bin: 11011'
# 使用 # 自动添加前缀
"int: {0:d};  hex: {0:#x};  oct: {0:#o};  bin: {0:#b}".format(67)
# 'int: 67;  hex: 0x43;  oct: 0o103;  bin: 0b1000011'

百分数

"Percent: {:.2%}".format(0.1415926)
# 'Percent: 14.16%'

来几个更复杂的示例

'{:02X}{:02X}{:02X}{:02X}'.format(*[0, 1, 168, 18])
# '0001A812'
int(_, 16)  # 下划线 _ 表示解释器评估的最近一次的结果
# 108562
for num in [18, 96, 5, 7, 24]:
    for base in 'bodx':
        print('{:10{base}}'.format(num, base=base), end=' ')
    print()
#      10010         22         18         12 
#    1100000        140         96         60 
#        101          5          5          5 
#        111          7          7          7 
#      11000         30         24         18 
# 使用 format_map 可以直接传入一个字典，省去关键字赋值
pos = {
    'chrom': 'chr1', 
    'start': 189675224,
    'end': 189679632
}
'{chrom}:{start}-{end}'.format_map(pos)
# 'chr1:189675224-189679632'

在 Python 3.6 之后又引入了一种字符串格式化方法 —— f-string，在引号的前面加上一个 f 来标识。其使用方式基本与 format 函数一样，可以将表达式直接传入花括号内（field_name 字段），会自动计算并将值插入字符串中

f'1 + 2 = {1 + 2}'
# '1 + 2 = 3'
f"{pos['chrom']}:{pos['start']}-{pos['end']}"
# 'chr1:189675224-189679632'
f'{3.1415926: 10}'
# ' 3.1415926'

模板字符串

模板字符串主要用途还是文本翻译，其支持基于 $ 的字符串替换，类似于 shell 获取变量的方式， $ 后面的标识符必须为有效的变量名称，也可以将后面的标识符用花括号包裹起来，第一个非标识符代表占位符终结。

该功能需要使用到内置 string 模块的 Template 类

from string import Template

简单示例

t = Template('my name is $name')
t.substitute(name='Tom')
# 'my name is Tom'
d = {
    'name': 'Tom',
    'age': 20
}
Template('$name is $age years old.').substitute(d)
# 'Tom is 20 years old.'

两个 $ 表示转义，输出 $ 而不表示占位

Template('Give $name $$100').substitute(d)
# 'Give Tom $100'

当占位符之后紧跟着有效字符但不是占位符的一部分时，可以使用花括号区分

Template('${sub}thing will be OK!').substitute(sub='some')
# 'something will be OK!'

使用 substitute 方法有一个问题，当占位符所对应的参数没有被传递，将会抛出一个异常，这时，可以使用另一个方法 safe_substitute，找不到参数值的占位符将会原样输出

Template('Give $name $100').substitute(d)
# ValueError: Invalid placeholder in string: line 1, col 12
Template('Give $name $100').safe_substitute(d)
# 'Give Tom $100'

模板类可以被继承，我们可以自定义字符串格式化方式，具体的实现方式不在本书介绍范围内。

正则表达式

正则表达式的语法基本都是一样的，只是不同编程语言的实现不同，前面我们也已经介绍了 R 语言中的正则表达式及其用法，下面我们也介绍一下 Python 中正则表达式，两相对照能够更好地理解正则的用法，以及不同语言的不同思路，找到其中的共性。

Python中的正则表达式由内置的 re模块提供，而第三方模块 regex 也提供了与标准库 re 模块相兼容的API 接口，同时还提供了额外的功能和更全面的 Unicode 支持。但基本上，我们使用标准库已经足够满足需求了。

import re

基本字符

在没详细介绍匹配函数之前，我们主要使用 findall 来展示匹配结果，其包含三个参数

pattern：定义字符串匹配规则
string：需要去匹配的目标字符串
flags：匹配模式

该函数从左到右进行扫描，按顺序返回所有匹配结果，返回值为一个列表

Python 中也有一些表示特殊含义的字符

. ^ $ * + ? { } [ ] \ | ( )

特殊字符

元字符	含义
`.`	匹配除换行外所有字符，如果指定了 `DOTALL` 标签，则表示任意字符
`\`	对字符进行转义，`\\`表示
`\|`	`A\|B` 表示匹配 `A` 或 `B` 中的一个，若 `A` 匹配成功则不匹配 `B`

re.findall('ab.', 'abc') 
# ['abc']
re.findall('ab.', 'ab\n') 
# []
re.findall('ab.', 'ab\n', re.DOTALL)
# ['ab\n']
re.findall('\.', 'aa')
# []
re.findall('\.', 'aa.')
# ['.']

反斜杠灾难：反斜杠具有转义作用，如果需要匹配的字符串中存在多个\，就需要调加相应数量的\来转义

re.findall('\\\\ab', '\\abc') # ['\\ab']

在反复使用反斜杠的正则中，这会导致大量重复的反斜杠，并使得生成的字符串难以理解。解决方案就是使用原始字符串表示法，反斜杠不再表示转义

re.findall(r'\\ab', '\\abc')  # ['\\ab']
re.findall(r'\n', '\n')       # ['\n']

匹配|字符需要转义，使用 \| 或 [|] 都可以

re.findall('a|b', 'acb')
# ['a', 'b']
re.findall('[|]', 'ab|c')
# ['|']

边界匹配

元字符	含义
`^`	匹配字符串的开头，如果是 `MULTILINE` 模式，匹配每行开头（`\n`之后）的首个符号，在字符集中可表示非
`$`	匹配字符串的末尾（不包含换行符），在 `MULTILINE` 模式下匹配每行末尾（`\n`之前）的字符。

re.findall('^ab', 'abcda\nabddd')                                     
# ['ab']
re.findall('^ab', 'abcda\nabddd', re.MULTILINE)
# ['ab', 'ab']
re.findall('[^a]', 'aaa\nbbb')
# ['\n', 'b', 'b', 'b']
re.findall('ab$', 'abcdab\nabdab')                                   
# ['ab']
# 上面结果中的 ab 是 \n 前面的还是后面的呢？进行如下测试，发现是后面的 ab 被匹配了
re.findall('ab.$', 'abcdab1\nabdab2')
# ['ab2']
# 在 MULTILINE 模式下，两个都被匹配了
re.findall('ab$', 'abcdab\nabdab', re.MULTILINE)
# ['ab', 'ab']
# 而对 $ 在换行结尾的字符串中匹配时，会得到两个空字符，一个在换行符之前，一个在字符串的末尾
re.findall('$', 'abcdab1\n')
# ['', '']

数量词

数量词	含义
`*`	匹配前一个规则 `0` 次或无限次
`+`	匹配前一个规则 `1` 次或无限次
`?`	匹配前一个规则 `0` 次或 `1` 次
`{m}`	指定前面的正则表达式出现的次数，出现次数必须完全一致
`{m, n}`	指定前面的正则表达式出现的次数在`m~n`之间，匹配下界是 `m`，上界是 `n`
`*?`、`+?`、`??`、`{m, n}?`	非贪婪模式

re.findall('ab*', 'a')
# ['a']
re.findall('ab*', 'ab')
# ['ab']
re.findall('ab*', 'abbbbbbbbbbbbbb')
# ['abbbbbbbbbbbbbb']

re.findall('ab+', 'a')
# []
re.findall('ab+', 'ab')
# ['ab']

re.findall('ab?', 'a')
# ['a']
re.findall('ab?', 'ab')
# ['ab']
re.findall('ab?', 'abbbbbbbbbbbbbb')
# ['ab']

指定数量

re.findall('a{3}', 'aa') 
# []
re.findall('a{3}', 'aaaaa')
# ['aaa']

指定数量范围

re.findall('a{3, 5}', 'aaaa')  # 3,5 之间不能添加空格
# []
re.findall('a{3,5}', 'aaaa')
# ['aaaa']
re.findall('a{3,}', 'aaaa')
# ['aaaa']
re.findall('a{,5}', 'aaaa')
# ['aaaa', '']

前面几个数量词都是贪婪的，也就是说会尽可能的匹配更多的字符串，如果在这些修饰符后面加上?，便成了非贪婪模式，会尽可能少的匹配字符串

re.findall('<.*>', '<a>bcd>')
# ['<a>bcd>']
re.findall('<.*?>', '<a>bcd>')
# ['<a>']

非贪婪模式的数量范围会往下界靠

re.findall('a{3,}?', 'aaaaa')
# ['aaa']

字符集

元字符	含义
`[ ]`	表示字符集集合。匹配该字符需要转义`\[,\]`

字符范围使用 - 来表示，例如 [a-j] 表示小写字母 a~j，[1-6] 表示数字 1~6，如果要表示字符 - 需要转义或将其放在首尾

re.findall('[abc]', 'ab.')
# ['a', 'b']
re.findall('[a\-z]', '-')
# ['-']
re.findall('[-a]', '-')
# ['-']
re.findall('[a-]', '-')
# ['-']

特殊字符失去特殊含义，比如 [(+*)] 只会匹配这几个字符 '('、'+'、'*'、')'

re.findall('[(+*)]', '+-*/()')
# ['+', '*', '(', ')']

字符集与 R 相同

re.findall('[\w]', 'abfagg-/*-')
# ['a', 'b', 'f', 'a', 'g', 'g']
re.findall('\d+', '123,bcd,001')
# ['123', '001']

字符集取反，^ 只能放在首位才能表示取反

re.findall('[^\w]', 'abfagg-/*-')
# ['-', '/', '*', '-']

匹配 [] 字符需要加上反斜杠或者放到集合首位

re.findall('\]', 'abc]')     # 加上反斜杠
# [']']
re.findall('[]{}]', ']abc')  # 放到集合首位
# [']']

捕获组

元字符	含义
`(...)`	匹配括号内的组合表达式，并标注表达式的开始和结束位置，可用于后续捕获

每对小括号代表一个组合，可以通过\number的方式引用组合，\1表示第一个组合。如果要匹配字符 ( 或者 ), 也需要转义或者放在字符集合里: [(], [)]。

re.findall('a(b+)', 'abbb')
# ['bbb']
re.findall(r'(b)a\1', 'bab')
# ['b']

扩展标记法

(?...) 这种扩展标记法在括号内以?开头，其后第一个字符决定了采用什么样的语法。

扩展模式

可以在?后面添加( 'a', 'i', 'L', 'm', 's', 'u', 'x' 中的一个或多个字符，然后加上匹配模式。这些字符标记都有一个对应的 flag 参数值，两种方法等效

字符	flag	含义
`'a'`	`re.A` 或 `re.ASCII`	只匹配 `ASCII` 字符
`'i'`	`re.I` 或 `re.IGNORECASE`	忽略大小写
`'m'`	`re.M` 或 `re.MULTILINE`	多行模式
`'s'`	`re.S` 或 `re.DOTALL`	`.` 匹配全部字符
`'u'`	`re.U`	`Unicode` 匹配，`Python3` 默认开启这个模式
`'x'`	`re.X` 或 `re.VERBOSE`	冗长模式

注意：'a', 'L', 'u' 作为内联标记是相互排斥的，它们不能在一起使用

re.findall('(?i)ab', 'Ab')      # 忽略大小写
# ['Ab']
re.findall('ab', 'Ab', re.I)    # 等价于上面的代码
# ['Ab']
re.findall('(?si)ab.', 'Ab\n')  # 连用s、i
# ['Ab']
re.findall('^a.', 'ab\nac')     # 多行模式
# ['ab']
re.findall('(?m)^a.', 'ab\nac')
# ['ab', 'ac']
re.findall('(?s)ab.', 'ab\n')   # .匹配全部字符
# ['ab\n']

冗长模式允许你编写可读性更好的正则表达式，通过分段和添加注释,其中空白符号会被忽略

re.findall(r"""(?x)\d +  # 整数位
                \.       # 小数点
                \d *     # 小数位
                """, '3.1415na')
# ['3.1415']

非捕获版本

括号分组的非捕获版本，该分组所匹配的子字符串不能在执行匹配后被获取或是在之后的模式中被引用。该模式可以搭配 | 和 {m} 使用

re.findall('(abc){2}', 'abcabc')
# ['abc']
re.findall('(?:abc){2}', 'abcabc')
# ['abcabc']

可以看出，捕获版本和非捕获版本的区别，捕获版本会将圆括号分组内匹配的字符作为独立的结果返回。而非捕获版本会将圆括号分组内的模式与嵌套在其外面的匹配模式作为一个整体进行匹配，下面来看一个嵌套捕获的例子

re.findall('(a(bc))cbs', 'abccbs')
# [('abc', 'bc')]
re.findall('(a(?:bc))cbs', 'abccbs')
# ['abc']
re.findall('(abc)|cbs', 'cbs')
# ['']
re.findall('(?:abc)|cbs', 'cbs')
# ['cbs']

命名分组

其中 (?P<name>…) 表示为分组指定一个名称，每个名称只能对应于一个正则表达式，并只能定义一次。(?P=name) 可以引用对应名称的匹配模式捕获到的内容，例如

re.findall('(?P<name>abc)\\1', 'abcabc')
# ['abc']
re.findall('(?P<name>abc)(?P=name)', 'abcabc')
# ['abc']

添加注释

为匹配模式添加注释信息，里面的内容会被忽略。

re.findall('abc(?#这是注释)123', 'abc123')
# ['abc123']

环视

其中 (?=…) 表示只有后面匹配到了 … 的内容才会匹配前面的规则，称为后视断言，(?!…) 表示只有后面的内容与 … 不匹配时才返回前面的匹配，称为前视取反

# 只有后面是 'Asimov' 的情况下才匹配前面的 'Isaac '
re.findall('Isaac (?=Asimov)', 'Isaac Asimov, Isaac Ash')
# ['Isaac ']
# 只有后面不是 'Asimov' 的时候才匹配前面的 'Isaac ' 
re.findall('Isaac. (?!Asimov)', 'Isaac1 Asimov, Isaac2 Ash')
# ['Isaac2 ']

看看，是不是一下子就明了了，既然有根据后面字符断言的，那么根据前面字符来断言，也是很合理的。

(?<=…) 表示匹配当前位置之前是 ... 的样式，称为前视断言， (?<?…) 表示匹配当前位置之前不是 ... 的样式，称为后视取反。与前面的规则相对

re.findall('(?<=Isaac )Asimov.', 'Isaac Asimov1, Asimov2')
# ['Asimov1']
re.findall('(?<!Isaac )Asimov.', 'Isaac Asimov1, Asimov2')
# ['Asimov2']

条件匹配：`(?(id/name)yes-pattern|no-pattern)`

如果给定的 id 或 name 存在，将会尝试匹配 yes-pattern，否则就尝试匹配 no-pattern，其中 no-pattern 是可选的，也可以被忽略。

是不是有点像 if else三目运算，其中 id 和 name 是分组编号和名称

re.findall('(<)?(\w+@\w+(?:\.\w+))(?(1)>|$)', '<[email protected]>')
# [('<', '[email protected]')]
re.findall('(<)?(\w+@\w+(?:\.\w+))(?(1)>|$)', '[email protected]>')
# []
re.findall('(<)?(\w+@\w+(?:\.\w+))(?(1)>|$)', '<[email protected]')
# [('', '[email protected]')]
re.findall('(<)?(\w+@\w+(?:\.\w+))(?(1)>|$)', '[email protected]')
# [('', '[email protected]')]

我们来解析一下这个正则表达式：其中，第一个括号捕获的是 <，后面的 ? 用于判断 < 是否存在；第二个括号，里面是邮箱的格式，\w 代表数字、字母和下划线集合；第三个括号嵌套在第二个当中，而且声明为非捕获版本，是邮箱 . 及后面的字符；最后一个括号当中，?(1)>|$：其中 1 表示对第一个括号分组的引用，如果存在，就匹配 >，否则匹配空。

最后的匹配结果就是 <[email protected]> 和[email protected]，但是不会匹配 <[email protected] 和 <[email protected]

但是上面的第三个结果为啥不一样呢？因为 findall 允许返回空匹配，在有 ? 的情况下，它会分为两种情况去匹配，即当 < 存在的时，匹配不到 >；在<不存在时，能匹配到 [email protected]

特殊序列

Python 中的字符集与 R 类似，包含

元字符	含义	元字符	含义
`\A`	只匹配字符串开始	`\Z`	只匹配字符串尾部
`\b`	匹配单词边界	`\B`	匹配非单词边界
`\d`	匹配十进制数字	`\D`	匹配非十进制数字
`\s`	匹配空白字符	`\S`	匹配非空白字符
`\w`	匹配字母数字下划线	`\W`	匹配非字母数字下划线

匹配字符串边界时需要使用双下划线 \\b，字符串边界通常指的是 \w 和 \W 字符之间，或者 \w 和字符串开始/结尾的边界

re.findall(r'\bHello\b', 'Hello world! Hellooo')
# ['Hello']
re.findall('\\bHello\\b', 'Hello world! Hellooo')  # 相当于上面的代码
# ['Hello']
re.findall(r'\bHello\b', 'Hello world! Hello.')
# ['Hello', 'Hello']
re.findall('\BHello\B', 'Hello worldHello123')
# ['Hello']

匹配数字和文本

re.findall('\d+', 'ab123d\nabc')
# ['123']
re.findall('\D+', 'ab123d\nabc')
# ['ab', 'd\nabc']
re.findall('\s+', 'ab12 3d\nab\tc')
# [' ', '\n', '\t']
re.findall('\S+', 'ab12 3d\nab\tc')
# ['ab12', '3d', 'ab', 'c']
re.findall('\w+', '[email protected]')
# ['user_name', 'host163', 'com']
re.findall('\W+', '[email protected]')
# ['@', '.']

匹配开头和结尾

re.findall('\Aab.', 'abccadc\nabC', re.MULTILINE)
# ['abc']
re.findall(r'^ab.', 'abccadc\nabC')
# ['abc']
re.findall(r'dd\Z', 'abddacdd')
# ['dd']
re.findall(r'dd$', 'abddacdd')
# ['dd']

常用函数

在前面的示例中，我们只使用了 findall 函数来进行正则匹配，当然还有其他几个好用的函数

函数	功能	函数	功能
`search`	全局搜索返回第一个匹配结果，	`match`	字符串起始位置开始匹配
`fullmatch`	匹配整个字符串	`split`	字符串拆分，可识别正则分隔符
`findall`	返回所有匹配结果	`finditer`	返回一个匹配结果迭代器
`sub`	替换字符串	`subn`	返回替换后的字符串和替换次数

re.search(pattern, string, flags=0)

扫描整个字符串找到匹配样式的第一个位置，并返回一个相应的匹配对象；如果没有匹配到，就返回 None ，注意这和匹配到一个长度为零的结果是不同的。

ans = re.search('abc', 'abcdd')
if ans:
    print('Search result: ', ans.group())
else:
    print('No match')
# Search result:  abc

re.match(pattern, string, flags=0)

从 string 的起始位置开始，匹配成功就会返回一个匹配对象；如果没有匹配到，就返回 None 。

注意：即使在多行模式下， re.match()也只从字符串的开始位置开始匹配，而不是从每行的起始位置开始匹配。

ans = re.match('abc', 'abcdd')
if ans:
    print('match result: ', ans.group())
else:
    print('No match')
# Match result:  abc
ans = re.match('abc', 'babcdd')
if ans:
    print('match result: ', ans.group())
else:
    print('No match')
# No match

re.fullmatch(pattern, string, flags=0)

整个 string 都要匹配到正则表达式，如果匹配到就返回一个匹配对象，否则就返回一个 None

ans = re.fullmatch('abc.dd', 'abcddd')
if ans:
    print('Match result: ', ans.group())
else:
    print('No match')
# Match result:  abcddd`

re.split(pattern, string, maxsplit=0, flags=0)

用 pattern 去切割 string ，如果在 pattern 中使用了圆括号，那么所有分隔符也会包含在返回的结果列表中。maxsplit 设定最多分割次数，剩下的字符全部返回到列表的最后一个元素。

re.split(r'\W+', 'Words, words, words.')     # 用非文本字符（字母数字下划线）分割
# ['Words', 'words', 'words', '']
re.split(r'(\W+)', 'Words, words, words.')   # 分割字符串也会保留在结果列表中
# ['Words', ', ', 'words', ', ', 'words', '.', '']
re.split(r'\W+', 'Words, words, words.', 1)  # 分割一次
# ['Words', 'words, words.']
re.split('(?i)[a-f]+', '0a3aB9')             # 以 [a-f] 之间的字符分割，且不区分大小写
re.split('[a-f]+', '0a3aB9', flags=re.I)
# out: ['0', '3', '9']`

re.findall(pattern, string, flags=0)

从左到右进行扫描，按顺序返回所有匹配结果，返回值为一个列表，前面的示例都是使用 findall ，这里便不再举例啦。

re.finditer(pattern, string, flags=0)

与 findall 差不多，不一样的地方是：返回一个匹配对象迭代器

for ans in re.finditer(r'\w+', 'Words, words, words.'):
    print(ans.group(), end='\t')
# Words words words`

re.sub(pattern, repl, string, count=0, flags=0)

使用 repl 替换 string 中匹配到符合 pattern 模式的子串，并返回替换后的字符串；如果样式没有找到，则原样返回 string。可选参数 count 指定要替换的最大次数（非负数），默认是全部替换。

repl 可以是字符串或函数，如果传入的是字符串，那么任何反斜杠转义序列都会被处理，如 \n 会被转换为一个换行符；其他未知转义序列，如 \& 会保持原样；分组引用像是 \2 会用样式中第 2 组所匹配到的子字符串来替换。

re.sub('\w+', '123', 'hello, world, hello python')
# '123, 123, 123 123'
re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
       r'static PyObject*\npy_\1(void)\n{',
       'def myfunc():')
# 'static PyObject*\npy_myfunc(void)\n{'

其中 \1 引用了第一个捕获分组，即函数名称。

如果传入的是函数，那它会对每个非重复的 pattern 进行调用，这个函数将一个匹配对象作为参数，并返回一个替换后的字符串。

def dashrepl(matchobj):
    if matchobj.group(0) == '-': 
        return ' '
    else: 
        return '-'

re.sub('-{1,2}', dashrepl, 'pro----gram-files')
# 'pro--gram files'`

re.subn(pattern, repl, string, count=0, flags=0)

与 sub() 函数相同，但是返回一个元组（字符串, 替换次数）

re.subn('\w+', '123', 'hello, world, hello python')
# ('123, 123, 123 123', 4)`

匹配对象

上一节我们有提到匹配对象，顾名思义就是匹配成功所返回的存储了匹配信息的对象，如果没有匹配的话会返回 None 。所以你可以简单的用 if 语句来判断是否匹配，像 match 和 search 匹配成功都会返回一个匹配对象，例如

match = re.search(pattern, string)
if match:
    process(match)

匹配对象也包含一些方法和属性，方法包括

方法	功能	方法	功能
`group`	返回一个或者多个匹配的子组	`groups`	返回一个包含所有匹配结果的元组
`groupdict`	返回字典形式的匹配结果	`span`	匹配的区间
`start`	匹配的开始位置	`end`	匹配的结束位置

Match.group([group1, ...])

如果传递一个参数，结果就是一个字符串；如果有多个参数，结果就是一个元组（每个参数对应一个项）；如果没有参数，group 参数默认等于 0（即返回整个匹配）；如果一个组号是负数，或者大于样式中定义的组数，抛出一个 IndexError 索引错误。如果使用了命名分组，那么 groupN 参数就可以是分组的名字；如果一个组匹配了多次，就只返回最后一次匹配结果

m = re.match(r"(\w+) (\w+)", "Lebron James, Kobe")
m.group()
# 'Lebron James'
m.group(0)
# 'Lebron James'
m.group(1)
# 'Lebron'
m.group(2)
# 'James'
m.group(1, 2)
# ('Lebron', 'James')
m.group(3)
# 'IndexError: no such group'
# 命名分组
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Lebron James")
m.group('first_name')  # == m.group(1)
# 'Lebron'，依旧可以使用数字索引
m.group('last_name')   # == m.group(2)
# 'James'
m = re.match(r"(..)+", "aabbcc")  # 多次匹配，返回最后一个匹配
m.group(1)
# 'cc'

其实，上述 group(n) 也可以直接使用方括号加索引的方式来获取

m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Lebron James")
m[0]
# 'Lebron James'
m[1]
# 'Lebron'
m[2]
# 'James'

Match.groups(default=None)

default 参数用于设置匹配不成功时的返回值，默认为 None。

m = re.match(r"(\d+)\.(\d+)", "3.1415926")
m.groups()
# ('3', '1415926')
m = re.match(r"(\d+)\.?(\d+)?", "345")  # 设置未匹配的返回值
m.groups()
# ('345', None)
m.groups(-1)
# ('345', -1)

Match.groupdict(default=None)

返回一个字典，包含了所有的命名分组的匹配结果，key 就是组名。 default 参数同上。如果分组未命名，则返回空字典

m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Lebron James")
m.groupdict()
# {'first_name': 'Lebron', 'last_name': 'James'}

Match.start([group]) 和Match.end([group])

返回 group 匹配到的子串开始和结束位置索引，group 默认为 0（意思是整个匹配的子串），如果 group 存在，但未匹配到，就返回 -1 。

m = re.match(r"(\w+) (\w+)", "Lebron James, Kobe")
m.start(), m.end()
# (0, 12)
m.start(1), m.end(1)
# (0, 6)
m.start(2), m.end(2)
# (7, 12)
string = "Lebron James, Kobe"
m.group(1) == string[m.start(1): m.end(1)]
# True

Match.span([group])

对于一个匹配 m ，返回一个二元组 (m.start(group), m.end(group)) 。注意，如果 group 没有在这个匹配中，就返回 (-1, -1) ，group 默认为 0，就是全部匹配结果。

m = re.match(r"(\w+) (\w+)", "Lebron James, Kobe")
m.span()
# (0, 12)

主要包含两个属性

Match.re：返回产生这个实例的正则表达式对象
Match.string：搜索的目标字符串

m = re.match(r"(\w+) (\w+)", "Lebron James, Kobe")
m.string
# 'Lebron James, Kobe'
m.re
# re.compile(r'(\w+) (\w+)', re.UNICODE)

讲到这，我们就把下面要说的内容引出来了。是的，没错，正是正则表达式对象。什么是正则表达式对象？

正则表达式对象

正则表达式字符串经过编译后，就是正则表达式对象了，使用 compile 方法编译将正则表达式的样式编译为一个正则表达式对象，该对象可以重复多次使用，让程序更加高效运行

re.compile(pattern, flags=0)

pattern：传入的正则样式
flags：指定匹配模式，如 re.MULTILINE 等

prog = re.compile(pattern)
result = prog.match(string)

# 等价于
result = re.match(pattern, string)

正则表达式对象支持前面介绍的所有顶层的函数，如 search、match 和 findall 等，同时多了两个参数：pos 和 endpos ，用于指定字符串搜索的起始和终止位置。

例如 search ，扫描整个字符串，如果匹配到结果则返回一个匹配对象，没有匹配到就返回 None 。

p.search(string, pos, endpos) 等价于 p.search(string[pos:endpos], 0)

p = re.compile("dog")

m = p.search('a dog')
m.group()
# 'dog'
m = p.search('a dog', 3)
m 
# None
m = p.search('a dog', 2, 5)
m.group()
# 'dog'

例如，match 方法

p = re.compile("aaa")
m = p.match("hello aaa bbb")
m
# m is None
m = p.match("hello aaa bbb", 6)
m.group()
# 'aaa'

其他正则对象的方法与 re 模块的顶层函数大同小异，只是加了 pos 和 endpos 两个参数来限制搜索范围，就不再重复介绍啦！

文件读写

Python 字符串的主要操作上面已经基本上介绍完了，下面将介绍字符串在文件读写中的简单应用。

我们使用内置的 open 函数来打开一个文件对象，然后对该对象进行操作，打开文件可以指定不同的模式，虽然模式非常多，主要可以分为读、写和追加三种模式以及文本和二进制两种文件格式

模式	描述	模式	描述
`t`	文本模式	`b`	二进制模式
`x`	创建文件并写入，如果文件存在会报错	`+`	可读可写模式
`r`	只读，文件指针在开头	`r+`	相较于 `r` 增加了写入功能
`w`	创建文件并写入，文件存在时会覆盖原文件	`w+`	相较于 `w` 增加了读取功能
`a`	追加模式，指针在文件末尾，文件不存在会新建文件	`a+`	相较于 `a` 增加了读取功能

默认是对文本文件进行读写，若要进行二进制的读写都需要加上 b，例如读取二进制为 rb。

写入文件

以写的方式打开文本文件，使用 write 函数可以将字符串写入文件中

zen = [
    'Beautiful is better than ugly.',
    'Explicit is better than implicit.',
    'Simple is better than complex.',
    'Complex is better than complicated.',
    'Flat is better than nested.',
    'Sparse is better than dense.'
]
f = open('this.txt', 'w')
for line in zen:
    f.write(line)
    f.write('\n')
f.close()

或者使用 writelines 将一个字符串列表写入文件中，但是这种方式不会为每个字符串元素添加换行符，还是得自己手动添加上去

f = open('this.txt', 'w')
f.writelines([s+'\n' for s in zen])
f.close()

打开文件记得及时关闭，判断文件是否关闭

f.closed
# True

读取文件

以只读的方式打开文本文件，可以使用 read 函数一次性读取所有文件内容并以字符串的形式返回

f = open('this.txt', 'r')
for line in f.read().split('\t'):
    print(line)
f.close()

或者使用 readline 函数每次读取一行

f = open('this.txt', 'r')
while line := f.readline():
    print(line, end='')
f.close()

或者使用 readlines 读取指定行数，返回一个列表

f = open('this.txt', 'r')
while True:
    lines = f.readlines(3)
    if not lines:
        break
    for line in lines:
        print(line, end='')
f.close()

文件指针

所谓文件指针，可以理解为我们在编写文档时的光标位置，即我们所有的输入都是基于当前光标，不断往前移动。在文件读写是，我们可以使用 tell 函数获取当前文件指针的位置，并使用 seek 函数来移动指针。

seek 移动指针的方式有三种：

seek(offset, 0) ：从文件开始位置往前移动 offset （正数）个字节，默认值
seek(offset, 1) ：从当前位置移动 offset （可正可负）个字节
seek(offset, 2) ：从文件末尾往后移动 offset （负数）个字节

f = open('file_pointer.txt', 'a+')  # 追加可读可写模式
f.write(zen[0] + '\n')              # 写入 31 个字符
f.tell()                            # 当前文件指针的位置
# 31
f.seek(0)                           # 指针移动到开头
f.write(zen[1] + '\n')              # 写入 34 个字符
f.tell()
# 65
f.seek(0, 2)                        # 移动到末尾
f.write(zen[2] + '\n')              # 写入 31 个字符
f.tell()
# 96
f.seek(0)                           # 移动到文件开头准备读取数据
f.read().split('\n')
# ['Beautiful is better than ugly.',
#  'Explicit is better than implicit.',
#  'Simple is better than complex.',
#  '']
f.close()

可以看到，该方式无法从文件开头插入数据，数据还是以末尾追加的方式写入文件，而不管文件指针的位置在哪

上下文管理

文件读写操作很容易出现一些异常，异常发生或忘记关闭文件，会造成文件损坏、泄露等问题。使用 with 语句的上下文管理机制，可以保证程序在退出时安全关闭文件

with open('file_pointer.txt') as f:
    for line in f.read().strip().split('\n'):
        print(line)
# Beautiful is better than ugly.
# Explicit is better than implicit.
# Simple is better than complex.

随机读写文本

linecache 模块可以从一个文本文件中获取任意行中的字符串

import linecache

linecache.getline('this.txt', 2)
# 'Explicit is better than implicit.\n'

目录操作

Python 目录操作也是比较常用的，例如路径拼接、获取当前路径，获取路径下的文件等操作，都可以使用标准库 os 来进行操作，下面列出常用的一些函数

函数	功能	函数	功能
`os.getcwd`	获取当前工作路径	`os.chdir`	切换当前工作路径
`os.mkdir`	新建一个文件夹	`os.rmdir`	删除一个文件夹
`os.makedirs`	创建多级目录	`os.listdir`	获取目录下所有文件
`os.path.exists`	路径是否存在	`os.path.join`	拼接一个或多个路径
`os.path.split`	将路径拆分为 `(head, tail)` 形式	`os.path.dirname`	返回路径的目录名称
`os.path.basename`	返回路径的基本名称	`os.path.abspath`	返回路径的绝对路径

os.getcwd()
# /Users/dengxsh/Documents
os.chdir('/Users/dengxsh/Documents/WorkSpace/')
os.listdir()
# ['Go', 'PyCharm', 'image', 'IntelliJ', 'VSCode', 'Jupyter', 'Qt5']
os.path.exists('Go')
# True
path = '/Users/dengxsh/Documents/Python'
os.path.exists(path)
# False
os.path.basename(path)
# 'Python'
os.path.dirname(path)
# '/Users/dengxsh/Documents'
os.path.split(path)
# ('/Users/dengxsh/Documents', 'Python')
os.path.join(path, 'str', 'path')
# '/Users/dengxsh/Documents/Python/str/path'
os.path.abspath('.')
# '/Users/dengxsh/Documents/WorkSpace'