cookbook-第二章-字符串和文本处理

95 阅读 0 评论 63 点赞

我是靠谱客的博主大力眼睛，这篇文章主要介绍cookbook-第二章-字符串和文本处理，现在分享给大家，希望可以做个参考。

在任意多分隔符下分离字符串

字符串对象的split()方法适用场景很少(不适合于多分隔符或者分隔符后空格数目未知的情况)。

使用re.split()

复制代码

1
2
3
4
5
6
import re

line = 'asdf dasdasdas; dasdasda,     dadasd'
print(re.split(r'[;,s]s*', line))

['asdf', 'dasdasdas', 'dasdasda', 'dadasd']

使用捕获(capture group)

复制代码

1
2
3
4
5
6
import re

line = 'asdf dasdasdas; dasdasda,     dadasd'
print(re.split(r'(;|,|s)s*', line))

['asdf', ' ', 'dasdasdas', ';', 'dasdasda', ',', 'dadasd']

我们发现，捕获可以获取()里面对应的模式，即分隔符

获取分隔符常常很有用，比如用分隔符重构字符串

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import re

line = 'asdf dasdasdas; dasdasda,     dadasd'
fields = re.split(r'(;|,|s)s*', line)
#::表示隔2取，默认以0开始,如0,2,4,...
values = fields[::2]
#以1开始隔2取
delimiters = fields[1::2] + ['']
print(values)
print(delimiters)
#''.join表示连接
print(''.join(v + d for v, d in zip(values, delimiters)))

['asdf', 'dasdasdas', 'dasdasda', 'dadasd']
[' ', ';', ',', '']
asdf dasdasdas;dasdasda,dadasd

如果不想获取分隔符，并且想用括号来对正则表达式分段，注意使用?:

复制代码

1
2
3
4
5
6
7
import re

line = 'asdf dasdasdas; dasdasda,     dadasd'
fields = re.split(r'(?:;|,|s)s*', line)
print(fields)

['asdf', 'dasdasdas', 'dasdasda', 'dadasd']

字符串前缀后缀

使用str.startswith()或者str.endswith()

复制代码

1
2
3
4
5
6
7
8
filename = 'spam.txt'
print(filename.endswith('txt'))
print(filename.startswith('file'))
print('http://www.python.org'.startswith('http:'))

True
False
True

使用元组提供多个选择

复制代码

1
filename.startswith(('txt', 'spam'))

需要注意的地方:
元组不能替换为列表或者集合，如果为列表或集合，注意先使用tuple()转换为元组

startswith(endswith),元组和any结合使用

复制代码

1
2
3
import os

if any(name.startswith(('.c', '.h')) for name in os.listdir(dirname)):

用于判断某个目录下是否具有特定类型的文件

使用类shell通配符匹配字符串

使用fnmatch()

复制代码

1
2
3
4
5
6
from fnmatch import fnmatch

print(fnmatch('foo.txt', '*.txt'))
print(fnmatch('foo.txt', '?oo.txt'))
print(fnmatch('Dat45.csv', 'Dat[0-9]*'))

需要注意的是，fnmatch()的大小写判断依赖于操作系统。

使用fnmatchcase()

复制代码

1
2
3
4
5
from fnmatch import fnmatch,fnmatchcase
#无论任何操作系统，遵守约定俗成的大小写规则
print(fnmatchcase('foo.txt', '*.TXT'))

False

需要注意的地方:
fnmatch()是字符串方法和强大的正则表达式之间的折中，如果只需要简单的通配符机制，那么这是一个不错的选择。

匹配和查找文本模式

如果你想匹配的文本是简单的字面值，那么可以使用startswith(),endswith(),find()来匹配

如果是更复杂的匹配，那么就要用到正则表达式了

使用正则表达式

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re

text1 = '3/25/2018'
text2 = 'March 25, 2018'

if re.match(r'd+/d+/d+', text1):
    print('yes')
else:
    print('no')
if re.match(r'd+/d+/d+', text2):
    print('yes')
else:
    print('no')
yes
no

如果需要使用同一模式来多次匹配字符串，那么需要先用compile编译正则表达式

使用compile()

复制代码

1
2
3
4
5
6
7
8
9
10
11
import re

text1 = '03/25/2018'

datepat = re.compile(r'd+/d+/d+')

if datepat.match(text1):
    print('yes')
else:
    print('no')
yes

需要注意的是，match总是匹配第一次出现的模式，如果想匹配所有模式，使用findall

使用findall匹配所有模式

复制代码

1
2
3
4
5
6
7
8
9
import re

text = 'Today is 03/25/2018.Yesterday is 03/24/2018.'

datepat = re.compile(r'd+/d+/d+')

print(datepat.findall(text))

['03/25/2018', '03/24/2018']

使用捕获

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import re


datepat = re.compile(r'(d+)/(d+)/(d+)')

m = datepat.match('03/25/2018')

print(m)

print(m.group(0), m.group(1), m.group(2), m.group(3), m.groups())

text = 'Today is 03/25/2018.Yesterday is 03/24/2018'

print(datepat.findall(text))

for month, day, year in datepat.findall(text):
    print('{}-{}-{}'.format(month, day, year))

<_sre.SRE_Match object; span=(0, 10), match='03/25/2018'>
03/25/2018 03 25 2018 ('03', '25', '2018')
[('03', '25', '2018'), ('03', '24', '2018')]
03-25-2018
03-24-2018

注意到，使用捕获很利于后续处理，因为捕获将匹配的模式分开了

findall找出所有匹配并且返回列表，如果想迭代处理，那么可以使用finditer()

使用finditer

复制代码

1
2
3
4
5
6
7
8
9
10
11
import re

datepat = re.compile(r'(d+)/(d+)/(d+)')

text = 'Today is 03/25/2018.Yesterday is 03/24/2018'

for m in datepat.finditer(text):
    print(m.groups()) 

('03', '25', '2018')
('03', '24', '2018')

对使用正则表达式的简单总结:
1.编译正则表达式，返回pattern对象
2.使用match(),find(),findall(),finditer()等方法进行匹配

对于match,需要注意，它只检查字符串的开始，这就导致有时候不是你想要的结果

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
import re


datepat = re.compile(r'(d+)/(d+)/(d+)')

m = datepat.match('11/27/2012dzxdzxdzx')

print(m)

print(m.group())

<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>
11/27/2012

因此需要用到结束标志$

总结一下需要注意的地方:使用正则表达式之前记得使用compile()编译，可以减少开销

查找和替换文本

对于普通的字面字符串而言，使用str.replace()

使用str.replace()

复制代码

1
2
3
4
5
text = 'how do you'

print(text.replace('do', 'are'))

how are you

对于更复杂的情况，使用re.sub().

使用re.sub()

复制代码

1
2
3
4
5
6
7
8
9
import re

text = 'Today is 03/25/2018.Yesterday is 03/24/2018'

#3表示取捕获组的第三个,即年
#re.sub()的第一个参数为匹配的模式，第二个参数是替换的模式
print(re.sub(r'(d+)/(d+)/(d+)', r'3-1-2', text))

Today is 2018-03-25.Yesterday is 2018-03-24

使用re.compile()

复制代码

1
2
3
4
5
6
7
8
import re

text = 'Today is 03/25/2018.Yesterday is 03/24/2018'

datepat = re.compile(r'(d+)/(d+)/(d+)')
print(datepat.sub(r'3-1-2', text))

Today is 2018-03-25.Yesterday is 2018-03-24

对于更加复杂的替换，可以使用一个回调函数来定义替换部分

使用回调函数

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import re
from calendar import month_abbr

#回调函数的参数是一个match对象，通常由find或者match返回，可以调用group提取匹配特定部分
def change_date(m):
    #m.group(1)提取了月的部分，并且由month_abbr()转换为了月份缩写形式
    mon_name = month_abbr[int(m.group(1))]
    #注意这里的{}相当于占位符，由format指定的参数来替换
    return '{} {} {}'.format(m.group(2), mon_name, m.group(3))

text = 'Today is 03/25/2018.Yesterday is 03/24/2018'

datepat = re.compile(r'(d+)/(d+)/(d+)')
print(datepat.sub(change_date, text))

Today is 25 Mar 2018.Yesterday is 24 Mar 2018

整个功能相当于,datepat调用sub()，通过编译的正则表达式获取到了匹配，即match对象，将match对象传递给回调函数，回调函数进行处理，返回的结果替换匹配，一次替换结束，再进行第二次替换，依次进行，直到结束。

如果想知道进行了多少次替换，可以使用re.subn()

使用re.subn()

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
import re
from calendar import month_abbr

text = 'Today is 03/25/2018.Yesterday is 03/24/2018'

datepat = re.compile(r'(d+)/(d+)/(d+)')
newtext, n = datepat.subn(r'3-1-2', text)

print(newtext)
print(n)

Today is 2018-03-25.Yesterday is 2018-03-24
2

查找和替换大小写不敏感的文本

使用re.IGNORECASE标志

复制代码

1
2
3
4
5
6
7
8
import re

text = 'UPPER PYTHON,lower python, Mixed Python'
print(re.findall('python', text, flags = re.IGNORECASE))
print(re.sub('python', 'case', text, flags = re.IGNORECASE))

['PYTHON', 'python', 'Python']
UPPER case,lower case, Mixed case

这种做法暴露了一个缺陷。不能根据匹配内容的大小写替换对应的大小写内容(比如如果python为大写，仍然替换的是case)

使用支持函数

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import re

#这种写法的意思是，每次调用re.sub,会调用matchcase,而matchcase返回replace，
#于是将匹配对象match传递给了replace,replace在matchcase的作用范围内，于是可以
#获取word的值进行处理
def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

text = 'UPPER PYTHON,lower python, Mixed Python'
print(re.sub('python', matchcase('case'), text, flags = re.IGNORECASE))

我们可以发现，替换的case随着python的大写，小写，首字母大写而变化

使匹配最短

先来看一个例子

复制代码

1
2
3
4
5
6
7
8
9
10
import re

str_pat = re.compile(r'"(.*)"')
text1 = 'Computer says "no."'
print(str_pat.findall(text1))

text2 = 'Computer says "no." Phone says "yes."'
print(str_pat.findall(text2))
['no.']
['no." Phone says "yes.']

注意到匹配了整个”no.” Phone says “yes.”“，而我们想要的应该是”no.”和”yes.”。这是由于正则表达式中的*为贪婪的(greedy),所以匹配总会找出符合条件的最长的匹配。为了解决这个问题，需要使用?

?操作符使匹配最短

复制代码

1
2
3
4
5
6
7
import re

str_pat = re.compile(r'"(.*?)"')
text2 = 'Computer says "no." Phone says "yes."'
print(str_pat.findall(text2))

['no.', 'yes.']

?操作符使匹配非贪心，并且使匹配最短

正则表达式处理多行

先来看一个处理注释的例子

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
import re

comment = re.compile(r'/*(.*?)*/')
text1 = '/* this is a comment */'
text2 = '''/* this is a
              multiline comment */
        '''
print(comment.findall(text1))
print(comment.findall(text2))

[' this is a comment ']
[]

我们会发现匹配多行的例子失败了，这是因为’.’虽然可以匹配任意字符，但是不能匹配换行符。
解决方法1

使用?:

复制代码

1
2
3
4
5
6
7
8
9
import re
#注意，这里使用了?:非捕获符
comment = re.compile(r'/*((?:.|n)*?)*/')
text2 = '''/* this is a
              multiline comment */
        '''
print(comment.findall(text2))

[' this is an              multiline comment ']

解决方法2

使用re.DOTALL

复制代码

1
2
3
4
5
6
7
8
9
import re

comment = re.compile(r'/*(.*?)*/', re.DOTALL)
text2 = '''/* this is a
              multiline comment */
        '''
print(comment.findall(text2))

[' this is an              multiline comment ']

需要注意的地方:
re.DOTALL适用于简单的情况，如果正则表达式相当复杂可能会出问题。因此首推编写出正确的正则表达式，即第一种方法。

将Unicode文本正规化为标准文本

当处理Unicode字符串时，你需要确保每个字符串拥有相同的潜在表示形式。
对于Unicode字符集，有些字符可以存在多种表示形式。看一个例子

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
s1 = 'Spicy Jalapeu00f1o'
s2 = 'Spicy Jalapenu0303o'

print(s1)
print(s2)
print(s1 == s2)
print(len(s1))
print(len(s2))

Spicy Jalapeño
Spicy Jalapeño
False
14
15

“Spicy Jalapeño”存在两种表示形式,第一种是一个整体u00f1，第二种是n结合一个结合字符(combining character)u0303
为了解决这个问题，我们需要使用unicodedata模块将字符串转换为同一种形式

使用unicodedata.normalize()

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import unicodedata

s1 = 'Spicy Jalapeu00f1o'
s2 = 'Spicy Jalapenu0303o'

#normalize()的第一个参数指定正规化形式
#NFC代表完全组成
#NFD代表完全解题，使用结合字符
t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2)
print(ascii(t2))

t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)
print(ascii(t4))

True
'Spicy Jalapexf1o'
True
'Spicy Jalapenu0303o'

总结:
1.正规化是相当重要的，因为你不知道你的输入文本是什么形式,尤其你无法控制输入的时候
2.正规化也可以用来过滤文本,如下所示

复制代码

1
2
3
4
5
6
7
8
import unicodedata

s1 = 'Spicy Jalapeu00f1o'
t1 = unicodedata.normalize('NFD', s1)
#unicodedata.combing()可以识别出结合字符
print(''.join(c for c in t1 if not unicodedata.combining(c)))

Spicy Jalapeno

注意到，结合字符被过滤了

在正则表达式中处理Unicode字符集

正则表达式库可以初步识别一些基本的Unicode字符类型，如d可以识别数字

复制代码

1
2
3
4
5
6
7
8
9
10
import re

num = re.compile('d+')
print(num.match('123'))

#如果你想匹配特定Unicode字符，可以使用转义字符
print(num.match('u0661u0662u0663'))

<_sre.SRE_Match object; span=(0, 3), match='123'>
<_sre.SRE_Match object; span=(0, 3), match='١٢٣'>

看一个比较麻烦的例子
这里写图片描述

总结一下关于处理Unicode需要注意的地方:
一定要注意正规化，将所有字符串都处理为同一种表示形式

从字符串中去除不想要的字符

使用strip(),lstrip(),rstrip()

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
s = '      hello word       n'
print(s.strip())
print(s.lstrip())
print(s.rstrip())

t = '---------helloxxxxxxx'
print(t.lstrip('-'))
print(t.strip('-x'))

hello word
hello word       

      hello word
helloxxxxxxx
hello

需要注意的地方:
1,strip()默认去除空白符（包括换行符），但是也可以指定去除特定字符
2.strip()只处理左右两边的空白处，不处理中间的。如果想处理中间的空白字符，需要使用replace或者re.sub()

迭代时，结合使用strip除去空白符

复制代码

1
2
3
4
5
with open(filename) as f:
    #注意这里使用的是生成器表达式,因此会返回一个迭代器，而不是创建临时的列表保存数据
    lines = (line.strip() for line in f)
    for line in lines:
        pass

清理文本

在一些很简单的层面上，可以使用str.upper(),str.lower()或者str.replace(),re.sub()等方法来处理文本，或者使用之前讲到的unicodedata.normalize()来正规化文本。但是更进一层，可以使用translate()

使用translate()

这里写图片描述

注意到，translate()方法做的相当于是文本处理的映射。

更进一步，清除所有结合字符
这里写图片描述
可以看出，先通过dict.fromkeys建立了一个以所有结合字符为键，值为None的字典，然后translate使用这个字典将结合字符映射为空，即清除掉

再看一个例子，将Unicode字符集中的数字字符替换为ASCII码的等价形式
这里写图片描述

使用encode和decode

这里写图片描述

总结:
1.越简单，通常越快。例如str.replace()通常是最快的，尽管可能需要调用多次。
2.translate()对于字符->字符转换，或者字符->清除也是很快的
3.没有通用的最好的方法。模式是尝试各种方法，找出最优性能的那一个。

分配文本字符串

使用ljust(),rjust(),center()

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
text = 'Hello World'
#填充到20个字符，不足的地方填充空白，字符串左移
print(text.ljust(20))
##填充到20个字符，不足的地方填充空白，字符串右移
print(text.rjust(20))
##填充到20个字符，不足的地方填充空白，字符串往中间移动
print(text.center(20))

Hello World         
         Hello World
    Hello World     

#指定填充字符
print(text.rjust(20, '='))
print(text.center(20, '*'))

=========Hello World
****Hello World*****

使用format()

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
text = 'Hello World'
print(format(text, '>20'))
print(format(text, '<20'))
print(format(text, '^20'))
print(format(text, '=>20s'))
print(format(text, '*^20s'))

         Hello World
Hello World         
    Hello World     
=========Hello World
****Hello World*****

#注意，format可以一次性格式化多个字符串
print('{:>10s} {:>10s}'.format('Hello', 'World'))

     Hello      World

对于format()需要注意的地方:
它不仅适用于字符串，它适用于所有值类型。因此format()比其他方法也更具一般性

复制代码

1
2
3
4
5
6
x = 1.2345
print(format(x, '>10'))
print(format(x, '^10.2f'))

    1.2345
   1.23

使用’%’

复制代码

1
2
3
4
5
6
text = 'Hello World'
print('%-20s' % text)
print('%20s' % text)

Hello World         
         Hello World

总结:
format()方法功能最强大，也更具一般性。另外，尽量不要用’%’

连接字符串

如果你想连接序列或者可迭代对象，最快的方法是join()

使用join()

复制代码

1
2
3
4
5
parts = ['Is', 'Chicago', 'Not', 'Chicago?']

print(' '.join(parts))

Is Chicago Not Chicago?

如果只是想连接很少的字符串，那么可以使用’+’

使用’+’

复制代码

1
2
3
4
5
6
a = 'is'
b = 'you'

print(a + ' ' + b)

is you

‘+’可以是format的替换
这里写图片描述

如果只是连接字面值，可以这样做

复制代码

1
2
3
4
5
a = 'Hello ' 'World'

print(a)

Hello World

需要注意的地方:
当连接大量字符串时，’+’号的开销很巨大，因为有内存拷贝和垃圾回收等开销(跟java一样，字符串多了用StringBuilder).这种情况一定要用join()

使用生成器表达式

这里写图片描述
注意到,str()字符串转换和使用生成器表达式进行字符串连接同时进行。

有时候，字符串连接是不必要的。看看例子
这里写图片描述

如果涉及到I/O，那么字符串的连接需要一些权衡
这里写图片描述
对于版本1，当chunk1和chunk2很小时，会比版本2还快，因为版本2由于有两次I/O操作，会有额外开销，但是当chunk1和chunk2很大时，版本1的额外开销很大,因为它会产生临时结果以及内存拷贝等开销

通过生成器函数产生片段

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def sample():
    yield 'Is'
    yield 'Chicago'
    yield 'Not'
    yield 'Chicago'

text = ''.join(sample())

for part in sample():
    f.write(part)

#短短几行就实现了缓冲区的功能，是不是相当酷(我说的是Python)
def combine(source, maxsize):
    parts = []
    size = 0
    for part in source:
        parts.append(part)
        size += len(part)
        if size >= maxsize:
            yield ''.join(parts)
            parts = []
            size = 0
    yield ''.join(parts)

for part in combine(sample(), 32768):
    f.write(part)

注意到，生成器函数并不在意数据组合的细节，他只负责产生数据(搬运工)

在字符串中插入变量

使用字符串对象的format()

复制代码

1
2
3
4
s = '{name} has {n} messages'
print(s.format(name = 'xxx', n = 'xxx'))

xxx has xxx messages

如果变量已经定义过了，那么可以结合使用format_map()和vars()

复制代码

1
2
3
4
5
6
7
name = 'xxx'
n = 'xxx'

s = '{name} has {n} messages'
print(s.format_map(vars()))

xxx has xxx messages

vars()还有一个特性，那就是适用于实例对象

复制代码

1
2
3
4
5
6
7
8
9
10
class Info:
    def __init__(self, name, n):
        self.name = name
        self.n = n

s = '{name} has {n} messages'
a = Info('xxx', 11)
print(s.format_map(vars(a)))

xxx has 11 messages

需要注意的是，format()和format的一个缺陷是不能处理缺失值

复制代码

1
2
3
4
5
6
7
s = '{name} has {n} messages'
print(s.format(name = 'xxx'))

Traceback (most recent call last):
  File "C:UsersAdministratorDesktoptest.py", line 3, in <module>
    print(s.format(name = 'xxx'))
KeyError: 'n'

一个解决办法是定义一个字典子类，编写missing()方法

使用missing()

复制代码

1
2
3
4
5
6
7
8
9
10
11
class safesub(dict):
    def __missing__(self, key):
        return '{' + key + '}'

name = 'xxx'
n = 'xxx'
del(name)
s = '{name} has {n} messages'
print(s.format_map(safesub(vars())))

{name} has xxx messages

注意到，缺失的数据没有被替换

如果你发现你经常使用format_map,那么应该用一个方法将它包装起来，这个方法用到了”frame hack”

使用frame hack

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import sys

class safesub(dict):
    def __missing__(self, key):
        return '{' + key + '}'

def sub(text):
    return text.format_map(safesub(sys._getframe(1).f_locals))


name = 'xxx'
n = 111

print(sub('Hello {name}'))
print(sub('{n} hello'))
print(sub('your favourite color is {color}'))

Hello xxx
111 hello
your favourite color is {color}

其他的插入变量的方式

这里写图片描述
需要注意的是，以上两种方式都没有format()好

总结:
1.format()的优势在于，可以结合字符串格式化(alignment, padding, numerical formatting)等使用
2.f_locals是一个本地变量的拷贝，类型为字典，因此不用担心改变它会产生额外影响

格式化文本到指定的列数

使用textwrap()

这里写图片描述

使用get_terminal_size()获取终端窗口大小

这里写图片描述

处理html和xml实体

使用html.escape()

这里写图片描述

使用errors处理ascii码转换

这里写图片描述

使用html parser或者xml parser

这里写图片描述

将文本转换为token

假设你有这样一个字符串,text = ‘foo = 23 +42 * 10’

为了符号化这个字符串,你需要某种识别模式的方式,例如转化为这样的形式
这里写图片描述

为了符号化，第一步就需要定义所有可能的符号，包括空格。方法是正则表达式和命名捕获组(named capture group)

使用正则表达式和命名捕获组

这里写图片描述

我们注意到，一般形式为?p<符号名称>+正则表达式

符号化的第一步完成了，第二步是使用scanner()

使用scanner()

这里写图片描述

结合使用生成器和scanner

这里写图片描述

使用生成器表达式过滤token

这里写图片描述

总结:
1.必须想清楚每种可能的符号，如果有任何模式未被匹配，scanner的扫描过程就会终止，这也是为什么需要特别定义空字符的符号
2.符号的顺序也需要注意

编写解析器

BNF:
这里写图片描述
EBNF:

表达式计算器

复制代码

import re
import collections

NUM = r'(?P<NUM>d+)'
PLUS = r'(?P<PLUS>+)'
MINUS = r'(?P<MINUS>-)'
TIMES = r'(?P<TIMES>*)'
DIVIDE = r'(?P<DIVIDE>/)'
LPAREN = r'(?P<LPAREN>()'
RPAREN = r'(?P<RPAREN>))'
WS = r'(?P<WS>s+)'

master_pat = re.compile('|'.join([NUM, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN, WS]))

Token = collections.namedtuple('Token', ['type', 'value'])

def generate_tokens(text):
    scanner = master_pat.scanner(text)
    for m in iter(scanner.match, None):
        tok = Token(m.lastgroup, m.group())
        if tok.type != 'WS':
            yield tok

class ExpressionEvaluator:
    def parse(self, text):
        self.tokens = generate_tokens(text)
        self.tok = None
        self.nexttok = None
        self.advance()
        return self.expr()

def advance(self):
        self.tok, self.nexttok = self.nexttok, next(self.tokens, None)

def _accept(self, toktype):
        if self.nexttok and self.nexttok.type == toktype:
            self.advance()
            return True
        else:
            return False

def _expect(self, toktype):
        if not self._accept(toktype):
            raise SyntaxError('Expected ' + toktype)

def expr(self):
        expval = self.term()

while self._accept('PLUS') or self._accept('MINUS'):
            op = self.tok.type
            right = self.term()
            if op == 'PLUS':
                expval += right
            elif op == 'MINUS':
                expval -= right

return expval

def term(self):
        termval = self.factor()

while self._accept('TIMES') or self._accept('DIVIDE'):
            op = self.tok.type
            right = self.factor()
            if op == 'TIMES':
                termval *= right
            elif op == 'DIVIDE':
                termval /= right

return termval

def factor(self):
        if self._accept('NUM'):
            return int(self.tok.value)
        elif self._accept('LPAREN'):
            expval =    self.expr()
            self._expect('RPAREN')
            return expval
        else:
            raise SyntaxError('Expected or LPAREN')

e = ExpressionEvaluator()
print(e.parse('2'))
print(e.parse(' 2 + (3 * 3) + 4 / 5 + (5 / 5) * 10'))

2
21.8

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import re
import collections

NUM = r'(?P<NUM>d+)'
PLUS = r'(?P<PLUS>+)'
MINUS = r'(?P<MINUS>-)'
TIMES = r'(?P<TIMES>*)'
DIVIDE = r'(?P<DIVIDE>/)'
LPAREN = r'(?P<LPAREN>()'
RPAREN = r'(?P<RPAREN>))'
WS = r'(?P<WS>s+)'

master_pat = re.compile('|'.join([NUM, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN, WS]))

Token = collections.namedtuple('Token', ['type', 'value'])

def generate_tokens(text):
    scanner = master_pat.scanner(text)
    for m in iter(scanner.match, None):
        tok = Token(m.lastgroup, m.group())
        if tok.type != 'WS':
            yield tok

class ExpressionEvaluator:
    def parse(self, text):
        self.tokens = generate_tokens(text)
        self.tok = None
        self.nexttok = None
        self.advance()
        return self.expr()

    def advance(self):
        self.tok, self.nexttok = self.nexttok, next(self.tokens, None)

    def _accept(self, toktype):
        if self.nexttok and self.nexttok.type == toktype:
            self.advance()
            return True
        else:
            return False

    def _expect(self, toktype):
        if not self._accept(toktype):
            raise SyntaxError('Expected ' + toktype)

    def expr(self):
        expval = self.term()

        while self._accept('PLUS') or self._accept('MINUS'):
            op = self.tok.type
            right = self.term()
            if op == 'PLUS':
                expval += right
            elif op == 'MINUS':
                expval -= right

        return expval

    def term(self):
        termval = self.factor()

        while self._accept('TIMES') or self._accept('DIVIDE'):
            op = self.tok.type
            right = self.factor()
            if op == 'TIMES':
                termval *= right
            elif op == 'DIVIDE':
                termval /= right

        return termval

    def factor(self):
        if self._accept('NUM'):
            return int(self.tok.value)
        elif self._accept('LPAREN'):
            expval =    self.expr()
            self._expect('RPAREN')
            return expval
        else:
            raise SyntaxError('Expected or LPAREN')

e = ExpressionEvaluator()
print(e.parse('2'))
print(e.parse(' 2 + (3 * 3) + 4 / 5 + (5 / 5) * 10'))

2
21.8

解析器

复制代码

class ExpressionTreeBuilder(ExpressionEvaluator):
    def expr(self):
        expval = self.term()

while self._accept('PLUS') or self._accept('MINUS'):
            op = self.tok.type
            right = self.term()
            if op == 'PLUS':
                expval = ('+', expval, right)
            elif op == 'MINUS':
                expval = ('-', expval, right)

return expval

def term(self):
        termval = self.factor()

while self._accept('TIMES') or self._accept('DIVIDE'):
            op = self.tok.type
            right = self.factor()
            if op == 'TIMES':
                termval = ('*', termval, right)
            elif op == 'DIVIDE':
                termval = ('/', termval, right)

return termval

e = ExpressionTreeBuilder()
print(e.parse('2 + 3 * (4 + 5 / 6)'))

('+', 2, ('*', 3, ('+', 4, ('/', 5, 6))))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class ExpressionTreeBuilder(ExpressionEvaluator):
    def expr(self):
        expval = self.term()

        while self._accept('PLUS') or self._accept('MINUS'):
            op = self.tok.type
            right = self.term()
            if op == 'PLUS':
                expval = ('+', expval, right)
            elif op == 'MINUS':
                expval = ('-', expval, right)

        return expval

    def term(self):
        termval = self.factor()

        while self._accept('TIMES') or self._accept('DIVIDE'):
            op = self.tok.type
            right = self.factor()
            if op == 'TIMES':
                termval = ('*', termval, right)
            elif op == 'DIVIDE':
                termval = ('/', termval, right)

        return termval

    def factor(self):
        if self._accept('NUM'):
            return int(self.tok.value)
        elif self._accept('LPAREN'):
            expval =    self.expr()
            self._expect('RPAREN')
            return expval
        else:
            raise SyntaxError('Expected or LPAREN')

e = ExpressionTreeBuilder()
print(e.parse('2 + 3 * (4 + 5 / 6)'))

('+', 2, ('*', 3, ('+', 4, ('/', 5, 6))))

解析步骤总结(直接上原文吧，感觉自己总结的可能会误导。。。)
这里写图片描述

对字节字符串进行文本操作

字节字符串支持大多数文本字符串的内置函数
这里写图片描述

这里写图片描述

注意，将正则表达式应用于字节字符串时，正则表达式也需要是字节形式
这里写图片描述

一点需要注意的字节字符串文本字符串的差异
1.索引后得到的值为整数而不是字符
2.字节字符串输出会形如’b’xxxxxx”
3.format()不支持字节字符串
4.如果将字节字符串作为文件名，会使文件名编解码失效。

总结:
处理文本，尽量使用文本字符串!

最后

以上就是大力眼睛最近收集整理的关于cookbook-第二章-字符串和文本处理的全部内容，更多相关cookbook-第二章-字符串和文本处理内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：python
浏览次数：95 次浏览
发布日期：2023-10-28 21:26:20
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_o_18_fx_13__7_g0.html

cookbook-第二章-字符串和文本处理

在任意多分隔符下分离字符串

使用re.split()

使用捕获(capture group)

字符串前缀后缀

使用str.startswith()或者str.endswith()

使用元组提供多个选择

startswith(endswith),元组和any结合使用

使用类shell通配符匹配字符串

使用fnmatch()

使用fnmatchcase()

匹配和查找文本模式

使用正则表达式

使用compile()

使用findall匹配所有模式

使用捕获

使用finditer

查找和替换文本

使用str.replace()

使用re.sub()

使用re.compile()

使用回调函数

使用re.subn()

查找和替换大小写不敏感的文本

使用re.IGNORECASE标志

使用支持函数

使匹配最短

?操作符使匹配最短

正则表达式处理多行

使用?:

使用re.DOTALL

将Unicode文本正规化为标准文本

使用unicodedata.normalize()

在正则表达式中处理Unicode字符集

从字符串中去除不想要的字符

使用strip(),lstrip(),rstrip()

迭代时，结合使用strip除去空白符

清理文本

使用translate()

使用encode和decode

分配文本字符串

使用ljust(),rjust(),center()

使用format()

使用’%’

连接字符串

使用join()

使用’+’

使用生成器表达式

通过生成器函数产生片段

在字符串中插入变量

使用字符串对象的format()

使用missing()

使用frame hack

其他的插入变量的方式

格式化文本到指定的列数

使用textwrap()

使用get_terminal_size()获取终端窗口大小

处理html和xml实体

使用html.escape()

使用errors处理ascii码转换

使用html parser或者xml parser

将文本转换为token

使用正则表达式和命名捕获组

使用scanner()

结合使用生成器和scanner

使用生成器表达式过滤token

编写解析器

表达式计算器

解析器

对字节字符串进行文本操作

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

微信扫一扫：分享

发表评论取消回复