python学习手册（第4版）第三十六章 unicode和字节字符串

293 阅读 0 评论 194 点赞

我是靠谱客的博主飘逸小丸子，这篇文章主要介绍python学习手册（第4版）第三十六章 unicode和字节字符串，现在分享给大家，希望可以做个参考。

1.本章往后开始涉及进阶的内容

2.python3.0中，ASCII看作Unicode的一种简单类型

3.python3.0中字符串将使用平台默认的编码来进行编码和解码

如windows上的ASCII或UTF-8

>>> import sys

>>> sys.platform # 操作系统
'win32'
>>> sys.getdefaultencoding() # 得到默认的编码格式
'utf-8'
>>>

4.ASCII

ASCII定义了从0到127的字符代码，并且允许每个字符存储在一个8位的字节中（实际上，只有其中的7位真正用到）

示例如下，

>>> ord('a') # 字符'a'的ASCII整数值位97
97
>>> hex(97) # 整数值97的十六进制表示方法
'0x61'
>>> chr(97) # 将整数值转化为对应的ASCII字符
'a'
>>>

为了容纳特殊字符，新增Latin-1标准，它允许一个8位字节中所有可能的值（即0-255）来表示字符，并且把（ASCII范围之外的）128-255分配个特殊字符

示例如下，

>>> chr(199) # 这里的整数值超出了127（即ASCII的范围），得到一个特殊的字符
'Ç'
>>> chr(196)
'Ä'
>>>

5.Unicode

Unicode文本通常称为“宽字符”字符串，因为每个字符表示为多个字节。

Unicode通常用在国际化的程序中，以表示欧洲和亚洲的字符集，拥有比8位字节所能表示的更多的字符。

6.编码与解码

编码是根据一个想要的编码名称，把一个字符串翻译其为原始字节的形式

解码是根据其编码名称，把一个原始字节串翻译为字符串形式的过程

一般过程为，把字符串编码为原始字节，再把原始字节解码为字符串。

7.UTF-8

广为使用的Unicode为UTF-8编码，当然，ASCII、Latin-1等编码都属于Unicode；

ASCII是Latin-1和UTF-8的子集（即，对小于128的字符代码，UTF-8编码与ASCII是二进制兼容的）；

8.python3.0有3种字符串类型（一种用于表示文本，两种用于二进制数据（非文本数据，如图形等数据））

（1）str表示Unicode文本（8位的和更宽的）

（2）bytes表示二进制数据

（3）bytearray，是一种可变的bytes类型

str与bytes的转换如下，

>>> a = '123'
>>> type(a)
<class 'str'>
>>> b = a.encode() # 字符串转raw bytes，第一种方法（编码）
>>> type(b)
<class 'bytes'>
>>> b2 = bytes(a,encoding='utf-8') # 字符串转raw bytes，第二种方法
>>> type(b2)
<class 'bytes'>
>>> c = b.decode() # raw bytes转字符串，第一种方法（解码）
>>> type(c)
<class 'str'>
>>> c2 = str(b,encoding='utf-8') # raw bytes转字符串，第二种方法
>>> type(c2)
<class 'str'>
>>>

9.字符串格式化在python3.0中只对str有效，bytes无效

>>> b'%s'%99
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'int'
>>> '%s'%99
'99'
>>> b'{0}'.format(99)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'format'
>>> '{0}'.format(99)
'99'

根据错误提示信息，对bytes对象进行格式化时，没有.format方法，使用%时的对象也要为bytes，正确方式如下，

>>> b'%s'%b'11'
b'11'
>>>

10.python3.0中bytearray的使用（可变的bytes）

>>> s = 'spam'
>>> c = bytearray(s,'latin1') # 使用latin-1的编码方式
>>> c
bytearray(b'spam')
>>> c[1] = b'Y'[0] # bytearray是可变的bytes
>>> c
bytearray(b'sYam')
>>>
>>> c[1] = b'Y' # 这种赋值方式是错误的
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required

原因如下，

>>> c[1]
89
>>>

>>> b = bytearray(s,'utf-8') # 使用utf-8的编码方式
>>> b
bytearray(b'spam')
>>> b[1] # 不通编码格式，得到的编码结果不同
112
>>>

11.re模式匹配模块

用于查找、分割、替换