forked from liuzhongkai123/Machine-Learning-Practice
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
eece821
commit b02a555
Showing
1 changed file
with
153 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# 文本和字节序列 \n", | ||
"这一节开始复习python3中的字符串处理,相比python2中糟糕繁琐复杂易错的字符串系统,python3大大改进了这一方面。 \n", | ||
"\n", | ||
"## 字符与字节 \n", | ||
"我们通常所说的字符串即字符的序列,那么如何定义字符呢? \n", | ||
"py3里直接定义为“Unicode字符”,Unicode标准中,有如下区分: \n", | ||
"\n", | ||
"+ 字符标识 又称码位 由十进制来表示是一个0~1114111的数字,在Unicode标准中,通常用4~6个十六进制数表示,前缀为U+,所以我们通常看到的例如“U+1D11E”及为一个字符标识。 \n", | ||
"+ 编码 即字符的具体表述,编码是码位到字节序列转换的算法,例如UTF-8中,大写字母A(U+20AC)的编码是\\x41,在UTF-16LE中,编码为\\x41\\x00" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"字符串‘喵喵喵’的长度: 3\n", | ||
"字符串的utf-8编码: b'\\xe5\\x96\\xb5\\xe5\\x96\\xb5\\xe5\\x96\\xb5'\n", | ||
"该字符串utf-8编码长度: 9\n", | ||
"再将编码解码为字符标识: 喵喵喵\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# 例子 \n", | ||
"s = \"喵喵喵\"\n", | ||
"print(\"字符串‘喵喵喵’的长度:\", len(s))\n", | ||
"b = s.encode('utf-8')\n", | ||
"# 对字符串编码,可以看出一个喵被编码成三位\n", | ||
"print(\"字符串的utf-8编码:\", b)\n", | ||
"print(\"该字符串utf-8编码长度:\", len(b))\n", | ||
"print(\"再将编码解码为字符标识:\", b.decode('utf-8'))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"那么为啥不直接用字符标识呢(比如直接用十进制表示),这通常都是为提高传输与存储的效率。 \n", | ||
" \n", | ||
"说完字符再来谈谈字节,字节全名是二进制类型,包括bytes 和bytearray两个对象。其中其对象的每个元素都是一个0~255的整数" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"b'\\xe5\\x96\\xb5\\xe5\\x96\\xb5\\xe5\\x96\\xb5'\n", | ||
"打印第一个元素: 229\n", | ||
"b'cat'\n", | ||
"打印第一个元素: 99\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# 仍以喵喵喵为例 \n", | ||
"miao = bytes(\"喵喵喵\", encoding='utf-8')\n", | ||
"print(miao)\n", | ||
"print('打印第一个元素:', miao[0]) \n", | ||
"# 换成英文 \n", | ||
"miao = bytes(\"cat\", encoding='utf-8')\n", | ||
"print(miao)\n", | ||
"print('打印第一个元素:', miao[0]) \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"上述代码演示了bytes对象中单个元素的本质,注意到,打印中文时,出现的是utf-8编码,而打印英文时直接出现英文,这是因为英文属于ASCII字符,而ASCII字符(和制表符换行和回车\\t\\n\\r)会直接以字符本身显示,其他字节的值会以编码序列显示。 \n", | ||
"这里只是简单介绍二进制类型,如果需要深入处理二进制数据,通常会需要用到`struct`和`memoryview`这两个模块,可以届时阅读相关细节。 \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 编码 \n", | ||
"\n", | ||
"python中自带有100多编码解码器,包括ASCII,latin1,utf-8,cp437等等,不同编码能表示的字符范围也不尽相同。 \n", | ||
"其中,我们通常会遇到的除了ASCII编码外,还有: \n", | ||
"+ utf-8 目前最常用的编码,linux常用,web常用(约80%的网页使用0)。 \n", | ||
"+ utf-16le 一种utf-6的形式,也是现在windows的通用编码。 \n", | ||
"+ GBK 最常见的中文编码。 \n", | ||
"\n", | ||
"utf系列编码能处理任何字符串,但是如果使用一些较为小众或者古老的编码例如cp437,常常会出现问题(因为文本中出现了这些编码处理不了的字符)。 \n", | ||
"\n", | ||
"py3默认使用utf-8编码来对源代码(.py文件)编码,但是有时你在某些平台上打开文件时可能会默认使用其他编码,所以可以在文件开头加一行coding注释来指明你使用的编码,例如: \n", | ||
"`# coding:utf-8` \n", | ||
"\n", | ||
"另一个有趣的点是,在python3中,源码中是可以用非ASCII名称的,也就是说你可以用中文/俄文/日文等等来作为变量名,当然很多人会认为用中文显得不专业,不过这点倒无可厚非,主要是中文的变量名效率有些低,还要在中英文输入法中切来切去。 \n", | ||
"总的来说,除非是跨国公司或者开源代码,其他代码尽可以使用对开发团队易于阅读和开发的语言来做名称。 \n", | ||
"\n", | ||
"\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### 找出字节序列的编码 \n", | ||
"\n", | ||
"那么有一个问题也许会困扰大家,对于给定一段字节序列,它的编码未知,那么我们如何确定它的编码呢。 \n", | ||
"简单说,并没有办法完全确定。只能通过trial-error,即试探然后看看对不对来找,有一个python包叫`Chardet`在这方面做了一些工作。 \n", | ||
"\n", | ||
"\n", | ||
"## 处理文本文件 \n", | ||
"\n", | ||
"在处理文本文件时有一个\"Unicode三明治\"原则,我们知道三明治好吃的部分是中间(喜欢吃面包的不要打我),那么中间部分越大越好。 \n", | ||
"对应到文本处理中,在程序中,最好只有一次解码(读入输入文件)和一次编码(输出到文件),中间的业务逻辑完全只处理字符串对象而不涉及二进制。 \n", | ||
"那么只要保证输入输出时编码正确,你在python3中就基本不会遇到编码问题。" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |