第2章文本的歧义及其清理（包括，分词，去除停用词，词干提取，词形还原等）

Psyche ·

更新时间:2024-11-13

· 866 次阅读

第2章文本的歧义及其清理

文本处理的过程：

词项化—>去除停用词---->词干提取或词形还原
在这里插入图片描述
1. 简单看看json文件的基本内容：

example.json:
{
“array”: [1,2,3,4],
“boolean”: “True”,
“object”: {
“a”: “b”
},
“string”: “Hello World”
}

简单的处理代码：

import json
#打开文件
jsonfile=open("example.json")
#加载数据
data=json.load(jsonfile)
print(data['array'],data['boolean'],data['object'],data['string'])

结果如下：
在这里插入图片描述
2.语句分离
前边应该进行文本清理，如前面对html语言进行处理不必要字符，以及删去长度短的字母。
语句分离即将大段原生文本分割成一系列语句。
利用sent_tokenize分离语句:

from nltk.tokenize import sent_tokenize
#sent_tokenize是专门根据语句边界检测来分离语句的
inputstring=" This is an example sent. The sentence splitter will split on sent markers. Ohh really !!"
all_sent=sent_tokenize(inputstring)
print(all_sent)

结果如下：

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [’ This is an
example sent.’, ‘The sentence splitter will split on sent markers.’,
‘Ohh really !’, ‘!’]

Process finished with exit code 0

3.标识化处理
有各种表示器，最简单的python字符串类型的split()方法，利用空白符进行单词分割。word_tokenize()是一个更加强大同样的方法，还有另一个选择regex_tokenize()。同时也可以基于正则表达式来分割出相同字符串。
具体代码如下：

利用split():

s = "Hi Everyone ! This is the first day we go to school."
print(s.split())

结果：

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘Hi’,
‘Everyone’, ‘!’, ‘This’, ‘is’, ‘the’, ‘first’, ‘day’, ‘we’, ‘go’,
‘to’, ‘school.’]

Process finished with exit code 0

利用word_tokenize:

from nltk.tokenize import word_tokenize
s = "Hi Everyone ! This is the first day we go to school."
all_word=word_tokenize(s)
print(all_word)

结果：

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘Hi’,
‘Everyone’, ‘!’, ‘This’, ‘is’, ‘the’, ‘first’, ‘day’, ‘we’, ‘go’,
‘to’, ‘school’, ‘.’]

Process finished with exit code 0

利用·regexp_tokenize：

可以用\w+这个正则表达式，分隔出单词和数字，如果用\d+这个正则表达式，提取出纯数字内容。

from nltk.tokenize import regexp_tokenize,wordpunct_tokenize,blankline_tokenize
s = "Hi Everyone ! This is the first day we go to school."
all_word=regexp_tokenize(s, pattern='\w+')
print(all_word)

结果：

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘Hi’,
‘Everyone’, ‘This’, ‘is’, ‘the’, ‘first’, ‘day’, ‘we’, ‘go’, ‘to’,
‘school’]

Process finished with exit code 0

4.词干提取stemming
举个例子：
eating、eaten、eats->eat
将不同的词形变化归结为相同的词根，在像移除-s/es、-ing或-ed这类事情上都可以有70％以上的精确度
简单代码如下：

from nltk.stem import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
#创建Porter词干提取器
pst=PorterStemmer()
#创建Lancaster词干提取器
lst=LancasterStemmer()
print(lst.stem("eating"))
print(pst.stem("shopping"))

结果展示：

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py eat shop

Process finished with exit code 0

5.词形还原
更有条理，会利用上下文语境和词性来确定相关单词的变化形式
简单代码如下：

from nltk.stem import WordNetLemmatizer
wlem=WordNetLemmatizer()
print(wlem.lemmatize("I want to shopping"))

结果：

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py I want to
shopping

Process finished with exit code 0

6.停用词移除
停用词对文档或者查询时无用的，有两种方法筛选出停用词：
方法一：通过人工或者网站上找到停用词列表。
方法二：利用频率来构建停用词列表
NLTK就有停用词库
简单代码如下：

#从corpus中导出停用词序列
from nltk.corpus import stopwords
#得到英语english的停用词
stoplist=stopwords.words('english')
#我们可以查看一下停用词有哪些
print(stoplist)
text="This is just a test"
#将文本的字母全部调整为小写
text=text.lower()
print(text)
#剔除在停用词列表中的单词
clenwordlist=[word for word in text.split() if word not in stoplist]
print(clenwordlist)

结果：

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘i’, ‘me’, ‘my’,
‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’, ‘yours’,
‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, ‘her’,
‘hers’, ‘herself’, ‘it’, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’,
‘theirs’, ‘themselves’, ‘what’, ‘which’, ‘who’, ‘whom’, ‘this’,
‘that’, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’,
‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’,
‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’,
‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’, ‘with’, ‘about’, ‘against’,
‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’,
‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’,
‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’,
‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’,
‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’,
‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’,
‘don’, ‘should’, ‘now’] this is just a test [‘test’]

Process finished with exit code 0

7.拼音纠错
我们可以通过纯字典查找方式创建一个非常基本的拼写检查器，也可以用模糊字符串匹配，最常用的是edit-distance算法，具体见后面章节
简单代码如下：

from nltk.metrics import edit_distance
print(edit_distance("rain","shine"))

结果如下：

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py
3
Process finished with exit code 0

词干提取与词形还原有什么区别：
个人认为词干提取是缩减，砍掉尾部,比如driving->driv,而不是drive；而词形还原会根据上下文进行变换；比如drove->drive
本章小结：
主要是文本的处理，我们学习了：
文本分离，单词分离，词干提取、词形还原以及去除停用词，还有拼音纠错等等
需注意：
在完成停用词移除之后，我们还可以执行其它NLP操作吗？
答案是否定的：这是不可能的。所有典型的NLP应用，譬如词性标注、断句处理等，都需要根据上下文语境来为既定文本生成相关的标签。一旦我们移除了停用词，其上下文环境也就不存在了。

作者：LYsdu

停用词分词还原

1024 个赞

需要登录后方可回复, 如果你还没有账号请注册新账号

相关文章

Web 网页验证

Dara 2020-01-06

503

Vue请求java服务端并返回数据代码实例

Rena 2020-12-17

666

Nginx使用反向代理实现负载均衡过程解析

Opal 2021-03-06

601

电脑还原系统报错出现：Decompression error Abort?

Irina 2022-03-09

1676

设置的系统还原点不起作用的可能原因整理

Hazel 2022-03-10

1363

MySQL数据库的备份与还原操作方法

Madeleine 2022-06-08

870

python中文分词+词频统计的实现步骤

Malina 2022-06-12

495

Python第三方库jieba库与中文分词全面详解

Tia 2022-07-07

1364

Vue.js实现页面后退时还原滚动位置的操作方法

Acacia 2022-07-15

361

Python使用re模块实现okenizer(表达式分词器)

Abbie 2022-07-15

1310

Python利用re模块实现简易分词(tokenization)

Julie 2022-07-15

685

python读取json数据还原表格批量转换成html

Xenia 2022-07-15

1549

git如何还原到某次commit并强制推送远程

Bunny 2022-09-24

813

MySQL实现分词搜索(FULLTEXT)的方法

Gretchen 2022-10-19

1832

Python中文分词库jieba(结巴分词)详细使用介绍

Nita 2022-10-23

478

Elasticsearch Analyzer 内置分词器使用示例详解

Tricia 2022-11-09

600

泛型的类型擦除后fastjson反序列化时如何还原详解

Kathy 2022-11-10

534

Archlinux Timeshift系统备份与还原的操作方法

Querida 2023-01-12

1474

搜索一文入门ElasticSearch(节点分片CRUD倒排索引分词)

Anna 2023-03-23

1883

使用Oracle进行数据库备份与还原

Gitana 2023-04-10

256

我要提问

致谢

帮助他人，成就自己。

人生最大成功就是伸出热情而温暖的双手，尽自己所能去帮助身边的每一个人，只要无私的奉献，就会收获到美好的生活。

1024问感谢每一位朋友的帮助和支持。

软件开发网提供编程的基础软件技术培训教程,软件开发编程实例讲解Go,Node,HTML,CSS,Javascript,Python,Java,Ruby,C,PHP,MySQL等软件开发编程语言以及数据开发的基础知识，也提供大量的软件开发在线实例、从入门到精通就在1024问。

育儿网微养生全球行美食街育儿菜谱大全海南旅游女性养狗百科星座