自然语言处理基础

Octavia ·

更新时间:2024-11-13

· 550 次阅读

Content 文本预处理；语言模型；循环神经网络基础 机器翻译及相关技术；注意力机制与Seq2seq模型；Transformer

一、文本预处理

文本数据的常见预处理步骤，预处理通常包括四个步骤：

读入文本分词建立字典，将每个词映射到一个唯一的索引（index）将文本从词的序列转换为索引的序列，方便输入模型 Code


#文本预处理具体操作
#1、读入文本
import collections
import re
def read_time_machine():
    with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f:
        lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]
    return lines
lines = read_time_machine()
print('# sentences %d' % len(lines))
def tokenize(sentences, token = 'word'):
    #将一段话每个单词分开
    if token == 'word':
        return [sentence.split(' ') for sentence in sentences]
    elif token == 'char':
        return [list(sentence) for sentence in sentences]
    else:
        print('ERROR: unkown token type ' + token)
#test
tokens = tokenize(lines)
tokens[0:2]
#2、分词：将一个句子划分成若干个词（token），转换为一个词的序列。
def tokenize(sentences, token = 'word'):
    #将一段话每个单词分开
    if token == 'word':
        return [sentence.split(' ') for sentence in sentences]
    elif token == 'char':
        return [list(sentence) for sentence in sentences]
    else:
        print('ERROR: unkown token type ' + token)
#test
tokens = tokenize(lines)
tokens[0:2]
#3、建立字典：为了方便模型处理，我们需要将字符串转换为数字，所以需要先构建一个字典（vocabulary），将每个词映射到一个唯一的索引编号
class Vocab(object):
    def __init__(self, tokens, min_freq = 0, use_special_tokens = False):
        counter = count_corpus(tokens) 
        self.token_freqs = list(counter.items())
        self.idx_to_token = []
        if use_special_tokens:
            ## padding, begin of sentence, end of sentence, unknown
            self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
            self.idx_to_token += ['', '', '', '']
        else:
            self.unk = 0
            self.idx_to_token += ['']
        self.idx_to_token += [token for token, freq in self.token_freqs
                                if freq >= min_freq and token not in self.idx_to_token]
        self.token_to_idx = dict()
        for idx, token in enumerate(self.idx_to_token):
            self.token_to_idx[token] = idx
    def __len__(self):
        return len(self.idx_to_token)
    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]
    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]
def count_corpus(sentences):
    tokens = [tk for st in sentences for tk in st]
    return collections.Counter(tokens)  # 返回一个字典，记录每个词的出现次数            
#test
vocab = Vocab(tokens)
print(list(vocab.token_to_idx.items())[0:10])
#4、将词转为索引,用现有工具进行分词
#使用字典，我们可以将原文本中的句子从单词序列转换为索引序列
for i in range(8, 10):
    print('words:', tokens[i])
    print('indices:', vocab[tokens[i]])
#用现有工具进行分词：spaCy和NLTK。
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])
from nltk.tokenize import word_tokenize
from nltk import data
data.path.append('/home/kesci/input/nltk_data3784/nltk_data')
print(word_tokenize(text))

二、语言模型
一段自然语言文本可以看作是一个离散时间序列，给定一个长度为T的词的序列w1,w2,…,wT，语言模型的目标就是评估该序列是否合理，即计算该序列的概率。

1、语言模型
假设序列w1,w2,…,wT中的每个词是依次生成的，我们有

语言模型的参数就是词的概率以及给定前几个词情况下的条件概率。设训练数据集为一个大型文本语料库，如维基百科的所有条目，词的概率可以通过该词在训练数据集中的相对词频来计算。
2、n元语法
序列长度增加，计算和存储多个词共同出现的概率的复杂度会呈指数级增加。n元语法通过马尔可夫假设简化模型，马尔科夫假设是指一个词的出现只与前面n个词相关，即n阶马尔可夫链（Markov chain of order n），如果n=1，那么有P(w3∣w1,w2)=P(w3∣w2)。基于n−1阶马尔可夫链，我们可以将语言模型改写为

以上也叫n元语法（n-grams），它是基于n−1阶马尔可夫链的概率语言模型。
问题：在一元语法中，由三个词组成的句子“你走先”和“你先走”的概率是一样的。然而，当n较大时，n元语法需要计算并存储大量的词频和多词相邻频率。
Code

#1、读取数据集
with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f:
    corpus_chars = f.read()
print(len(corpus_chars))
print(corpus_chars[: 40])
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[: 10000]
#2、建立字符索引
#建立字符索引
idx_to_char = list(set(corpus_chars)) #去重，得到索引到字符映射
char_to_idx = {char: i for i, char in enumerate(idx_to_char)} #字符到索引的映射
vocab_size = len(char_to_idx)
print(vocab_size)
corpus_incices = [char_to_idx[char] for char in corpus_chars] #将每个字符串转化为索引，得到一个索引序列
sample = corpus_incices[: 20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)
def load_data_jay_lyrics():
    with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f:
        corpus_chars = f.read()
    corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
    corpus_chars = corpus_chars[0:10000]
    idx_to_char = list(set(corpus_chars))
    char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
    vocab_size = len(char_to_idx)
    corpus_indices = [char_to_idx[char] for char in corpus_chars]
    return corpus_indices, char_to_idx, idx_to_char, vocab_size
#时序数据的采样
'''
在训练中我们需要每次随机读取小批量样本和标签。与之前章节的实验数据不同的是，时序数据的一个样本通常包含连续的字符。假设时间步数为5，样本序列为5个字符，即“想”“要”“有”“直”“升”。该样本的标签序列为这些字符分别在训练集中的下一个字符，即“要”“有”“直”“升”“机”，即=“想要有直升”，=“要有直升机”。
'''
import torch
import random
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # 减1是因为对于长度为n的序列，X最多只有包含其中的前n - 1个字符
    num_examples = (len(corpus_indices) - 1) // num_steps  # 下取整，得到不重叠情况下的样本个数
    example_indices = [i * num_steps for i in range(num_examples)]  # 每个样本的第一个字符在corpus_indices中的下标
    random.shuffle(example_indices)
    def _data(i):
        # 返回从i开始的长为num_steps的序列
        return corpus_indices[i: i + num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    for i in range(0, num_examples, batch_size):
        # 每次选出batch_size个随机样本
        batch_indices = example_indices[i: i + batch_size]  # 当前batch的各个样本的首字符的下标
        X = [_data(j) for j in batch_indices]
        Y = [_data(j + 1) for j in batch_indices]
        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)
#test
my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')
#相邻采样
#在相邻采样中，相邻的两个随机小批量在原始序列上的位置相毗邻
def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    corpus_len = len(corpus_indices) // batch_size * batch_size  # 保留下来的序列的长度
    corpus_indices = corpus_indices[: corpus_len]  # 仅保留前corpus_len个字符
    indices = torch.tensor(corpus_indices, device=device)
    indices = indices.view(batch_size, -1)  # resize成(batch_size, )
    batch_num = (indices.shape[1] - 1) // num_steps
    for i in range(batch_num):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

三、循环神经网络
于当前的输入与过去的输入序列，预测序列的下一个字符。循环神经网络引入一个隐藏变量H，用Ht表示H在时间步t的值。Ht的计算基于Xt和Ht−1，可以认为Ht记录了到当前字符为止的序列信息，利用Ht对序列的下一个字符进行预测。


LSTM

四、机器翻译及相关技术
机器翻译（MT）：将一段文本从一种语言自动翻译为另一种语言，用神经网络解决这个问题通常称为神经机器翻译（NMT）。 主要特征：输出是单词序列而不是单个单词。 输出序列的长度可能与源序列的长度不同。

具体结构

我们使用 softmax函数 获得注意力权重：

最终的输出就是value的加权求和：

六、Transformer
利用attention机制实现了并行化捕捉序列依赖，并且同时处理序列的每个位置的tokens，上述优势使得Transformer模型在性能优异的同时大大减少了训练时间。

多头注意力



作者：ai_XZP_master
                    
 
                

                            自然语言
                            自然语言处理


           
    
    

            
                
                    
                
            
            
                
    
        
            需要 登录 后方可回复, 如果你还没有账号请 注册新账号
        
    
                
            
                
                    
                        相关文章

    
        
            .NET/C#如何判断某个类是否是泛型类型或泛型接口的子类型详解
        
        
            Sachi
            2020-08-28
        
    
    
        741
    


    
        
            PHP设计模式之策略模式原理与用法实例分析
        
        
            Glenna
            2020-03-10
        
    
    
        914
    


    
        
    
    
        
            自然语言处理：用paddle对人民日报语料进行分词，停用词，数据清洗和熵计算
        
        
            Gamila
            2021-05-28
        
    
    
        712
    


    
        
    
    
        
            Python输入输出-自然语言处理+json格式化
        
        
            Ianthe
            2021-03-22
        
    
    
        555
    


    
        
    
    
        
            第1章 自然语言处理简介
        
        
            Serafina
            2020-12-07
        
    
    
        775
    


    
        
    
    
        
            浅谈自然语言处理中的word2vec
        
        
            Sue
            2020-05-23
        
    
    
        888
    


    
        
            自然语言处理：pyltp安装教程与问题汇总
        
        
            Neoma
            2020-07-24
        
    
    
        758
    


    
        
            哈工大自然语言处理工具箱之ltp在windows10下的安装使用教程
        
        
            Floria
            2020-07-10
        
    
    
        876
    


    
        
            能让你轻松的实现自然语言处理的5个Python库
        
        
            Fleur
            2021-12-16
        
    
    
        565
    


    
        
    
    
        
            Linux日历程序California 0.2 发布 添加了“自然语言”解析器
        
        
            Pandora
            2022-01-27
        
    
    
        850
    


    
        
            Python结合spaCy 进行简易自然语言处理
        
        
            Kara
            2022-07-15
        
    
    
        76
    


    
        
            自然语言处理之文本热词提取(含有《源码》和《数据》)
        
        
            Pandora
            2022-07-15
        
    
    
        243
    


    
        
    
    
        
            将自然语言查询转换为SQL代码的AI工具使用详解
        
        
            Kitty
            2023-03-21
        
    
    
        732
    


    
        
            自然语言处理NLPTextRNN实现情感分类
        
        
            Rhea
            2023-07-01
        
    
    
        777


        
    
        
            我要提问
        
    
    
        
        
    
        致谢
        
            帮助他人，成就自己。
            人生最大成功就是伸出热情而温暖的双手，尽自己所能去帮助身边的每一个人，只要无私的奉献，就会收获到美好的生活。
            1024问感谢每一位朋友的帮助和支持。
            软件开发网提供编程的基础软件技术培训教程,软件开发编程实例讲解Go,Node,HTML,CSS,Javascript,Python,Java,Ruby,C,PHP,MySQL等软件开发编程语言以及数据开发的基础知识，也提供大量的软件开发在线实例、从入门到精通就在1024问。
        
    
    
        
            
    育儿网
    微养生
    全球行
    美食街
    育儿
    菜谱大全
    海南旅游
    女性
    养狗百科
    星座