文本情感分析—数据预处理

Jasmine ·

更新时间:2024-09-20

· 518 次阅读

数据预处理代码：
引自：文本情感分析

def load_data(filepath, input_shape=20):
    df = pd.read_csv(filepath)
    # 标签及词汇表
    labels, vocabulary = list(df['label'].unique()), list(df['evaluation'].unique())
    # 构造字符级别的特征
    string = ''
    for word in vocabulary:
        string += word
    vocabulary = set(string)
    # 字典列表
    word_dictionary = {word: i+1 for i, word in enumerate(vocabulary)}
    with open('word_dict.pk', 'wb') as f:
        pickle.dump(word_dictionary, f)
    inverse_word_dictionary = {i+1: word for i, word in enumerate(vocabulary)}
    label_dictionary = {label: i for i, label in enumerate(labels)}
    with open('label_dict.pk', 'wb') as f:
        pickle.dump(label_dictionary, f)
    output_dictionary = {i: labels for i, labels in enumerate(labels)}
    vocab_size = len(word_dictionary.keys()) # 词汇表大小
    label_size = len(label_dictionary.keys()) # 标签类别数量
    # 序列填充，按input_shape填充，长度不足的按0补充
    x = [[word_dictionary[word] for word in sent] for sent in df['evaluation']]
    x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)
    y = [[label_dictionary[sent]] for sent in df['label']]
    y = [np_utils.to_categorical(label, num_classes=label_size) for label in y]
    y = np.array([list(_[0]) for _ in y])
    return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary

语句1： labels, vocabulary = list(df['label'].unique()), list(df['evaluation'].unique())

效果示例：

['正面', '负面']

作用：取出数据集中的数据

语句2：

string = ''
for word in vocabulary:
    string += word
vocabulary = set(string)

作用：便于构建字典列表

语句3：

word_dictionary = {word: i + 1 for i, word in enumerate(vocabulary)}
with open('word_dict.pk', 'wb') as f:
    pickle.dump(word_dictionary, f)
inverse_word_dictionary = {i + 1: word for i, word in enumerate(vocabulary)}
label_dictionary = {label: i for i, label in enumerate(labels)}
with open('label_dict.pk', 'wb') as f:
    pickle.dump(label_dictionary, f)
output_dictionary = {i: labels for i, labels in enumerate(labels)}

构建字典列表，即可以认为是一个hashtable，将数据中的字给编号，便于将句子转化成整数的矩阵。
例如：将“我爱你”，“我喜欢你”和“我不喜欢你”转化成
1 2 3 0 0
1 4 5 3 0
1 6 4 5 3
便于后面训练模型使用。

pickle.dump

pickle.dump(obj, file, [,protocol])
注释：序列化对象，将对象obj保存到文件file中去。参数protocol是序列化模式，默认是0，以文本形式进行序列化。

enumerate

enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标。
enumerate用法介绍

语句4：

x = [[word_dictionary[word] for word in sent] for sent in df['evaluation']]
x = pad_sequences(maxlen=180, sequences=x, padding='post', value=0)
y = [[label_dictionary[sent]] for sent in df['label']]
y = [np_utils.to_categorical(label, num_classes=label_size) for label in y]
y = np.array([list(_[0]) for _ in y])

在倒数第二行将y转换成onehot时，此时y输出为：

[array([[1., 0.]], dtype=float32), array([[1., 0.]], dtype=float32)]

因此，需要用np.array将y转换成onehot表示。

pad_sequences语法：

keras.preprocessing.sequence.pad_sequences(sequences, 
 maxlen=None,
 dtype='int32',
 padding='pre',
 truncating='pre', 
 value=0.)

sequences：浮点数或整数构成的两层嵌套列表
maxlen：None或整数，为序列的最大长度。大于此长度的序列将被截短，小于此长度的序列将在后部填0.
dtype：返回的numpy array的数据类型
padding：‘pre’或‘post’，确定当需要补0时，在序列的起始还是结尾补`
truncating：‘pre’或‘post’，确定当需要截断序列时，从起始还是结尾截断
value：浮点数，此值将在填充时代替默认的填充值0

to_categorical语法：

to_categorical(y, num_classes=None, dtype=‘float32’)
将整型的类别标签转为onehot编码。

作者：深夜喝牛奶

情感文本情感分析数据预处理数据情感分析

1024 个赞