NLP-文本预处理（1）

文本预处理（1）

前言
一、繁简转换
二、字符串分割
三、去除连续重复字符及标点符号
四、查看字符串长度分布

前言

繁简转换、字符串分割、去除连续重复标点符号、查看字符串长度分布

一、繁简转换

from opencc import OpenCC
cc = OpenCC('t2s') # t2s 繁转简  s2t 简转繁
print(cc.convert('中國'))

二、字符串分割

import re
# 只保留中文、大小写字母和阿拉伯数字
reg = "[^0-9A-Za-z\u4e00-\u9fa5]"
text = '！@#我BU知道zhe是什么。。111..？'
print(re.sub(reg, '', '！@#我BU知道zhe是什么。。111..？'))
# output => 我BU知道zhe是什么111

text2 = '今天。。你吃饭！！了吗？？？吃的...什么;啊'
print(re.split('[!?！？。;…]|(\.{3})',text2))
# output => ['今天', None, '', None, '你吃饭', None, '', None, '了吗', None, '', None, '', None, '吃的', '...', '什么', None, '啊']
l = [i for i in re.split('[!?！？。;…]|(\.{3})',text2) if i != '' and i != None].
print(l)
# output => ['今天', '你吃饭', '了吗', '吃的', '...', '什么', '啊']

三、去除连续重复字符及标点符号

from itertools import groupby
text = '好淡啊啊啊啊aaa,好像兑了水!!!!~~~'
s = ''
for i,j in groupby(text):
    s += i
print(s) # output => 好淡啊a,好像兑了水!~

四、查看字符串长度分布

import pandas as pd
df = pd.read_csv(file_path)
df['text_len'] = df['comment'].map(lambda x:len(str(x)))
print(df['text_len'].describe(percentiles=[0.5,0.8,0.9]))
import matplotlib.pyplot as plt
plt.hist(df['text_len'],bins=30,rwidth=0.9,density=True)
plt.show()

NLP-文本预处理（1）

文本预处理（1）

前言

一、繁简转换

二、字符串分割

三、去除连续重复字符及标点符号

四、查看字符串长度分布

悦读