文本语料库的精炼研究
[Abstract]:Text corpus is the foundation of text data mining. Many text corpora are derived from the actual work of production and life, and are usually defined by industry experts. The data set in this paper comes from the mayor's open telephone office. With the change of industry category in different periods, there will inevitably be a lot of incorrect data in the corpus. Because of the large corpus, it is usually not able to be proofread by experts one by one. Therefore, we must use the method of data mining to find error classification data, and then proofread the error classification data one by industry experts. The purpose of this paper is to screen the data of error classification in the corpus so as to correct the classification of data by industry experts. This paper discusses the discrimination and classification of text data. This paper first discusses the technology and flow of text classification, then discusses the nature of naive Bayes method, finally discusses the refinement of text corpus, and discusses the method of selecting category discrimination error data. An empirical analysis is given. Under the condition of big data, it is not realistic to adopt the method of manual correction of text data by industry experts because it will consume a lot of manpower, material resources and financial resources. According to certain rules, batch marking of text data categories is another effective method, this method can effectively avoid the shortcomings of direct expert classification, but the accuracy of text data class marking is low. In combination with the above two methods, the third method is put forward. Firstly, the classification of text data is labeled in batches, and the text data that is wrong in category marking is handed over to industry experts for manual marking. Then the text data in the text corpus is corrected by the text data marked by industry experts. The study of text corpus refining is based on the third method. Different methods are used to extract the text data of category discrimination errors in the text corpus. In all methods, the text data which is wrong in category discrimination is the most likely text data for category marking errors. The purpose of text corpus refining is to extract the text data which is most likely to be a category tagging error in the text corpus. This part of text data is handed over to the category of manual marking of industry experts. Finally, the category of text data of text corpus is corrected based on the text data of industry experts. This paper first introduces the general process of text data classification, then introduces the naive Bayes classification algorithm; finally, the purpose and method of text corpus preprocessing, feature extraction, text corpus refining, The text data which extract the category discrimination error and so on are studied. The emphasis of this paper is to study the method of extracting text data of category discrimination error.
【学位授予单位】:东北师范大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:H08
【参考文献】
相关期刊论文 前10条
1 邸鹏;段利国;;一种新型朴素贝叶斯文本分类算法[J];数据采集与处理;2014年01期
2 刘德喜;万常选;;社会化短文本自动摘要研究综述[J];小型微型计算机系统;2013年12期
3 曾青华;袁家斌;张云洲;;基于Hadoop的贝叶斯过滤MapReduce模型[J];计算机工程;2013年11期
4 卫洁;石洪波;冀素琴;;基于Hadoop的分布式朴素贝叶斯文本分类[J];计算机系统应用;2012年02期
5 陈朝大;梁柱勋;郑士基;;一种利用关联规则的改进朴素贝叶斯分类算法[J];计算机系统应用;2010年11期
6 郑炜;沈文;张英鹏;;基于改进朴素贝叶斯算法的垃圾邮件过滤器的研究[J];西北工业大学学报;2010年04期
7 黄魏;高兵;刘异;杨克巍;;基于词条组合的中文文本分词方法[J];科学技术与工程;2010年01期
8 邓u&;付长贺;;四种贝叶斯分类器及其比较[J];沈阳师范大学学报(自然科学版);2008年01期
9 王双成;忻瑞婵;;广义朴素贝叶斯分类器[J];计算机应用与软件;2007年11期
10 张玉芳;彭时名;吕佳;;基于文本分类TFIDF方法的改进与应用[J];计算机工程;2006年19期
相关硕士学位论文 前4条
1 吴文岫;短文本分类语料库的构建及分类方法的研究[D];安徽大学;2015年
2 李太白;短文本分类中特征选择算法的研究[D];重庆师范大学;2013年
3 常娟;短文本分类方法研究[D];复旦大学;2008年
4 张虎;汉语语料库词性标注一致性检查及自动校对方法研究[D];山西大学;2005年
,本文编号:2263630
本文链接:https://www.wllwen.com/wenyilunwen/yuyanyishu/2263630.html