文本语料库的精炼研究

发布时间：2018-10-11 09:07

【摘要】：文本语料库是文本数据挖掘的基础。很多文本语料库来源于生产生活的实际工作中,通常由行业专家为其定义类别。本文的数据集来源于市长公开电话办公室,随着不同时期行业类别的变更,语料库难免会有很多错误数据,由于语料库较大,通常不能由专家逐条校对,所以必须使用数据挖掘的方法找出错分类数据,针对这些错分类数据再由行业专家逐一校对。本文研究的内容就是筛选语料库中的错分类数据,以便于行业专家矫正数据类别。本文讨论了文本数据的判别分类问题。文中首先对文本分类技术和流程给出了论述,而后讨论了朴素贝叶斯方法的性质,最后讨论了文本语料库的精炼研究,讨论了类别判别错误数据的选取方法,并给出了实证分析。在大数据条件下,通过行业专家对文本数据人工标记类别的方法,由于会消耗大量的人力、物力、财力,采用行业专家人工校正的方法是不现实的。按照一定的规则,批量的对文本数据标记类别是另一种有效的方法,该方法能够有效的避免直接专家标类别的缺点,但文本数据类别标记的精确度比较低。结合以上两种方法,提出了第三种方法,首先批量对文本数据标记类别,将类别标记错误的文本数据交给行业专家进行人工标记,然后用行业专家标记的文本数据对文本语料库中的文本数据进行校正。文本语料库的精炼研究是基于第三种方法的。利用不同的方法提取文本语料库中类别判别错误的文本数据,在所有的方法中类别判别均为错误的文本数据是最可能为类别标记错误的文本数据。文本语料库精炼的目的是提取文本语料库中最可能为类别标记错误的文本数据。将这部分文本数据交给行业专家人工标记类别,最后基于行业专家人工标记的文本数据将文本语料库的文本数据的类别进行校正。本文首先简述文本数据分类的一般流程;然后介绍朴素贝叶斯分类算法;最后对文本语料库的预处理,特征词提取,文本语料库精炼的目的和方法,提取类别判别错误的文本数据等进行研究。本文重点内容是研究提取类别判别错误的文本数据的方法。
[Abstract]:Text corpus is the foundation of text data mining. Many text corpora are derived from the actual work of production and life, and are usually defined by industry experts. The data set in this paper comes from the mayor's open telephone office. With the change of industry category in different periods, there will inevitably be a lot of incorrect data in the corpus. Because of the large corpus, it is usually not able to be proofread by experts one by one. Therefore, we must use the method of data mining to find error classification data, and then proofread the error classification data one by industry experts. The purpose of this paper is to screen the data of error classification in the corpus so as to correct the classification of data by industry experts. This paper discusses the discrimination and classification of text data. This paper first discusses the technology and flow of text classification, then discusses the nature of naive Bayes method, finally discusses the refinement of text corpus, and discusses the method of selecting category discrimination error data. An empirical analysis is given. Under the condition of big data, it is not realistic to adopt the method of manual correction of text data by industry experts because it will consume a lot of manpower, material resources and financial resources. According to certain rules, batch marking of text data categories is another effective method, this method can effectively avoid the shortcomings of direct expert classification, but the accuracy of text data class marking is low. In combination with the above two methods, the third method is put forward. Firstly, the classification of text data is labeled in batches, and the text data that is wrong in category marking is handed over to industry experts for manual marking. Then the text data in the text corpus is corrected by the text data marked by industry experts. The study of text corpus refining is based on the third method. Different methods are used to extract the text data of category discrimination errors in the text corpus. In all methods, the text data which is wrong in category discrimination is the most likely text data for category marking errors. The purpose of text corpus refining is to extract the text data which is most likely to be a category tagging error in the text corpus. This part of text data is handed over to the category of manual marking of industry experts. Finally, the category of text data of text corpus is corrected based on the text data of industry experts. This paper first introduces the general process of text data classification, then introduces the naive Bayes classification algorithm; finally, the purpose and method of text corpus preprocessing, feature extraction, text corpus refining, The text data which extract the category discrimination error and so on are studied. The emphasis of this paper is to study the method of extracting text data of category discrimination error.
【学位授予单位】：东北师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：H08

【参考文献】