Web中文文本分类技术研究与实现

发布时间：2018-01-03 05:02

本文关键词：Web中文文本分类技术研究与实现　出处：《武汉理工大学》2014年硕士论文　论文类型：学位论文

【摘要】：在信息化的大背景下，Web的飞速发展及互联网的普及给我们的工作和生活带来了极大的便捷，网络成为人们获取信息的重要来源。不过由于异构与开放的Internet网络，大量垃圾信息充斥其中，对待无尽的网络信息，怎样有效的管理，如何快速、准确地发现潜在有用的知识成为当前的研究热点。有效应对繁杂网页内容的一个重要方法就是将其分类，而目前文本仍然是网页的主要呈现形式，因此文本分类是解决该问题的核心，同时它也是有关搜索引擎，信息的检索与过滤的基础技术，广泛的适用性决定了对它研究的现实意义。Web中文文本分类是互联网技术与传统文本分类技术相结合的产物，简单概述，它是利用已知类别Web中文文档学习出一个分类模型，继而确定未知文档类别的技术，整个过程包括预处理Web中文文本、选取特征词集、文本表示、计算词权值、样本分类等步骤。首先本论文在阐明Web中文分类关键技术的基础上，总结了研究的背景及现状，分析了研究的流程思路，，做了很多理论和实现上的研究。理论方面，在综合分析总结了已有方法的不足之后，对分类过程中的一些环节进行了改进。针对Web下的特殊使用环境，提出了在特征选择之前，不同位置的文本，分区域、分步骤并赋予不同权重处理的思路；对于卡方统计只顾文档频率而没考虑词频，本类出现少而非本类中普遍存在时极有可能被选为特征词以及自身公式中均匀分布纠正惩罚能力不够等情况，提出了词频补偿因子、类别比重因子、类内分布因子的概念，将它们乘在传统方法的公式后面作为补偿来对原方法加以改进，取得了良好的效果；在分类算法方面，重点研究了KNN算法，在深入分析其原理后总结出了优缺点。针对KNN算法采用内积公式计算文本相似度比较粗糙的情况（文中已举例说明），给出了一种利用相似接近系数进行完善的方法。通过设计相关试验证明，以上改进后的措施在准确率、召回率、F1值等方面都有不同程度的提高。实现方面，本文设计了用于Web中文文本分类试验的小工具软件，包括用于建立样本库的Web网页采集模块，用于处理文本和分类过程的分类模块，用于评估对比最终结果的评估模块。并提供了设计的主要方案和用到的一些关键技术。
[Abstract]:In the context of information technology, the rapid development of the Web and the popularity of the Internet have brought us great convenience in our work and life. Network has become an important source of information, but because of the heterogeneous and open Internet network, a large number of spam information is flooded with it, how to deal with the endless network information, how to effectively manage. How to quickly and accurately find the potentially useful knowledge has become a hot topic. An important way to deal with the complex web content is to classify it, while the text is still the main presentation form of web pages. Therefore, text classification is the core of the problem, and it is also the basic technology of search engine, information retrieval and filtering. Web Chinese text classification is the product of the combination of Internet technology and traditional text classification technology. It is a technique of learning a classification model by using known Web Chinese documents and then determining unknown document categories. The whole process includes preprocessing Web Chinese text, selecting feature word sets, and text representation. Calculation of word weight, sample classification and other steps. First of all, on the basis of clarifying the key technologies of Web Chinese classification, this paper summarizes the background and current situation of the research, analyzes the research process, and does a lot of theoretical and practical research. After synthetically analyzing and summarizing the shortcomings of the existing methods, some improvements are made in the process of classification. In view of the special use environment under Web, the text with different positions before feature selection is put forward. Sub-region, step by step and give different weight to deal with the train of thought; For chi-square statistics only the frequency of documents but not the word frequency, it is very likely that the chi-square statistics can be chosen as the feature words and the ability of correcting punishment in its formula is not enough when it is less than common in this class. The concepts of word frequency compensation factor, category specific gravity factor and intra-class distribution factor are put forward, which are multiplied by the formula of the traditional method as compensation to improve the original method, and good results are obtained. In the aspect of classification algorithm, we focus on KNN algorithm. After in-depth analysis of its principle, the advantages and disadvantages are summarized. For the KNN algorithm, the inner product formula is used to calculate the text similarity rough (examples have been illustrated in this paper). This paper presents a method of improving by using similar proximity coefficient. Through the design of related experiments, the improved measures have different degrees of improvement in accuracy, recall rate and F1 value. In this paper, a small tool software for Web Chinese text classification experiment is designed, including the Web web page collection module which is used to build the sample database, and the classification module used to process the text and the classification process. The evaluation module is used to evaluate and compare the final results. The main design scheme and some key techniques used are also provided.
【学位授予单位】：武汉理工大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.1

【参考文献】