中文网页分类算法研究

发布时间：2018-01-30 23:12

本文关键词： 中文网页分类向量空间模型词共现图 KNN　出处：《江苏科技大学》2013年硕士论文　论文类型：学位论文

【摘要】：随着Internet及其相关技术的飞速发展，互联网上出现了海量而庞杂的Web信息资源。如何从这些海量的非结构化数据中提取和产生知识，找到人们感兴趣的内容，已经成为当前迫切需要解决的重要问题。中文网页分类技术作为解决这一问题的关键技术之一，日益成为研究的热点。其在搜索引擎、信息推送、信息过滤和自动问答等领域得到了越来越广泛的应用。本文详细介绍了中文网页分类中的关键技术，包括网页的预处理技术、特征提取技术和主流的网页分类算法。讨论了诸如TF-IDF、互信息、2统计量、信息增益和期望交叉熵等特征提取方法。详细分析了最小距离算法、KNN算法、朴素贝叶斯算法和支持向量机算法等主流网页分类算法的基本思想和主要的优缺点。在网页的特征提取算法中，传统的VSM模型忽略了词项之间具有相互依赖且语义相关的特点。词共现图方法可以较好的解决这一问题，，但是目前的主流词共现图方法大多对于特征词项权重的计算机械简单。而本文提出的一种改进型的词共现图方法既考虑词之间语义信息，又不忽视高频词对于主题表示的重要影响。实验证明，该方法实现简单，准确率较高。在网页分类算法中，KNN算法有着非常广泛的应用。但KNN算法的一个显著缺点是计算复杂度会随着训练集规模的增加而线性增加，在训练集规模较大时，该算法时间消耗非常大。针对这一情况，本文提出了一种改进型的KNN算法，主要的思想是通过改进待分类文本的近邻点的查找策略，从而提高KNN算法的运行效率，降低其算法复杂度。在本文的最后，通过实验验证了KNN、NB和SVM算法的各自性能。并对本文提出的改进型KNN算法给出了对比实验数据，证明了它的确拥有提高分类计算效率、降低算法复杂度的优点。
[Abstract]:With the rapid development of Internet and its related technologies, massive and complex Web information resources appear on the Internet. How to extract and generate knowledge from these massive unstructured data. Finding the content that people are interested in has become an important problem that needs to be solved urgently. As one of the key technologies to solve this problem, Chinese web page classification technology has become a hot research topic day by day. The fields of information push, information filtering and automatic question and answer have been used more and more widely. This paper introduces the key technologies of Chinese web page classification in detail, including page preprocessing, feature extraction and mainstream web page classification algorithms, and discusses statistics such as TF-IDF and mutual information. The information gain and expected cross-entropy are extracted. The minimum distance algorithm and KNN algorithm are analyzed in detail. The basic idea and main advantages and disadvantages of the main web page classification algorithms such as naive Bayes algorithm and support vector machine algorithm. In the feature extraction algorithm of web pages, the traditional VSM model ignores the interdependent and semantically related features of word items. The word co-occurrence graph method can solve this problem better. However, the current mainstream word co-occurrence graph method is mostly simple to calculate the weight of feature words. A modified word co-occurrence graph method proposed in this paper not only takes into account the semantic information between words. The experimental results show that the method is simple and accurate. KNN algorithm is widely used in web page classification algorithm, but one of the significant disadvantages of KNN algorithm is that the computational complexity increases linearly with the increase of training set size. When the training set is large, the time consumption of the algorithm is very large. In view of this situation, this paper proposes an improved KNN algorithm, the main idea is to improve the nearest neighbor search strategy of the text to be classified. In order to improve the efficiency of the KNN algorithm and reduce the complexity of the algorithm. At the end of this paper, the performance of KNNNNNNNNNNB and SVM algorithm is verified by experiments, and the experimental data of the improved KNN algorithm proposed in this paper are compared. It is proved that it does have the advantages of improving the efficiency of classification computation and reducing the complexity of the algorithm.
【学位授予单位】：江苏科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP393.092;TP391.1

【参考文献】