当前位置:主页 > 文艺论文 > 广告艺术论文 >

基于KNN及相关链接的中文网页分类研究

发布时间:2018-03-07 00:20

  本文选题:中文网页分类 切入点:网页提取 出处:《哈尔滨工程大学》2008年硕士论文 论文类型:学位论文


【摘要】: 随着Internet的飞速发展,网上信息正在呈指数级增长。面对杂乱的网页信息资源,人们需要对海量的网页信息进行分类整理,从而可以快速检索到期望的目标及其关联信息。网页自动分类提供了处理和组织大规模网页的关键技术,是使信息资源得以合理有效组织的重要方法。如何提高网页分类的准确率和召回率,是研究人员不懈追求的目标。 本文通过中文网页正文提取方法,较好地提取出中文网页中的正文文本,将网页标记的处理、噪音信息过滤和网页正文提取三个方面结合起来。网页中的链接主要分为两类,与本页主题相关的链接称为相关链接,与本页主题无关的链接称为无关链接,例如导航条和广告链接等等。本文提出的相关链接提取算法,能够较好地抽取出中文网页中的相关链接,该算法时间复杂性低,准确率和召回率都令人满意。本文基于向量空间模型,采用词频法选择网页中的特征词,采用机器学习算法KNN对中文网页进行分类,设计实现了一个中文网页分类器。比较了基于网页标题分类、基于网页正文分类、基于网页相关链接分类,以及将正文与相关链接结合分类、将标题与相关链接结合分类的分类效果,印证了中文网页中相关链接对网页分类具有积极影响的设想,同时也提出了一种分类方法。 通过开放测试,实验数据表明,本文提出的网页正文和相关链接结合分类的方法所需的训练集较小,各个类别的分类F1值均在92%以上,比传统的网页分类效果有了一定的提高。
[Abstract]:With the rapid development of Internet, the online information is increasing exponentially. This allows you to quickly retrieve the desired target and its associated information. Automated web page categorization provides key techniques for processing and organizing large-scale web pages, It is an important method to organize information resources reasonably and effectively. How to improve the accuracy and recall rate of web page classification is the goal pursued by researchers. In this paper, the text of Chinese web pages is extracted by the method of text extraction, which combines three aspects: the processing of page tags, noise information filtering and page text extraction. The links in web pages are divided into two types. Links related to topics on this page are called related links, and links that are not related to topics on this page are called irrelevant links, such as navigation bars and advertising links. The algorithm has the advantages of low time complexity, good accuracy and good recall rate. Based on vector space model, the feature words in Chinese web pages are selected by word frequency method. A Chinese web page classifier is designed and implemented by machine learning algorithm KNN. The classification effect of combining the text with the related links and the combination of the title and the related links proves the assumption that the related links in Chinese web pages have a positive impact on the classification of web pages. At the same time, a classification method is proposed. Through the open test, the experimental data show that the training set of the web page text and related links combined with classification method proposed in this paper is relatively small, and the F1 value of each category is above 92%. Compared with the traditional web page classification effect has certain improvement.
【学位授予单位】:哈尔滨工程大学
【学位级别】:硕士
【学位授予年份】:2008
【分类号】:TP393.092

【引证文献】

相关硕士学位论文 前1条

1 白凡;改进的K近邻算法在网页文本分类中的应用[D];安徽大学;2010年



本文编号:1577140

资料下载
论文发表

本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1577140.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户6211c***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com