聚类算法在网页分类中的应用研究
发布时间:2018-08-20 12:28
【摘要】:近年来,随着信息技术的不断发展,网页数量大幅度增长,网络上的信息量急剧增加,用户对网页信息的搜索是通过搜索引擎实现的。搜索引擎可以帮助用户屏蔽掉大量的无关的信息。搜索引擎系统已进入以智能化、人性化为标志的第三时代。这一时代区别于前两个时代最大的特点是将人工智能技术运动到搜索引擎系统中,而聚类算法就是其中最主要的算法。 聚类技术将搜索引擎返回的结果分为若干个类,供用户进行有针对性地查找。目前现存的搜索引擎大多单纯地对网页内容进行聚类,本文在分析了现存的聚类算法的基础上,对常用的算法进行了优将CBC算法运用到网页聚类中,并且加入了搜索词作为主要的参照数据,通过在聚类中加大的搜索词的权重,在特征权值的计算中等方面对CBC算法进行了改进。实现了改进后的CBC算法,并且用数据集对新算法与传统的K-means算法的结果进行了比较,证明算法在精确度上优于传统的K-means算法,在效率上也有较为明显的优势。 最后,本文在改进的聚类算法的基础上设计了一个中文聚类系统,对于网页从抓取到分析,,再到分类都做了模块化的设计工作,并在此基础上对算法以及下一步的工作提出了改进的想法。
[Abstract]:In recent years, with the continuous development of information technology, the number of web pages has increased dramatically, and the amount of information on the network has increased sharply. Search engines can help users block out a lot of irrelevant information. Search engine system has entered the third era marked by intelligence and humanization. The most important characteristic of this era is to move artificial intelligence technology into search engine system, and clustering algorithm is the most important one. Clustering technology divides the results returned by search engines into several classes for users to search. At present, most of the existing search engines simply cluster the content of the web pages. Based on the analysis of the existing clustering algorithms, this paper applies the CBC algorithm to the web page clustering. The search term is added as the main reference data, and the CBC algorithm is improved in the calculation of the feature weight by increasing the weight of the search term in the clustering. The improved CBC algorithm is implemented, and the results of the new algorithm and the traditional K-means algorithm are compared with the data set. It is proved that the algorithm is superior to the traditional K-means algorithm in accuracy and has obvious advantages in efficiency. Finally, this paper designs a Chinese clustering system based on the improved clustering algorithm. On this basis, the algorithm and the next work are proposed to improve the idea.
【学位授予单位】:北京化工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
本文编号:2193609
[Abstract]:In recent years, with the continuous development of information technology, the number of web pages has increased dramatically, and the amount of information on the network has increased sharply. Search engines can help users block out a lot of irrelevant information. Search engine system has entered the third era marked by intelligence and humanization. The most important characteristic of this era is to move artificial intelligence technology into search engine system, and clustering algorithm is the most important one. Clustering technology divides the results returned by search engines into several classes for users to search. At present, most of the existing search engines simply cluster the content of the web pages. Based on the analysis of the existing clustering algorithms, this paper applies the CBC algorithm to the web page clustering. The search term is added as the main reference data, and the CBC algorithm is improved in the calculation of the feature weight by increasing the weight of the search term in the clustering. The improved CBC algorithm is implemented, and the results of the new algorithm and the traditional K-means algorithm are compared with the data set. It is proved that the algorithm is superior to the traditional K-means algorithm in accuracy and has obvious advantages in efficiency. Finally, this paper designs a Chinese clustering system based on the improved clustering algorithm. On this basis, the algorithm and the next work are proposed to improve the idea.
【学位授予单位】:北京化工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
【参考文献】
相关期刊论文 前5条
1 陈建超;胡桂武;杨志华;严桂夺;;基于全局性确定聚类中心的文本聚类[J];计算机工程与应用;2011年10期
2 熊忠阳;吴林敏;张玉芳;;针对非均匀数据集的DBSCAN过滤式改进算法[J];计算机应用研究;2009年10期
3 闫仁武;商好值;;一种基于遗传算法的模糊C均值算法[J];科学技术与工程;2010年28期
4 赵慧;刘希玉;崔海青;;网格聚类算法[J];计算机技术与发展;2010年09期
5 孔继利;顾傜;孙欣;冯爱兰;;系统聚类和重心法在多节点配送中心选址中的研究[J];物流技术;2010年05期
相关会议论文 前1条
1 李世峰;黄磊;刘昌平;;几种聚类方法的比较[A];第八届全国汉字识别学术会议论文集[C];2002年
相关博士学位论文 前1条
1 于澝;基于一维SOM神经网络的聚类及数据分析方法研究[D];天津大学;2009年
相关硕士学位论文 前2条
1 林丽;基于语义距离的文本聚类算法研究[D];厦门大学;2007年
2 翟少丹;基于混合模型的聚类算法研究[D];西北大学;2009年
本文编号:2193609
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2193609.html