Web挖掘中的HITS算法的一种改进策略
发布时间:2018-04-30 18:25
本文选题:Web挖掘 + HITS算法 ; 参考:《吉林大学》2013年硕士论文
【摘要】:21世纪是一个社会信息化程度不断提高,网络技术高速发展的时代。越来越多的信息不断整合集中,并通过互联网进行保存和传递。如何从海量信息中快速高效的获取所需信息,在做任何事情都离不开计算机的这样一个时代,这无疑是人们必须面对的一个问题。 搜索引擎技术的产生和发展无疑为网络中信息的抓取和检索提供了可能。但是任何的技术都不是完美的,由于搜索引擎是基于通用性而产生的,,这就使得其对网页的选取并不具有偏好性,也就不能实现更为精确和科学的抓取。由于web页面中包含各种复杂的信息,数据的结构形式也很复杂,面对这种情况,对web页面进行精确的分析和信息的抓取和检索具有非常特殊的复杂性。Web挖掘技术其实是在传统数据挖掘技术的基础上而产生的。在这种方法中,它可以通过对web的结构信息、文本信息或者其他的网页内容信息进行相关性分析,进而能够从web页面中的半结构化文档中抽取便于数据挖掘的结构化信息。本文研究的课题就是如何能够提供一种有效的精确的信息检索方案。 本文首先对Web挖掘中经典的链接分析算法HITS算法和PageRank算法进行了介绍,并分析了其优缺点。本文中选择HITS算法作为研究的基本算法。在实验中发现,HITS算法对实效性的信息不敏感,另一方面,HITS算法存在不能识别冗余的无效链接的问题。在此基础上,本文提出了一种基于时间衰减参数的方法,其原理是对传统的HITS算法进行改进,提出了TM-HITS算法。分别进行了引入对模拟数据的分析以及针对网页抓取技术获得的真实数据进行分析实验,实验数据表明了,该算法能够有效的获取实效性更高的网页,同时较好的避免了广告链接和无效页面等恶意的或者是非恶意的无用链入链接的干扰。 另一方面,本文根据上述实验和改进的经验总结的同时,也对今后Web挖掘技术的发展趋势做了一些展望,提出了一种基于两种链接分析算法综合使用的信息检索模型的可行性方法。该方法可以分别在服务器端以及客户端建立集成了不同算法的链接分析模块,可以根据不同用户的需求,进行不同精度下的搜索,同时该方法可以引入机器学习的方法不断对模型进行修正,以期能够达到智能化检索以及不同用户可以根据自己的喜好来定制检索服务等更深层次的需求。
[Abstract]:The 21st century is an era in which the information level of society is constantly improving and the network technology is developing at a high speed. More and more information is continuously integrated and centralized, and stored and transmitted through the Internet. How to get the required information quickly and efficiently from the massive information is a problem that people must face in the era of doing anything without computer. The emergence and development of search engine technology undoubtedly provides the possibility for information capture and retrieval in the network. But any technology is not perfect, because the search engine is based on generality, which makes the selection of web pages has no preference, so it can not achieve more accurate and scientific capture. Because the web page contains a variety of complex information, the structure of the data is also very complex, faced with this situation, The accurate analysis of web pages and the retrieval and retrieval of information have very special complexity. Web mining technology is based on the traditional data mining technology. In this method, the structure information, text information or other web content information of web can be analyzed by correlation analysis, and then the structured information can be extracted from semi-structured documents in web pages to facilitate data mining. The topic of this paper is how to provide an effective and accurate information retrieval scheme. In this paper, the classical link analysis algorithms in Web mining, HITS algorithm and PageRank algorithm, are introduced, and their advantages and disadvantages are analyzed. In this paper, HITS algorithm is chosen as the basic algorithm. It is found in the experiment that the hits algorithm is not sensitive to the actual information. On the other hand, the hits algorithm does not recognize redundant invalid links. On this basis, this paper proposes a method based on the time attenuation parameter. The principle is to improve the traditional HITS algorithm and propose the TM-HITS algorithm. The analysis of simulated data and the analysis of real data obtained by web crawling technology are carried out respectively. The experimental data show that the algorithm can effectively obtain more effective web pages. At the same time, it can avoid the malevolent or non-malicious chain-in links such as advertising links and invalid pages. On the other hand, based on the above experiments and improved experience, this paper also makes some prospects for the development trend of Web mining technology in the future. A feasible information retrieval model based on two link analysis algorithms is proposed. This method can set up link analysis module which integrates different algorithms in the server and client, and can search with different precision according to the needs of different users. At the same time, this method can introduce the machine learning method to modify the model constantly, in order to achieve intelligent retrieval and different users can customize retrieval services according to their own preferences and other deeper requirements.
【学位授予单位】:吉林大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP311.13
【参考文献】
相关期刊论文 前4条
1 石晶,龚震宇,裘杭萍,张毓森;一种更稳定的链接分析算法——子空间HITS算法[J];吉林大学学报(理学版);2003年01期
2 王艳华,张纪;Web结构挖掘及其算法[J];计算机工程;2005年S1期
3 王晓宇,周傲英;万维网的链接结构分析及其应用综述[J];软件学报;2003年10期
4 常庆;周明全;耿国华;;基于PageRank和HITS的Web搜索[J];计算机技术与发展;2008年07期
相关硕士学位论文 前2条
1 黄隽毅;关于Web数据挖掘中HITS算法的研究[D];大连理工大学;2004年
2 桂挡平;基于链接相似度的Web社区发现算法研究[D];大连理工大学;2008年
本文编号:1825669
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1825669.html