当前位置:主页 > 科技论文 > 搜索引擎论文 >

垂直搜索引擎中主题爬行技术的研究

发布时间:2018-04-15 02:15

  本文选题:主题爬行 + 维基百科 ; 参考:《重庆大学》2012年硕士论文


【摘要】:随着互联网技术的飞速发展,传统的通用搜索引擎逐渐暴露出了覆盖率低、结果不准确等弊端。为了满足用户精确搜索的需求,垂直搜索引擎应运而生。它利用主题爬行技术来搜集Web中与某个领域(主题)相关的网页,并提供面向该领域的检索服务。无疑,主题爬行技术是垂直搜索引擎的核心部分,直接影响着垂直搜索引擎的性能。本文重点研究了主题描述、候选链接优先级的预测和自适应的爬行策略等主题爬行中的关键技术,主要内容包括: (1)提出了一种基于维基百科的主题描述方法。对主题进行清晰、准确的描述是主题爬行器的基础,主题的描述方式也决定了主题相关性的计算方式。现有的算法多基于特征集来描述主题,并通过特征词的机械匹配来计算主题的相关性,但它不仅忽视了特征词之间的语义关系,而且使得特征词分布过于稀疏,,降低了对主题的描述性;也有一些方法引入了本体或语义词典来分析词语之间的语义关联,但现有的本体很少,而语义词典多存在着开放性差、词汇量有限、更新不及时的缺点。针对这些不足,本文将易于获取、更新及时、描述客观的维基百科作为背景知识,根据分类树来构建主题向量空间,并将主题描述文档映射成向量来描述主题,并且在相关性计算时引入了语义分析;同时,利用消歧参照表来解决词语映射到概念的过程中映射不符合实际或一词多义的问题。实验表明,该方法比传统方法在信息量总和及查准率上均有显著提高。 (2)提出了一种基于网页分块的候选链接优先级的预测方法。候选链接的优先级预测决定了主题爬行的方向和结果,现有算法多根据页面内容、锚文本和锚文本上下文来预测候选链接的优先级,但页面中含有广告等噪音数据,锚文本上下文难以界定,锚文本包含的信息量也很有限。因此,本文首先基于深度优先遍历对网页进行分块,过滤掉了部分噪音节点,再从网页内容文本、块文本和锚文本三个方面综合预测候选链接的优先级。实验表明,引入网页分块有效改善了主题爬行的性能。 (3)提出了基于信息增益和基于信息量总和比率的两种自适应方法。由于根据分类树的概念层次体系所获得的主题初始描述往往不够客观和准确,所以本文在每爬行一定数量的网页后,就根据两种自适应方法对已爬行的所有网页重新计算并自动反馈更新主题向量空间中每个概念的权重,从而完善主题描述。实验表明,两者都实现了主题的增量爬行;引入基于信息增益的自适应方法后爬取的网页比引入基于信息量总和比率的自适应方法后爬取的网页与主题更加相关,而基于信息量总和比率的自适应方法在总体上则比基于信息增益的自适应方法有更高的稳定性。 最后,设计并实现了一个主题爬行的原型系统,并利用该原型系统进行了一系列实验,对本文中提出的方法进行验证分析。
[Abstract]:With the rapid development of Internet technology, the traditional universal search engine gradually exposed the shortcomings of low coverage and inaccurate results.In order to meet the needs of users for accurate search, vertical search engine emerged as the times require.It makes use of topic crawling technology to collect web pages related to a domain (topic) in Web and provides retrieval services for that domain.Undoubtedly, subject crawling technology is the core part of vertical search engine, which directly affects the performance of vertical search engine.This paper focuses on the key technologies of topic crawling, such as topic description, candidate link priority prediction and adaptive crawling strategy. The main contents are as follows:A method of subject description based on Wikipedia is proposed.A clear and accurate description of the theme is the basis of the theme crawler, and the method of theme description also determines the calculation method of the theme correlation.Most of the existing algorithms describe the topic based on feature set and calculate the relevance of the topic by the mechanical matching of the feature words. However, it not only ignores the semantic relationship among the feature words, but also makes the distribution of the feature words too sparse.Some methods have been introduced to analyze the semantic association between words, but few ontologies are available, and most semantic dictionaries have poor openness and limited vocabulary.The shortcoming of updating is not in time.Aiming at these shortcomings, this paper uses Wikipedia, which is easy to obtain, update and describe objectively, as background knowledge, constructs topic vector space according to classification tree, and maps topic description document to vector to describe topic.At the same time, the disambiguation reference table is used to solve the problem that the mapping is not practical or polysemous in the process of mapping words to concepts.The experimental results show that this method is more effective than the traditional method in the sum of information and precision.A candidate link priority prediction method based on web page partitioning is proposed.The priority prediction of candidate link determines the direction and result of topic crawling. Most of the existing algorithms predict the priority of candidate link according to the page content, anchor text and anchor text context, but the page contains noise data such as advertisement, etc.The context of anchor text is difficult to define and the amount of information contained in anchor text is very limited.Therefore, based on depth-first traversal, this paper divides the web page into blocks, filters out some noise nodes, and then synthetically predicts the priority of candidate links from three aspects: page content text, block text and anchor text.Experimental results show that the performance of topic crawling is improved effectively by introducing web page partitioning.3) two adaptive methods based on information gain and information sum ratio are proposed.Because the initial description of the subject is often not objective and accurate according to the conceptual hierarchy of the classification tree, this paper, after crawling a certain number of web pages,The weight of each concept in the topic vector space is updated automatically by recalculating all pages crawled according to two adaptive methods so as to perfect the topic description.Experiments show that both of them achieve incremental crawling of topics, and that the pages crawled after the adaptive method based on information gain are more relevant to the topic than the pages crawled by the adaptive method based on the sum of information ratio.On the whole, the adaptive method based on the sum ratio of information is more stable than the adaptive method based on information gain.Finally, a subject crawling prototype system is designed and implemented, and a series of experiments are carried out using the prototype system, and the method proposed in this paper is verified and analyzed.
【学位授予单位】:重庆大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3

【参考文献】

相关期刊论文 前4条

1 王辉;左万利;王晖昱;宁爱军;孙志伟;满春雷;;基于质心向量的增量式主题爬行[J];计算机研究与发展;2009年02期

2 欧阳柳波,李学勇,李国徽,王鑫;专业搜索引擎搜索策略综述[J];计算机工程;2004年13期

3 赵佳鹤;王秀坤;刘亚欣;;基于语义分析的主题信息采集系统的设计与实现[J];计算机应用;2007年02期

4 蒋宗礼;徐学可;李帅;;一种基于超链接引导的主题搜索的主题敏感爬行方法[J];计算机应用;2008年04期

相关博士学位论文 前1条

1 陈竹敏;面向垂直搜索引擎的主题爬行技术研究[D];山东大学;2008年

相关硕士学位论文 前2条

1 王晓伟;垂直搜索引擎若干关键技术的研究[D];浙江大学;2007年

2 林碧霞;基于领域本体的主题爬虫研究及实现[D];西南交通大学;2010年



本文编号:1752062

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1752062.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户4cf65***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com