当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于领域本体和相似概念背景图的主题爬行策略研究

发布时间:2018-05-24 13:12

  本文选题:主题爬行虫 + 形式概念分析 ; 参考:《西华大学》2012年硕士论文


【摘要】:近年来,随着互联网中的信息以指数数量级的增长,互联网中所包含的信息量越来越大,这给人们寻找有用信息带来了困难,因此一个高效准确的用于组织和检索有用信息的搜索引擎就变得越来越必要。爬行虫是搜索引擎中的一个重要组件,它主要用于从网上搜集文档信息。由于用于通用搜索引擎的爬行虫耗费大量的磁盘空间和网络带宽,并且搜索结果的准确率也比较低,因此主题搜索引擎以其智能化、个性化、领域化、专业化等特点很快成为了当前学术界和产业界研究的热点。 主题爬行虫致力于搜集与预先给定的主题相关的网页,而不是遍历整个网络,它基于这样的一个事实:一个主题相关的网页总趋向于链向相同主题的其他网页。主题爬行虫需要解决的一个主要问题就是在爬行过程中如何为未访问的URLs赋予一个适当的优先级分值以维持比较高的收获率。为了解决这个问题,本文提出了一种基于领域本体和形式概念分析技术的主题爬行策略,该策略首先通过WordNet和概念相关度构建核心相似图,然后结合概念格知识构建相似概念背景图,最后结合URL对应的锚文本与主题的相关度以及链接分析技术计算待爬行URLs的优先级分值,并最终决定URLs的访问顺序。 论文的主要研究内容包括以下几点: 1.提出了一种度量语义相关度的方法。语义相关度是用来衡量文档或词语之间语义相关性的一个概念,它反映了两个对象之间的关联程度。本文借助WordNet领域本体所包含的丰富语义,借鉴了多种度量语义相关度的方法,并最终总结出了应用于本文的度量语义相关度的方法。 2.提出了一种构建相似概念背景图的方法。本文通过对搜集回的代表爬行主题的基础网页和基础网页链向的当前网页进行分析处理后得到的基础概念格、当前概念格以及能描述爬行主题的特征词集后,首先将特征词集基于WordNet词库进行同义词扩展,生成扩展特征词集,然后再使用度量语义相关度的方法构建核心相似图,最后根据本文提出的算法利用核心相似图、基础概念格和当前概念格构建相似概念背景图。 3.提出了一种基于语义链接分析和相似概念背景图的预测URLs优先级分值的策略。锚文本一般是网页的引用者从另一个角度对网页主题进行的简短概述,因此它最能体现网页的主题。本文提出了一种计算锚文本和主题相关度的方法,并结合上文中生成的相似概念背景图,提出了一种计算URLs优先级分值的方法按照优先级分值的大小指导主题爬行。 最后,论文利用召回率、recall-precision、F-Measure等三种度量指标对比分析了本文提出的主题爬行策略和基于宽度优先的爬行策略、基于背景图的主题爬行策略、基于相关背景图的主题爬行策略以及基于概念背景图的主题爬行策略。实验表明,,在同等条件下,本文提出的主题爬行策略具有一定的优势,这也论证了该方法的有效性和可行性。
[Abstract]:In recent years, as the information in the Internet is increasing exponentially, the amount of information contained in the Internet is becoming more and more large, which brings difficulties for people to find useful information. Therefore, a efficient and accurate search engine used to organize and retrieve useful information is becoming more and more necessary. Crawler is an important search engine. Component, which is mainly used to collect document information from the Internet. Because crawlers used in general search engines consume a lot of disk space and network bandwidth, and the accuracy of search results is relatively low, so the theme search engine quickly becomes the current academic and industrial community with its intelligence, personalization, domain and specialization. The hot spot of research.
A topic crawler aims to collect web pages related to a given topic rather than traversing the entire network. It is based on the fact that a topic related web page tends to chain to the other pages of the same topic. One of the main questions that the subject crawler needs to address is how to use the UR in the crawl process. In order to solve this problem, a topic crawling strategy based on domain ontology and formal concept analysis technology is proposed in this paper. In order to solve this problem, this strategy first constructs the core similar graph through WordNet and concept correlation, and then constructs similar concept back with concept lattice knowledge. It finally combines the correlation between the anchor text and the theme of the URL and the link analysis technique to calculate the priority value of the URLs to be crawled, and ultimately determines the order of access of the URLs.
The main contents of this paper include the following points:
1. a method of measuring semantic correlation is proposed. Semantic correlation is a concept used to measure the semantic relevance between documents and words. It reflects the degree of association between two objects. This paper draws on the rich semantics contained in the domain ontology of WordNet and draws on the methods of the semantic correlation of a variety of degrees. A method used to measure semantic correlation in this paper.
2. a method of building a similar concept background map is proposed. By analyzing the basic concept lattice, the current concept lattice and the feature words that can describe the crawling subject, the feature word set is first based on the WordNet lexicon. To expand the synonym, generate the set of extended feature words, and then use the method of measuring semantic correlation to construct the core similar graph. Finally, according to the algorithm proposed in this paper, we use the core similarity graph, the basic concept lattice and the current concept lattice to construct the similar concept background map.
3. a strategy for predicting URLs priority based on semantic link analysis and similar concept background map is proposed. The anchor text is generally a brief overview of web pages from another angle. Therefore, it can most reflect the theme of the web page. This paper presents a method for calculating the correlation between the anchor text and the topic. Combined with the similar concept background map generated in the above, we propose a method to calculate the priority score of URLs, which guides the topic crawling according to the size of the priority value.
Finally, the thesis uses the recall, recall-precision, F-Measure and other three metrics to compare the theme crawling strategy and the crawl strategy based on the width first, the theme crawling strategy based on the background map, the theme crawling strategy based on the related background map and the theme crawling strategy based on the concept background map. Ming, under the same conditions, the theme crawling strategy proposed in this paper has certain advantages, which also demonstrates the effectiveness and feasibility of the method.
【学位授予单位】:西华大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3

【参考文献】

相关期刊论文 前2条

1 魏玲;祁建军;张文修;;概念格与粗糙集的关系研究[J];计算机科学;2006年03期

2 费静婷;顾君忠;杨静;黄俊春;;基于WordNet和聚焦爬虫的半自动领域本体构建[J];计算机应用;2008年S2期

相关博士学位论文 前3条

1 杜亚军;搜索引擎智能行为的研究及实现[D];西南交通大学;2005年

2 王斌;汉英双语语料库自动对齐研究[D];中国科学院研究生院(计算技术研究所);1999年

3 宋玲;语义相似度计算及其应用研究[D];山东大学;2009年

相关硕士学位论文 前4条

1 董占兵;基于形式概念分析的主题搜索策略研究[D];西华大学;2007年

2 宫玲;概念格建格算法的研究[D];辽宁师范大学;2007年

3 杨月奎;基于语义的主题爬行方向研究[D];西华大学;2009年

4 彭强强;基于概念背景图的主题爬行策略研究[D];西华大学;2010年



本文编号:1929182

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1929182.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户4bb12***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com