基于形式概念分析的聚焦爬虫算法
发布时间:2018-04-27 07:20
本文选题:形式概念分析 + 概念格 ; 参考:《中央民族大学》2013年硕士论文
【摘要】:移动互联网的迅速增长使得搜索引擎面临巨大的挑战,搜索引擎如何适应这种变化以及如何提供更优质的检索服务成为了一个备受关注的问题,作为其重要组成部分的网络爬虫算法成为人们研究的热点。通用网络爬虫由于爬行的规模较大,爬行页面内容比较杂乱,不能满足用户对于特定信息以及兴趣主题的集中爬行。面向主题的网络爬虫可以有选择的爬行与主题相关的网页,有效的减少了爬行页面的数量,而且提高了抓取的准确度并满足了用户对特定主题的搜索需求。 形式概念分析是一种基于概念格的数据分析方法,自从形式概念分析理论提出以来,它就因为知识表示的直观、简洁等特点受到研究者的广泛关注,已经在软件工程、图书馆和信息科学、数据挖掘等诸多领域得到了广泛的应用。 本文通过研究现有主题爬虫的原理,提出了将形式概念分析这一数据分析工具应用到主题爬虫的有关算法中,将概念格应用到主题相关性分析以及排序算法,从而改进了爬虫的相关算法。本文的研究工作主要有: 首先,本文通过对形式概念分析理论的学习,认真研究了其核心概念格上概念间的关系以及概念格的结构,联想到将概念格融入到主题爬虫的算法中。 其次,重点研究了主题爬虫的原理,包括对其结构,搜索策略,pagerank排序算法和主题相关度的研究,改进了基于概念格的主题相关度算法并将其用来计算爬虫的主题相关度。分析了pagerank排序算法的缺陷,并在此基础上结合概念格提出了改进的pagerank算法。
[Abstract]:The rapid growth of the mobile Internet makes search engines face enormous challenges. How search engines adapt to this change and how to provide better search services has become a problem of great concern. As an important part of the network crawler algorithm has become a hot topic. Because of the large scale of crawling and the cluttered content of crawling pages, general web crawlers can not satisfy the concentration of users' crawling for specific information and topics of interest. Topic-oriented web crawlers can selectively crawl theme-related pages, effectively reduce the number of crawling pages, and improve the accuracy of crawling and meet the search needs of users for specific topics. Formal conceptual analysis is a data analysis method based on concept lattice. Since the theory of formal conceptual analysis was put forward, it has been widely concerned by researchers for its intuitive and concise knowledge representation, and has been widely used in software engineering. Library and information science, data mining and many other fields have been widely used. In this paper, by studying the principle of topic crawler, we propose to apply formal concept analysis, which is a data analysis tool, to the algorithm of topic crawler, and to apply concept lattice to topic correlation analysis and sorting algorithm. The algorithm of reptile is improved. The main research work of this paper is as follows: Firstly, by studying the formal conceptual analysis theory, this paper studies the relationship between the concepts on the core concept lattice and the structure of the concept lattice, associating the concept lattice with the algorithm of topic crawler. Secondly, the principle of topic crawler is studied, including its structure, search strategy pagerank sorting algorithm and topic correlation degree. The topic correlation algorithm based on concept lattice is improved and used to calculate the topic correlation of crawler. The defects of pagerank sorting algorithm are analyzed, and an improved pagerank algorithm is proposed based on concept lattice.
【学位授予单位】:中央民族大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3
【参考文献】
相关期刊论文 前4条
1 李鸿儒;魏平;;基于不可约元的概念格属性特征识别方法[J];计算机科学;2006年06期
2 胡健;杨炳儒;;增量式广义概念格结构的生成算法研究与实现[J];计算机科学;2009年05期
3 杨炳儒,李岩,陈新中,王霞;Web结构挖掘[J];计算机工程;2003年20期
4 汪涛,樊孝忠;主题爬虫的设计与实现[J];计算机应用;2004年S1期
相关硕士学位论文 前3条
1 董占兵;基于形式概念分析的主题搜索策略研究[D];西华大学;2007年
2 王莹煜;基于多Agent系统的主题爬虫理解与协作研究[D];西华大学;2010年
3 王凯;基于概念格的领域本体概念相似度提取方法研究[D];安徽农业大学;2011年
,本文编号:1809776
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1809776.html