基于本体的主题相关度算法研究

发布时间：2019-07-01 15:38

【摘要】：专业搜索引擎针对某一特定领域、某一特定人群或某一特定需求提供有价值的信息和服务，是网络信息搜索未来发展的方向之一。在网络资源规模巨大且资源总量迅速增加的情况下，专业搜索引擎首要解决的问题是如何高效准确的获取特定领域或特定主题的网络信息——目标网络资源，包括网页和链接。此问题的核心和关键点是如何计算目标网络资源的主题相关度，包括评估网页的主题相关度与预测链接的主题相关度。现有的主题相关度算法基本在字符层次上计算主题相关度，处理概念或语义的能力相对不足，结果是主题相关度判断不准确，导致获取主题信息的准确率较低。由于本体优秀的语义表达能力，研究引入本体工具，利用本体表达主题并将网页概念化，在比较分析各个经典主题相关度算法的基础上，最终比选出具备更高准确率和效率的主题相关度算法，包括网页主题相关度评估算法和链接主题相关度预测算法，进而设计并实现具备更高效率和语义处理能力的主题网络信息抓取系统——基于本体的主题爬虫系统，最后通过实验验证算法的有效性。在归纳和评述相关文献的基础上，针对获取主题信息时准确率和效率不高的问题，以收获率和时间效率为指标分别比选出合适的主题相关度算法予以解决。在提高主题信息获取准确率方面，通过比较KNN分类算法、概念空间向量模型CSVM算法和基于本体的主题相关度评估算法，选定基于本体的主题相关度评估算法，算法将网页中的概念映射到本体中计算网页主题相关度。在提高主题信息获取效率方面，通过比较主题敏感的PageRank算法、基于链接文本内容的算法和基于本体的链接主题相关度预测算法，选定基于本体的链接主题相关度预测算法，算法结合了Q学习和朴素贝叶斯分类器以预测链接的长期价值，通过比较链接的长期价值选取待抓取的链接，其中Q学习器通过基于本体的网页主题相关度评估算法算出的网页主题相关度值获得反馈。在选定的算法基础上，研究应用此算法设计基于本体的主题爬虫系统，通过构建小型苹果本体，以苹果主题为例详细阐述了主题爬虫系统的运行流程，最后实现系统并以收获率为指标与宽度优先算法指导的爬虫以及Best-First算法指导的爬虫相比较，实验结果显示，基于本体的主题相关度算法指导的主题爬虫具备更高的收获率，在抓取主题相关网络资源时具备更大的潜力。
[Abstract]:Professional search engine provides valuable information and services for a specific field, a specific group or a specific demand, which is one of the development directions of network information search in the future. With the large scale of network resources and the rapid increase of the total amount of resources, the first problem solved by professional search engines is how to obtain the network information of specific fields or topics efficiently and accurately-the target network resources, including web pages and links. The core and key point of this problem is how to calculate the topic correlation of the target network resources, including evaluating the topic correlation of the web page and predicting the topic correlation of the link. The existing topic correlation algorithms basically calculate the topic correlation at the character level, and the ability to deal with concepts or semantics is relatively insufficient. The result is that the judgment of topic correlation is not accurate, resulting in low accuracy of obtaining topic information. Because of the excellent semantic expression ability of ontology, the ontology tool is introduced, and the web page is conceptualized by using ontology to express the topic and conceptualize the web page. on the basis of comparing and analyzing the classical topic correlation algorithms, the topic correlation algorithm with higher accuracy and efficiency is finally selected, including the web page topic correlation evaluation algorithm and the link topic correlation prediction algorithm. Furthermore, a topic crawler system based on ontology, which has higher efficiency and semantic processing ability, is designed and implemented. Finally, the effectiveness of the algorithm is verified by experiments. On the basis of summing up and reviewing the relevant literature, aiming at the problem of low accuracy and efficiency in obtaining subject information, the harvest rate and time efficiency are compared with the appropriate topic correlation algorithm to solve the problem. In order to improve the accuracy of topic information acquisition, by comparing KNN classification algorithm, concept space vector model CSVM algorithm and ontology-based topic correlation evaluation algorithm, the ontology-based topic correlation evaluation algorithm is selected, and the concept in web page is mapped to ontology to calculate the topic correlation degree of web page. In order to improve the efficiency of topic information acquisition, by comparing the topic-sensitive PageRank algorithm, the linked text content-based algorithm and the ontology-based link topic correlation prediction algorithm, the ontology-based link topic correlation prediction algorithm is selected. The algorithm combines Q learning and naive Bayesian classifiers to predict the long-term value of the link, and selects the link to be grasped by comparing the long-term value of the link. Among them, the Q learner obtains feedback through the web topic correlation value calculated by ontology-based web topic correlation evaluation algorithm. On the basis of the selected algorithm, the ontology-based topic crawler system is designed by using this algorithm. By constructing the small apple ontology, the running flow of the subject crawler system is described in detail by taking the apple theme as an example. Finally, the system is realized and compared with the crawler guided by the width first algorithm and the crawler guided by the Best-First algorithm. The experimental results show that the crawler guided by the width first algorithm and the crawler guided by the width first algorithm are compared with the crawler guided by the width first algorithm. The topic crawler guided by ontology-based topic correlation algorithm has higher harvest rate and greater potential when grasping topic-related network resources.
【学位授予单位】：中国农业科学院
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】