基于本体概念相似度的主题爬虫中网页排序模型研究

发布时间：2018-10-04 19:47

【摘要】：相比通用搜索引擎，专注于某一具体领域的主题搜索引擎可以带来更高精度的信息采集，为用户带来更好信息检索服务。主题爬虫作为主题搜索引擎的核心模块，提高检索信息的领域相关度就显得尤为重要。但是由于网络资源规模巨大且呈高度动态的增长，采集结果仍然会存在大量不相关的网页信息，从而导致采集效率下降。针对这种问题，本文通过研究主题爬虫设计中的相关性分析技术，主要是网页排序算法的研究，分析总结目前网页排序算法的优缺点，并结合盐湖领域特点，利用本体在表达语义方面的优势，提出一种新的基于本体概念相似度的网页排序算法，以此提高主题相关性计算准确度。该方法首先选择出合适网页作为初始领子种子集合，然后通过构建盐湖领域本体获取本体概念集，并对概念集分类且给予权重，，利用概念相似度计算方法计算网页内所有概念与本体概念集中概念的相似度，根据综合得分对网页进行排序，将得分高的网页存放到主题爬虫中，为将来的网页采集做准备。最后通过实验证明，该算法不仅大大减少了不相关的结果，提高了采集网页的主题相关度，而且也提高了检索的准确率。
[Abstract]:Compared with the general search engine, the subject search engine focused on a specific field can bring higher precision information collection and better information retrieval service for users. As the core module of subject search engine, it is very important to improve the relevance of subject crawler. However, due to the large scale and highly dynamic growth of network resources, there will still be a large number of irrelevant web page information, which leads to a decline in the efficiency of collection. In order to solve this problem, this paper analyzes and summarizes the advantages and disadvantages of the current web page sorting algorithm, and combines the characteristics of the salt lake field by studying the correlation analysis technology in the subject crawler design, mainly the research of the web page sorting algorithm. Taking advantage of ontology in expressing semantics, a new web page sorting algorithm based on ontology concept similarity is proposed to improve the accuracy of topic correlation calculation. The method first selects the appropriate web page as the initial collar seed set, then obtains the ontology concept set by constructing the salt lake domain ontology, and classifies the concept set and gives the weight to the concept set. The concept similarity calculation method is used to calculate the similarity between all the concepts in the web page and the concepts in the ontology concept set. According to the comprehensive score, the web pages with high scores are sorted, and the high score pages are stored in the subject crawler to prepare for the future collection of web pages. Finally, the experimental results show that the algorithm not only reduces the irrelevant results, but also improves the retrieval accuracy.
【学位授予单位】：北京信息科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】