基于学术网络的虹检索系统设计与应用研究

发布时间：2018-05-15 12:43

本文选题：学术网络 + 文献检索　；参考：《山东大学》2017年硕士论文

【摘要】：随着移动互联网、云计算技术的快速发展,各行各业产生、获取、处理和存储的数据量正以指数级别呈爆炸式的增长。大数据作为新时代发展的标志,以多元、多态、互联的形式影响着社会生产生活。在学术领域,文献累积数量已达亿级,海量文献数据对传统检索方法造成了巨大的挑战。传统的文献检索方法主要通过单一的文献信息,例如检索词与检索内容之间相关度或者文献的引用量进行排序,并没有考虑学术网络中节点之间的关联关系以及节点自身的属性,因此检索结果会存在关联度较差、偏离主题、检索质量不高等缺陷。此外,传统学术检索系统主要提供文献检索服务,而实际上领域权威专家推荐可以更好地指导科研工作者的研究以及发展方向。针对海量学术数据,如何挖掘更深层的链接结构语义信息,建立专家检索系统,也是重要的研究课题。数据挖掘技术和分布式计算的发展,为解决以上问题提供了有效的手段。本文针对文献检索以及专家检索两种场景,通过构建学术信息网络,实现了对检索方法的优化以及检索系统的应用设计。首先,在文献检索系统中,基于链接分析PageRank算法对文献节点重要度排序,并针对PageRank算法的性能缺陷做了以下两方面的改进:(1)利用学术信息网络节点的不同属性,计算学术网络中文献节点的权威度。基于文献权威度对PageRank算法中的权重分配策略进行改进,从而提出了 SQT-Rank算法,提高了算法的排序性能;(2)考虑到大数据背景下文献数据量巨大,利用MapReduce编程模型对SQT-Rank算法并行化处理,提高了算法的计算性能。再者,与同构信息网络相比,异构信息网络蕴含更丰富的链接结构语义信息。在专家检索系统中,为进行更深层的数据挖掘和分析,首先构建了学术异构信息网络,并从中抽取了文献、专家以及期刊相关的六个关系矩阵。最后基于文献、专家、期刊相互增强作用的统一架构,提出专家重要度排序MR-Rank算法,获得了更加公平合理的专家排序结果。最后,在上述理论方法研究的基础上,对基于学术网络的虹检索系统进行了架构设计与功能实现。整个系统架构包含数据获取、数据存储、数据索引、数据分析以及结果可视化展现等部分。通过数据分析处理实现对学术数据提取、清洗、转换,完成文献、专家节点重要度分析等功能,最后以指定的方式将排序结果可视化展示给用户。综上,本文主要针对大数据背景下海量文献精准检索和领域专家推荐问题。通过构建同构和异构学术网络模型,基于优化后的文献排序SQT-Rank算法和专家排序MR-Rank算法挖掘网络中节点重要度,并进一步应用虹检索系统为用户推荐高质量的文献、专家,以提高用户的检索体验效果。
[Abstract]:With the rapid development of mobile Internet and cloud computing technology, the amount of data generated, acquired, processed and stored in various industries is increasing exponentially. As a symbol of the development of the new era, big data affects social production and life in the form of multivariate, polymorphic and interconnected. In the academic field, the accumulated amount of literature has reached 100 million levels, which poses a great challenge to the traditional retrieval methods. The traditional literature retrieval methods are mainly sorted by single document information, such as the relevance between the search words and the retrieval content or the quantity of references to the literature. The relationship between nodes in academic networks and the attributes of nodes themselves are not considered, so the retrieval results will have some defects, such as poor correlation degree, deviation from the topic, low retrieval quality and so on. In addition, the traditional academic retrieval system mainly provides the document retrieval service, but in fact, the domain authority expert recommendation can better direct the research and the development direction of the research worker. It is also an important research topic to mine deeper semantic information of link structure and establish expert retrieval system for massive academic data. The development of data mining technology and distributed computing provides an effective method to solve the above problems. In this paper, the optimization of retrieval methods and the application design of retrieval system are realized by constructing academic information network, aiming at the two scenarios of literature retrieval and expert retrieval. First of all, in the literature retrieval system, the importance degree of the document node is sorted based on the link analysis PageRank algorithm, and the performance defects of the PageRank algorithm are improved in the following two aspects: 1) the different attributes of the academic information network node are used. The degree of authority of the document node in the academic network is calculated. This paper improves the weight allocation strategy of PageRank algorithm based on document authority degree, and then proposes SQT-Rank algorithm, which improves the sorting performance of the algorithm. MapReduce programming model is used to parallelize the SQT-Rank algorithm, and the performance of the algorithm is improved. Moreover, compared with isomorphic information network, heterogeneous information network contains more semantic information of link structure. In the expert retrieval system, for deeper data mining and analysis, the academic heterogeneous information network is first constructed, and six relational matrices of literature, experts and periodicals are extracted from it. Finally, based on the unified framework of the mutual enhancement of literature, experts and periodicals, the MR-Rank algorithm of expert importance ranking is proposed, and a more fair and reasonable result of expert ranking is obtained. Finally, on the basis of the above theoretical research, the architecture and function of rainbow retrieval system based on academic network are designed and implemented. The whole system architecture includes data acquisition, data storage, data index, data analysis and visualization of results. The functions of extracting, cleaning, transforming, completing documents and analyzing the importance of expert nodes are realized through data analysis. Finally, the sorting results are visualized to the users in a specified way. In summary, this paper mainly focuses on the problem of accurate retrieval and expert recommendation in the context of big data. By constructing the isomorphic and heterogeneous academic network model, based on the optimized SQT-Rank algorithm and the expert sorting MR-Rank algorithm, the importance of the nodes in the network is mined, and the rainbow retrieval system is further applied to recommend high quality documents and experts for users. In order to improve the user's retrieval experience effect.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】