基于个人档案的信息提取和可视化分析

发布时间：2018-07-13 12:34

【摘要】：随着互联网的普及,网上的信息呈爆炸式增长。除了数量的膨胀,信息的类型也呈现了越发多样化的趋势。在多种多样的数据类型中,有一类数据可以被称作"个人档案",例如简历、个人主页、在线百科上的人物介绍页等等。这类数据为推测人物之间的社交关系提供了可能。举例而言,如果两个人曾在重叠的时间段内在同一所大学学习,则他们很有可能是同学。通过这种分析所得到的社交网络蕴含巨大的价值,可以被应用于多个问题领域,如社交网络分析中常见的最具影响力分析、社区发现等。本文介绍了一个针对个人档案数据进行信息提取和可视化分析的系统,并详细阐述了系统涉及的主要算法。该系统主要包含两大功能:对个人档案进行信息提取,构建基于实体的关联网络,并借此预测人物之间的社交关系;基于此网络,通过计算PageRank对人物的重要性或影响力进行浅层分析。我们将建立上述网络的过程分成了两步。首先,建立由多种类型实体共同构成的关联网络,这可以视作针对特定领域的一个异构的信息网络。这个步骤涉及到对个人档案数据的结构化处理,包括实体识别、事件提取等过程。我们针对数据特点,选择了基于句法解析树相似度进行聚类并结合规则提取的方法实现事件提取。第二步是基于已构建的关联网络,通过路径分析建立人名节点之间的关系。在此之前,我们需要补充其他类型节点之间的关系以便得到较为全面的路径信息。考虑到异构网络的特点,我们使用了不同的方法构建不同类型节点之间的关系。对上述信息网络的可视化分析主要是通过计算PageRank对人物的重要度或者说是影响力进行排名。在可视化的环境下,限于人的认知能力以及显示设备的精度等因素,我们认为节点的排名顺序比实际的PageRank值更为重要。因此,PageRank的计算应当在保证节点相对顺序基本不再发生变化时就提前停止。现有针对PageRank进行改进的研究有两个分支。一类研究倾向于从数学角度加快传统的Power方法的收敛速度;另一类基于Monte Carlo方法来近似PageRank的计算结果。然而,他们都不适合用来近似节点的排名顺序。第一种方法致力于在维持准确率的前提下加快收敛速度;而第二种方法虽然效率很高,但它更擅长高排名节点的识别,对高排名节点之间的顺序近似不够理想。因此,文章第二部分提出了 Early-stop算法。该算法可以分为两个步骤:Grouping和Parallel Updating。Grouping通过模拟随机游走确定节点顺序的大致范围;Parallel Updating通过并行更新的方法在小范围内调整排名临近的节点的顺序。实验结果证明Early-stop算法有效地提高了高排名节点顺序近似的准确性。本文的贡献主要有以下几点:提出了一个基于个人档案进行数据抽取和分析的系统,完成了从信息提取到可视化分析的整个过程;指出可视化分析降低了对计算结果的精度要求,进而提出了快速近似PageRank的Early-stop算法;通过大量实验证明Early-stop算法在近似节点排名方面的准确率高于当前最新的随机模拟算法。
[Abstract]:With the popularity of the Internet, the information on the Internet has exploded. In addition to the expansion of the number, the types of information are becoming more diverse. In a variety of data types, one kind of data can be called "personal files", such as resume, personal home page, personage introduction page on online encyclopedia, and so on. Social relationships among people are possible. For example, if two people have been learning from the same university in the overlapping period of time, they are likely to be classmates. The social network obtained through this analysis is valuable and can be applied to a number of problems, such as the most common in social network analysis. This paper introduces a system for information extraction and visual analysis of personal file data, and describes the main algorithms involved in the system. The system includes two main functions: extracting information from personal files, building an entity based association network, and predicting among people. Social relationships; based on this network, a shallow analysis of the importance or influence of PageRank on people is carried out. The process of building the above network is divided into two steps. First, the establishment of an association network composed of various types of entities, which can be considered as a heterogeneous information network for a specific domain. This step involves To the structured processing of personal file data, including entity recognition and event extraction, we select the method of clustering based on syntactic parsing tree similarity and combine rules extraction to extract the event. The second step is based on the established association network and through path analysis to establish name nodes between people. Before this, we need to supplement the relationship between other types of nodes in order to get more comprehensive path information. Considering the characteristics of heterogeneous networks, we use different methods to build the relationship between different types of nodes. The visual analysis of the information network is mainly through the calculation of the importance of PageRank to the characters. In a visual environment, limited to human cognitive ability and the accuracy of display devices, we think that the ranking of nodes is more important than the actual PageRank value. Therefore, the calculation of PageRank should stop in advance when the relative order of the node is no longer changed. There are two branches of research on the improvement of PageRank. One class of studies tends to speed up the convergence rate of traditional Power methods from a mathematical point of view; another is based on the Monte Carlo method to approximate the results of PageRank. However, they are not suitable for the approximate ranking of nodes. The first method is committed to maintaining the accuracy. Under the premise, the speed of convergence is accelerated; while the second method is very efficient, but it is better at the recognition of high ranking nodes, and the order of the high ranking nodes is not ideal. Therefore, the second part of the article puts forward the Early-stop algorithm. The algorithm can be divided into two steps: Grouping and Parallel Updating.Grouping are simulated random by random. Walk to determine the general range of node order; Parallel Updating adjusts the order of nodes near the ranking in a small range by parallel update methods. The experimental results prove that the Early-stop algorithm effectively improves the accuracy of the order approximation of high ranking nodes. The main contributions of this paper are as follows: a personal file is proposed. The system that carries out data extraction and analysis, completes the whole process from information extraction to visual analysis. It points out that visual analysis reduces the precision requirements of the calculation results, and then proposes a fast approximate PageRank Early-stop algorithm. Through a large number of experiments, it is proved that the accuracy of the Early-stop algorithm in the approximate node ranking is higher than that of when. The latest stochastic simulation algorithm.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】