基于个人档案的信息提取和可视化分析
[Abstract]:With the popularity of the Internet, the information on the Internet has exploded. In addition to the expansion of the number, the types of information are becoming more diverse. In a variety of data types, one kind of data can be called "personal files", such as resume, personal home page, personage introduction page on online encyclopedia, and so on. Social relationships among people are possible. For example, if two people have been learning from the same university in the overlapping period of time, they are likely to be classmates. The social network obtained through this analysis is valuable and can be applied to a number of problems, such as the most common in social network analysis. This paper introduces a system for information extraction and visual analysis of personal file data, and describes the main algorithms involved in the system. The system includes two main functions: extracting information from personal files, building an entity based association network, and predicting among people. Social relationships; based on this network, a shallow analysis of the importance or influence of PageRank on people is carried out. The process of building the above network is divided into two steps. First, the establishment of an association network composed of various types of entities, which can be considered as a heterogeneous information network for a specific domain. This step involves To the structured processing of personal file data, including entity recognition and event extraction, we select the method of clustering based on syntactic parsing tree similarity and combine rules extraction to extract the event. The second step is based on the established association network and through path analysis to establish name nodes between people. Before this, we need to supplement the relationship between other types of nodes in order to get more comprehensive path information. Considering the characteristics of heterogeneous networks, we use different methods to build the relationship between different types of nodes. The visual analysis of the information network is mainly through the calculation of the importance of PageRank to the characters. In a visual environment, limited to human cognitive ability and the accuracy of display devices, we think that the ranking of nodes is more important than the actual PageRank value. Therefore, the calculation of PageRank should stop in advance when the relative order of the node is no longer changed. There are two branches of research on the improvement of PageRank. One class of studies tends to speed up the convergence rate of traditional Power methods from a mathematical point of view; another is based on the Monte Carlo method to approximate the results of PageRank. However, they are not suitable for the approximate ranking of nodes. The first method is committed to maintaining the accuracy. Under the premise, the speed of convergence is accelerated; while the second method is very efficient, but it is better at the recognition of high ranking nodes, and the order of the high ranking nodes is not ideal. Therefore, the second part of the article puts forward the Early-stop algorithm. The algorithm can be divided into two steps: Grouping and Parallel Updating.Grouping are simulated random by random. Walk to determine the general range of node order; Parallel Updating adjusts the order of nodes near the ranking in a small range by parallel update methods. The experimental results prove that the Early-stop algorithm effectively improves the accuracy of the order approximation of high ranking nodes. The main contributions of this paper are as follows: a personal file is proposed. The system that carries out data extraction and analysis, completes the whole process from information extraction to visual analysis. It points out that visual analysis reduces the precision requirements of the calculation results, and then proposes a fast approximate PageRank Early-stop algorithm. Through a large number of experiments, it is proved that the accuracy of the Early-stop algorithm in the approximate node ranking is higher than that of when. The latest stochastic simulation algorithm.
【学位授予单位】:山东大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【相似文献】
相关期刊论文 前10条
1 赵丽华;聂建国;;可视化技术在图书馆中的应用[J];图书馆学刊;2011年03期
2 赵倩;任磊;滕东兴;;基于笔式界面的交互式可视化分析系统[J];计算机工程与应用;2009年06期
3 袁顺波;蒋定福;董文鸳;;期刊影响因子研究演进的可视化分析[J];嘉兴学院学报;2011年05期
4 王伟军;官思发;李亚芳;;知识共享研究热点与前沿的可视化分析[J];图书情报知识;2012年01期
5 李琰;赵龙钊;李红霞;;1991—2012年《中国安全科学学报》发表论文可视化分析[J];中国安全科学学报;2013年09期
6 邱均平;吕红;;基于知识图谱的知识网络研究可视化分析[J];情报科学;2013年12期
7 侯筱蓉;赵德春;胡虹;;专利引证类型可视化分析[J];科技管理研究;2011年17期
8 张婷;;国际核心期刊中云计算研究的可视化分析[J];农业图书情报学刊;2012年03期
9 刘真真;;探讨园艺植物可视化技术的应用[J];现代园艺;2013年16期
10 程业炳;;国内外知识转移研究现状的可视化分析[J];内蒙古财经大学学报;2013年03期
相关会议论文 前7条
1 郭建勇;刘俊;张鉴;迟学斌;;5·12汶川地震的可视化与分析[A];图像图形技术研究与应用(2010)[C];2010年
2 张振龙;杨波;;可视化智能化机构分析与设计系统的研制[A];第十三届全国机构学学术研讨会论文集[C];2002年
3 孙传谆;郑新奇;邓红蒂;左玉强;苏航;;土地节约集约利用研究进展的可视化分析[A];中国山区土地资源开发利用与人地协调发展研究[C];2010年
4 孙传谆;郑新奇;邓红蒂;左玉强;苏航;;土地节约集约利用研究进展的可视化分析[A];中国山区土地资源开发利用与人地协调发展研究[C];2010年
5 柳辉;;基于AutoCAD的维修性人机可视化分析[A];面向制造业的自动化与信息化技术创新设计的基础技术——2001年中国机械工程学会年会暨第九届全国特种加工学术年会论文集[C];2001年
6 杨璐;伍蓓;杜杰丽;;IT外包决策研究回顾和模型评介——基于CiteSpaceⅡ的可视化分析[A];第九届中国科技政策与管理学术年会论文集[C];2013年
7 李红纲;鲍玉斌;焦洪国;于戈;郑怀远;;维分析树导航下的可视化OLAP分析[A];第十八届全国数据库学术会议论文集(研究报告篇)[C];2001年
相关硕士学位论文 前10条
1 王舒可;新闻可视化研究[D];河北大学;2015年
2 夏晴;科研工作成功原因挖掘及可视化[D];上海大学;2015年
3 杨宏伟;宜宾电网可视化分析预警系统的设计与实现[D];电子科技大学;2014年
4 杨阳;微博内容的采集、分析及其可视化研究[D];大连理工大学;2015年
5 赵珏;区域经济普查数据可视化分析系统的设计与实现[D];电子科技大学;2015年
6 朱美玲;近十五年来我国高等教育质量研究的可视化分析[D];西北师范大学;2015年
7 李洁;基于SNA的馆藏数字资源知识聚合可视化研究[D];吉林大学;2016年
8 孙伟伟;图结构数据的可视化分析系统的设计与实现[D];东南大学;2016年
9 吕朝阳;基于个人档案的信息提取和可视化分析[D];山东大学;2017年
10 马井刚;面向复杂网络的可视化分析工具的设计与实现[D];北京邮电大学;2010年
,本文编号:2119385
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2119385.html