基于条件随机场的信息抽取与情报信息可视化
发布时间:2018-05-21 15:55
本文选题:CRFs + CRFsuite ; 参考:《北方工业大学》2017年硕士论文
【摘要】:近年来,网络的发展日新月异,网络安全威胁与日俱增。网络数据的数据量、速度、种类的迅速膨胀带来了如何对海量异构数据进行融合、存储和管理等问题。在爆炸式增长的互联网信息中,人物信息也以几何式增长,但总是数据丰富而信息贫乏。人们获取信息的主要来源仍然是文本类型数据,如何对海量的人物文本信息进行有效的提取成为人们关心的热点问题。传统方法即采用人工统计方法提取并分析这些文本类型数据,虽然准确率较高,但是需要耗费大量的人力资源,导致信息抽取效率很低。这种方式已经无法满足人们对信息获取效率的要求,由此产生了信息抽取技术。经过对网络数据及信息抽取模型的研究,本文的主要成果如下:1、提出了一种人物信息的抽取规则。通过对网络数据的格式及特点进行研究,建立人物信息抽取规则。规则主要包括人物信息的特征前导词,出现位置以及方法三部分。其中出现位置主要包括三种类型:Body、Cookies、Url;方法是指当前会话类型采用GET方式还是POST方式;特征前导词为相关人物信息值所在位置的前三个关键词,利用分词过滤的方式分离提取特征前导词。使用该规则进行抽取,能够准确地得到人物信息。2、提出了基于CRFSuite的面向人物属性的信息抽取方法。CRFSuite是条件随机场(CRFs)算法对序列数据标记的一种实现,该模型具有训练速度快,准确率高等特点。通过对已有域的学习,提取出人物信息在网络数据中的特征前导词、位置、以及方法,从而建立人物信息抽取规则。应用CRFsuite将其训练为模型,并将模型应用到网络数据中将人物信息匹配出来,建立结构化人物信息库。最终得到结构化形式的情报数据。3、设计并实现了可视化分析系统。该系统将经过信息抽取之后结构化"情报"间的关系以图形化的形式展现出来,将虚拟人物信息与现实人物信息关联起来。实现"信息"到"情报"的转换,最终将信息资源优势转化为决策优势。
[Abstract]:In recent years, with the rapid development of the network, network security threats are increasing. The rapid expansion of the data volume, speed and type of network data brings problems such as how to fuse, store and manage the massive heterogeneous data. In the explosive growth of Internet information, character information also grows in geometric form, but it is always rich in data and poor in information. Text type data is still the main source for people to obtain information. How to extract the massive human text information effectively has become a hot issue. The traditional method is to extract and analyze these text type data by artificial statistics. Although the accuracy is high, it needs a lot of human resources, which leads to the low efficiency of information extraction. This method can not meet the requirements of the efficiency of information acquisition, resulting in information extraction technology. Through the research on the model of network data and information extraction, the main achievements of this paper are as follows: 1. Through the research on the format and characteristics of network data, the rules of character information extraction are established. The rules mainly include the character leading word, the position of appearance and the method of character information. There are mainly three types of: body / Cookies-Url; the method refers to whether the current conversation type is GET or POST; the leading word is the first three keywords of the position where the information value of the relevant person is located. Feature leading words are separated and extracted by word segmentation. Using this rule to extract the character information, we can get the character information accurately. The method of character attribute oriented information extraction based on CRFSuite. CRF Suite is an implementation of conditional Random Field (CRF) algorithm to mark the sequence data. The model has fast training speed. High accuracy and other characteristics. By learning the existing fields, the characteristic leading words, positions and methods of character information in the network data are extracted, and the rules of character information extraction are established. CRFsuite is used to train it as a model, and the model is applied to the network data to match the information of people, and the structured information base of people is established. Finally, the structured intelligence data. 3 is obtained, and the visual analysis system is designed and implemented. In this system, the relationship between structured "information" after information extraction is displayed graphically, and the virtual character information is associated with the real person information. Finally, the advantages of information resources are transformed into decision-making advantages.
【学位授予单位】:北方工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.08
【参考文献】
相关期刊论文 前10条
1 乔磊;李存华;仲兆满;王俊;刘冬冬;;基于规则的人物信息抽取算法的研究[J];南京师大学报(自然科学版);2012年04期
2 张钊;唐文;温巧燕;;一种基于长度语义约束的报文格式挖掘方法[J];北京邮电大学学报;2012年06期
3 潘t,
本文编号:1919845
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1919845.html