基于新闻数据的中文人物社会关系抽取研究
[Abstract]:With the continuous expansion of the scale of the Internet, the information and data contained therein are also growing. The goal of information extraction technology is to mine the structured data from the massive unstructured data in the Internet. Entity relation extraction is a sub-task of information extraction, which has become a research hotspot in the field of data mining and information retrieval. Personal-relationship extraction belongs to an aspect of entity relation extraction. The triple data of personal-relationship is used to construct personal-relationship network and question-and-answer system, which has high application value. However, at present, the research on relation extraction is mainly focused on the processing of English corpus, and the research on relation extraction based on Chinese data is slow and difficult. The relationship extraction method based on machine learning has become a hot research topic because of its good performance in relation extraction results. According to the different training data acquisition methods, this paper studies three methods based on semi-supervised learning, far supervised learning and unsupervised learning. The main contributions are as follows: 1. The supervised learning relational extraction method is highly dependent on the training data of manual annotation, and the cost of manual annotation is too high. In order to obtain high performance of relation extraction under the condition of small amount of labeled data, this paper studies the relationship extraction method of semi-supervised learning. Using semi-supervised learning algorithm based on label propagation can improve the effect of relational extraction under a small amount of labeled data, but random selection of training samples will affect the performance of relational extraction. In order to improve the relationship extraction effect of label propagation algorithm, this paper combines tag propagation algorithm with active learning method to extract human relationship. This method takes the initiative to select the most helpful samples for relational classification, which can reduce the number of invalid labeled samples, and improve the performance of the system under the condition of the same amount of tagged data. 2. In the present research of relation extraction, remote supervision is usually used to construct training data automatically, but the basic hypothesis of remote supervised learning is inaccurate, so noise data will be introduced into the training data. In this paper, a method of filtering noise in training data based on scoring function is proposed, which can reduce the noise data obtained from training data based on remote supervised learning. In addition, aiming at the problem that the accuracy of the current relational extraction system is not ideal, this paper applies word vector technology to extract some features based on word vector from the single sentence text and adds some features based on word vector to the commonly used relational extraction feature system. Used to enhance the performance of the personal-relationship extraction system. 3. All of the above methods need to predefine the relationship type and then extract the relation to obtain the corresponding relational instance. These methods limit the types of relationships that can be obtained by the relational extraction model, and can not obtain the relational triples of the new relational types. Therefore, this paper proposes an unsupervised learning based relational extraction method which does not require training data and predefined relationship types. In this method, first of all, people pairs with high correlation degree are obtained from the news title data for relation extraction, and then the key words in the co-occurrence sentences are obtained by TF-IDF after the related characters are preprocessed to their news data. Secondly, based on the co-occurrence information, the association between words is obtained, and then the keyword association network is established. Finally, the relationship between people is obtained by using the graph clustering analysis of the association network.
【学位授予单位】:中国科学技术大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 秦兵;刘安安;刘挺;;无指导的中文开放式实体关系抽取[J];计算机研究与发展;2015年05期
2 潘云;布勒布丽汗·伊沙巴依;杨静;尹敏;;利用中文在线资源的远程监督人物关系抽取[J];小型微型计算机系统;2015年04期
3 黄卫春;范少帅;熊李艳;钟茂生;;基于特征选择的人物关系抽取方法[J];科学技术与工程;2015年03期
4 郭喜跃;何婷婷;胡小华;陈前军;;基于句法语义特征的中文实体关系抽取[J];中文信息学报;2014年06期
5 张俊丽;常艳丽;师文;;标签传播算法理论及其应用研究综述[J];计算机应用研究;2013年01期
6 刘康;钱旭;王自强;;主动学习算法综述[J];计算机工程与应用;2012年34期
7 王立霞;淮晓永;;基于语义的中文文本关键词提取算法[J];计算机工程;2012年01期
8 毛小丽;何中市;邢欣来;刘莉;;基于语义角色的实体关系抽取[J];计算机工程;2011年17期
9 黄鑫;朱巧明;钱龙华;刘梅梅;;基于特征组合的中文实体关系抽取[J];微电子学与计算机;2010年04期
10 雷钰丽;李阳;王崇骏;刘红星;谢俊元;;基于权重的马尔可夫随机游走相似度度量的实体识别方法[J];河北师范大学学报(自然科学版);2010年01期
相关硕士学位论文 前2条
1 宁海燕;实体关系自动抽取技术的比较研究[D];哈尔滨工业大学;2010年
2 李晶;基于网络抱团发现的命名实体关系抽取[D];华中师范大学;2006年
,本文编号:2386521
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2386521.html