当前位置:主页 > 经济论文 > 技术经济论文 >

面向民族信息资源领域的非结构化数据语义关系挖掘

发布时间:2018-09-07 11:17
【摘要】:非结构化的文本数据占了民族信息资源重要部分,如何对其充分开发利用并有效传播,将对促进经济社会发展和民族间文化交流起到积极的推动作用。本文对民族信息资源领域中的非结构化文本进行语义关系挖掘,对挖掘过程中的产生的关键问题进行研究,其主要研究内容如下:1、对民族信息资源领域文本进行分词,以基于字符串最大匹配的词典分词方法进行粗分,通过双向字符串最大匹配算法进行交集型歧义识别,通过统计民族信息资源领域生语料库来进行交集型歧义处理,并基于这些算法实现中文分词器。2.针对民族信息资源领域文本中存在大量的领域词汇,运用大规模领域语料库来进行新词识别,对其产生的多特征海量数据以及统计速度过慢的问题,提出了在Map Reduce并行计算模型下的基于N-Gram的海量语料库多特征识别算法,该算法运用N-Gram算法进行候选词识别,然后对卡方统计量和左右熵值以及词频等作为特征,在特征计算的过程中进行并行化改进,运用规则的方法识别是否是新词,基于以上算法实现了对民族信息领域中的新词识别。3.在识别民族信息资源领域中的相关命名实体后对其进行实体关系挖掘,由于预先设定完善的实体关系体系较为困难,同时制作大规模的关系标注语料库非常困难,因此本文运用基于无监督学习的开放式信息抽取方法对文本进行实体关系挖掘,设计实现了对民族信息领域中的命名实体进行关系挖掘的平台。通过对民族信息资源领域的非结构化数据语义关系挖掘,解决了民族资源管理与服务的问题。
[Abstract]:The unstructured text data occupies an important part of the national information resources. How to fully develop and utilize it and spread it effectively will play a positive role in promoting the economic and social development and cultural exchange among nationalities. In this paper, the semantic relationship of unstructured text in the field of national information resources is excavated, and the key problems in the process of mining are studied. The main research contents are as follows: 1, partitioning the text in the field of national information resources. The dictionary segmentation method based on the maximum matching of strings is used for coarse segmentation, the two-way maximum string matching algorithm is used to recognize the intersection ambiguity, and the cross-type ambiguity is processed by statistical corpus of the field of national information resources. And based on these algorithms to implement Chinese word segmentation. 2. In view of the existence of a large number of domain words in the text of the field of national information resources, a large scale domain corpus is used to identify the new words, and the problems of the large amount of data generated by them and the slow statistical speed are also discussed. In this paper, a multi-feature recognition algorithm of massive corpus based on N-Gram in Map Reduce parallel computing model is proposed. The algorithm uses N-Gram algorithm to recognize candidate words, and then uses chi-square statistics, left and right entropy and word frequency as features. In the process of feature calculation, the parallel improvement is carried out, and the rule method is used to recognize whether the new word is a new word. Based on the above algorithm, the recognition of new words in the field of national information is realized. 3. After identifying the related named entities in the field of national information resources, it is difficult to mine the entity relations, because it is difficult to set up the perfect entity relation system in advance, and it is very difficult to make the large-scale relational tagging corpus at the same time. Therefore, this paper uses the open information extraction method based on unsupervised learning to mine the entity relationship of text, and designs and implements the platform of relation mining for named entities in the field of national information. The problem of national resource management and service is solved by mining the semantic relationship of unstructured data in the field of national information resources.
【学位授予单位】:云南师范大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1

【相似文献】

相关期刊论文 前2条

1 张全法,郭茂田;用于输入民族信息的ActiveX控件的开发[J];郑州大学学报(理学版);2004年03期

2 ;[J];;年期

相关会议论文 前1条

1 张巨龄;;民族信息传播与社会和谐问题的思考[A];中国少数民族地区信息传播与社会发展论丛(2010年刊)[C];2010年

相关硕士学位论文 前1条

1 黄鹏;面向民族信息资源领域的非结构化数据语义关系挖掘[D];云南师范大学;2016年



本文编号:2228106

资料下载
论文发表

本文链接:https://www.wllwen.com/jingjilunwen/jiliangjingjilunwen/2228106.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户a19d0***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com