多项文本挖掘关键技术的研究和实现

发布时间：2018-05-01 18:59

本文选题：文本挖掘 + 新词发现　；参考：《哈尔滨工业大学》2017年硕士论文

【摘要】：文本挖掘是指通过计算机对文本进行的信息挖掘、含义分析、分类标注和关联分析等处理,可以从文本中提取出能为人所用的信息乃至于知识。互联网行业和各产业的信息化发展为文本挖掘提供了丰富的文本语料资源,也同时要求文本挖掘系统的准确性、有效性、运算效率和个性化水平不断提升。文本挖掘要求从纯文本中提取出有价值的信息并为信息化事业的发展提供基础,其中属于特定语义类别的新词、文本事件类别、文本事件元素和文档摘要是应用广泛的文本信息。本文研究并实现了解决文本挖掘中多个核心问题的方法,包括面向特定语义类别的新词发现,面向ACE2005语料的事件类别识别和在事件类别信息基础上的事件元素识别,以及面向单文档和多文档的自动摘要。新词发现、事件识别和自动摘要系统均在各自的标注语料中进行了实验,并取得了较为理想的效果。面向特定语义类别的新词发现方面,本文考虑到对语料进行类别标注的成本较高,从同类新词具有相似的上下文信息的角度出发,设计了一种基于bootstrapping和软模式匹配的新词发现方法,根据语义特点将新词拆分成多个部分,并根据新词部分将新词所在句子分割为多个槽,通过统计已标注新词和候选新词各词频槽的词向量相似度和词频向量相似度为候选新词打分,并将评分较高的候选新词加入已标注新词。本文在电子病历语料中进行了实验,将症状新词拆分成部位,性状两部分,症状新词发现的F值达到了81.40%。面向ACE2005语料的事件类别识别和事件元素识别方面,本文在其他研究者基于支持向量机分类器的方法基础上进行了改进。在事件类别识别中,本文根据同句中各个候选触发词的位置和触发事件的信息,加入了一些和候选触发词和候选元素相关的特征,并优化了文本信息预处理的方法。基于带有事件标签以及相应的实体、时间、数值标注的中英文ACE2005语料,本文衡量了事件类别识别和事件元素识别方法的效果,在事件元素识别中也加入了和实体、数值和时间标签相关的新特征。事件类别识别的F值达到了64.2%,事件元素识别的F值达到了63.7%。任务中,本文将TextRank算法和聚类方法结合起来,利用BM25算法及多种句子相似度算法设置TextRank无向图模型中的边权重,并通过聚类方法尝试减少自动中的冗余信息,将句子和文档间关系作为摘要提取的依据。系统在DUC2001以及DUC2002语料上进行了多种长度的单文档和多文档的实验并用ROUGE工具进行了评测,取得了较好效果。
[Abstract]:Text mining refers to information mining, meaning analysis, categorization and association analysis of text through computer, which can extract information and even knowledge that can be used by people from text. The development of information technology in the Internet industry and various industries provides abundant text corpus resources for text mining. At the same time, the accuracy, validity, operational efficiency and personalized level of text mining system are also required. Text mining requires the extraction of valuable information from pure text and provides a basis for the development of information technology, which belongs to a specific semantic category of new words, text event category, Text event elements and document abstracts are widely used text information. This paper studies and implements methods to solve several core problems in text mining, including new word discovery for specific semantic categories, event class recognition for ACE2005 corpus and event element recognition based on event category information. And for single document and multi-document automatic summary. Both the event recognition and automatic summarization systems have been experimented in their tagged corpus, and satisfactory results have been achieved. With regard to the discovery of new words for specific semantic categories, this paper takes into account the high cost of classifying the corpus, starting from the point of view that similar new words have similar contextual information. In this paper, a new word discovery method based on bootstrapping and soft pattern matching is designed. According to the semantic characteristics, the new word is divided into several parts, and the new word sentence is divided into multiple slots according to the new word part. Word vector similarity and word frequency vector similarity of tagged neologisms and candidate neologisms were counted as candidate neologisms, and tagged neologisms were added to tagged neologisms. In this paper, an experiment was carried out in the electronic medical record corpus. The symptom neologisms were divided into two parts, and the F value of symptom neologisms was 81.40g. In the aspect of event class recognition and event element recognition for ACE2005 corpus, this paper improves the method based on support vector machine classifier. According to the position of each candidate trigger word and the information of trigger event in the same sentence, this paper adds some features related to candidate trigger word and candidate element, and optimizes the method of text information preprocessing. Based on the Chinese and English ACE2005 corpus with event label and corresponding entity, time and value, this paper measures the effect of event class recognition and event element recognition, and also adds and entity to event element recognition. New features related to numerical and time labels. The F value of event category recognition is 64.2 and that of event element recognition is 63.7. In the task, we combine TextRank algorithm with clustering method, use BM25 algorithm and sentence similarity algorithm to set edge weight in TextRank undirected graph model, and try to reduce redundant information by clustering method. The relation between sentence and document is used as the basis of abstract extraction. The experiment of single document and multiple document on DUC2001 and DUC2002 corpus has been carried out and evaluated with ROUGE tool, and good results have been obtained.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】