中文微博文本规范化方法及关键技术研究

发布时间：2018-04-01 03:27

本文选题：中文微博　切入点：文本规范化　出处：《武汉大学》2016年博士论文

【摘要】：近些年,微博由于其短文本性、即时性和裂变式传播特性,已成为当前最重要的社交网络媒体之一。它亦成为人类获取新闻时事、人际交往、自我表达、社会分享以及社会参与的重要媒介及社会公共舆论、企业品牌和产品推广、传统媒体传播的重要平台。然而由于微博文本存在大量的非规范词现象,使得传统的自然语言工具在处理微博文本时性能较低。因此文本规范化已成为微博文本分析的一个重要预处理过程。不同于英文非规范词通常属于词典外的词,中文非规范词形式更加复杂,如语音替换、缩写、释义和新词等。本文主要研究基于中文微博文本的规范化。传统的方法通常把非规范词看作是一个拼写错误,采用噪音模型或翻译模型来进行规范化。另一些方法尝试从语义的角度来研究文本规范化,但仍面临着一些关键挑战。本文根据中文微博文本的语言特点,研究了中文微博文本规范化所面临的三个关键问题：非规范词词义学习、非规范词与规范词对关系挖掘和文本规范化与分词联合处理。具体工作如下：一、基于词汇链超图的词义归纳模型微博中非规范词大部分表示为新的词义,识别非规范词可以看作是一个消歧任务,但传统的词典显然已不能满足要求,关键是如何从微博文本中学习或归纳微博词义。词义归纳是一个非监督任务,目的是从大规范文本中归纳目标单词的词义。本文提出一个基于词汇链超图的词义归纳模型。该模型采用词汇链表示目标单词的多实例间高阶语义关系,然后利用词汇链来构建超图模型。该模型从全局的角度抓住了复杂的高阶语义关系。实验结果显示本文所提模型的有效性。此外实验显示了词汇链对系统性能的影响,且显示单词的词义数目及语义粒度对词义归纳系统的性能有较大影响。二、基于嵌入表示学习的非规范词-规范词对关系挖掘非规范词通常有固定的规范词与之对应,构建非规范词典有助于文本规范化。其关键是如何从大规模微博文本中挖掘出非规范词-规范词对关系。假设非规范词与规范词具有相同的词义,本文提出一个基于嵌入表示的多词义学习模型,该模型克服了传统多词嵌入表示中不同词的词义表示是相互独立的,提出在学习全局的多词义嵌入表示方法时,同时学习出同义关系。该模型通过引入窗口位置信息,有效的解决了表示偏差问题。利用该模型,采用过滤和分类等后处理,提出一个从大规模微博语料中挖掘非规范词-规范词对关系的框架。实验结果显示该方法的有效性。三、联合分词、词性标注和文本规范化模型本文探索文本规范化及其应用研究。针对中文微博存在分词问题,提出一个联合分词、词性标注和文本规范化模型。该模型在基于迁移的联合分词与词性标注模型的基础上,通常增加迁移行为来对文本进行规范化。分词在规范的文本中进行,而好的分词有助于发现非规范化词,从而有利于规范化。该模型能有效利用标准的标注语料进行训练,克服了缺少语料的问题。使用两类特征对模型打分,其中规范文本特征可作为公共特征,非规范化文本作为域特征,自然的实现了特征扩充,使该模型具有较好的域适应性。实验结果显示,联合模型能使三个任务彼此受益,且语言统计特征有助于提高它们的性能。
[Abstract]:In recent years, micro-blog because of its short nature, immediacy and fission propagation characteristics, has become one of the most important social media. It has become the people to get news, interpersonal communication, self expression, social sharing and social participation in the media and social public opinion, the enterprise brand and product promotion, an important platform for traditional media spread. However due to the existence of non standard word micro-blog text phenomena, the traditional natural language processing tools in micro-blog text when performance is low. So the text standardization has become an important pretreatment process of micro-blog text analysis. Different from English non-standard words usually belong to the dictionary words, Chinese non-standard the word form is more complex, such as voice substitution, abbreviations, definitions and words. This paper mainly studies the Chinese micro-blog text based on the specification. The traditional method is usually non standard word As a spelling error, the noise model or translation model to carry out standardization. Some other methods to try to study the text normalization from the semantic point of view, but still faces some key challenges. Based on the linguistic features of text Chinese micro-blog, on three key issues facing the Wei Bowen Chinese Standardization: non standard words learning, nonstandard words and standard word on the relationship between mining and text normalization and segmentation processing. The specific work is as follows: first, the meaning of lexical chain hypergraph inductive model micro-blog non standard word most represented as new meanings based on the identification of non standard words can be regarded as a disambiguation task. But the traditional dictionary cannot satisfy the demand, the key is how to learn from micro-blog or micro-blog word summarized in the text. The meaning of induction is an unsupervised task, the purpose of this is summed up from the standard text Target words. This paper presents a model of inductive lexical chain hypergraph based on semantic. The model adopts the lexical chain multiple instances of a target word between the higher-order semantic relations, and then to a hypergraph model using lexical chains. The model captures the high order complex semantic relations from a global perspective. The experimental results show the effectiveness the model proposed in this paper. In addition the experiment demonstrates the effect of lexical chain on the performance of the system, and has great influence on performance of the meaning of number and word semantic granularity on lexical induction system. Two, embedded learning non canonical word representation standard word non-standard words usually have a fixed standard word corresponding to mining based on the construction of non standard dictionary help text normalization. The key is how to extract the text from the massive micro-blog non standard word - Specification of the relationship. Assuming that the non standard word and standard word has The same meaning, this paper proposes a representation based on embedded multi word learning model, this model overcomes the shortcomings of traditional multi word representation in different embedded word representation is independent of each other, said the proposed method in learning global multi meaning embedded, at the same time learn synonyms. The model introduces the window position information, effective to solve the problem that deviation. With this model, the filtering and classification of postprocessing, propose a large-scale corpus from micro-blog mining non standard word - Specification of the relationship framework. Experimental results show that this method is effective. Three, combined with research on text segmentation, normalization and application of this model to explore POS tagging and text specification. For Chinese micro-blog word segmentation problem, propose a joint word segmentation, POS tagging and text normalization model. In this model, combined with word segmentation and part of speech based on migration Based on the annotation model, usually to increase the migration behavior of standardization of the text segmentation. In standard text, and good segmentation is helpful to find the non standardized word, which is conducive to standardization. This model can effectively use the standard corpus for training, overcome the lack of data. The use of two kinds of feature scoring model, which can be used as a standard text features public features, non standardized text as the natural characteristics, realize the feature expansion, so that the model has better adaptability domain. Experimental results show that the combined model can make the three tasks benefit from each other, and the statistical characteristics of language helps to improve their performance.

【学位授予单位】：武汉大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】