文档融合关键技术研究

发布时间：2019-02-27 19:13

【摘要】：文档融合是组织文本及整合信息的关键技术,也是自然语言生成的重要基础。该技术旨在整合跨多个文档的重要信息,生成简洁流畅的摘要,不同于传统意义上的文摘生成任务,该摘要既要覆盖主题文档集合中的共同信息,也要体现重要的差异性信息,它不仅仅是关键内容的提炼,更强调相关内容的整合。其中,如何获取文档集合中主题概念及这些主题概念所延伸出的主题发展,将整个集合中的关键信息按一定逻辑有条理的排序,基于不同的主题内容对篇章或语句进行聚类组织等都是重要的研究课题。本文主要从三个方面探索文档融合任务涉及的关键技术,具体内容如下:1.文档融合任务整合同一事件或对象的相关信息,以新闻事件为例,不同新闻报道所描述的同一新闻事件,基于不同视角所呈现的信息不同,后续报道中还会出现伴随事件发展而产生的新的相关信息。为有效去除冗余信息,获得主题及主题相关信息,本文提出一个基于模糊多重集理论的对象合并框架,基于合并函数将文档集对应的多重集以及单篇文档中概念对应的模糊多重集合进行合并操作,然后利用有效性评价函数评价并优化合并函数,从而获得关键概念及其相关词。2.文档融合需要逻辑通顺的内容排布,以句子为处理“粒度”,在文档集合中抽取出蕴含关键概念及发展线索的语句,利用排序融合技术对这些句子进行排序,形成逻辑通顺、可读性强的新的篇章结构。本文提出利用主题句子聚类以及图模型对待排序语句进行组合及建模,将语句排序问题转化为连续型Hopfield神经网络所擅长的路径优化问题,在主题簇对应的图中节点间寻找到一条最短路径,最后将路径输出顺序作为最优的排序方案。3.文档融合需要解决基本的主题内容划分问题,由于缺乏领域背景知识,对于特定事件或特定领域的主题聚类仍然存在困难,体现在这类聚类问题中难以有效提取相关特征。本文提出基于领域本体获取领域知识从而指导特征选择,利用向量空间模型对这些特征进行表示,通过矩阵变换得到模糊等价关系矩阵实现聚类。该方法是一种无监督的方法,不需要预先对数据进行人工标注,无需训练过程,因此在组织特殊领域文档时具有较高的灵活性和自动化处理的能力。
[Abstract]:Document fusion is the key technology of organizing text and integrating information, and it is also the important foundation of natural language generation. The purpose of this technique is to integrate important information across multiple documents to generate concise and smooth abstracts. Unlike the traditional task of generating abstracts, the abstract not only covers the common information in the collection of subject documents, but also embodies the important difference information. It is not only the extraction of key content, but also the integration of related content. Among them, how to get the topic concept in the document collection and the topic development that these topic concept extends, and arrange the key information in the whole set according to certain logical and organized order. It is an important research topic to cluster and organize text or sentence based on different topic content. This paper mainly explores the key technologies involved in the document fusion task from three aspects, the details of which are as follows: 1. The document fusion task integrates the relevant information of the same event or object. Taking the news event as an example, different news reports describe the same news event, and based on different perspectives, the information presented is different. Follow-up reports will also appear with the development of events with the emergence of new relevant information. In order to effectively remove redundant information and obtain topic and related information, this paper proposes an object merging framework based on fuzzy multi-set theory. Based on the merging function, the multiple sets corresponding to the document set and the fuzzy multiple sets corresponding to the concepts in a single document are combined, and then the merging function is evaluated and optimized by using the effectiveness evaluation function. In order to obtain the key concepts and related words. 2. Document fusion needs logical content arrangement, taking sentences as processing "granularity", extracting sentences containing key concepts and development clues from the document collection, sorting these sentences by sorting fusion technology to form logical smooth, and making use of sorting and fusion technology to sort these sentences to form logical fluency, which contains the key concepts and development clues in the document collection. A new text structure with strong readability. In this paper, the topic sentence clustering and graph model are used to combine and model the sorting sentences, and the problem of sentence sequencing is transformed into the path optimization problem of continuous Hopfield neural network. A shortest path is found among the nodes in the graph corresponding to the topic cluster. Finally, the output sequence of the path is used as the optimal sorting scheme. Document fusion needs to solve the basic problem of subject content partition. Due to the lack of domain background knowledge, there are still difficulties in topic clustering for specific events or specific domains, and it is difficult to extract relevant features effectively in this kind of clustering problem. In this paper, domain knowledge acquisition based on domain ontology is proposed to guide feature selection. These features are represented by vector space model, and fuzzy equivalence relation matrix is obtained by matrix transformation to realize clustering. This method is an unsupervised method, does not need to label data manually in advance, does not need training process, so it has high flexibility and automatic processing ability in organizing documents in special fields.
【学位授予单位】：吉林大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】