基于多重文本关系图中clique子团聚类的主题识别方法研究

发布时间：2018-12-11 11:29

【摘要】：在网络成为最主要科学交流和信息传播渠道的今天,越来越多的机构将其研究成果以电子化形式呈现,这些电子化的文本资源中蕴涵着丰富的语义信息。面对这些海量的资源,科研人员很难在短时间内快速捕获文本中的主旨内容。如何高效准确地呈现文本资源中的核心主题,辅助科研人员对文本集中的重要关联信息进行聚焦,提高科研效率,一直是文本挖掘研究中的一个重要问题。在对现有有益研究成果借鉴的基础上,结合文本中术语和术语关系的特点,论文提出将文本中的术语和术语间的共现、句法和语义关系利用图结构进行表示,识别文本关系图中的紧密关联子团,基于所得到的紧密关联子团聚类来揭示文本子主题的整体研究思路。开展了两个方面的研究:①将文本集中的术语和术语间各种关系属性进行叠加归并,构建多重文本关系叠加模型;②基于clique子团间相似性距离和语义标识,进行聚类识别文本集中所包含的重要子主题。论文采用"migraine disorders"主题中近五年的文献构建文本集,对提出的方法开展了2个有效性实验。实验1与文本中领域专家所给出的标引词按语义类型分组结果对比,结果表明论文提出的方法与领域专家给出的标引词语义类型分组结果具有一致性;实验2与目前广泛使用的LDA方法结果进行对比,在准确率和召回率上都较LDA方法有所提高。2个实验均证明了文中方法的有效性。
[Abstract]:Today, the network has become the most important channel for scientific communication and information dissemination, more and more institutions present their research results in the form of electronic. These electronic text resources contain rich semantic information. In the face of these huge resources, it is difficult for researchers to quickly capture the main content of the text in a short time. How to efficiently and accurately present the core theme of text resources, assist researchers to focus on the important related information in text collection, and improve the efficiency of scientific research, has been an important problem in text mining research. On the basis of reference to the existing useful research results, combined with the characteristics of the terminology and the relationship between terms in the text, this paper proposes that the syntactic and semantic relations in the text should be represented by the graph structure, and the syntactic and semantic relations in the text should be represented by the co-occurrence of the terms and the terms in the text. In this paper, the close association cluster in text relation graph is identified, and the whole research idea of text sub-topic is revealed based on the cluster class of closely related sub-cluster. Two aspects of the research are carried out: (1) the superposition and merging of the terms and the relational attributes in the text set to construct the superposition model of multiple text relations; (2) based on the similarity distance between clique clusters and semantic identification, the important sub-topics contained in the text set are identified by clustering. In this paper, a text collection is constructed by using the literature of "migraine disorders" in the past five years, and two effective experiments are carried out on the proposed method. Experiment 1 is compared with the result of grouping the indexing words according to the semantic types given by the domain experts in the text. The results show that the method proposed in this paper is consistent with the semantic grouping results of the indexing words given by the domain experts. Compared with the results of LDA method which is widely used at present, the accuracy and recall rate of experiment 2 are higher than that of LDA method, and the effectiveness of the proposed method is proved by two experiments.
【作者单位】：中国科学院文献情报中心;中国科学院武汉文献情报中心;
【基金】：中国科学院文献情报中心青年人才领域前沿项目“基于图模式的科技文献主题语义标注方法研究”(G160081001)
【分类号】：G254

【相似文献】