基于词嵌入模型的内容关联方法设计与应用

发布时间：2018-11-15 13:38

【摘要】：现实生活中,文本内容之间的关联十分常见。它既可以是论文文献系统中,篇章之间的引用;也可以是在线论坛里,读者评论与原始文章的对应。这些关联为用户提供了一个很好的联系通道,并且增加了人们对文本内容理解的客观性与全面性。还可以为接下来的工作比如信息检索、摘要和内容管理提供有力的帮助。然而,日益增长的语料数据规模决定了这项任务无法仅仅依靠于人工。因此,探索一种自动化的内容关联任务实现方法存在必要性。目前为止,内容关联的大部分实现方法都是基于传统语法或语义特征的相似度计算,产生的主要问题来自于文本及单词的表面浅层特征的缺陷。而近些年来,词嵌入模型在自然语言处理任务中特别在挖掘深层语义方面呈现出优异的表现。在本论文中,我们提出一种引入词嵌入模型作为特征的内容关联方法。先对模型结构原理进行深入学习,然后详细评估了不同参数下的训练出的词向量结果,最后对英文生物领域论文文献和中英文在线论坛(天涯杂谈、英国卫报)数据三种语料进行了实验,并与传统方法进行比较,证实了我们提出方法的有效性。
[Abstract]:In real life, the correlation between text content is very common. It can be either a reference between chapters in a paper document system or an online forum where the reader comments correspond to the original article. These associations provide users with a good communication channel and enhance the objectivity and comprehensiveness of text content understanding. It can also provide effective assistance for subsequent tasks such as information retrieval, summary and content management. However, the growing size of corpus data makes this task impossible to rely on human resources alone. Therefore, it is necessary to explore an automatic implementation method of content association task. Up to now, most of the implementation methods of content association are based on the similarity calculation of traditional grammar or semantic features, and the main problem arises from the defects of surface shallow features of text and words. In recent years, word embedding model has shown excellent performance in the task of natural language processing, especially in mining deep semantics. In this paper, we propose a content association method which uses word embedding model as a feature. First, we study the structure of the model in depth, then we evaluate the results of the trained word vector under different parameters in detail. Finally, we discuss the literature in English biology field and the online forum in Chinese and English. The three kinds of data are tested and compared with the traditional method, which proves the validity of the proposed method.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】