汉越双语新闻话题发现研究
发布时间:2018-03-01 00:20
本文关键词: 新闻要素 Hadoop 汉越可比语料 双语词语相似度 汉越双语话题 出处:《昆明理工大学》2017年硕士论文 论文类型:学位论文
【摘要】:随着互联网信息技术的进步,中国与越南等地区在政治、经济、文化等各方面的交流也越来越密切。作为两国信息交流的主要载体,及时有效的发现有关两国的相关新闻话题及新闻话题的发展演化变得尤为重要。因此针对当前尚未充分考虑到利用新闻页面要素之间的关联关系进行话题发现的问题及汉越双语平行语料稀缺、汉越双语词典比较难构建、统计机器翻译尚未完全成熟的现状,提出了融合页面要素关联关系的中文新闻话题发现方法和基于可比语料词语相似度的汉越跨语言话题发现方法:(1)考虑到新闻话题之间具有主题相关的特点,同一话题中的新闻往往还存在发布时间相近、实体共现、事件要素共现等特点,这些要素之间的关联关系对新闻话题的发现具有重要影响,因此提出了融合页面要素关联关系的中文新闻话题发现方法。首先采用基于词频统计的TF-IDF方法计算基于词的特征权重生成文档空间向量利用余弦相似度算法计算新闻页面相似度,得到新闻页面初始相似度矩阵。然后以不同新闻文档内要素的关联关系特征作为半监督约束信息对初始相似度矩阵进行校正,对调整后的初始相似度矩阵采用近邻传播的聚类算法实现文本聚类,对聚类后的新闻文档簇抽取新闻话题,从而实现新闻话题的发现。最后通过对比实验验证融合新闻要素关联关系的话题发现方法较未加入约束信息的方法取得较好的效果。(2)可比语料是指发表的新闻文章由两种不同的语言在同一时期内自然形成并且不同语言表达的新闻是主题相关的,因此提出了基于可比语料词语相似度的汉越跨语言话题发现方法。首先利用汉越可比语料训练出双语词语表征的词向量,以词向量为基础,计算汉语查询词与越南语词之间的相似度,根据相似度值选取出越南语候选扩展词。然后根据得到的汉越双语词的相似度,实现中文新闻话题到越南语查询扩展的翻译,利用查询扩展得到的越南词在越南语语料库中进行检索返回与查询相关的越南语文档,利用AP算法进行聚类获得与中文文本相关的越南语各类事件。对比实验表明本文借助可比语料的查询表达式翻译的方法较传统的双语LDA的方法在跨语言话题分析方面具有较好的效果。(3)设计并实现了汉越双语舆情话题发现原型系统,利用该系统可以方便快捷的了解到中国和东南亚国家对某一新闻话题的报道情况和话题详情,为进一步研究该课题提供了实验平台,为后续研究汉越双语新闻话题的演变提供了相关资源。
[Abstract]:With the development of information technology on the Internet, the exchanges between China and Vietnam in political, economic, cultural and other fields are getting closer and closer. As the main carrier of information exchange between the two countries, Timely and effective discovery of relevant news topics and the development and evolution of news topics in both countries has become particularly important. The problem and the scarcity of Chinese-Vietnamese bilingual parallel data, Chinese-Vietnamese bilingual dictionaries are difficult to build, and statistical machine translation is not yet fully mature. In this paper, a Chinese news topic discovery method based on the correlation of page elements and a Chinese-Vietnamese cross-language topic discovery method based on the similarity of comparable corpus words are proposed. News in the same topic often has the characteristics of similar release time, co-occurrence of entity, co-occurrence of event elements, and so on. The relationship between these elements has an important impact on the discovery of news topics. This paper proposes a Chinese news topic discovery method based on the correlation of page elements. Firstly, the TF-IDF method based on word frequency statistics is used to calculate the feature weight generated document space vector based on word frequency, and the cosine similarity algorithm is used to calculate the document space vector. News page similarity, The initial similarity matrix of news pages is obtained, and then the initial similarity matrix is corrected by using the correlation relation feature of the elements in different news documents as semi-supervised constraint information. For the adjusted initial similarity matrix, the nearest neighbor propagation clustering algorithm is used to realize the text clustering, and the news topic is extracted from the clustered news document clusters. Finally, a comparative experiment was conducted to verify that the method of topic discovery combined with the correlation of news elements achieved better results than the method without constraint information. 2) the comparable corpus refers to the published news articles. Chapters come naturally from two different languages over the same period of time and news expressed in different languages is thematically relevant. Therefore, a cross-language topic discovery method based on the similarity of comparable corpus is proposed. Firstly, the Chinese and Vietnamese comparative corpus is used to train the word vector of bilingual words representation, which is based on the word vector. The similarity between Chinese query words and Vietnamese words is calculated, and Vietnamese candidate extension words are selected according to the similarity value. Then, according to the similarity of Chinese-Vietnamese bilingual words, the translation of Chinese news topic to Vietnamese query expansion is realized. The Vietnamese words obtained by query expansion are retrieved in the Vietnamese Corpus to return the Vietnamese language documents related to the query. The AP algorithm is used to cluster the Vietnamese events related to Chinese text. The comparative experiment shows that the method of query expression translation based on comparable corpus is more effective than the traditional bilingual LDA method in cross-language topic analysis. The prototype system of Chinese-Vietnamese bilingual public opinion topic discovery is designed and implemented. By using the system, we can easily and quickly understand the reports and details of a certain news topic in China and Southeast Asian countries, and provide an experimental platform for further research on this topic. It provides relevant resources for the further study on the evolution of bilingual news topics between China and Vietnam.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 刘端阳;王良芳;;结合语义扩展度和词汇链的关键词提取算法[J];计算机科学;2013年12期
2 田久乐;赵蔚;;基于同义词词林的词语相似度计算方法[J];吉林大学学报(信息科学版);2010年06期
3 刘铭;王晓龙;刘远超;;基于词汇链的关键短语抽取方法的研究[J];计算机学报;2010年07期
4 张先飞;郭志刚;刘嵩;程磊;田雨暄;;基于触发词指导的自相似度聚类事件检测[J];计算机科学;2010年03期
5 俞辉;;基于LSA和pLSA的多文档自动文摘[J];计算机工程与科学;2009年09期
6 肖宇;于剑;;基于近邻传播算法的半监督聚类[J];软件学报;2008年11期
7 石晶;胡明;石鑫;戴国忠;;基于LDA模型的文本分割[J];计算机学报;2008年10期
8 俞辉;;基于PLSA模型的Web用户聚类算法研究[J];计算机工程与科学;2008年07期
9 洪宇;张宇;刘挺;李生;;话题检测与跟踪的评测及研究综述[J];中文信息学报;2007年06期
10 赵华;赵铁军;于浩;郑德权;;基于查询向量的英语话题跟踪研究[J];计算机研究与发展;2007年08期
相关硕士学位论文 前1条
1 龚海军;网络热点话题自动发现技术研究[D];华中师范大学;2008年
,本文编号:1549605
本文链接:https://www.wllwen.com/jingjilunwen/jiliangjingjilunwen/1549605.html