互联网新闻的汉越双语话题演化关键技术研究
发布时间:2018-07-16 12:39
【摘要】:越南与中国的关系密切,从海量的汉越新闻话题文本集合中分析出话题随着时间变化而发生的演变,对于增进两国人民的文化交流有着十分重要的意义。话题演化分析技术旨在将用户关注的话题以简洁、有序地方式表示出来,这样可以帮助用户清楚地了解话题的整个来龙去脉。汉越话题文本集合是使用两种语言描述了相同内容的文本集合,由于文本中无论是哪种语言,都包含了意思相同或者相近的事件要素,例如对象、时间、地点和事件触发词。利用汉越话题文本集合中存在的这种公性,可以构建汉越话题要素对将两种语言联系到一起。本文围绕已有的汉越话题文本集合,使用了基于子话题关联的演化分析方法,并完成了如下两个特色工作:1.提出了基于超图的汉越双语新闻话题要素提取方法。首先根据触发词激励的方法提取新闻中的事件要素,然后在此基础上构建话题超图模型,将汉越事件要素作为结点,将汉越文本集合中的句子作为超边,根据概率评估函数计算结点和超边的初始权重,最后采用PageRank随机游走方法,对汉越事件素进行评分,从而得到汉越话题要素。实验结果表明,该方法相比只考虑单文本事件要素提取方法的效果有显著提高。2.提出了基于子话题关联的汉越双语话题演化分析方法。首先采用k-means算法得到初始子话题集合,把已经获得的初始子话题集作为样本实例,通过基于knn算法的单遍聚类方法得到各个时间片内的子话题集合。然后用余弦方法和KL距离的混合公式计算出不同时间窗口内子话题的相似度值。最后通过本文提出的话题演化分析步骤得到不同时间片之间的子话题的关系。与相似度只用KL距离或者只用余弦公式计算的方法相比,本文提出的方法更加有效。
[Abstract]:Vietnam has a close relationship with China. It is of great significance to analyze the evolution of topics over time from the mass collection of Chinese and Vietnamese news topics, which is of great significance to enhance the cultural exchanges between the two peoples. The technology of topic evolution analysis aims to express the topics concerned by users in a concise and orderly manner, which can help users to understand the whole context of the topic clearly. The Sino-Vietnamese topic text set is a text set in which the same content is described in two languages. No matter which language it is, it contains the same or similar event elements, such as object, time, place and event trigger word. By using this commonality in the text set of Chinese and Vietnamese topics, we can construct a pair of Chinese and Vietnamese topic elements to connect the two languages. In this paper, an evolutionary analysis method based on sub-topic association is used around the existing Sino-Vietnamese topic text set, and the following two special works are completed: 1. A method of extracting Chinese and Vietnamese bilingual news topic elements based on hypergraph is proposed. First of all, the event elements in news are extracted according to the method of trigger word motivation, then the topic hypergraph model is constructed on the basis of which, the Sino-Vietnamese event element is used as the node, and the sentence in the Sino-Vietnamese text set is taken as the super-edge. According to the probability evaluation function, the initial weights of nodes and overedges are calculated, and PageRank random walk method is used to score the Sino-Vietnamese event elements, and then to obtain the Sino-Vietnamese topic elements. The experimental results show that the effectiveness of this method is significantly higher than that of only considering single text event element extraction. A method of Chinese and Vietnamese bilingual topic evolution analysis based on subtopic correlation is proposed. First, the initial subtopic set is obtained by using the k-means algorithm, and the initial subtopic set is taken as a sample example, and the sub-topic set in each time slice is obtained by the single-pass clustering method based on the knn algorithm. Then the similarity values of subtopics in different time windows are calculated by using the mixed formula of cosine method and KL distance. Finally, the relationship of sub-topics between different time slices is obtained by the analytical steps of topic evolution proposed in this paper. Compared with the method using only KL distance or cosine formula, the proposed method is more effective.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
本文编号:2126415
[Abstract]:Vietnam has a close relationship with China. It is of great significance to analyze the evolution of topics over time from the mass collection of Chinese and Vietnamese news topics, which is of great significance to enhance the cultural exchanges between the two peoples. The technology of topic evolution analysis aims to express the topics concerned by users in a concise and orderly manner, which can help users to understand the whole context of the topic clearly. The Sino-Vietnamese topic text set is a text set in which the same content is described in two languages. No matter which language it is, it contains the same or similar event elements, such as object, time, place and event trigger word. By using this commonality in the text set of Chinese and Vietnamese topics, we can construct a pair of Chinese and Vietnamese topic elements to connect the two languages. In this paper, an evolutionary analysis method based on sub-topic association is used around the existing Sino-Vietnamese topic text set, and the following two special works are completed: 1. A method of extracting Chinese and Vietnamese bilingual news topic elements based on hypergraph is proposed. First of all, the event elements in news are extracted according to the method of trigger word motivation, then the topic hypergraph model is constructed on the basis of which, the Sino-Vietnamese event element is used as the node, and the sentence in the Sino-Vietnamese text set is taken as the super-edge. According to the probability evaluation function, the initial weights of nodes and overedges are calculated, and PageRank random walk method is used to score the Sino-Vietnamese event elements, and then to obtain the Sino-Vietnamese topic elements. The experimental results show that the effectiveness of this method is significantly higher than that of only considering single text event element extraction. A method of Chinese and Vietnamese bilingual topic evolution analysis based on subtopic correlation is proposed. First, the initial subtopic set is obtained by using the k-means algorithm, and the initial subtopic set is taken as a sample example, and the sub-topic set in each time slice is obtained by the single-pass clustering method based on the knn algorithm. Then the similarity values of subtopics in different time windows are calculated by using the mixed formula of cosine method and KL distance. Finally, the relationship of sub-topics between different time slices is obtained by the analytical steps of topic evolution proposed in this paper. Compared with the method using only KL distance or cosine formula, the proposed method is more effective.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 刘炜;刘菲京;王东;刘宗田;;一种基于事件本体的文本事件要素提取方法[J];中文信息学报;2016年04期
2 潘清清;周枫;余正涛;郭剑毅;线岩团;;基于条件随机场的越南语命名实体识别方法[J];山东大学学报(理学版);2014年01期
3 张先飞;郭志刚;刘嵩;程磊;田雨暄;;基于触发词指导的自相似度聚类事件检测[J];计算机科学;2010年03期
4 张阔;李涓子;吴刚;王克宏;;基于词元再评估的新事件检测模型[J];软件学报;2008年04期
5 洪宇;张宇;范基礼;刘挺;李生;;基于子话题分治匹配的新事件检测[J];计算机学报;2008年04期
6 孙吉贵;刘杰;赵连宇;;聚类算法研究[J];软件学报;2008年01期
7 赵妍妍;秦兵;车万翔;刘挺;;中文事件抽取技术研究[J];中文信息学报;2008年01期
8 邱立坤;龙志yN;钟华;程葳;;层次化话题发现与跟踪方法及系统实现[J];广西师范大学学报(自然科学版);2007年02期
9 洪宇;张宇;刘挺;郑伟;龚诚;李生;;基于层次聚类的自适应信息过滤学习算法[J];中文信息学报;2007年03期
10 宋丹;王卫东;陈英;;基于改进向量空间模型的话题识别与跟踪[J];计算机技术与发展;2006年09期
相关硕士学位论文 前1条
1 冯礼;基于事件框架的突发事件信息抽取[D];上海交通大学;2008年
,本文编号:2126415
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2126415.html