基于ICE-LDA模型的中英文跨语言话题发现研究
发布时间:2018-10-17 17:05
【摘要】:近年来互联网在全球化的大背景下飞速发展,针对跨语言的网络数据挖掘成为国内外舆情分析的热点问题,有效实时地检测中英文网络环境下的热点话题对舆情的掌握和舆情的发展有着至关重要的作用。网络新闻作为网络信息舆情中的重要组成部分,由于互联网的大规模普及而成为人们方便快捷获知信息的重要来源。首先,本文选择中文与英文的网络新闻作为数据源进行采集,提出了在LDA模型上改进的ICE-LDA模型进行跨英汉语言网络环境下的共现话题发现。采用话题向量化的方式,对建模产生的话题进行JS距离检测和话题文本分布相似度度量。其次,本文分别对爬虫采集到的中英混合新闻数据分别构建可对比平行语料集和非可对比语料集进行话题建模,在建模过程中利用TF-IDF算法对文档提取特征词去噪,提高话题特征表示去除无意义噪音词。最后,分别采用两种不同的话题向量化方式进行跨语言的共现话题发现建模。实验结果表明,在本文设计的爬虫采集构建的真实数据集上,改进后的话题模型不仅能够在不需要先验话题对的情况下对可对比语料集进行跨语言共现话题进行发现,而且能够对语料不平衡的情况进行共现话题发现。
[Abstract]:In recent years, the Internet has developed rapidly under the background of globalization. Cross-language network data mining has become a hot issue in the analysis of public opinion at home and abroad. Effective real-time detection of hot topics in the Chinese and English network environment plays an important role in the mastery and development of public opinion. As an important part of the network information public opinion, network news has become an important source for people to get information conveniently and quickly because of the large-scale popularization of the Internet. Firstly, this paper chooses the Chinese and English network news as the data source to collect, and proposes an improved ICE-LDA model based on the LDA model for co-occurrence topic discovery across the English-Chinese language network environment. Topic vectorization is used to detect the JS distance and measure the similarity of topic text distribution. Secondly, this paper constructs the Chinese and English mixed news data collected by the crawler to model the topic set of the parallel corpus and the non-comparable corpus respectively. In the process of modeling, the TF-IDF algorithm is used to remove the noise of the feature words extracted from the document. Improve topic feature to remove meaningless noise words. Finally, two different methods of topic vectorization are used to model cross-language co-occurrence topic discovery. The experimental results show that the improved topic model can not only discover the cross-language co-occurrence of the comparable corpus without a priori topic pair on the real data set constructed by the crawler collected in this paper. Moreover, we can find the co-occurrence topic in the unbalanced situation of the corpus.
【作者单位】: 四川大学网络空间安全研究院;四川大学计算机学院;
【基金】:国家科技支撑计划资助项目(2012BAH18B05) 国家自然科学基金资助项目(61272447) 四川大学青年教师启动基金(2015SCU11079)
【分类号】:TP391.1
,
本文编号:2277362
[Abstract]:In recent years, the Internet has developed rapidly under the background of globalization. Cross-language network data mining has become a hot issue in the analysis of public opinion at home and abroad. Effective real-time detection of hot topics in the Chinese and English network environment plays an important role in the mastery and development of public opinion. As an important part of the network information public opinion, network news has become an important source for people to get information conveniently and quickly because of the large-scale popularization of the Internet. Firstly, this paper chooses the Chinese and English network news as the data source to collect, and proposes an improved ICE-LDA model based on the LDA model for co-occurrence topic discovery across the English-Chinese language network environment. Topic vectorization is used to detect the JS distance and measure the similarity of topic text distribution. Secondly, this paper constructs the Chinese and English mixed news data collected by the crawler to model the topic set of the parallel corpus and the non-comparable corpus respectively. In the process of modeling, the TF-IDF algorithm is used to remove the noise of the feature words extracted from the document. Improve topic feature to remove meaningless noise words. Finally, two different methods of topic vectorization are used to model cross-language co-occurrence topic discovery. The experimental results show that the improved topic model can not only discover the cross-language co-occurrence of the comparable corpus without a priori topic pair on the real data set constructed by the crawler collected in this paper. Moreover, we can find the co-occurrence topic in the unbalanced situation of the corpus.
【作者单位】: 四川大学网络空间安全研究院;四川大学计算机学院;
【基金】:国家科技支撑计划资助项目(2012BAH18B05) 国家自然科学基金资助项目(61272447) 四川大学青年教师启动基金(2015SCU11079)
【分类号】:TP391.1
,
本文编号:2277362
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2277362.html