当前位置:主页 > 文艺论文 > 语言学论文 >

英、汉跨语言话题检测与跟踪技术研究

发布时间:2018-04-22 02:20

  本文选题:跨语言话题检测 + 跨语言话题跟踪 ; 参考:《中央民族大学》2013年博士论文


【摘要】:当今世界已经逐步迈入信息化和数字化时代。根据CNNIC第30次调查报告①显示,截止2012年6月底我国网络用户数量已达到5.38亿,网站数达到250万,网络新闻的用户规模达到3.92亿,网民对网络新闻的使用率高达73.0%。由于网络新闻发布简便快捷等特点,互联网已成为新闻传播的“第四媒体”。普通民众希望从海量网络资源中获取自己感兴趣的新闻话题,同时也希望了解其他国家的新闻话题。因此,对网络新闻话题进行跨语言的检测与跟踪,己经逐渐成为当今国内外学者研究的兴趣之所在。 目前的跨语言话题检测与跟踪研究中存在着多个具有挑战性的难题。首先,网络新闻报道文本描述手段匮乏,涉及多语言环境的新闻报道话题描述难度更大;其次,跨语言话题检测与跟踪需要实现多语言环境下的新闻报道处理,怎样跨越语言鸿沟,是首先需要攻克的技术难题之一。再次,如何更好地发展现有技术,并将其应用到话题检测与跟踪研究中,这一问题值得进一步探讨。针对上述问题,希望本文对英、汉跨语言话题检测与跟踪技术的研究能为语言处理相关技术的发展做出微薄贡献,并能为我国多民族语言文本处理提供一定的借鉴。 本文的研究主要包括跨语言新闻报道文本分析、跨语言话题模型构建方法、语料库构建方法、跨语言话题检测和跨语言话题跟踪等五个部分。 首先,笔者从新闻报道的本质因素研究入手,从新闻的认知理解和本身特性这两个角度来分析新闻报道的核心要素。通过分析,笔者认为词汇处理是对文本进行描述的有效途径之一;新闻要素也可作为对报道文本加以区分的手段。 其次,本文从“报道-话题-事件”的相互关系出发,阐述了CLTDT研究中新闻报道模型构建的基本思路;分析了当前常用文本表示模型的特点与不足;认为早期文本表示模型缺乏对“报道-话题-事件”之间关系的深入描写和刻画。为了揭示新闻文本中潜藏的话题,本文选取了LSI模型和LDA模型进行文本建模实验,并通过实验对比和分析了两种模型对新闻报道文本的描述能力。 在以上理论分析和实验验证的基础上,我们提出在英、汉可比语料库的基础上进行跨语言话题检测与跟踪研究的思路。通过语料采集、元数据处理、新闻事件分类、语料分词处理和标注、命名实体标注等流程和步骤,本文尝试建立“英、汉跨语言新闻报道可比语料库”。我们将以语料库中所包含的英、汉新闻报道文本语料为基础,对跨语言环境中的新闻话题进行检测与跟踪研究。 在综合当前跨语言处理技术和LDA模型研究的基础上,结合本文研究目的,笔者提出跨语言联合LDA (CLU-LDA)模型。这一模型既可以对英、汉新闻报道进行事件回顾检测,又可以对新事件进行发现。在跨语言话题跟踪中,通过使用先验的话题模型对新闻报道样本话题进行推断,借助已有先验知识和可比语料库,我们不仅可以在时间序列上描绘出新闻事件的话题发展状况,还可以对特定新闻报道进行有效跟踪。
[Abstract]:Today, the world has gradually entered the era of information and digital. According to the thirtieth survey report of CNNIC, the number of Internet users in China has reached 538 million by the end of June 2012, the number of Web sites has reached 2 million 500 thousand, the user scale of network news has reached 392 million, and the use rate of Internet news is as high as 73.0%. because of the simple distribution of network news. Fast and so on, the Internet has become the "fourth media" of news communication. Ordinary people want to get news topics of interest from the mass network resources, and also want to know the news topics of other countries. Therefore, the cross language detection and tracking of the topic of network news has gradually become a domestic and foreign scholar. The interest of the study is.
There are many challenging problems in the current cross language topic detection and tracking research. First, the text description means of the network news report is scarce and the news report topic involving multi language environment is more difficult to describe. Secondly, the cross language topic detection and tracking needs to deal with the news reports under the multi language environment and how to cross the language. The more language gap is one of the technical problems that need to be tackled first. Again, how to develop the existing technology better and apply it to the research of topic detection and tracking is worth further discussion. In view of the above problems, this paper hopes that the research of the English, Chinese and cross language topic detection and tracking technology can be used for language processing related technologies. It will make a modest contribution to the development and provide some references for the processing of multilingual texts in China.
The research of this paper includes five parts: cross language news report text analysis, cross language topic model building method, corpus construction method, cross language topic detection and cross language topic tracking.
First of all, the author starts with the study of the essential factors of news reports and analyzes the core elements of news reports from the two perspectives of the cognitive understanding of the news and their own characteristics. Through the analysis, the author thinks that lexical processing is one of the effective ways to describe the text, and the news elements can also be used as a means to distinguish the text from the news.
Secondly, starting from the relationship of "report topic event", this paper expounds the basic idea of the construction of news report model in CLTDT research, analyzes the characteristics and shortcomings of the current common text representation model, and thinks that the early text representation model lacks the deep description and characterization of the relationship between "report topic event". To reveal the latent topic in the news text, this paper selects the LSI model and the LDA model to carry out the text modeling experiment, and compares and analyzes the ability of the two models to describe the news text.
On the basis of the above theoretical analysis and experimental verification, we put forward the ideas of cross language topic detection and tracking on the basis of the English and Chinese corpus, through the process and steps of language collection, metadata processing, news event classification, word segmentation processing and tagging, and the labeling of the name of the life body. This paper tries to establish "English and Chinese". We will examine and track news topics in a cross language environment, based on the corpus of English and Chinese news reports that are included in the corpus.
On the basis of the study of current cross language processing and LDA model and the purpose of this study, I propose a cross language joint LDA (CLU-LDA) model. This model can not only review the events of English and Chinese news reports, but also discover new events. In cross language topic tracking, we use a priori topic model. Based on the prior knowledge and comparable corpus, we can not only describe the development of news events on the time series, but also track the specific news reports effectively.

【学位授予单位】:中央民族大学
【学位级别】:博士
【学位授予年份】:2013
【分类号】:H15;H315;H087

【参考文献】

相关期刊论文 前10条

1 房璐;葛运东;洪宇;姚建民;;可比较语料库构建及在跨语言信息检索中的应用[J];广西师范大学学报(自然科学版);2010年03期

2 赵华;赵铁军;张姝;王浩畅;;基于内容分析的话题检测研究[J];哈尔滨工业大学学报;2006年10期

3 刘远超;宋明凯;刘铭;张想;;用于细颗粒度挖掘的产品评论语料库构建技术[J];哈尔滨工业大学学报;2012年03期

4 贾自艳 ,何清 ,张海俊 ,李嘉佑 ,史忠植;一种基于动态进化模型的事件探测和追踪算法[J];计算机研究与发展;2004年07期

5 于满泉;骆卫华;许洪波;白硕;;话题识别与跟踪中的层次化话题识别技术研究[J];计算机研究与发展;2006年03期

6 张sソ,

本文编号:1785169


资料下载
论文发表

本文链接:https://www.wllwen.com/wenyilunwen/yuyanxuelw/1785169.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户62ebc***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com