领域UGC文本中话题-特征关系抽取及应用研究
本文选题:领域文本 + UGC ; 参考:《电子科技大学》2016年博士论文
【摘要】:Web2.0时代,社会化媒体促使用户既是信息的使用者也是信息的发布者。网络中每时每刻都有新的数据产生,网络数据资源大量累积,人们进入大数据时代。大数据是一把“双刃剑”,在拥有巨大价值的同时,庞大的数据量和纷繁的数据结构对信息处理提出了巨大的挑战。文本是最古老的信息存储方式之一。在网络数据资源中,UGC文本占有很大比重。海量UGC文本蕴含丰富的信息,尤其是域内信息。近年来,文本挖掘技术作为一个有力的工具被应用于人工自然语言处理的研究中来处理如何从文档中挖掘出有用的信息。但是,UGC文本由于撰写者层次不一,具有内容表达随意、写作不规范等特点,给从海量UGC文本中的信息抽取工作带来了巨大的挑战。此外,传统信息抽取方法挖掘出纷繁复杂的信息关系,不利于用户理解信息。在信息爆炸的时代,文本挖掘出的信息需要符合用户需求,且易于用户理解和记忆。因此,对UGC文本以话题方式进行信息抽取,并根据多话题间相互关系构建一个基于用户需求的信息抽取和管理系统至关重要。基于上述思考,本论文对海量UGC文本的信息抽取及相关应用展开了深入的研究。具体的研究内容和相关结论如下:(1)基于词单元依赖关系的复合新词发现分词效果决定了文本挖掘最终结果的优劣。由于传统分词软件不能很好处理UGC文本中的复合新词,本论文提出了一种新的无需词典、无需前期语料库训练,基于统计的复合新词发现方法(FPSMC)。该方法首先利用序列频繁模式挖掘出候选复合新词,然后通过计算候选复合新词的序列最大置信度(Max-confidence)进行筛选,反复迭代最终得到文本中存在的复合新词。实验结果表明,FPSMC算法UGC文本数据集中,有较好的复合新词抽取效果。与其他复合新词抽取算法相比,FPSMC更善于发现复合新词中的人名、地名、组织机构名称、专有名词、时间等命名实体。通常来说,命名实体大多是UGC文本中的话题词。所以,FPSMC对复合新词抽取的良好效果,更有助于发现UGC文本数据集中用户表达出的行为偏好,为后续的话题识别及其特征抽取、商务应用分析奠定良好的基础。(2)域内文本话题界限划分及其特征词抽取话题是UGC文本中隐含的重要信息元素,对UGC文本进行基于话题的信息组织能够让用户更方便全面的获取UGC文本中的信息。鉴于传统话题抽取技术中抽取出的话题结果经常受到公共热点词的干扰,且挖掘出与话题相关的特征中信息粒度较粗的泛化特征较多。所以,本论文提出了一种新的文档数据关联分析方法,从海量UGC中分析出“热点话题词和话题界限”,最后根据热点话题界限对UGC文本进行切分,找出与各热点话题词关联的“局部特征词”。实验证明,本论文提出TVS算法可以有效的屏蔽高频词的干扰,从大规模网络文本数据中抓取出领域的热点话题词及其局部特征。同时,适应性实验和可扩展性实验结果表明,该算法能适用于不同类型文本数据集;并且该算法既能通过并行计算的方式实现,也能在单个计算机上保持良好挖掘性能。(3)UGC文本中多话题关系及其特征抽取的应用研究传统话题发现与抽取方法,很难识别和理清UGC文本中话题与话题之间的相互关系。而UGC文本中话题之间的相互关系也包含了信息,UGC文本中话题之间的相互关系能有效的促进信息使用者理解和掌握信息。本论文基于旅游博客文本数据,结合相应的多话题关系及其特征抽取方法挖掘出了热门旅游景点话题、景点话题的局部特征、景点话题之间的相互关系,并基于此构建了基于旅行者需求的旅游信息抽取与管理系统。该系统从旅行者面临的“去哪里玩”、“玩什么”以及“怎么去玩”三大需求出发,构建了旅游博客文本预处理、热门旅游景点及其TOI抽取、热门旅游景点区域化、旅游路径发现及推荐四大模块,分别有针对性的解决旅行者的三大需求。本论文利用北京旅游博客数据集对系统各模块进行了示例实验,并将实验结果采用可视化技术进行展示。实验证明,本旅游信息抽取与管理系统能有效的从大规模旅游博客文本数据中提取出旅行者需要的旅游信息,并能够很好的协助旅行者完成自己的旅游出行规划。
[Abstract]:In the Web2.0 era, social media made users not only the users of information but also the publisher of information. In the network, new data were generated every time, and the network data resources accumulated much. People entered the era of big data. Large data is a "double-edged sword", with huge value and numerous data nodes at the same time. Text is a great challenge to information processing. Text is one of the oldest information storage methods. In network data resources, UGC text occupies a large proportion. Massive UGC text contains rich information, especially in domain information. In recent years, text mining technology has been used as a powerful tool in the research of artificial Natural Language Processing. This paper deals with how to extract useful information from the document. However, UGC text has brought great challenges to information extraction from massive UGC text because of the different composer level, random content expression and non standard writing. Users understand information. In the era of information explosion, the information extracted from text needs to meet the needs of the user and is easy to understand and remember. Therefore, it is very important to extract information from UGC text by topic mode and to build a user requirement based information extraction and management system based on the relationship between multiple topics. This paper studies the information extraction and related applications of massive UGC text. The specific research content and relevant conclusions are as follows: (1) the results of the compound new words based on the dependency relationship of words determine the final result of the text mining. Because the traditional word segmentation software can not handle the compound in the UGC text well In this paper, a new kind of new word discovery method (FPSMC) is proposed without a dictionary, without pre corpus training and statistical based compound neologism. This method first uses sequential frequent patterns to excavate candidate compound words, and then iterates the final iteration by calculating the maximum confidence degree of the candidate compound new word (Max-confidence). The experimental results show that the FPSMC algorithm UGC text data set has a good compound new word extraction effect. Compared with other compound new word extraction algorithms, FPSMC is better at discovering the names of people, place names, organization names, proper nouns, time and other named entities in compound neologisms. Most of them are the topic words in the UGC text. So, the good effect of FPSMC on the extraction of the compound new words helps to find the behavior preference expressed by the UGC text data centralized users, and lays a good foundation for the subsequent topic identification and feature extraction and business application analysis. (2) the topic of the topic boundary division and the feature word extraction in the domain is the topic of the topic. The important information element implied in the UGC text, the topic based information organization for the UGC text can make the user more convenient and comprehensive to obtain the information in the UGC text. In this paper, a new method of document data association analysis is proposed, which analyzes "hot topic words and topic boundaries" from massive UGC. Finally, according to the boundaries of hot topic, UGC text is divided and the "local feature words" associated with various hot topics are found. The experiment shows that this paper proposes TVS calculation. The method can effectively shield the interference of high frequency words and take out the hot topic words and local features of the domain from the large-scale network text data. At the same time, the results of adaptive experiment and extensibility experiment show that the algorithm can be applied to different types of text data sets, and the algorithm can be realized by parallel computing and can also be used. Good mining performance on a single computer. (3) the application of multi topic relationship and feature extraction in UGC text. Research on the traditional topic discovery and extraction method, it is difficult to identify and clear the relationship between topic and topic in UGC text. The interrelationship between topics in UGC text also contains information, and between topics in UGC text The relationship can effectively promote the information users to understand and grasp the information. Based on the text data of Tourism Blog, this paper excavates the hot tourist attractions, the local features of the scenic spots and the relationship between the scenic topics, and based on this, based on the tourist blog text data, and based on this construction. From the three needs of "where to play", "what to play" and "how to play", the system has constructed four major modules, such as the text preprocessing of tourist blogs, popular tourist attractions and their TOI extraction, the regionalization of popular tourist attractions, tourism path discovery and recommendation, respectively. In order to solve the three major needs of travelers, this paper uses the Beijing Tourism Blog data set to carry out an example experiment on the system modules, and shows the experimental results using visual technology. The experiment proves that the tourism information extraction and management system can effectively extract traveler needs from the large-scale travel blog text data. The travel information can help travelers to complete their own travel planning.
【学位授予单位】:电子科技大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP391.1;F592
【相似文献】
相关期刊论文 前10条
1 郎宇洁;;面向UGC的网络信息资源开发研究[J];科技创业月刊;2012年07期
2 张建;李益;;手机UGC——审美创造新舞台[J];新闻爱好者;2012年23期
3 汪科科;;UGC视频,做自己生活的导演[J];数码影像时代;2012年02期
4 王剑;;UGC语境下传统媒体的表现以及应对[J];视听纵横;2012年04期
5 张博;任殿顺;;大数据背景下UGC的价值研究和出版应用[J];科技与出版;2014年03期
6 仲钇霏;杜志红;;UGC时代电视媒体的被动与主动[J];视听界;2013年02期
7 王光文;;论视频网站UGC经营者的版权侵权注意义务[J];国际新闻界;2012年03期
8 吕尚彬;;重视UGC 激励用户分享和原创[J];新闻战线;2013年07期
9 郑无边;;不存在的马戏团 互联网文字UGC杂谈[J];数码影像时代;2013年04期
10 董平;;UGC将是移动互联网的新热点[J];通信世界;2008年05期
相关重要报纸文章 前4条
1 本报记者 傅盛裕;UGC、粉丝经济、作者营销及其他[N];文汇报;2014年
2 记者 高少华;视频网站押注UGC战略[N];经济参考报;2013年
3 马斌;移动UGC业务将随3G崛起[N];人民邮电;2008年
4 商报记者 魏蔚;视频业借电商模式运营UGC内容谋盈利[N];北京商报;2012年
相关博士学位论文 前1条
1 徐华林;领域UGC文本中话题-特征关系抽取及应用研究[D];电子科技大学;2016年
相关硕士学位论文 前3条
1 李莎;基于UGC的旅游目的地吸引力分析[D];哈尔滨工业大学;2011年
2 严瑶;用户创造内容(UGC)的受众角色研究[D];华中师范大学;2014年
3 蓝勤华;用户创造内容(UGC)动机研究[D];南京大学;2011年
,本文编号:1860744
本文链接:https://www.wllwen.com/shoufeilunwen/jjglbs/1860744.html