面向中文微博的产品名实体识别与规范化算法设计与实现
发布时间:2018-01-21 17:55
本文关键词: 微博 产品名实体识别 层叠条件随机场 词向量 实体规范化 出处:《北京理工大学》2015年硕士论文 论文类型:学位论文
【摘要】:随着互联网的发展,微博等社交网络平台逐渐兴起,用户不再仅仅是信息的浏览者,同时也成为信息的发布者,互联网已经从一个信息发布平台转变为互动交流的平台。新浪、腾讯等微博平台上海量的微博信息承载着巨大的商业价值。微博作为传播最快、用户量最大的社交媒体之一,成为重要的信息来源。互联网时代,网络营销、舆情监控和商业智能越来越受到企业的关注,从海量的微博信息中准确的识别出产品名实体是实现网络舆情监控和商业智能的基础和前提。目前从微博中识别产品名实体时仍然采用传统媒体文本中常用的处理方法,忽略了微博上下文信息缺乏、省略问题严重、表达不规范等问题,导致从微博中识别产品名实体的性能较差、实体歧义问题较严重。针对这些问题,本文主要研究针对微博文本的产品名实体识别方法,主要工作和创新点如下:1)提出了基于层叠条件随机场模型和产品知识库的产品名实体识别方法,该方法通过引入具有属性分类的产品实体知识库,提升了产品名实体识别的性能,实验结果表明该方法对复杂结构的实体识别准确率和召回率分别提高了0.6%和3.2%。2)提出一种融合全局上下文语义信息的基于词向量模型的特征选择方法,该方法针对微博文本上下文语义信息缺乏的不足,采用词向量和词聚类两种方法进行特征选择,词聚类方法可以降低对训练语料的要求,实验结果显示词向量和词聚类方法分别可以使产品名实体的整体识别性能F1值提高3.12%和3.34%。3)提出了基于全局以及局部上下文信息和用户交互关系的产品名实体规范化方法,实验结果表明该方法比基于知识库的方法F1值提升了6.92%。4)设计并实现了针对微博文本进行产品名实体识别和规范化的原型系统,该系统综合考虑了识别和规范化的准确率和召回率以及系统的时间和空间效率,实现了对微博文本的逐条处理和批量处理两种处理方式。
[Abstract]:With the development of the Internet, Weibo and other social network platforms are gradually rising, users are not only information visitors, but also become information publishers. The Internet has changed from an information publishing platform to an interactive exchange platform. Weibo platforms such as Sina, Tencent, etc., Shanghai's Weibo information carries enormous commercial value. Weibo as the fastest spread. One of the largest users of social media has become an important source of information. In the Internet era, Internet marketing, public opinion monitoring and business intelligence are increasingly attracting the attention of enterprises. Accurate identification of product name entities from massive Weibo information is the basis and prerequisite for realizing network public opinion monitoring and business intelligence. At present, the traditional media texts are still used to identify product name entities from Weibo. The way. Ignoring Weibo's lack of context information, serious ellipsis problem, nonstandard expression and other problems, the performance of identifying product name entities from Weibo is poor, and entity ambiguity is serious. This paper mainly studies the product name entity recognition method for Weibo text. The main work and innovation are as follows: 1) A product name entity recognition method based on cascading conditional random field model and product knowledge base is proposed. This method improves the performance of product name entity recognition by introducing product entity knowledge base with attribute classification. The experimental results show that the accuracy and recall rate of entity recognition of complex structures are improved by 0.6% and 3.2, respectively. A feature selection method based on word vector model is proposed, which combines global context semantic information. Aiming at the lack of context semantic information in Weibo text, this method adopts word vector and word clustering methods to select features. Word clustering method can reduce the requirement of training corpus. The experimental results show that word vector and word clustering can increase the overall recognition performance of product name entities by 3.12% and 3.34.3, respectively. A method of product name entity normalization based on global and local context information and user interaction is proposed. The experimental results show that the proposed method improves the F1 value by 6.92. 4) and implements a prototype system for product name entity recognition and standardization for Weibo text. The system synthetically considers the accuracy and recall rate of recognition and normalization as well as the time and space efficiency of the system, and realizes two processing methods of Weibo text, one by one, and the other is batch processing.
【学位授予单位】:北京理工大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP391.1
【参考文献】
相关期刊论文 前3条
1 张朝胜;郭剑毅;线岩团;余正涛;雷春雅;王海雄;;基于条件随机场的英文产品命名实体识别[J];计算机工程与科学;2010年06期
2 刘非凡;赵军;吕碧波;徐波;于浩;夏迎炬;;面向商务信息抽取的产品命名实体识别研究[J];中文信息学报;2006年01期
3 赵军;;命名实体识别、排歧和跨语言关联[J];中文信息学报;2009年02期
,本文编号:1452166
本文链接:https://www.wllwen.com/guanlilunwen/yingxiaoguanlilunwen/1452166.html