基于中文在线评论的产品特征提取与情感分析研究
发布时间:2018-04-26 03:31
本文选题:评论挖掘 + 特征提取 ; 参考:《东南大学》2016年硕士论文
【摘要】:随着互联网应用的普及以及电子商务的迅速发展,网络购物已经成为人们普遍且重要的消费方式。在线评论是电子商务网站上的一个重要的数据资产,它们是用户在网上购买产品后对产品发布的包含个人主观或者客观的态度及意见的文本集合,这些评论数据为网购用户和商家提供了巨大的潜在价值。海量的在线评论依靠人工阅读理解显然无法实现,评论挖掘技术的出现为解决这一问题提供了有效的解决手段并成为了国内外学者研究的热点。评论挖掘主要研究内容包含特征提取和情感分析两部分,本文围绕中文在线评论挖掘的研究,开展了如下工作:1)构建电子产品领域的中文在线评论资料库。本文利用定制化的爬虫工具来自动化抓取京东和淘宝的关于电子产品评论的html内容,并进行解析,然后采用本文提出的初始评论过滤标准对原始评论数据进行过滤和清洗,采用中科院分词工具进行分词,去停用词后,统计词频存入到数据库中,最后将经过预处理的数据存入ES集群中。2)提出一种高效的基于中文在线评论二次剪枝算法来进行特征提取。本文在传统的序列模式挖掘算法的基础上,针对其准确率和召回率不够高的问题,将传统GSP算法与基于统计基础的词对共现度方法进行结合,实现特征的提取和剪枝,得到的特征集合为后续的情感分析工作奠定基础。3)中文句法模式的构建。本文采用句法分析器对评论进行句法解析,而后统计各个依存关系在语料库中的频率,通过对依存模式的研究,结合在线评论的特征,构建了7个依存模式,并提出了一个基于语义距离和标点的提取算法来提取特征及观点组成的元组。最后,本文构建了一个基于11个特征的分类特征模型,并采用SVM、逻辑回归和贝叶斯算法作为分类器,与基线模型进行多个实验比较。通过对特征的筛选和排序,本文最后获得了5个与分类结果最相关的特征,实验结果表明了本文的方法的有效性和易用性。
[Abstract]:With the popularity of Internet applications and the rapid development of e-commerce, online shopping has become a common and important way of consumption. Online reviews are an important data asset on e-commerce websites. They are a collection of texts containing personal subjective or objective attitudes and opinions issued by users after purchasing products on the Internet. These comments provide huge potential value for online shopping users and merchants. It is obvious that massive online reviews can not be realized by manual reading comprehension. The emergence of comment mining technology has provided an effective solution to this problem and has become a hot research topic of scholars at home and abroad. Comment mining mainly includes feature extraction and emotion analysis. This paper focuses on the research of Chinese online comment mining, and develops the following work: 1) to construct the online review database of electronic products. In this paper, we use customized crawler tools to automatically capture and analyze the html content of electronic product reviews by JingDong and Taobao, and then use the initial comment filtering standard proposed in this paper to filter and clean the original comment data. After using the segmentation tool of the Chinese Academy of Sciences to stop the word, the statistical word frequency is stored in the database. Finally, the pre-processed data is stored in es cluster. 2) an efficient two-pruning algorithm based on Chinese online comment is proposed for feature extraction. In this paper, based on the traditional sequential pattern mining algorithm, aiming at the problem that the accuracy and recall rate are not high enough, the traditional GSP algorithm is combined with the cooccurrence degree method based on statistics to achieve feature extraction and pruning. The obtained feature sets lay the foundation for the subsequent affective analysis. 3) the construction of Chinese syntactic patterns. In this paper, the syntactic parser is used to parse the comments, and then the frequency of the dependencies in the corpus is counted. Through the study of the dependency patterns and the features of the online comments, seven dependency patterns are constructed. An algorithm based on semantic distance and punctuation is proposed to extract tuples composed of features and viewpoints. Finally, a classification feature model based on 11 features is constructed, and SVM, logical regression and Bayesian algorithms are used as classifiers to compare with the baseline model. Finally, five features which are most relevant to the classification results are obtained through the selection and sorting of the features. The experimental results show that the proposed method is effective and easy to use.
【学位授予单位】:东南大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
【参考文献】
相关期刊论文 前8条
1 祖李军;王卫平;;中文网络评论中提取产品特征的研究[J];计算机系统应用;2014年05期
2 桂斌;杨小平;张中夏;肖文韬;;基于微博表情符号的情感词典构建研究[J];北京理工大学学报;2014年05期
3 吴丽华;冯建平;曹均阔;;中文网络评论的IT产品特征挖掘及情感倾向分析[J];计算机与数字工程;2012年11期
4 刘俊;邹东升;邢欣来;李英豪;;基于主题特征的关键词抽取[J];计算机应用研究;2012年11期
5 王洪伟;郑丽娟;尹裴;史伟;;在线评论的情感极性分类研究综述[J];情报科学;2012年08期
6 李实;叶强;李一军;罗嗣卿;;挖掘中文网络客户评论的产品特征及情感倾向[J];计算机应用研究;2010年08期
7 崔大志;孙丽伟;;在线评论情感词汇模糊本体库构建[J];辽宁工程技术大学学报(社会科学版);2010年04期
8 娄德成;姚天f ;;汉语句子语义极性分析和观点抽取方法的研究[J];计算机应用;2006年11期
,本文编号:1804319
本文链接:https://www.wllwen.com/jingjilunwen/dianzishangwulunwen/1804319.html