基于可信度传递的商品垃圾评论检测研究

发布时间：2018-04-23 00:04

本文选题：文本挖掘 + 垃圾评论检测　；参考：《重庆大学》2016年硕士论文

【摘要】：随着互联网技术的发展,网络已经对人们表达自己和与他人互动的方式产生了巨大的影响。在线评论在今天的电子商务中起到至关重要的作用,消费者常常会通过网络查看商品或者商店的评论信息,然后做出购买决策。但是由于网络中存在着大量的垃圾评论,消费者会被误导甚至购买质量低下的商品,严重影响购物体验,商家也会因为恶意评论而名誉受损。因此,近年来垃圾评论的智能化检测已成为一个研究热点。本文系统地总结并论述了垃圾评论检测领域的发展现状,分析了该领域相关算法和技术。针对传统基于人工标注的算法性能评价体系工作量大、不利于计算机处理等问题,本文提出使用两个识伪度指标来度量检测算法的性能。主要思想是比较检测前后的数据样本在推荐系统准确度和评论正反馈率上的差异,这种方法为观察垃圾评论检测的效果提供了新的视角,可以作为传统评价体系的补充。本文使用可信度得分作为度量评论、评论者、商品可信程度的指标,通过分析影响评论可信度的关键因素,从中抽取了评论文本的长度、属性覆盖率、时间分布三个特征计算评论初始可信度得分。此外,本文在属性词典的提取中巧妙的将词频统计方法与主题词模型进行融合,并使用成熟的第三方工具word2vec构建提取模型,实验表明,本算法能获取更加丰富而准确的属性词典。受评论关系图和Web事实发现的启发,本文发现了评论、评论者、商品三者之间可信度的相互影响关系,而之前的研究者多把三者作为单一研究对象,忽略了它们之间的关系。因此本文提出了一种基于可信度传递的垃圾评论检测算法,该算法将评论、评论者、商品抽象成图模型,以评论初始可信度得分为基础,以三者构成的网络为线索,构建计算评论、评论者、商品可信度得分的模型,修正评论可信度得分,排除可信度得分小于可信阈值的评论。实验表明,此算法在准确率和召回率上都有一定的提升。
[Abstract]:With the development of Internet technology, the Internet has had a great impact on the way people express themselves and interact with others. Online reviews play a crucial role in today's e-commerce, and consumers often view reviews of goods or stores online and make purchase decisions. However, due to the existence of a large number of spam comments on the Internet, consumers will be misled or even buy goods of low quality, which will seriously affect the shopping experience, and the reputation of merchants will also be damaged by malicious comments. Therefore, the intelligent detection of spam reviews has become a research hotspot in recent years. This paper systematically summarizes and discusses the development of garbage comment detection field, and analyzes the relevant algorithms and technologies in this field. In order to solve the problem that the traditional performance evaluation system based on manual annotation is difficult to deal with by computer, this paper proposes to measure the performance of the detection algorithm by using two false recognition indexes. The main idea is to compare the difference between the accuracy of recommendation system and the positive feedback rate of comments before and after detection. This method provides a new perspective for observing the effect of garbage comment detection and can be used as a supplement to the traditional evaluation system. In this paper, the credibility score is used as an index to measure the credibility of a comment. By analyzing the key factors affecting the credibility of the comment, the author extracts the length of the comment text and the coverage of the attribute, by analyzing the key factors affecting the credibility of the comment. Three features of time distribution are used to calculate the initial reliability score of comments. In addition, in the extraction of attribute dictionary, this paper skillfully combines the word frequency statistics method with the thematic word model, and uses the mature third-party tool word2vec to construct the extraction model. The experiment shows that, This algorithm can obtain more abundant and accurate attribute dictionary. Inspired by the review diagram and the fact finding of Web, this paper finds the relationship between the credibility of the commentary, the reviewer and the commodity, but most of the previous researchers regarded the three as a single object of study, ignoring the relationship between them. Therefore, this paper proposes a spam comment detection algorithm based on credibility transfer. The algorithm abstracts comments, reviewers and commodities into graph models, based on the initial credibility score of comments, and takes the network composed of the three as a clue. A model for calculating the credibility of comments, reviewers and commodities is constructed, and the reliability score of comments is revised to exclude those whose credibility score is less than the trust threshold. Experiments show that the algorithm can improve the accuracy and recall rate.
【学位授予单位】：重庆大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】