基于词向量的在线评论话题及其特征抽取研究
发布时间:2018-04-29 06:18
本文选题:在线评论 + 特征抽取 ; 参考:《电子科技大学》2016年硕士论文
【摘要】:IT技术和互联网对人类社会的信息交互带来了巨大的革新,同时也促使了新的交易方式——电子商务的出现,随着电子商务的发展成熟,人们越来越热衷于通过网络购买商品和服务,在研究领域,众多的学者对消费者行为的研究也从线下迁移到了线上。电子商务话题的研究是近几年的热门领域。Web2.0带来的交互便利、快捷使得用户能轻易的在网上留下自己的行为轨迹、发表自己的观点和意见,网络购物人群的快速增长使得电子商务网站积累了大量的购物数据,其中包括大量的非结构化的评论文本信息。对于消费者而言,这些评论信息有助于其做出更有效的购物决策,而对于商品的生产厂商而言,这些评论反映了消费者对其产品和公司服务的市场反馈,相较于普通问卷、咨询等调研方式,在线商品评论数据更为庞大和直接。用户在电子商务网站上留下的在线评论是消费者自发、随意撰写的,这些评论往往结构散乱、内容简短,这种文本的稀疏特性使得学者们在研究评论时面临很大的困难;另一方面,电子商务网站上的商品成千上万,各自的评论更是从体量上超过了人类能够阅读、判断的极限;即大数据、稀疏性带来的问题使得研究难以进行。对于在线商品评论的研究,以前的学者多从文档层面对评论文本进行研究,考虑句子结构,语法特点、词频等特征,或者从概率模型的角度,研究潜语义层面的话题特征,这些研究虽然取得了一定的结果,不过在处理文本的过程中,忽视了作为一个整体句子的语义信息。随着当今计算能力的提高,神经网络语言模型在语义层面解释了文本的产生和语义的表达。本文利用神经网络将在线评论文本从传统的文档空间转移到高维的向量语义空间,并对挖掘的评论特征种子词进行聚类,对于在线评论的话题和特征抽取达到了更好的效果。另外,对于大量数据的真实背景缺失问题,本文通过改进的困惑度指标,基于最大熵的原理,证明了本文所提方法的可靠性。同时,本文所改进的困惑度指标也可扩展为对大数据环境下聚类问题的统一评价指标,对大数据下的研究有一定贡献。为真实背景缺失的算法比较,提供了一个较好的评价方式。
[Abstract]:It technology and the Internet have brought great innovation to the information exchange in human society, and at the same time, it has also promoted the emergence of a new transaction method-e-commerce, with the development and maturity of e-commerce. People are more and more interested in buying goods and services through the Internet. In the field of research, many scholars have moved their research on consumer behavior from offline to online. The research on the topic of electronic commerce is the interaction convenience brought by the popular field. Web 2.0 in recent years, which makes it easy for users to leave their own behavior track and express their views and opinions on the Internet. With the rapid growth of online shopping population, e-commerce websites have accumulated a lot of shopping data, including a large amount of unstructured comment text information. For consumers, these comments help them to make more effective shopping decisions, while for manufacturers of goods, they reflect consumer market feedback on their products and company services, as opposed to general questionnaires. Consulting and other research methods, online commodity review data is larger and more direct. The online comments left by users on e-commerce websites are spontaneous and random written by consumers. These comments are often scattered in structure and short in content. The sparse nature of the text makes it difficult for scholars to study comments. On the other hand, there are thousands of goods on e-commerce websites, and their respective comments exceed the limits of human reading and judgment; big data, the sparsity of the problem makes it difficult to carry out research. For the research of online commodity review, previous scholars have studied the comment text from the document level, considering sentence structure, grammatical characteristics, word frequency and other features, or from the perspective of the probability model, to study the topic features at the latent semantic level. Although these studies have achieved some results, they ignore the semantic information as a whole in the process of text processing. With the improvement of computational power, the neural network language model explains the text generation and semantic expression at the semantic level. In this paper, the neural network is used to transfer the online comment text from the traditional document space to the high-dimensional vector semantic space, and to cluster the comment feature seed words, which can achieve better results for online comment topic and feature extraction. In addition, the reliability of the proposed method is proved by the improved bewilderment index and the principle of maximum entropy for the lack of real background of a large number of data. At the same time, the improved bewilderment index in this paper can be extended to the unified evaluation index of cluster problem under big data environment. It provides a better evaluation method for the comparison of the algorithms without real background.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
【参考文献】
相关期刊论文 前9条
1 何有世;李金海;马云蕾;李烁朋;;基于复杂网络构建面向主题的在线评论挖掘模型[J];软科学;2015年10期
2 王祖辉;姜维;李一军;;在线评论情感分析中固定搭配特征提取方法研究[J];管理工程学报;2014年04期
3 陈炯;张虎;曹付元;;面向中文客户评论的评价搭配识别研究[J];计算机工程与设计;2013年03期
4 杨源;马云龙;林鸿飞;;评论挖掘中产品属性归类问题研究[J];中文信息学报;2012年03期
5 徐戈;王厚峰;;自然语言处理中主题模型的发展[J];计算机学报;2011年08期
6 李培;何中市;黄永文;;基于依存关系分析的网络评论极性分类研究[J];计算机工程与应用;2010年11期
7 周杰;林琛;李弼程;;基于机器学习的网络新闻评论情感分类研究[J];计算机应用;2010年04期
8 李实;叶强;李一军;Rob Law;;中文网络客户评论的产品特征挖掘方法研究[J];管理科学学报;2009年02期
9 刘群,张华平,俞鸿魁,程学旗;基于层叠隐马模型的汉语词法分析[J];计算机研究与发展;2004年08期
,本文编号:1818759
本文链接:https://www.wllwen.com/jingjilunwen/dianzishangwulunwen/1818759.html