当前位置:主页 > 科技论文 > 软件论文 >

分布式多数据源电商数据融合分析系统

发布时间:2018-07-15 09:56
【摘要】:随着互联网、移动智能终端的普及,物流行业的快速发展,电子商务越来越成为人们生活和国民经济的重要组成部分。电商平台作为人们购物载体,承载着大量有价值数据,从电商数据中不仅能够还原用户网络购物时所处的环境,分析网络购物环境对用户行为的影响,又能分析商品市场的行为规律,为商家给出行为建议,还可分析国民经济情况,具有较高的研究价值。电商数据分析挖掘是对电商数据进行分析挖掘,以获得有价值的信息的过程。电商数据分析挖掘属于数据挖掘的一部分,同时又有自身的特殊性。在电商数据分析挖掘过程中,存在着以下几个难题需要解决:数据采集、预处理难;多个数据源间缺少直接联系,单一电商数据的可信度和完整度较低,缺少对多数据源数据的融合分析;单机数据挖掘系统无法应对电商的海量数据的处理需求,需要应用分布式数据挖掘系统,同时一些常用的数据挖掘算法在分布式下的实现效率较低。本文的主要工作点分为以下3点:(1)针对电商数据的特点,对电商数据进行了有针对性和具体的数据分析挖掘工作。本文从电商数据定义、数据采集开始,分析了电商网站所包含的数据类型,根据分析需求采集了所需要的数据,设计了数据存储格式。针对电商数据的包含较多的半结构化、无结构化数据,数据不规范,数据噪声大等数据特点,从数据预处理切入,制定解决方法,以保证数据有较好的数据质量。同时运用关联分析、聚类、线性回归、人工神经网络等多种数据挖掘方法对电商数据进行分析挖掘。(2)设计和实现了一种多数据源电商数据融合的方法,对不同电商网站数据进行数据融合,并将融合后的数据用于数据挖掘中。本文分析电商网站的商品信息的结构特点,根据其特点设计一种多电商数据融合的方法,通过对电商数据的预处理和文本分析,提取出商品名、商品属性名、商品属性内容的分级特征,设计了无监督的学习算法,可在不同数据源的商品参数对应关系未知的情况下,依据种子特征对数据进行学习、匹配,利用多种商品参数,逐步找到匹配商品和商品参数,减少了数据融合的计算量,同时相比于使用单一参数进行数据融合所得到的结果,提高了商品实体统一的准确率,且能灵活设定相同商品的标准,得到不同标准下的匹配结果。并将融合后的数据用于数据预测,相比于使用单一数据源数据,预测结果的准确率得到了提升。(3)设计了基于Hadoop的分布式电商数据挖掘系统,改进和实现了层次聚类在Hadoop下的实现。分析了分布式计算架构的特点,设计了采用基于Hadoop的分布式数据分析挖掘系统。针对Hadoop对迭代不友好,而层次聚类具有较高迭代次数所导致的传统层次聚类在Hadoop下的实现效率较低的问题,依据层次聚类的算法原理和Hadoop的结构特点设计了改进的层次聚类,在类间距离是单调递增的情况下,其不改变聚类结果,能在一次聚类过程中聚合多个类,减少了迭代次数,能大幅提高层次聚类在Hadoop下的计算效率。同时探讨了在缺少商品多维特征信息的情况下,通过用户对商品的使用日志间接计算商品之间的相似度,进而使用层次聚类得到商品聚类信息,并通过实验验证了方法的可行性。
[Abstract]:With the popularity of the Internet, the popularization of mobile intelligent terminals and the rapid development of the logistics industry, e-commerce has become an important part of the people's life and the national economy. As a shopping carrier, e-commerce platform carries a large number of valuable data. From the e-commerce data, it can not only restore the environment of the user's network shopping, but also the analysis network. The influence of the collaterals shopping environment on the behavior of the users can also analyze the behavior rules of the commodity market, give the behavior suggestions for the merchants and analyze the national economic situation, and have high research value. The data analysis and mining of e-commerce is the process of analyzing and mining the e-commerce data to obtain valuable information. The data analysis and mining of e-commerce is a number of data mining. In the process of data analysis and mining of e-commerce, there are several problems to be solved in the process of data analysis and mining of e-commerce: data acquisition, preprocessing, lack of direct connection between multiple data sources, low credibility and integrity of single e-commerce data, lack of fusion analysis of multi data source data, and single computer data. The mining system can not deal with the demand of mass data processing of e-commerce. It needs to apply distributed data mining system. At the same time, some common data mining algorithms have low efficiency in distributed implementation. The main work points of this paper are divided into 3 points: (1) aiming at the special point of e-commerce data, it is pertinent and specific to e-commerce data. Data analysis and mining work. In this paper, from the definition of e-commerce data and data acquisition, the data types included in the e-commerce site are analyzed. According to the analysis requirements, the required data are collected, and the data storage format is designed. The data include more semi-structured, unstructured data, unstandardized data and large data noise. According to the characteristics of the data preprocessing, the solution is made to ensure that the data has better data quality. At the same time, the data mining methods such as association analysis, clustering, linear regression, artificial neural network and other data mining methods are used to analyze and excavate the e-commerce data. (2) a method of data fusion for multi data source is designed and implemented, and different electricity is used for different electricity. The commercial website data is used for data fusion, and the fusion data are used in data mining. This paper analyzes the structural features of commercial information on e-commerce sites, designs a method of multi e-commerce data fusion according to its characteristics, and extracts commodity name, commodity attribute name and commodity attribute content by preprocessing and text analysis of e-commerce data. The unsupervised learning algorithm is designed, which can learn and match the data according to the characteristics of the seeds in the case of the unknown relation of the commodity parameters of the different data sources, and use a variety of commodity parameters to gradually find the matching goods and commodity parameters, and reduce the amount of calculation of data fusion, while comparing with the single parameter. The results obtained by data fusion can improve the accuracy of the unity of the commodity entities, and can flexibly set the standard of the same goods, get the matching results under different standards. And use the data after the fusion to predict the data. Compared with the use of single data source data, the accuracy of the prediction results has been improved. (3) the Hadoop based classification is designed. The implementation of hierarchical cluster data mining system is improved and realized under Hadoop. The characteristics of distributed computing architecture are analyzed. A distributed data analysis mining system based on Hadoop is designed. The traditional hierarchical clustering which is caused by Hadoop is not friendly to the iteration, and the hierarchical clustering has high overlapping times in Hadoop. According to the principle of hierarchical clustering algorithm and the structure characteristics of Hadoop, the improved hierarchical clustering is designed. Under the condition of monotonous increasing distance between classes, it can not change the clustering results, and can aggregate many classes in a cluster process, reduce the number of iterations, and can greatly improve the level of hierarchical clustering under the Hadoop. At the same time, the feasibility of the method is verified by using the hierarchical clustering to calculate the similarity between the goods and then use the hierarchical clustering to calculate the similarity between the goods under the condition of the lack of multi-dimensional feature information.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13

【相似文献】

相关期刊论文 前10条

1 李玲玲;;关于凝聚型层次聚类时间复杂度的研究[J];宿州学院学报;2011年02期

2 潘大庆;;基于层次聚类的微博敏感话题检测算法研究[J];广西民族大学学报(自然科学版);2012年04期

3 郑晓鸣;吕士颖;王晓东;;一种基于随机抽取的有限深度层次聚类[J];郑州大学学报(理学版);2007年03期

4 汤周文;叶东毅;;基于层次聚类的差异化属性约简算法[J];计算机应用;2009年02期

5 文顺;赵杰煜;朱绍军;;基于贝叶斯和谐度的层次聚类[J];模式识别与人工智能;2013年12期

6 龚尚福;陈婉璐;贾澎涛;;层次聚类社区发现算法的研究[J];计算机应用研究;2013年11期

7 香红丽;王潇涵;罗淑云;;基于层次聚类方法研究课程关系结构[J];中国科教创新导刊;2011年26期

8 李晓飞;;基于动态层次聚类的离散化算法的研究[J];计算机应用与软件;2009年10期

9 张阔,徐鹏,李涓子,王克宏;基于优化层次聚类的文档逻辑结构抽取[J];清华大学学报(自然科学版);2005年04期

10 王旅;彭宏;胡劲松;梁华芳;;层次聚类在种群亲缘关系研究中的应用[J];计算机时代;2006年07期

相关会议论文 前6条

1 吾守尔·斯拉木;吴启南;;基于层次聚类方法[A];第六届全国计算机应用联合学术会议论文集[C];2002年

2 彭楠峗;王厚峰;凌晨添;;基于层次聚类的网络新闻热点发现[A];中国计算语言学研究前沿进展(2009-2011)[C];2011年

3 杨建武;;Web检索结果的层次聚类研究[A];第二十一届中国数据库学术会议论文集(技术报告篇)[C];2004年

4 刘启亮;邓敏;李光强;王佳t,

本文编号:2123692


资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2123692.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户224fe***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com