分布式多数据源电商数据融合分析系统
[Abstract]:With the popularity of the Internet, the popularization of mobile intelligent terminals and the rapid development of the logistics industry, e-commerce has become an important part of the people's life and the national economy. As a shopping carrier, e-commerce platform carries a large number of valuable data. From the e-commerce data, it can not only restore the environment of the user's network shopping, but also the analysis network. The influence of the collaterals shopping environment on the behavior of the users can also analyze the behavior rules of the commodity market, give the behavior suggestions for the merchants and analyze the national economic situation, and have high research value. The data analysis and mining of e-commerce is the process of analyzing and mining the e-commerce data to obtain valuable information. The data analysis and mining of e-commerce is a number of data mining. In the process of data analysis and mining of e-commerce, there are several problems to be solved in the process of data analysis and mining of e-commerce: data acquisition, preprocessing, lack of direct connection between multiple data sources, low credibility and integrity of single e-commerce data, lack of fusion analysis of multi data source data, and single computer data. The mining system can not deal with the demand of mass data processing of e-commerce. It needs to apply distributed data mining system. At the same time, some common data mining algorithms have low efficiency in distributed implementation. The main work points of this paper are divided into 3 points: (1) aiming at the special point of e-commerce data, it is pertinent and specific to e-commerce data. Data analysis and mining work. In this paper, from the definition of e-commerce data and data acquisition, the data types included in the e-commerce site are analyzed. According to the analysis requirements, the required data are collected, and the data storage format is designed. The data include more semi-structured, unstructured data, unstandardized data and large data noise. According to the characteristics of the data preprocessing, the solution is made to ensure that the data has better data quality. At the same time, the data mining methods such as association analysis, clustering, linear regression, artificial neural network and other data mining methods are used to analyze and excavate the e-commerce data. (2) a method of data fusion for multi data source is designed and implemented, and different electricity is used for different electricity. The commercial website data is used for data fusion, and the fusion data are used in data mining. This paper analyzes the structural features of commercial information on e-commerce sites, designs a method of multi e-commerce data fusion according to its characteristics, and extracts commodity name, commodity attribute name and commodity attribute content by preprocessing and text analysis of e-commerce data. The unsupervised learning algorithm is designed, which can learn and match the data according to the characteristics of the seeds in the case of the unknown relation of the commodity parameters of the different data sources, and use a variety of commodity parameters to gradually find the matching goods and commodity parameters, and reduce the amount of calculation of data fusion, while comparing with the single parameter. The results obtained by data fusion can improve the accuracy of the unity of the commodity entities, and can flexibly set the standard of the same goods, get the matching results under different standards. And use the data after the fusion to predict the data. Compared with the use of single data source data, the accuracy of the prediction results has been improved. (3) the Hadoop based classification is designed. The implementation of hierarchical cluster data mining system is improved and realized under Hadoop. The characteristics of distributed computing architecture are analyzed. A distributed data analysis mining system based on Hadoop is designed. The traditional hierarchical clustering which is caused by Hadoop is not friendly to the iteration, and the hierarchical clustering has high overlapping times in Hadoop. According to the principle of hierarchical clustering algorithm and the structure characteristics of Hadoop, the improved hierarchical clustering is designed. Under the condition of monotonous increasing distance between classes, it can not change the clustering results, and can aggregate many classes in a cluster process, reduce the number of iterations, and can greatly improve the level of hierarchical clustering under the Hadoop. At the same time, the feasibility of the method is verified by using the hierarchical clustering to calculate the similarity between the goods and then use the hierarchical clustering to calculate the similarity between the goods under the condition of the lack of multi-dimensional feature information.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
【相似文献】
相关期刊论文 前10条
1 李玲玲;;关于凝聚型层次聚类时间复杂度的研究[J];宿州学院学报;2011年02期
2 潘大庆;;基于层次聚类的微博敏感话题检测算法研究[J];广西民族大学学报(自然科学版);2012年04期
3 郑晓鸣;吕士颖;王晓东;;一种基于随机抽取的有限深度层次聚类[J];郑州大学学报(理学版);2007年03期
4 汤周文;叶东毅;;基于层次聚类的差异化属性约简算法[J];计算机应用;2009年02期
5 文顺;赵杰煜;朱绍军;;基于贝叶斯和谐度的层次聚类[J];模式识别与人工智能;2013年12期
6 龚尚福;陈婉璐;贾澎涛;;层次聚类社区发现算法的研究[J];计算机应用研究;2013年11期
7 香红丽;王潇涵;罗淑云;;基于层次聚类方法研究课程关系结构[J];中国科教创新导刊;2011年26期
8 李晓飞;;基于动态层次聚类的离散化算法的研究[J];计算机应用与软件;2009年10期
9 张阔,徐鹏,李涓子,王克宏;基于优化层次聚类的文档逻辑结构抽取[J];清华大学学报(自然科学版);2005年04期
10 王旅;彭宏;胡劲松;梁华芳;;层次聚类在种群亲缘关系研究中的应用[J];计算机时代;2006年07期
相关会议论文 前6条
1 吾守尔·斯拉木;吴启南;;基于层次聚类方法[A];第六届全国计算机应用联合学术会议论文集[C];2002年
2 彭楠峗;王厚峰;凌晨添;;基于层次聚类的网络新闻热点发现[A];中国计算语言学研究前沿进展(2009-2011)[C];2011年
3 杨建武;;Web检索结果的层次聚类研究[A];第二十一届中国数据库学术会议论文集(技术报告篇)[C];2004年
4 刘启亮;邓敏;李光强;王佳t,
本文编号:2123692
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2123692.html