分布式图聚类及其在电子商务数据挖掘中的应用

发布时间：2018-05-18 21:47

本文选题：分布式聚类 + 图聚类　；参考：《东华大学》2013年硕士论文

【摘要】：图作为一种常用的数据结构,由结点及其之间的连接边组成,目前已成为各种复杂对象及其之间联系的建模工具。在电子商务网站中,客户登录网站并进行物品交易,都会在网站的后台数据库里生成相关的交易数据。利用这些交易数据,可以构建出各种各样的客户关系网络图。以购买同种物品的客户关系为例,图的结点表示不同的客户,而图中的一条边则表示两个客户在该网站上购买了相同的物品。与其他类型数据类似,这种客户关系网络图蕴藏着丰富的信息与知识,在电子商务网站的客户关系管理中具有实际的应用价值。图聚类是利用聚类技术在图中分析出那些内部联系紧密、外部联系松散的聚簇。图聚类已在社会网络的社区发现、蛋白质的复合物检测等应用得到实际的运用。在上述电子商务网站的客户关系网络图中,可以利用图聚类的方法,挖掘出不同的客户群体簇。所挖掘出来的客户群体簇,可能代表了该群体簇里的客户具有相似的兴趣、偏好,也可能代表了这些客户具有相似的家庭结构、年龄段等。这类信息对于电子商务网站进行个性化商品推荐,制定更有针对性的营销策略,提升网站的运营具有指导意义。一些主流的电子商务网站,例如淘宝、一号店等,其拥有的客户数量相当庞大,由这些客户所形成的关系图也会非常巨大。面对庞大的数据量,单个工作站不管是在CPU计算能力还是在内存消耗上均无法满足需求,从而导致聚类分析无法正常执行。在大规模的客户关系图中,如何有效地挖掘出客户群体簇,已成为业界共同关注的问题。 MapReduce作为一种并行编程模型,可实现上百乃至上千台计算机的互联,将巨大的系统资源池连接在一起,形成庞大的机器集群,特别适用于大规模数据的并行处理。本文考虑MapReduce在大数据处理上所具有的优势,试图将MapReduce与传统的图聚类方法相结合,提出一种分布式的图聚类方法,并将之运用于客户关系发现的实际应用中。本文以作者参与的“钢贸网站交易数据分析”实际项目为应用实例,利用某钢贸公司2006年至2011年积累下来的5年交易数据,通过图聚类的方法,分析得到钢贸客户群体,为该公司制定有效的钢材销售策略提供了决策支持。具体而言,本文的研究内容主要包括： 1)论文首先介绍相关技术,包括数据挖掘、图聚类、MapReduce并行框架及其开源实现Hadoop。 2)接着以钢贸电子商务网站为具体实例,结合钢贸交易数据的实际特点,阐述了钢贸交易数据仓库构建过程,并对钢贸客户关系图建模进行了详细论述。 3)论文以MapReduce框架为基础,提出了一种基于MapReduce的分布式图聚类算法,即MR-LSH算法,以解决在分布式环境下如何利用LSH实现大规模图数据的可扩展并行聚类问题。该算法将MapReduce并行框架与位置敏感哈希(Locality Sensitive Hashing,简称LSH)相结合,从而在MapReduce并行框架中实现一种基于位置敏感哈希的分布式图聚类算法即MR-LSH算法。论文将详细论述MR-LSH算法的具体思路及其实现框架,并详细介绍了框架中的各个步骤的实现方法。在此基础上,论文运用某钢贸公司2006年至2011年的交易数据生成的客户关系图,通过实例证明本文所提到的分布式图聚类在电子商务数据挖掘领域里的可行性与实用性。实验结果表明,该系统安全可靠、易维护、具有良好的可扩展性。
[Abstract]:As a common data structure, it is made up of nodes and the connections between them and has now become a modeling tool for various complex objects and their connections. A variety of customer relationship network diagrams can be built. As an example of the customer relationship for the purchase of the same item, the nodes of the graph represent different customers, while one side of the graph indicates that two customers have purchased the same items on the site. It has practical application value in customer relationship management of e-commerce website.
Graph clustering is a clustering technique that uses clustering technology to analyze the compact clusters with tight internal connections and loose external connections. Graph clustering has been found in the community network, and the application of protein complex detection has been applied. In the customer relationship network diagram of the e-business website, the method of graph clustering can be used to excavate Different customer clusters. The cluster of customer groups may represent the similar interests, preferences, and similar family structure, age, etc. of the customers in the cluster. This kind of information can make personalized recommendation for e-commerce websites and make more targeted marketing strategies. The promotion of the operation of the website is of guiding significance.
Some mainstream e-commerce sites, such as Taobao, No. 1 store and so on, have a large number of customers, and the relationships formed by these customers will be very huge. In the face of huge data, a single workstation is unable to meet the demand in both the CPU computing power and the memory consumption, which leads to the failure of clustering analysis. Frequent implementation. How to effectively mine customer clusters in large-scale customer relationship diagrams has become a common concern of the industry.
As a parallel programming model, MapReduce can interconnect the hundreds of thousands of computers, connect huge pool of system resources together, form a large cluster of machines, especially for parallel processing of large data. This paper considers the advantages of MapReduce in large data processing, and tries to make MapReduce and traditional Combining graph clustering method, a distributed graph clustering method is proposed and applied to the practical application of customer relationship discovery.
This paper, taking the actual project of "trading data analysis of steel trade website transaction data" as an application example, uses the 5 year transaction data accumulated by a steel trade company from 2006 to 2011, and analyzes the customer group of steel trade through the method of graph clustering, which provides the decision support for the company to formulate effective steel sales strategy. The main contents of the paper are as follows:
1) the paper first introduces related technologies, including data mining, graph clustering, MapReduce parallel framework and its open source implementation Hadoop.
2) then taking the steel trade e-commerce website as a concrete example and combining the actual characteristics of the trading data of steel trade, this paper expounds the construction process of the data warehouse of the trade in steel trade, and expounds the modeling of the customer relationship diagram of the steel trade.
3) based on the MapReduce framework, this paper proposes a distributed graph clustering algorithm based on MapReduce, that is, MR-LSH algorithm, to solve the scalable parallel clustering problem of how to use LSH to realize large scale graph data in a distributed environment. The algorithm combines MapReduce parallel framework and location sensitive Hashi (Locality Sensitive Hashing, L). SH) combined to implement a distributed graph clustering algorithm based on location sensitive hash in the MapReduce parallel framework, that is, MR-LSH algorithm. This paper will discuss the specific idea and implementation framework of MR-LSH algorithm in detail, and introduce the implementation of each step in the framework in detail.
On this basis, the paper uses a customer relationship diagram generated by the trading data of a steel trade company from 2006 to 2011, and proves the feasibility and practicability of the distributed graph clustering in the field of electronic commerce data mining through an example. The experimental results show that the system is safe, reliable, easy to maintain and has good scalability.
【学位授予单位】：东华大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP311.13;F713.36

【参考文献】