面向海量商品数据的分布式层次聚类算法设计与实现

发布时间：2019-03-16 15:14

【摘要】：得益于计算机科学与信息技术的进步,企业可以方便的收集并储存大量数据。但收集到的数据仅仅占用了大量的存储空间,无法对企业的价值产生有效的帮助,因此企业开始着手于从数据中挖掘信息。以往的信息挖掘过程由专家分析并解释数据,这种方式随着数据量以及属性的急剧增加而变得越来越困难。所以,如何有效地从巨大数据库中自动的发现知识,更进一步加工转化成企业不可或缺的商业智慧,逐渐成为二十一世纪企业和机构所必须面对的重要课题。在生产实践中,数据的增加速度与数据分析所消耗的大量时间已经形成了越来越突出的矛盾。数据挖掘正是为了解决传统分析方法的问题,针对大规模数据的分析处理而出现的技术。数据挖掘通过将自学习算法应用在大规模数据集上,得到隐藏在数据中难以获取的知识与信息。海关作为国家商品进出口的主要监管单位,是海量进出口数据的生产者和拥有者。随着业务流程信息化建设的深入和完善,海关已经基本实现了较为完整的数据化监管和数字化运营能力。但同时,相对有限的数据分析手段与不断增长的数据和业务复杂度之间的矛盾也日益突出。如何对海量的报关商品进行有效的归类和管理成为海关监管中亟待解决的问题。本论文以海关商品数据分析项目为主线,在MapReduce框架的基础上实现了对商品数据的一系列处理模块,形成了商品数据的分布式聚类系统。主要内容包括商品数据的预处理、TF-IDF计算、倒排索引的构建、相似度矩阵的计算、单连接层次聚类计算等。最后利用层次聚类的结果对海关的商品数据进行了整理,为海关情报分析研判模块提供精确的分组统计依据,在实际应用中产生了效果。
[Abstract]:Thanks to advances in computer science and information technology, businesses can easily collect and store large amounts of data. However, the collected data only takes up a large amount of storage space and can not effectively help the value of the enterprise. Therefore, the enterprise begins to mine information from the data. In the past, the process of information mining was analyzed and interpreted by experts, which became more and more difficult with the rapid increase of data and attributes. Therefore, how to discover knowledge automatically from the huge database and further process into the indispensable business wisdom of enterprises has gradually become an important subject that enterprises and organizations have to face in the 21 century. In production practice, the increasing speed of data and the time consumed by data analysis have formed a more and more prominent contradiction. Data mining is just to solve the problem of traditional analysis methods, aiming at the analysis of large-scale data processing technology. By applying the self-learning algorithm to large-scale data sets, data mining can get the knowledge and information hidden in the data. As the main regulatory unit of national commodity import and export, customs is the producer and owner of mass import and export data. With the deepening and perfection of business process information construction, customs has basically realized relatively complete data-based supervision and digital operation capability. But at the same time, the contradiction between the relatively limited data analysis means and the increasing data and business complexity is becoming more and more prominent. How to effectively classify and manage the vast quantities of customs declaration goods becomes an urgent problem to be solved in customs supervision. Based on the main line of customs commodity data analysis project, a series of processing modules of commodity data are implemented on the basis of MapReduce framework, and a distributed clustering system of commodity data is formed in this paper. The main contents include commodity data preprocessing, TF-IDF calculation, inverted index construction, similarity matrix calculation, single join hierarchical clustering calculation and so on. Finally, the result of hierarchical clustering is used to sort out the commodity data of customs, which provides the accurate statistical basis for the module of customs information analysis and judgment, and produces an effect in practical application.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【相似文献】