云环境下具有隐私保护的K-means聚类算法研究与设计

发布时间：2018-05-07 15:39

本文选题：隐私保护 + 数据挖掘　；参考：《哈尔滨工业大学》2017年硕士论文

【摘要】：众所周知,K-means聚类是数据挖掘中非常经典和常用的方法之一,它通过计算数据项之间的距离可以把相似的数据项聚集在一起。随着信息化、数字化、网络化进程加速,经济全球化已成为一种不可逆的趋势,聚类算法中的数据来源越来越多样化,数据安全越来越重要。考虑到数据会来自多个参与方,在这些数据中可能包含关于参与方的敏感信息或私人信息,如果这些信息在多个参与方之间共享,那么数据的隐私性将不能得到保证。具有隐私保护的联合数据挖掘可以在保护用户数据和挖掘结果隐私性的同时,对多个参与方的联合数据库进行数据挖掘,进一步提取出有用的信息。因此,如何设计出具有隐私保护的联合数据挖掘算法成为一个需要解决的难题。半诚实模型在许多情况下是符合实际场景的,该模型下数据的隐私性是通过各个参与方始终遵循协议来保证的。但是为保证数据的隐私性,该模型下的解决方案通常因为计算消耗和通信消耗较高,所以实际中并不可行。如今,随着科学技术的进步,越来越多的企业将数据存储在云平台,同时分布式云计算框架也为处理大数据提供强大的计算能力。本论文将借助云计算强大的计算能力提升算法的效率,保证算法的可行性。针对具有隐私保护的数据挖掘中存在的性能问题,本论文开展了对现有具有隐私保护的数据挖掘算法的深入研究,进而在水平划分的数据集上提出一种高效的具有隐私保护的K-means聚类算法,该算法支持有两个数据拥有者和云平台同时存在的存储外包和计算外包。数据以密文形式存储在云端,云平台通过与两个数据拥有者交互,完成在双方的联合数据集上K-means聚类数据挖掘的任务。本论文分别设计不同的安全协议解决具有隐私保护的K-means聚类算法中的三个技术难题:解决密文距离计算问题的安全距离计算协议、解决密文比较问题的安全比较协议和解决密文除法问题的安全电路协议。进而将这些安全协议应用到聚类算法框架中,实现具有隐私保护的K-means聚类算法。本论文从理论上分析了该算法的时间复杂度、空间复杂度和通讯复杂度,给出该算法在半诚实模型下的安全性证明,并且证明该算法在重计算质心点阶段允许参与方中最多有一个方为恶意方的安全性,最后通过实验计算加密数据的时间消耗和一次迭代过程中各参与方的时间消耗,验证了算法的可行性。
[Abstract]:It is well known that K-means clustering is one of the most classical and commonly used methods in data mining. It can gather similar data items together by calculating the distance between data items. With the acceleration of information, digitization and networking, economic globalization has become an irreversible trend. Data sources in clustering algorithms are becoming more and more diverse, and data security is becoming more and more important. Considering that the data will come from more than one participant, sensitive information or private information about the participants may be included in the data. If the information is shared among the participants, the privacy of the data will not be guaranteed. The joint data mining with privacy protection can not only protect the privacy of user data and mining results, but also mine the federated database of multiple participants to extract useful information. Therefore, how to design a joint data mining algorithm with privacy protection becomes a difficult problem to be solved. In many cases, the semi-honest model is in line with the actual scenario, and the privacy of the data in the model is guaranteed by the agreement that each participant always follows. However, in order to ensure the privacy of the data, the solution under the model is usually not feasible because of the high computation consumption and communication consumption. Nowadays, with the development of science and technology, more and more enterprises store data on cloud platform, and distributed cloud computing framework also provides powerful computing power for big data. This paper will improve the efficiency of the algorithm with the help of the powerful computing power of cloud computing, and ensure the feasibility of the algorithm. Aiming at the performance problems in data mining with privacy protection, this paper has carried out in-depth research on existing data mining algorithms with privacy protection. Furthermore, an efficient K-means clustering algorithm with privacy protection on horizontally partitioned data sets is proposed, which supports both storage outsourcing and computing outsourcing with two data owners and cloud platforms. The data is stored in the cloud in the form of ciphertext. By interacting with two data owners, the cloud platform completes the task of K-means clustering data mining on the joint data sets of both parties. In this paper, we design different security protocols to solve the three technical problems in K-means clustering algorithm with privacy protection: the secure distance computing protocol to solve the problem of ciphertext distance calculation. Security comparison protocol to solve ciphertext comparison problem and secure circuit protocol to solve ciphertext division problem. Then, these security protocols are applied to the clustering algorithm framework to implement the K-means clustering algorithm with privacy protection. In this paper, the time complexity, space complexity and communication complexity of the algorithm are analyzed theoretically, and the security proof of the algorithm under semi-honest model is given. It is proved that the algorithm allows the security of up to one of the participants to be malicious in the phase of recalculating the centroid point. Finally, the time consumption of encrypted data and the time consumption of each participant in an iterative process are calculated experimentally. The feasibility of the algorithm is verified.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP309;TP311.13

【参考文献】