面向众包数据库的隐私保护技术研究

发布时间：2018-07-20 17:59

【摘要】：众包数据库是一种利用众包平台将人类智慧和机器相结合,以解决传统关系数据库难以处理的查询任务的新型数据库。其核心思想是将查询及相应数据集以众包任务的形式发布到互联网,并最终交给大众网民,利用人类智慧来解决。然而包含隐私信息的数据集若不做任何处理就发送给大众网民,则可能造成隐私信息的泄漏。隐私问题在传统数据库领域已有多年的研究,其中数据匿名技术已在数据发布等实际应用中证明了其有效性。然而现有的匿名技术难以简单地应用于众包数据库,首先,众包数据库通常规模较大且分布式地存储于不同节点中,现有算法难以高效地处理这种大规模、分布式数据;其次,现有算法会造成任务相关的信息损失量过大,导致任务完成质量降低。为提高众包任务的完成质量,基于空间分割的Two-Phase Partition匿名算法通过抽样技术保留更多的任务相关信息,提高匿名数据的可用性。第一阶段Pre-Partition,以样本坐标为候选分割点,对空间做全域分割,根据真实值设计估值函数,筛选最优分割点集合。第二阶段Further-Partition,以第一阶段的输出为候选分割点,对空间做基于kd-tree的本地分割,再根据得到的子空间边界对数据做替换操作,完成数据匿名化。为高效地处理大规模、分布式众包数据库,基于MapReduce的并行匿名框架,实现了对Two-Phase Partition算法的并行化。该框架采用哈希技术将原数据集重新划分为多个子数据集,分别对其做匿名处理后再将其整合正完整的匿名数据集。实验表明,与现有算法相比,单机版Two-Phase Partition算法在查询正确率上提高了20%以上,且随着样本比例的增大,查询正确率增加。利用并行匿名框架实现Two-Phase Partition算法的并行化后,查询正确率略低于单机版算法,但降低幅度在5%以内,且在执行效率上可以实现随数据集大小的线性增长。因此该并行匿名方案适合于解决大规模、分布式众包数据库的隐私问题。
[Abstract]:Crowdsourcing database is a new type of database which combines human intelligence with machine by using crowdsourcing platform to solve the difficult query task of traditional relational database. Its core idea is to publish the query and the corresponding data set to the Internet in the form of crowdsourcing tasks, and finally to the mass Internet users to use human wisdom to solve the problem. However, if the data set containing private information is sent to Internet users without any processing, it may lead to the disclosure of privacy information. Privacy issues have been studied in the field of traditional databases for many years, among which the technology of data anonymity has been proved to be effective in practical applications such as data release. However, the existing anonymous technology is difficult to be simply applied to crowdsourcing databases. Firstly, crowdsourcing databases are usually stored in different nodes on a large scale and distributed, and the existing algorithms are difficult to deal with such large-scale and distributed data efficiently. The existing algorithms will result in excessive loss of information related to tasks, resulting in poor quality of task completion. In order to improve the completion quality of crowdsourcing tasks, the Two-Phase Partition anonymous algorithm based on space segmentation retains more task related information through sampling technology, and improves the availability of anonymous data. In the first stage Pre-Partition takes sample coordinates as candidate segmentation points makes global segmentation of space designs estimation functions according to real values and selects the optimal set of segmentation points. In the second stage Further-Partition takes the output of the first stage as the candidate segmentation point and performs the local segmentation of the space based on kd-tree then replaces the data according to the obtained subspace boundary to complete the data anonymity. In order to efficiently deal with large-scale and distributed crowdsourcing databases, a parallel anonymous framework based on MapReduce is implemented to parallelize the Two-Phase Partition algorithm. The framework uses hashing technique to redivide the original data set into multiple subdatasets, and then integrates them into complete anonymous data sets after anonymous processing. Experimental results show that the Two-Phase Partition algorithm increases the query accuracy by more than 20% compared with the existing algorithms, and the accuracy increases with the increase of the sample ratio. The parallel anonymous framework is used to realize the parallelization of Two-Phase Partition algorithm. The query accuracy rate is slightly lower than that of the single version algorithm, but the range is reduced by less than 5%, and the execution efficiency can increase linearly with the size of the dataset. Therefore, the parallel anonymous scheme is suitable to solve the privacy problem of large scale and distributed crowdsourcing database.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP309

【参考文献】