集值数据和社交网络联合发布中隐私保护方法研究

发布时间：2019-04-18 15:17

【摘要】：随着网络的飞速发展和普遍,各种应用产生了海量数据,比如微信、facebook、购物平台等。数据之间存在潜在的关联关系具有不可估量的社会和经济价值,比如进行群体行为分析,辅助商业决策等多方面的数据应用价值。在发布数据给数据挖掘者时,需要把数据进行隐私保护,因为数据一般都包含许多用户的隐私信息,容易导致隐私信息泄露,所以数据隐私保护就显得尤为重要。近几年,数据隐私保护是热门研究领域,已有不少相关研究成果,但现有的研究主要是针对单类型数据进行隐私保护。在大数据时代,数据的挖掘已经多源化,比如社交网络数据和事务性数据结合挖掘,解决购物推荐系统的冷启动问题等。在多源数据情况下,背景知识增多带来新的隐私问题,现有的隐私保护方法已不适用于多源数据的联合发布。相对关系型数据,集值数据具有高维度、稀疏等特征。关系型数据的隐私保护方法显然对集值数据已不适用,比如用k匿名隐私模型对集值数据进行保护会导致数据的信息损失过大。针对该情况,ρ-不确定性模型能较好地平衡隐私保护和信息损失,近年来也有许多基于ρ-不确定性的集值数据隐私保护的研究成果。社交网络数据方面也有很多数据保护模型,比如k度匿名、l多样性等,这些模型通过增删边或节点来满足隐私要求。这些保护模型能对单类型数据进行保护,但在社会网络数据与集值数据联合发布情况下,背景知识增多,使得受害者信息的泄露概率大于ρ,不符合数据隐私要求。因此,针对社会网络数据与集值数据联合发布,本文提出分组ρ-不确定性隐私保护模型。本文主要工作如下:首先,分析集值数据和社交网络数据现有的隐私保护模型,提出数据联合发布的攻击模型,现有的单数据类型隐私保护模型对该攻击模型已不适用。在集值数据中任意数据项的背景知识情况下,ρ-不确定性模型确保能推断出敏感数据项的概率不超过ρ。该模型在集值数据单独发布情况下是有效的,但与社交网络联合发布情况下,若攻击者还了解受害者在社交应用中有几个朋友,即了解社会网络数据受害者节点的度,则成功推断受害者在集值数据敏感项的概率大于ρ,不满足隐私要求。其次,针对上面的攻击模型,结合ρ-不确定性模型和度匿名模型,本文提出分组ρ-不确定性隐私保护模型。首先,该保护模型需要根据项目属性制定泛化树,比如apple、banana泛化为fruit。然后根据泛化树把集值数据分组,即集值数据中非敏感项目在泛化树中具有相同父节点的记录分为一组。基于ρ-不确定性模型,该模型要求每个分组都满足ρ-不确定性模型,并证明了每个分组满足ρ-不确定性模型情况下,整体的数据也是满足ρ-不确定性模型。最后把社交网络的节点分组(与集值数据的分组一致)并组内匿名处理,使得社交网络的节点在组内具有相同的度数。因此,在上面的背景知识下,攻击受害者的敏感项概率低于ρ,从而达到匿名需求。再次,基于分组ρ-不确定性隐私保护模型,本文还设计了一种隐私保护方算法。为了减少信息损失,提高数据实用性,该算法结合局部泛化和部分删除的方法来处理集值数据。在处理过程中采用自顶向下的局部泛化,当数据不满足隐私需求时,采用部分删除的方法来达到隐私需求。项目向下泛化会减少信息损失,但部分删除会增加损失,故此时要评估泛化前后的信息损失。若泛化后数据的信息损失较少就采用本次泛化,否则拒绝该泛化。在匿名社交网络数据时,为了提高数据实用性,该算法尽量保护社区结构的完整性,即优先删除社区间的边和优先添加社区内的边,减少增删边对社区结构的影响。最后,为了验证算法的实用性,本文从信息损失等方面来评估集值数据的效用性,从杰卡德相似系数等来衡量社交网络数据的效用性,实验结果证表明该算法在保护隐私同时,也有较好的数据实用性。
[Abstract]:With the rapid development and widespread use of the network, various applications have generated massive data, such as WeChat, facebook, shopping platform and so on. There is an immeasurable social and economic value between the data, such as group behavior analysis, auxiliary business decision and so on. When data is published to a data miner, the data needs to be protected by the privacy, since the data generally contains the privacy information of many users, which can easily lead to the disclosure of the privacy information, so the data privacy protection is particularly important. In recent years, data privacy protection is a popular research field, and there are many relevant research results, but the existing research is mainly for the privacy protection of single-type data. In the age of large data, data mining has been widely used, such as social network data and transactional data mining, to solve the cold start problem of the shopping recommendation system, and so on. In the case of multi-source data, the increase of the background knowledge brings new privacy problems, and the existing privacy protection method is not applicable to the joint release of multi-source data. Relative relation type data, set-valued data has the features of high dimension, sparse and so on. The privacy protection method of relational data is obviously not applicable to set-valued data, such as using the k-anonymity privacy model to protect the set-valued data, which can cause the data loss to be too large. In view of this situation, the time-uncertainty model can balance the privacy protection and information loss well, and in recent years there are many research results on the privacy protection of set-valued data based on the uncertainty. There are also many data protection models in social networking data, such as the k-degree anonymous, l-diversity, and so on, and these models meet the privacy requirements by adding or deleting edges or nodes. The protection model can protect the single-type data, but in the case of the joint release of the social network data and the set-valued data, the background knowledge is increased, so that the leakage probability of the victim information is greater than the threshold value, and the data privacy requirement is not met. Therefore, for the joint release of social network data and set-valued data, this paper proposes a packet-level-uncertainty privacy protection model. The main work is as follows: First, the existing privacy protection model of set-valued data and social network data is analyzed, and the attack model of data joint release is put forward. The existing single data type privacy protection model is not applicable to the attack model. In the case of the background knowledge of any data item in the set-valued data, the constraint-uncertainty model ensures that the probability of the sensitive data item is not more than the threshold value. The model is effective when the set-valued data is distributed separately, but in the case of a joint release with the social network, if the attacker also knows that the victim has several friends in the social application, that is, the degree of the social network data victim node, Then it is concluded that the probability of the victim in the set-valued data sensitive term is greater than the threshold value and the privacy requirement is not met. Secondly, based on the above attack model, combined with the model of the uncertainty model and the degree of anonymity, this paper puts forward the packet-uncertainty privacy protection model. First, the protection model requires a generalization tree, such as apple, bana, to be generalized to fruit based on the project properties. And then grouping the set-valued data according to the generalization tree, that is, the records of the non-sensitive items in the set-valued data have the same parent node in the generalization tree are divided into a group. Based on the uncertainty model, the model requires that each group meet the constraint-uncertainty model, and it is proved that each group meets the constraint-uncertainty model, and the whole data also satisfies the constraint-uncertainty model. And finally, grouping the nodes of the social network (consistent with the grouping of the set-valued data) and the anonymous processing in the group, so that the nodes of the social network have the same degree in the group. Therefore, under the background knowledge above, the probability of the sensitive term of the attack victim is lower than the threshold value, thus reaching the anonymous requirement. Thirdly, based on the packet-based-uncertainty privacy protection model, a privacy protection algorithm is also designed in this paper. In order to reduce the loss of information and improve the practicability of the data, the algorithm combines the local generalization and partial deletion to process the set-valued data. The top-down local generalization is adopted in the processing process, and when the data does not meet the privacy requirement, the method of partial deletion is adopted to achieve the privacy requirement. The downward generalization of the project will reduce the loss of information, but the partial deletion will increase the loss, so the information loss before and after the generalization is to be evaluated at this time. If the information loss of the data after generalization is less, the generalization is adopted, otherwise the generalization is rejected. In the case of anonymous social network data, in order to improve the data utility, the algorithm can protect the integrity of the community structure as much as possible, that is, to preferentially delete the edge between the communities and to preferentially add the edges within the community, and to reduce the impact of the addition and deletion on the community structure. Finally, in order to validate the practicability of the algorithm, this paper evaluates the utility of the set-valued data from the aspects of information loss and the like, and measures the utility of social network data from the similar coefficient of Jardard and the like. The results of the experiment show that the algorithm has good data practicability while protecting the privacy.
【学位授予单位】：广西师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP309

【相似文献】