基于相关实体检索模型的信息保护

发布时间：2018-05-21 09:42

本文选题：信息保护 + 实体检索　；参考：《复旦大学》2012年硕士论文

【摘要】：随着自然语言处理、数据挖掘等技术的发展,尤其是搜索引擎的广泛应用,人们可以很高效地将原本分散的信息组织在一起,普通用户也能便捷地从网络中获取期望的信息。然而强有力的网络信息检索技术是把双刃剑,用户在获取外部知识变得更快捷的同时,隐藏自己的私有信息也变得越来越困难。用户在论坛、博客、社交网络等web应用上发布的原本安全的信息、,攻击者通过搜索引擎进行的相关实体推定,就有可能造成用户的信息泄漏。传统的信息防护多集中在数据库及信息安全领域,前者主要研究结构化数据上的信息、保护；后者主要研究传输路径上的信息安全。本文作为863研究发展计划“基于Web的用户数据安全防护关键技术研究”的关键子项目,主要研究大规模非结构化数据上的敏感信息的关联性,构建互联网环境下的敏感信息保护框架,相关的研究背景主要集中在信息检索与自然语言处理方面。本文在利用搜索引擎的基础上,针对互联网用户数据的特点,综合运用了文本挖掘与信息检索的多种技术与方法,提出了一个多角度关联模型,通过相关实体检索预测出潜在的用户信息泄漏,从而达到保护用户信息的目的。本文的工作主要包括： ●介绍信息保护的研究现状,数据库及信息安全领域的传统信息保护方法,大规模非结构化数据防护涉及到的技术与方法 ●提出基于相关实体检索算法的信息保护框架,构建多角度实体关联模型,并通过对权威主页的深度挖掘,改进了关联模型的检索结果。 ●以框架为基础设计和实现了一个基于互联网海量语料的信息防护系统。系统的相关实体检索模块在TRE-C2010的相关实体任务数据集上进行了实验,与基于BM25及贝叶斯模型等其他实体检索方法相比,本文提出的方法各项评测指标都优于前者,显示了模型的准确性和适用性,证明了方法的有效性。
[Abstract]:With the development of natural language processing, data mining and other technologies, especially the wide application of search engines, people can organize the originally dispersed information efficiently, and ordinary users can easily obtain the desired information from the network. However, powerful network information retrieval technology is a double-edged sword. It is becoming more and more difficult for users to hide their private information while acquiring external knowledge more quickly. Users posted on web applications such as forums, blogs, social networks and other previously secure information, attackers through the search engine related entity presumption, may cause users' information disclosure. Traditional information protection mainly focuses on database and information security. The former mainly studies information protection on structured data and the latter focuses on information security in transmission path. As a key subproject of the 863 Research and Development Program "Research on key Technologies of user data Security Protection based on Web", this paper mainly studies the relevance of sensitive information on large-scale unstructured data. The research background of constructing sensitive information protection framework in Internet environment is mainly focused on information retrieval and natural language processing. On the basis of search engine, according to the characteristics of Internet user data, this paper synthetically applies various techniques and methods of text mining and information retrieval, and puts forward a multi-angle correlation model. The potential leakage of user information is predicted by retrieval of relevant entities, and the purpose of protecting user information is achieved. The work of this paper mainly includes: This paper introduces the research status of information protection, the traditional information protection methods in the field of database and information security, and the techniques and methods involved in large-scale unstructured data protection. An information protection framework based on relevant entity retrieval algorithm is proposed, and a multi-angle entity association model is constructed, and the retrieval results of the association model are improved through the deep mining of the authoritative home page. Based on the framework, an information protection system based on Internet mass corpus is designed and implemented. Compared with other entity retrieval methods based on BM25 and Bayesian model, the method proposed in this paper is superior to the former. The accuracy and applicability of the model are demonstrated, and the validity of the method is proved.
【学位授予单位】：复旦大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP311.13

【参考文献】