基于HDFS的电子文件集中存储和检索系统
发布时间:2018-07-15 09:54
【摘要】:我国电子文件随着政府信息化进程的推进得到很大的发展,政府工作中产生的电子文件数量已经超过纸质文件数量。相对于纸质文件的管理方式,电子文件的管理还不成熟,特别在存储方面,电子文件凭借其自身易于传输和保存的特点,可以不在局限于按照地域分散存储。对电子文件进行集中存储可以有效的加强电子文件的管控力度,提高办公效率,减少人力资源开销,并解决文件丢失、泄露等问题。但同时怎样实现海量电子文件的集中存储直接影响到整个系统的实现和效率。云存储是一个网络在线存储模型,数据被存储在存储虚拟池中,只要硬件容许它几乎可以提供无限的廉价存储能力。云存储技术可以高效的解决海量电子文件集中存储问题。基于Google File System(GFS)设计思想的开源云存储文件系统Hadoop Distributed File System(HDFS)凭借其出色的处理超大文件的性能和可靠性成为云存储技术研究的热点。而电子政务中的电子文件以小文件为主,HDFS在处理海量小文件的存储和访问时性能低下。 本文针对HDFS处理小文件的不足,提出一种通过使用存储缓存和读取缓存的策略来提高海量小文件的存储和访问效率。其基本思想为设计实现HDFS中间件在满足存储访问需求的同时减少HDFS的访问次数,从而提高存储访问效率。存储缓存策略的基本思想为设置多个缓冲区,存储小文件时通过多个缓冲区的优化选择来提高缓冲区的利用率,从而减少HDFS访问次数。读取缓存策咯的基本思想为使用buddy system的方式管理固定大小的整个读取缓存,并为每个分段缓存设置效率阈值,通过效率阈值来控制缓存的更新策略,最大限度提高缓存利用率,从而使访问文件时尽可能的利用读取缓存,减少访问HDFS的次数。本文在安全性方面也有一些策略设置,通过使用多级加密的形式来保证电子文件的集中存储访问过程中的机密性和隐私性。最后,本文实现原型系统并进行测试分析,以证明以上思想方法的可行性和可用性。
[Abstract]:With the development of government informatization, the number of electronic documents produced in government work has exceeded the number of paper documents. Compared with the management mode of paper files, the management of electronic files is not mature, especially in the storage, electronic files can not be limited to distributed storage according to their own characteristics of easy transmission and preservation. Centralized storage of electronic files can effectively strengthen the control of electronic documents, improve office efficiency, reduce the cost of human resources, and solve the problems of file loss and leakage. However, how to realize the centralized storage of massive electronic files directly affects the implementation and efficiency of the whole system. Cloud storage is a network online storage model, where data is stored in a virtual pool, as long as the hardware allows it to provide almost unlimited cheap storage capacity. Cloud storage technology can efficiently solve the problem of mass electronic file centralized storage. Hadoop distributed File system (HDFS), an open source cloud storage file system (HDFS) based on Google File system (GFS), has become a hot topic in cloud storage technology because of its excellent performance and reliability in processing large files. However, in E-government, small files are the main function of HDFS in dealing with the storage and access of large amount of small files. Aiming at the shortage of HDFS in dealing with small files, this paper proposes a strategy of using storage cache and reading cache to improve the storage and access efficiency of large amount of small files. The basic idea is to design and implement HDFS middleware to meet the storage access requirements and reduce the number of HDFS access so as to improve storage access efficiency. The basic idea of storage cache policy is to set up multiple buffers, and to improve the utilization of buffers by optimizing the selection of buffers when storing small files, thus reducing the number of HDFS visits. The basic idea of reading cache policy is to use buddy system to manage the whole read cache of fixed size, and set the efficiency threshold for each segment cache. The update strategy of cache is controlled by the efficiency threshold, and the cache utilization is maximized. In order to access the file as much as possible to use read cache, reduce the number of visits to HDFS. This paper also has some policy settings in the aspect of security, by using the form of multi-level encryption to ensure the confidentiality and privacy in the process of centralized storage and access of electronic files. Finally, the prototype system is implemented and tested to prove the feasibility and availability of the above methods.
【学位授予单位】:南京大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP333;TP391.3
[Abstract]:With the development of government informatization, the number of electronic documents produced in government work has exceeded the number of paper documents. Compared with the management mode of paper files, the management of electronic files is not mature, especially in the storage, electronic files can not be limited to distributed storage according to their own characteristics of easy transmission and preservation. Centralized storage of electronic files can effectively strengthen the control of electronic documents, improve office efficiency, reduce the cost of human resources, and solve the problems of file loss and leakage. However, how to realize the centralized storage of massive electronic files directly affects the implementation and efficiency of the whole system. Cloud storage is a network online storage model, where data is stored in a virtual pool, as long as the hardware allows it to provide almost unlimited cheap storage capacity. Cloud storage technology can efficiently solve the problem of mass electronic file centralized storage. Hadoop distributed File system (HDFS), an open source cloud storage file system (HDFS) based on Google File system (GFS), has become a hot topic in cloud storage technology because of its excellent performance and reliability in processing large files. However, in E-government, small files are the main function of HDFS in dealing with the storage and access of large amount of small files. Aiming at the shortage of HDFS in dealing with small files, this paper proposes a strategy of using storage cache and reading cache to improve the storage and access efficiency of large amount of small files. The basic idea is to design and implement HDFS middleware to meet the storage access requirements and reduce the number of HDFS access so as to improve storage access efficiency. The basic idea of storage cache policy is to set up multiple buffers, and to improve the utilization of buffers by optimizing the selection of buffers when storing small files, thus reducing the number of HDFS visits. The basic idea of reading cache policy is to use buddy system to manage the whole read cache of fixed size, and set the efficiency threshold for each segment cache. The update strategy of cache is controlled by the efficiency threshold, and the cache utilization is maximized. In order to access the file as much as possible to use read cache, reduce the number of visits to HDFS. This paper also has some policy settings in the aspect of security, by using the form of multi-level encryption to ensure the confidentiality and privacy in the process of centralized storage and access of electronic files. Finally, the prototype system is implemented and tested to prove the feasibility and availability of the above methods.
【学位授予单位】:南京大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP333;TP391.3
【相似文献】
相关期刊论文 前10条
1 肖美华,刘文革;优化文件分配及磁盘文件存储之策略[J];南昌航空工业学院学报;2001年01期
2 严小卫;;通过改变文件分配簇进行的加密和解密[J];微型机与应用;1990年11期
3 陈俊杰,张武生,沈美明,郑纬民;文件分配问题的一种动态解决算法[J];小型微型计算机系统;2004年07期
4 邵志毅;;文件恢复的可行性分析[J];陕西师范大学学报(自然科学版);2007年S2期
5 贺新征;费金龙;刘楠;祝跃飞;;基于文件过滤驱动的数据安全系统的研究与实现[J];微电子学与计算机;2008年03期
6 王明哲;;试谈根据,
本文编号:2123688
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/2123688.html