大规模图片存储与索引系统的设计与实现

发布时间：2018-04-19 04:10

本文选题：图片存储 + 键-值对　；参考：《华中科技大学》2013年硕士论文

【摘要】：随着数码产品的普及，家庭图片类型繁多，且其总量呈爆炸式增长，超出普通用户的管理能力，由此产生了‘为大规模图片文件设计高效存储与检索系统’的应用需求，针对此，设计并实现了一种大规模图片存储管理与检索原型系统。该系统采取C/S基础架构，具备数据上传功能和语义扩展特性，并采取了高效检索机制和优化技术。具体地，数据上传采用高效可靠的文件传输协议（FTP）将用户图片文件传输到服务器上存储；在客户端完成图片语义扩展，并以扩展属性的方式进行定义和保存；在服务器内存中，实现基于分层索引结构的键-值对数据库。对于键值对插入操作，首先通过第一层的Bloom Filter建立检索集，然后对键进行哈希处理获得第二层平衡二叉查找树（AVL树）的地址，最后在AVL树中进行插入操作；对于查询操作，通过第一层的Bloom Filter对查询条件进行过滤，然后对查询条件进行哈希处理获得第二层AVL树的地址，最后在AVL树中进行查询操作。服务器内存键值对数据库的增删改查操作接口通过远程调用的方式提供给客户端。最后，采用往日志文件中进行追加写操作和快照相结合的方式，将内存索引信息同步至磁盘日志文件中，保障了内存索引信息的可靠性。实验结果表明，基于键值对的内存分层索引结构每秒钟可写入48600左右个键值对，可读出377800左右个键值对。以一个拥有140000个文件的目录为例，通过Linux文件系统自带find命令，，平均查询时间约为0.5秒。假设每个文件有10个属性，对1400000个键值对建立内存索引结构需耗费30.78秒，其后，通过内存索引结构进行查询的时间约为30微秒，查询性能能提升三个数量级。
[Abstract]:With the popularity of digital products, there are many types of family pictures, and the total number of them is explosive, which is beyond the management ability of ordinary users. As a result, the application demand of 'designing efficient storage and retrieval system for large scale picture files' has arisen.In order to solve this problem, a large-scale picture storage management and retrieval prototype system is designed and implemented.The system adopts C / S infrastructure, has the function of data upload and semantic extension, and adopts efficient retrieval mechanism and optimization technology.In particular, the data upload uses an efficient and reliable file transfer protocol (FTP) to transfer user picture files to the server to store; to complete the semantic extension of the pictures in the client, and to define and save them in the form of extended attributes; in the memory of the server,The key-value pair database based on hierarchical index structure is implemented.For the key-value pair insertion operation, the retrieval set is first established through the Bloom Filter of the first layer, then the address of the second layer balanced binary lookup tree is obtained by hashing the key. Finally, the insertion operation is carried out in the AVL tree; for the query operation,The query condition is filtered by the Bloom Filter of the first layer, and the address of the second layer AVL tree is obtained by hashing the query condition. Finally, the query operation is carried out in the AVL tree.Server memory key to the database change-delete operation interface through remote call to the client.Finally, the memory index information is synchronized to the disk log file by the combination of append write operation and snapshot to the log file, which ensures the reliability of the memory index information.The experimental results show that the memory hierarchical index structure based on key-value pairs can write about 48600 key-value pairs per second and read out about 377800 key-value pairs.Taking a directory with 140000 files as an example, the average query time is about 0.5 seconds through the Linux file system with the find command.Assuming that each file has 10 attributes, it takes 30.78 seconds to set up the memory index structure for 1400,000 key and value pairs, and then, the query time through the memory index structure is about 30 microseconds, and the query performance can be improved by three orders of magnitude.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333;TP391.3

【参考文献】