高考数据分布式存储优化的设计与实现

发布时间：2019-01-14 11:33

【摘要】：近年来,信息产业在各行各业的飞速发展催生了行业数据的爆炸性增长,当然也包括教育高考领域。众所周知,每年的高考都会产生海量的高考数据,如何快速高效的存储这些海量的高考数据是一个值得研究的重要课题。在面对TB级别甚至是PB级别的海量数据时传统的关系型数据库对数据的存储能力日渐乏力。伴随着大规模数据的出现催生了不少存储这些数据的技术。其中谷歌的GFS和Apache的HDFS就是两种比较典型的大数据分布式存储技术,而在当下比较受欢迎的当属Apache公司的HDFS。HDFS的出现使得企业可以采用大量廉价机器组成的集群对海量的数据进行分布式存储,但是HDFS的分布式文件存储由一个主节点控制多个从数据节点的存储方式容易出现主节点瓶颈问题。对于本文研究的高考数据来说,如果采用HDFS来存储海量的高考数据,当大量的考生同时在线查询成绩时,来自不同客户端的请求都会涌入HDFS的主节点,这对HDFS的主节点来说是一个极大的挑战。针对上述问题,本文通过对HDFS分布式存储技术进行深入研究分析,提出了一种HDFS+MongoDB的分布式存储方案来解决HDFS主节点瓶颈问题,从而使高考数据分布式存储更加优化,考生查询成绩效率更高。基于上述分析,本文主要的研究工作如下:(1)首先明确了课题的选题背景及意义,接着对论文中应用到分布式存储技术、高考信息化技术、以及Spark大数据平台技术的发展现状进行了分析。(2)分析了利用HDFS分布式存储技术存储高考数据出现的主节点瓶颈问题,进而提出了一种HDFS+MongoDB的分布式存储高考数据的优化方案。(3)根据招生考试院的具体要求对采用优化存储方案的高考成绩查询系统从用户角度、功能性角度、非功能性角度等进行了详细的需求分析,并根据需求分析从系统的总体结构、系统功能、系统数据库和HDFS+MongoDB分布式存储等四个方面进行了详细设计。(4)基于系统的详细设计给出了具体的实现方法。采用黑盒测试方法对系统的功能进行了测试,并从响应时间、吞吐量、并发量三个方面对系统进行了性能测试,初步达到了预期目标。最后简单阐述了本文主要的研究内容,并明确了接下来要努力的方向。
[Abstract]:In recent years, the rapid development of information industry in various industries has given birth to the explosive growth of industry data, including, of course, the field of college entrance examination. As we all know, the college entrance examination every year will produce a huge amount of college entrance examination data, how to store these large amounts of college entrance examination data quickly and efficiently is an important topic worth studying. In the face of TB level or even PB level of massive data, the traditional relational database data storage capacity is increasingly weak. With the emergence of large-scale data, the emergence of a lot of data storage technology. Among them, Google's GFS and Apache's HDFS are two typical big data distributed storage technologies. The emergence of HDFS.HDFS, which is now a popular Apache company, allows enterprises to use clusters of cheap machines to store large amounts of data in a distributed manner. But the distributed file storage of HDFS is controlled by one master node, and the storage mode of multiple slave data nodes is prone to the bottleneck problem of master node. For the college entrance examination data studied in this paper, if we use HDFS to store a large amount of college entrance examination data, when a large number of candidates simultaneously online query results, the requests from different clients will flood into the main node of HDFS. This is a great challenge for the master node of HDFS. In view of the above problems, through the in-depth study and analysis of HDFS distributed storage technology, this paper proposes a distributed storage scheme of HDFS MongoDB to solve the bottleneck problem of HDFS master node, thus making the distributed storage of college entrance examination data more optimized. Examinee inquiry results are more efficient. Based on the above analysis, the main research work of this paper is as follows: (1) firstly, the background and significance of the topic are defined, and then the distributed storage technology, the college entrance examination information technology, are applied to the thesis. And the development of Spark big data platform technology is analyzed. (2) the bottleneck problem of main node in storing college entrance examination data using HDFS distributed storage technology is analyzed. Then an optimization scheme of distributed storage of college entrance examination data based on HDFS MongoDB is proposed. (3) according to the specific requirements of the college entrance examination institute, the query system of college entrance examination results using the optimized storage scheme is from the user's point of view, functional point of view, According to the requirement analysis, the system structure, the system function and the function of the system are analyzed in detail from the point of view of non-functional. The system database and HDFS MongoDB distributed storage are designed in detail. (4) based on the detailed design of the system, the implementation method is given. The function of the system is tested by using the black box test method, and the performance of the system is tested from three aspects: response time, throughput and concurrency. Finally, the main research content of this paper is briefly described, and the direction of the next efforts is defined.
【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP333

【参考文献】