千亿级智能交通大数据存储与检索系统的研究

发布时间：2018-03-08 04:29

本文选题：智能交通　切入点：大数据　出处：《杭州电子科技大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着中国城市化规模的逐步扩大以及城市居民收入的不断增加,私家车数量也在不断的增加,伴随着也带来了一系列的交通问题。为了便于城市交通的管理,智能交通系统应运而生,其通过引入现代化技术并结合各城市的具体需求,实现对交通信息的实时收集和处理,了解当前的交通环境并作出相应的调控。这对于保障城市交通高效运行和可持续发展具有重要的意义。数据存储与检索是智能交通系统的核心之一。在实际公安局点应用中,道路监控每天会产生海量的数据,仅浙江省一天产生的过车数据便有几亿,并且数据到达具有随机性。传统的关系型数据库由于其严格的表结构约束,无法实现海量数据的存取操作。并且当一张数据表的数据量达到一定级别时,索引本身就过于巨大。因此数据库的检索功能根本无法满足数据检索的需求,并且极易造成系统的瘫痪。本文对以上问题进行深入研究,设计了千亿级智能交通大数据存储与检索系统。系统采用分布式集群方案,以分布式框架Hadoop为基础将系统集群设计为主从架构。集群使用Zookeeper进行一致性管理,使用Yarn进行资源管理和分配。为保证系统集群的稳定性,通过虚拟IP和Zookeeper实现了负载均衡和高可用性机制,用于处理高并发连接和单点故障问题,并保证对外地址的一致性。针对海量数据存储和检索这个难点,引入搜索引擎Solr和非关系型数据库HBase实现数据存储和检索方案。针对高并发数据容易引起Solr不稳定,设计了Kafka和Spark Streaming高并发实时数据缓存和消费策略。针对海量数据检索延迟高,设计了自称为Solr分Core算法和时间紧缩算法,实现了千亿数据秒级检索,并设计了翻页缓存功能提升客户端翻页体验。最后,本文对系统进行测试,测试结果表明系统工作稳定,能高效存储海量多种类型数据。当数据库中存储一千亿条过车记录时,对此TB级别数据进行各种条件的检索均能在1s内响应。
[Abstract]:With the gradual expansion of the scale of urbanization in China and the increasing income of urban residents, the number of private cars is also increasing, accompanied by a series of traffic problems. In order to facilitate the management of urban traffic, Intelligent transportation system emerges as the times require. It can collect and process traffic information in real time by introducing modern technology and combining with the specific needs of each city. Understand the current traffic environment and make corresponding regulation and control. This is of great significance for ensuring the efficient operation and sustainable development of urban traffic. Data storage and retrieval is one of the core of intelligent transportation system. Road monitoring produces huge amounts of data every day. In Zhejiang Province alone, hundreds of millions of traffic data are generated in one day, and the arrival of the data is random. Traditional relational databases are constrained by their strict table structure. When the data amount of a data table reaches a certain level, the index itself is too large. Therefore, the retrieval function of the database can not meet the requirements of data retrieval at all. And it is easy to cause paralysis of the system. This paper makes a thorough study on the above problems, and designs a storage and retrieval system for big data, a 100bn level intelligent transportation system. The system adopts a distributed cluster scheme. The system cluster design is based on the distributed framework Hadoop. The cluster uses Zookeeper for consistency management and Yarn for resource management and allocation. A load balancing and high availability mechanism is implemented through virtual IP and Zookeeper, which is used to deal with the problems of high concurrent connection and single point failure, and to ensure the consistency of external address. This paper introduces search engine Solr and non-relational database HBase to realize the scheme of data storage and retrieval, aiming at the instability of Solr caused by high concurrency data. The high concurrent real-time data cache and consumption strategy of Kafka and Spark Streaming are designed. In view of the high latency of massive data retrieval, the self-called Solr sub-#en4# algorithm and the time compression algorithm are designed. Finally, the system is tested in this paper. The test results show that the system works stably and can store a large number of kinds of data efficiently. When 100 billion passing records are stored in the database, The terabyte level data can be retrieved in 1 s.
【学位授予单位】：杭州电子科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：U495;TP311.13

【参考文献】