基于Hadoop的出租车数据质量分析与处理
发布时间:2018-01-22 13:54
本文关键词: Hadoop 数据质量 数据清洗 停靠点 出处:《武汉理工大学》2015年硕士论文 论文类型:学位论文
【摘要】:深圳市通过智能交通系统(Intelligent Transportation System,ITS)建设,建立了智能交通公用信息平台,信息平台每天采集到海量的交通数据,这些数据蕴含着丰富的交通信息。高质量的交通数据是ITS做出正确决策的保证,然而,实际的交通数据采集过程中,由于设备故障、外界环境干扰、人为操作失误等多种因素的影响使得获取的原始数据不可避免地存在丢失、冗余等质量问题。本文结合项目需求,采用基于Hadoop搭建的云计算平台对深圳市海量出租车数据进行数据质量分析,并面向数据质量进行数据处理,主要工作包括以下几个方面:(1)研究国内外学者数据质量评估和数据清洗方面取得的成果与不足,并在此基础上引出本文的研究内容。(2)根据项目需求设计了基于决策学中层次分析法结合历史数据的评价体系,利用层次分析法计算评价指标权值并以历史数据的期望为基准得到数据质量分数,将数据质量问题量化,直观的反映数据质量状况。(3)针对深圳市出租车数据特征提出了GPS数据和营运数据质量评价方案,首先找到影响数据质量的主要因素,确定各自的评价指标,然后针对数据集中存在的冗余、不完整和错误数据,提出相应的评价规则算法判断是否符合条件。(4)面向深圳市出租车数据质量分析结果,提高数据质量。重点研究了重复数据清洗技术,提出了基于MapReduce的分块去重算法删除重复数据。然后分别对GPS数据和营运数据提出了基于Hadoop平台的出租车数据清洗方案,数据清洗方案主要针对数据不完整、冗余和错误的质量问题,将传统的清洗技术迁移到云平台。(5)将清洗后高质量的GPS数据应用于出租车停靠点研究,提出了基于DBSCAN的停靠点检测算法,从非载客的轨迹数据中找到出租车停靠点,检测算法主要分为三个步骤:候选点获取,候选点过滤和停靠点候选点聚类。候选点的获取是根据候选点检测算法,然后利用时间和空间属性对候选点过滤,最后分析各种聚类算法优缺点,选择DBSCAN聚类算法进行停靠点聚类。通过建立的数据质量评价体系,对出租车的GPS数据和营运数据质量进行评估,最终得到两个数据集的数据质量得分,能够直观的反应数据质量的好坏,为后面的清洗任务提供依据。根据数据质量评价结果研究相应的数据清洗方案,能够有效的提高了数据质量,为ITS做出正确的决策提供支持。根据清洗后的数据研究出租车停靠点,有助于城市管理人员更好的了解出租车驾驶员情况,对司机寻找乘客也有指导意义。
[Abstract]:Through the construction of Intelligent Transportation system in Shenzhen, the public information platform of intelligent transportation has been established. The information platform collects massive traffic data every day, which contains abundant traffic information. High quality traffic data is the guarantee for ITS to make the correct decision. However, in the actual traffic data collection process. Due to equipment failure, external environment interference, human error and other factors, the original data is inevitably lost, redundant and other quality problems. The cloud computing platform based on Hadoop is used to analyze the data quality of the mass taxi data in Shenzhen, and the data processing is oriented to the data quality. The main work includes the following aspects: 1) to study the achievements and shortcomings of domestic and foreign scholars in data quality assessment and data cleaning. On the basis of this, the research content of this paper is elicited. 2) according to the project requirements, the evaluation system based on AHP and historical data in decision science is designed. The weight value of evaluation index is calculated by AHP, and the data quality score is obtained based on the expectation of historical data, and the problem of data quality is quantified. According to the characteristics of taxi data in Shenzhen, the paper puts forward the evaluation scheme of GPS data and operation data quality. Firstly, it finds out the main factors that affect the data quality. Determine the respective evaluation indicators, and then address the data set of redundant, incomplete and erroneous data. The corresponding evaluation rule algorithm is put forward to judge whether or not it conforms to condition. (4) face to the result of taxi data quality analysis in Shenzhen to improve the data quality. The repeated data cleaning technology is studied emphatically. A block de-duplication algorithm based on MapReduce is proposed to delete the duplicate data. Then the cleaning scheme of taxi data based on Hadoop platform is proposed for GPS data and operation data respectively. The data cleaning scheme mainly aims at the quality problems of incomplete data, redundancy and error. The traditional cleaning technology is migrated to cloud platform. 5) the high quality GPS data after cleaning is applied to the research of taxi parking points. In this paper, a DBSCAN based algorithm for detecting stopping points is proposed. The algorithm can be divided into three steps: obtaining candidate points from the track data of non-passengers. Candidate point filtering and docking point candidate point clustering. Candidate points are obtained according to candidate point detection algorithm, then use time and space attributes to filter candidate points, and finally analyze the advantages and disadvantages of various clustering algorithms. The DBSCAN clustering algorithm is selected to cluster the docking points. Through the established data quality evaluation system, the GPS data and operation data quality of the taxi are evaluated. Finally, the data quality scores of the two data sets are obtained, which can directly reflect the quality of the data, and provide the basis for the later cleaning tasks. According to the evaluation results of data quality, the corresponding data cleaning scheme is studied. Can effectively improve the quality of data for ITS to make the right decision to provide support. According to the data washed after the study of taxi parking points, it is helpful for city managers to better understand the taxi driver situation. It is also instructive for drivers to find passengers.
【学位授予单位】:武汉理工大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:U495
【参考文献】
相关博士学位论文 前6条
1 王国华;高效重复数据删除技术研究[D];华南理工大学;2014年
2 乔媛媛;基于Hadoop的网络流量分析系统的研究与应用[D];北京邮电大学;2014年
3 樊华;面向物联网的RFID不确定数据清洗与存储技术研究[D];国防科学技术大学;2013年
4 夏英;智能交通系统中的时空数据分析关键技术研究[D];西南交通大学;2012年
5 王灿;基于在线重复数据消除的海量数据处理关键技术研究[D];电子科技大学;2012年
6 魏建生;高性能重复数据检测与删除技术研究[D];华中科技大学;2012年
相关硕士学位论文 前4条
1 卢本新;数据仓库数据质量管理的研究[D];大连理工大学;2013年
2 王洵;宏观统计数据质量评估实证分析[D];厦门大学;2013年
3 刘中超;数据中心的数据质量管理工具设计与实现[D];华中科技大学;2013年
4 苗润华;基于聚类和孤立点检测的数据预处理方法的研究[D];北京交通大学;2012年
,本文编号:1454849
本文链接:https://www.wllwen.com/kejilunwen/daoluqiaoliang/1454849.html