基于主动学习的半结构化数据清洗技术研究

发布时间：2018-09-11 17:04

【摘要】：随着互联网的快速发展产生了海量数据,按照数据结构可以将这些数据划分为:高结构化数据、半结构化数据及以原始文本。其中结构化数据由于其具有完整的逻辑结构以及描述信息,能够被人们广泛利用;原始文本中包含的可用信息较少,并且需要经过复杂的计算才能够加以利用;半结构化数据是介于以上两者之间的一种数据形式,是互联网上存在极其广泛的数据类型,它可以看作是具有一定结构的数据,但是结构变化很大,因为各个数据之间存在复杂多变的区分标志,通常不能用固定的形式进行描述。所以,如何能够解析半结构化数据吸引了人们的目光,本文针对海量半结构化数据的清洗问题展开研究,识别其中有价值的信息,对半结构化数据加以利用。并将海量半结构化数据进行规格化,解析各个字段的属性,最终形成带有属性标注的二维结构化数据。这样的结构化数据能够为后续的分析使用带来极大的便利。为此,本文提出了以下三种解决海量半结构化数据清洗问题的方法:(1)提出了基于双缓冲的多类型文件并行解析方法,使用双缓冲消息队列以及线程池,提升了串行解析的速度问题,还解决了并行解析中多种格式解析速度不一致造成的任务堆积问题;(2)提出基于正则表达式的属性集识别方法,使用正则表达式识别数据中字段的属性,根据属性位置及数据整体结构识别属性全集,在此基础上提出基于行列统计的数据规格化算法,统计属性的数量及位置,将统计结果结果与属性全集比较,确定每一个字段所在的列,从而形成带有属性标注的结构化数据;(3)提出基于主动学习的方法提升属性识别准确率。将已经标注属性的结构化数据作为训练集,使用C4.5算法构建分类模型,使用基于主动学习的分类器优化方法进一步提高学习模型属性识别的准确率。本文提出了基于投票机制的不确定性采样算法,筛选出最能影响分类器准确率的样例交由转件标注,并更新分类模型,最终形成一个高效率、高准确率、高可用性的数据清洗研究方法,能够将已知数据的清洗成功率提升至95%以上。
[Abstract]:With the rapid development of the Internet, these data can be divided into: highly structured data, semi-structured data and original text. Structured data can be widely used because of its complete logical structure and description information. Semi-structured data is a kind of data form between the above two. It is an extremely wide range of data types on the Internet. It can be regarded as data with a certain structure, but the structure changes a lot. Because of the complex and changeable distinguishing marks between different data, they can not be described in a fixed form. Therefore, how to analyze semi-structured data attracts people's attention. In this paper, the cleaning problem of massive semi-structured data is studied, the valuable information is identified, and the semi-structured data is utilized. The massive semi-structured data is normalized, and the attributes of each field are analyzed. Finally, the two-dimensional structured data with attribute annotation is formed. Such structured data can greatly facilitate the subsequent use of analysis. For this reason, this paper proposes the following three methods to solve the problem of massive semi-structured data cleaning: (1) A multi-type file parallel parsing method based on double buffers is proposed, which uses double-buffer message queue and thread pool. It improves the speed of serial parsing and solves the problem of task stacking caused by inconsistent parsing speed of many formats in parallel parsing. (2) an attribute set recognition method based on regular expressions is proposed. The regular expression is used to recognize the attribute of the field in the data, and the complete set of the attribute is recognized according to the position of the attribute and the whole structure of the data. On this basis, a data normalization algorithm based on column statistics is proposed, and the number and position of the statistical attribute are proposed. The statistical results are compared with the complete set of attributes to determine the columns in which each field is located, so as to form structured data with attribute annotation. (3) A method based on active learning is proposed to improve the accuracy of attribute recognition. Using structured data with tagged attributes as training set, C4.5 algorithm is used to construct classification model, and active learning-based classifier optimization method is used to further improve the accuracy of attribute recognition of learning model. In this paper, an uncertain sampling algorithm based on voting mechanism is proposed, which can select the samples that can affect the accuracy of classifier most, and update the classification model to form a high efficiency and high accuracy. The high availability data cleaning method can increase the success rate of data cleaning to more than 95%.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】