数据仓库中基于学习的实体解析方法研究

发布时间：2018-05-31 13:27

本文选题：数据仓库 + 数据质量　；参考：《昆明理工大学》2017年硕士论文

【摘要】：实体解析是针对数据仓库中数据质量管理的冗余识别技术。随着数据的海量增加,传统的实体解析方法中识别效率低和识别精确度不足等问题也逐渐凸显。本文分析了数据仓库和数据质量的相关理论和国内外研究成果,以及实体解析的主要方法。重点针对海量数据实体解析算法原理、基本模型、模块设计以及评价标准等展开了深入研究。以提高识别精度、减小计算时间为目标,针对某烟草集团数据中心的数据源,研究了基于学习的并行实体解析算法,并进行了仿真验证。主要研究内容如下:(1)以元组中关键属性相似度确定Canopy集合阈值,利用Canopy聚类对海量实体进行初步分块,使元组形成可叠加的子集,增加了算法的容错性。(2)针对数据分块后形成的相似实体对集合,引入位置编码技术和TF-IDF算法相结合对元组进行词特征的相似度计算方法。位置编码技术可以很好的识别单词的缩写等问题,TF-IDF算法对字符位置顺序不敏感,同时对属性字符串中具有类别区分能力的单词赋予相应权重信息。利用两算法的优势结合提取元组对的特征向量。(3)针对元组相似度和属性相似度之间的非线性映射关系,利用神经网络任意精度逼近非线性函数的特征,通过网络学习属性之间的内在关系动态实现权值、阈值等参数的调整,来完成实体是否匹配的判断。对于神经网络训练过程收敛速度慢,易陷入局部最优等问题,采用蚁群算法进行优化。弥补了传统实体匹配方法中根据属性相似度的加权和是否大于人工阈值判断元组对是否属于同一实体的不足。(4)实现了 Hadoop基础架构对海量实体解析的并行处理。利用数据中心供应商数据对方法和框架进行实验仿真,通过与传统的实体解析方法进行准确率、召回率和F1值等评价方式的对比分析,验证了基于学习的实体解析算法可以获得较高的识别精确度,并且随着节点数目的增加,识别效率也有很大程度地提高。
[Abstract]:Entity parsing is a redundant identification technique for data quality management in data warehouse. With the massive increase of data, the problems of low recognition efficiency and low recognition accuracy in traditional entity analysis methods have been gradually highlighted. In this paper, the related theories of data warehouse and data quality, the research results at home and abroad, and the main methods of entity analysis are analyzed. This paper focuses on the principle, basic model, module design and evaluation criteria of mass data entity analysis algorithm. In order to improve the recognition accuracy and reduce the computing time, a parallel entity analysis algorithm based on learning was studied for the data source of a tobacco group data center, and the simulation was carried out. The main contents of this paper are as follows: (1) the threshold of Canopy set is determined by similarity of key attributes in tuples, and the initial block of massive entities is divided by Canopy clustering to form superimposed subsets of tuples. The fault tolerance of the algorithm is increased. (2) aiming at the similar entity pair set which is formed after the data is partitioned, this paper introduces the position coding technique and the TF-IDF algorithm to calculate the similarity of the character of the tuple. The position coding technique can recognize the abbreviation of words very well. The TF-IDF algorithm is not sensitive to the character position order, and gives the corresponding weight information to the words with the ability to distinguish the categories in the attribute string. The advantage of the two algorithms is used to extract the feature vector of tuple pairs. (3) aiming at the nonlinear mapping between tuple similarity and attribute similarity, the neural network is used to approximate the feature of nonlinear function with arbitrary precision. Through the dynamic adjustment of weights, thresholds and other parameters, the judgment of whether the entity matches or not is completed through the intrinsic relationship between the learning attributes of the network. Ant colony algorithm (ACA) is used to solve the problems of slow convergence and easy to fall into local optimization in the process of neural network training. It makes up for the deficiency of traditional entity matching method to judge whether the tuple pair belongs to the same entity or not according to the weighted sum of attribute similarity degree or not. It realizes the parallel processing of massive entity parsing in Hadoop infrastructure. The method and framework are simulated with data center supplier data, and compared with traditional entity analysis methods, such as accuracy rate, recall rate and F1 value, etc. It is verified that the Learning-based entity resolution algorithm can obtain high recognition accuracy and the recognition efficiency is improved greatly with the increase of the number of nodes.
【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】