基于Hadoop平台的字符串相似性连接方法研究

发布时间：2018-03-07 06:02

本文选题：字符串相似连接　切入点：Hadoop　出处：《东华大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着电子商务、社交网络与云计算等互联网技术的广泛运用和迅速发展,数据量急剧增长,对大规模数据进行处理成为热点问题之一。字符串相似连接是数据处理的基本操作,它在文本检索、生物监测、信息处理、模式识别、数据整合与清洗等领域有着广泛的应用。基于字符相似度量方法有多种,包括编辑距离、杰卡德(Jaccard)相似度和Cosine相似度等,本文主要是对杰卡德相似度量的方法进行研究。字符串相似连接的方法分为两类:传统的字符串相似连接方法与基于分布式框架的字符串相似连接方法。传统的字符串相似连接方法有ALL-pairs、Ed-join和Trie-tree等,基于分布式框架的字符串相似连接方法有MRSimJoin、MR_DSJ和Fuzzy-Join等。本文对传统的字符串相似连接方法进行研究与分析,发现传统方法受限于机器内存空间、外存空间与CPU等资源,不适合对大规模数据进行相似连接,而使用Hadoop分布式框架对大规模数据进行处理是目前主要方式之一。因此本文研究如何在Hadoop分布式框架基础上能高效并行地处理字符串相似连接。本文做出的主要贡献:(1)本文提出了一种字符串相似连接模型SSJ-Model,该模型运用多种过滤策略且能增量式的对字符串进行相似连接。(2)研究Hadoop分布式框架运行原理,利用SSJ-Model提出了一种基于Hadoop的并行字符串相似性连接算法Hmrdp-join。(3)对Hmrdp-join算法进行优化,能保存MapReduce阶段部分临时结果,避免从磁盘拷贝数据产生的时间代价。更有效地对数据进行划分,平衡map阶段与reduce阶段的负载,避免产生数据倾斜。利用已存在的信息,避免相似连接过程中的部分重复计算。采用分组策略,减少对字符串的多重复制。(4)利用真实的数据集进行实验,分析得到优化后的Hmrdp-join算法有更高的效率。
[Abstract]:With the extensive use of e-commerce, social networking and cloud computing technology and the rapid development of the Internet, the explosive growth of data on large-scale data processing has become a hot issue. The string similarity join is the basic operation of the data processing in text retrieval, biological monitoring, information processing, pattern recognition, data integration and cleaning etc. is widely used in the fields of similar characters. There are many methods to measure based on edit distance, including Jaccard, (Jaccard) Cosine similarity and similarity, this paper is mainly research methods of similarity measure. Jaccard string similarity join method is divided into two types: traditional string similarity join method and distributed framework based on string similarity connection method. Traditional string similarity join methods ALL-pairs, Ed-join and Trie-tree, a distributed framework based on similar connection string The method of MRSimJoin, MR_DSJ and Fuzzy-Join. In this paper, the traditional method of connection string similarity research and analysis, found that the traditional method is restricted by the machine memory space, disk space and CPU resources, not suitable for the large-scale data similar connection, and the use of Hadoop framework for distributed data processing is one of the main ways. This paper studies how to based on Hadoop distributed framework can be efficiently processed in parallel connection string similarity. The main contributions of this paper are to: (1) this paper proposes a string similarity join SSJ-Model model, this model employs several filtering strategies and incremental similarity connection string. (2) study on the operation principle of Hadoop distributed framework using SSJ-Model, we propose a parallel string similarity join algorithm based on Hadoop Hmrdp-join. (3) on Hmrdp-join The optimum method, can save MapReduce some temporary results, avoid copying data from disk. The time cost effectively divide the data load balance map stage and reduce stage, to avoid data skew. Use of existing information, avoid similar connection parts in the process of repeated calculation. By grouping strategy to reduce the multiple copies on the string. (4) set of experiments using real data analysis and optimized Hmrdp-join algorithm has higher efficiency.

【学位授予单位】：东华大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP301.6

【相似文献】