分布式电商数据动态检测及查重系统设计与实现

发布时间：2018-05-12 04:31

本文选题：电商数据 + 动态检测　；参考：《北京邮电大学》2016年硕士论文

【摘要】：随着网民增长和电子商务的蓬勃发展,电商网站的规模越来越大,电商网站上的数据呈现爆炸式增长。由于电子购物也成越来越成为了人们日常生活的一部分,电商网站的数据也就成为研究人员最为研究人们日常经济活动的重要研究对象,因此高效率的搜集电商网站信息显得非常重要。但电商网站上不仅存在大量数据,也存在着大量的冗余数据,而大量的冗余数据会严重影响电商数据收集的时间效率以及数据的准确度,因此为了保证电商数据的高效动态抓取,在抓取过程中必须对数据进行动态检测。目前存在着许多数据查重算法,但这些算法都是相对普世性质的,没有充分利用电商网站数据的特点。因此本文先调研和总结国内各大主流电商网站的特点,然后通过电商网站的特点提出了基于网址特征的布隆过滤器和基于网址指纹的网页查重算法,最后利用提出的新算法设计并实现了分布式电商数据查重系统。(一)基于网址特征的布隆过滤器算法。本章节针对电商网站网页实时分析过程对效率的特殊要求,分析了基于传统布隆过滤器查重的原理,指出其网址查重中忽略了网址信息冗余的缺陷,提出了一种改进的网址查重的方法——基于网址特征提取的布隆过滤方法。该方法首先定义网址特征;并通过改进后的相应算法对其进行量化、提取;根据网址特征训练网址过滤规则;最后根据规则去除网址的冗余信息后对网址进行布隆查重。通过对200多万条数据实验发现改进后的布隆过滤器的时间效率有了很大的提升,并随着数据量的增加时间效率提升更明显,证明了所提方法有效,并能很好满足应用需求。(二)基于网址指纹的网页查重算法。通过对电商网站的分析可知,当多个网址对应于同一个页面时,两个网址的相似度非常大;同时通过对传统网页查重么算法分析可知,传统的网页查重算法必须先将网页下载后再进行查重,这样无法改善网页收集的效率。基于这两点考虑,本课题提出了基于网址指纹的网页查重算法,该算法通过对网址的属性提取、量化,指纹提取训练出网址指纹,最后通过相似度比较判断网址与其它网址的相似度。最后通过220万条数据实验发现:基于网址指纹的网页查重算法能够保证较小的误差率(1%)的前提下查重的时间效率的提高11%,而且随着数据量的增大效果更明显。(三)基于主题的分布式查重系统的设计与实现。首先分析传统布隆过滤器的原理和缺陷,设计了一种基于主题的分布式查重系统,为了保证分布式查重系统的高效性、可靠性以及可维护性,本章节使用了第三章和第四章所研究的查重方法,并通过zookeeper和thrift框架实现了该系统,最后通过分析可知基于主题的分布式查重系统具有良好的维护性、可靠性,比较传统分布式查重系统其时间效率更高。
[Abstract]:With the growth of Internet users and the rapid development of e-commerce, the scale of e-commerce websites is getting larger and larger, and the data on e-commerce websites are exploding. As e-shopping has become increasingly a part of people's daily lives, the data of e-commerce websites have become the most important research objects for researchers to study people's daily economic activities. Therefore, the efficient collection of e-commerce website information is very important. However, there are not only a lot of data but also a lot of redundant data on e-commerce websites, and a large amount of redundant data will seriously affect the time efficiency and accuracy of e-commerce data collection. Therefore, in order to ensure the efficient dynamic capture of e-commerce data, it is necessary to dynamically detect the data in the process of grabbing. At present, there are many data search algorithms, but these algorithms are relatively universal and do not make full use of the characteristics of e-commerce website data. Therefore, this paper first investigates and summarizes the characteristics of the major e-commerce websites in China, and then, through the characteristics of the e-commerce websites, puts forward a Bron filter based on the web site features and a web page search algorithm based on the URL fingerprint. Finally, a new algorithm is proposed to design and implement the distributed electronic quotient data retrieval system. (1) Bron filter algorithm based on URL feature. This chapter aims at the special requirement of efficiency in the process of real-time analysis of web pages of e-commerce websites, analyzes the principle of checking and rechecking based on the traditional Bron filter, and points out that the redundancy of web address information is neglected in the rechecking of web addresses. In this paper, an improved method for checking web addresses is proposed, which is based on the feature extraction of web addresses. The method firstly defines the URL features; quantifies them through the improved algorithm and extracts them; trains the URL filtering rules according to the URL characteristics; finally removes the redundant information of the URL according to the rules and then redoes the URLs. Through more than 2 million data experiments, it is found that the time efficiency of the improved Bron filter has been greatly improved, and with the increase of the amount of data, the time efficiency is more obvious. It is proved that the proposed method is effective and can well meet the needs of application. (2) Web page search algorithm based on web site fingerprint. Through the analysis of e-commerce websites, when multiple URLs correspond to the same page, the similarity between the two URLs is very large. At the same time, through the analysis of the traditional webpage search algorithm, we know that, The traditional search algorithm must first download the web page and then check it again, which can not improve the efficiency of web page collection. Based on these two considerations, this paper puts forward a web page search algorithm based on URL fingerprint, which trains the URL fingerprint by extracting, quantifying and extracting the URL attributes. Finally, the similarity between the URL and other URLs is judged by comparing the similarity. Finally, through 2.2 million data experiments, it is found that the algorithm based on web site fingerprint can guarantee the time efficiency of checking duplicate with a small error rate of 1), and the effect is more obvious with the increase of data volume. (3) the design and implementation of the subject-based distributed repeat checking system. Firstly, this paper analyzes the principle and defect of the traditional Bron filter, and designs a kind of distributed rechecking system based on the subject, in order to ensure the high efficiency, reliability and maintainability of the distributed re-checking system. In this chapter, we use the methods studied in chapter 3 and chapter 4, and realize the system through zookeeper and thrift framework. Finally, through the analysis, we can see that the distributed recount system based on topic has good maintainability and reliability. Compared with the traditional distributed repeat checking system, its time efficiency is higher.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP393.092;TP311.52

【参考文献】