基于版权服务的网络数据采集算法研究

发布时间：2019-04-24 08:19

【摘要】：伴随着网络的飞速发展,网络传播速度快和成本低,数字作品容易在互联网中传播和扩散,给数字版权管理工作带来了前所未有的挑战。未经授权的数字作品在互联网上的转载或盗链,严重地损害了数字作品权利人的权利和利益。如何有效地在网络上检测出那些未经授权的数字作品,是版权保护中网络监测的重要环节。而通用搜索引擎由于搜索的范围广、数据采集的规模庞大、检索结果往往重复等问题,所以,对基于版权服务的网络数据采集算法进行研究具有实际意义。论文首先介绍了通用搜索引擎的组成及工作原理,阐述了垂直搜索引擎的关键技术如网络爬虫、信息抽取等。针对搜索中重复链接问题,详细讨论了网络爬虫的URL地址去重策略和爬行搜索策略,分别论述了基于内存的Hash算法进行URL地址去重、基于嵌入式数据库Berk eley DB方式的URL地址去重算法以及基于内容和URL链接分析的搜索策略,并对这些算法的优缺点进行了比较和分析,在此基础上,论文综合了Bloom Filter算法消耗内存少、速度快和嵌入式数据库Berkeley DB进行URL地址去重时性能稳定等优点,结合数字音乐作品相对稳定的展现格式和所在网页层次深度较小等特点,设计了一种新的URL地址去重算法。根据不同要求分别采用Bloom Filter进行URL地址去重和Berkeley DB方法进行去重,同时对URL地址采用MD5压缩后存入嵌入式数据库中进行读取,这样能更好地减少存储空间。针对基于内容评价算法的“近视问题”和基于网络链接评价算法的“主题漂移”现象,将Shark Search算法和Hits算法的优点结合起来,同时考虑内容主题和链接互相加强的关系,提出一种新的主题爬取策略算法。论文以开源Heritrix框架为基础,设计了一个垂直搜索引擎,对本文提出的URL地址去重算法和搜索策略进行实验分析。论文的创新点是提出了一种新的URL地址去重算法和基于内容与链接评价相结合的搜索策略,并对算法的效率进行了测试分析。
[Abstract]:With the rapid development of network, the network transmission speed is fast and the cost is low, and the digital works are easy to spread and spread in the Internet, which brings the unprecedented challenge to the digital rights management work. Unauthorized reproduction or theft of digital works on the Internet seriously damages the rights and interests of the rights holders of digital works. How to effectively detect unauthorized digital works on the network is an important part of network monitoring in copyright protection. Because of the wide range of search, the large scale of data collection and the repeated retrieval results, the research on the network data acquisition algorithm based on copyright service is of practical significance. This paper first introduces the composition and working principle of general search engine, and expounds the key technologies of vertical search engine, such as web crawler, information extraction and so on. In order to solve the problem of repeated links in search, the URL address de-reduplication strategy and crawling search strategy of web crawler are discussed in detail, and the memory-based Hash algorithm for URL address reduplication is discussed respectively. The URL address de-duplication algorithm based on embedded database Berk eley DB and the search strategy based on content and URL link analysis are compared and analyzed. On this basis, the advantages and disadvantages of these algorithms are compared and analyzed. The paper combines the advantages of Bloom Filter algorithm, such as less memory consumption, faster speed and stable performance when the embedded database Berkeley DB is used to remove the heavy URL address, and combines the characteristics of the relatively stable presentation format of digital music works and the low level depth of the web page, and so on. A new URL address de-duplication algorithm is designed. According to different requirements, Bloom Filter is used to remove the URL address and the Berkeley DB method is used to remove the weight. At the same time, the URL address is compressed by MD5 and stored in the embedded database for reading, so that the storage space can be reduced better. In view of the "myopia problem" based on content evaluation algorithm and the "theme drift" phenomenon based on network link evaluation algorithm, the advantages of Shark Search algorithm and Hits algorithm are combined, and the relationship between content topic and link is considered. A new topic crawling strategy algorithm is proposed. Based on the open source Heritrix framework, a vertical search engine is designed, and the URL address de-duplication algorithm and search strategy proposed in this paper are analyzed experimentally. The innovation of this paper is that a new URL address reduplication algorithm and a search strategy based on content and link evaluation are proposed, and the efficiency of the algorithm is tested and analyzed.
【学位授予单位】：北方工业大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【相似文献】