基于元搜索引擎的网页采集技术的研究与实现

发布时间：2018-10-19 14:35

【摘要】：随着互联网的迅速发展，网络信息急剧膨胀，对互联网信息敏感的政府部门和企事业单位已经无法单单依靠人工监控来把握互联网的动向了。为了帮助用户更好地实时监控分析网络信息,近些年涌现了大量的互联网信息处理平台。这些互联网信息处理平台借助于高性能的计算机，及时、准确、全面的采集网络信息，并进一步为用户提供有价值的分析结果。然而，现有的网页信息采集技术在采集数据的时效性、全面性和有效率上还存在一定缺陷，并且设计复杂，维护困难，需要消耗大量的人力、物力。为了克服上述缺陷，本文将元搜索技术迁移应用到了互联网信息采集系统中去，提出了基于元搜索引擎的网页采集技术——采集型元搜索技术。实验结果表明，比起已有的网页信息采集技术，新的网页采集技术能够保证采集数据的时效性、全面性和有效率。本文所做主要工作如下： 1)对传统的网页采集技术进行了详细的研究和分析，阐述了各种网络爬虫在满足互联网信息处理平台的网页采集需求时的优缺点，提出了基于元搜索引擎的网页采集技术。 2)针对现有元搜索引擎应用于采集模块存在采集规模过小的问题，提出了基于局部共现统计的查询扩展技术（LCOOCS），通过增加查询次数的方式来获取更多相关网页。 3)针对LCOOCS需要对初检结果进行文本分析，而元搜索引擎的采集结果都是HTML网页源代码的问题，设计并实现了一种全自动的正文抽取算法TextEx。 4)设计并实现了一个采集型元搜索系统。总结提取了百度新闻、bing资讯等六大互联网搜索引擎的查询语法和结果页结构，，实现了查询提交以及结果下载的自动化。
[Abstract]:With the rapid development of the Internet and the rapid expansion of network information, government departments and enterprises that are sensitive to Internet information can no longer rely solely on manual monitoring to grasp the trend of the Internet. In order to help users monitor and analyze network information in real time, a large number of Internet information processing platforms have emerged in recent years. With the help of high performance computers, these Internet information processing platforms collect network information in a timely, accurate and comprehensive manner, and further provide valuable analysis results for users. However, the existing web page information collection technology still has some defects in the timeliness, comprehensiveness and efficiency of collecting data, and the design is complex and the maintenance is difficult, so it needs to consume a lot of manpower and material resources. In order to overcome the above shortcomings, this paper applies the meta-search technology migration to the Internet information collection system, and puts forward the web page acquisition technology based on meta search engine, which is the acquisition meta-search technology. The experimental results show that the new technology can ensure the timeliness, comprehensiveness and efficiency of the data collection. The main work of this paper is as follows: 1) the traditional web page acquisition technology is studied and analyzed in detail, and the advantages and disadvantages of various web crawlers in meeting the needs of the web page collection of the Internet information processing platform are expounded. This paper puts forward the technology of web page acquisition based on meta search engine. 2) aiming at the problem that the existing meta-search engine is used in the collection module, the scale of collection is too small. A query expansion technique based on local co-occurrence statistics (LCOOCS),) is proposed to obtain more relevant web pages by increasing the number of queries. 3) the text analysis of the first check results is carried out according to the needs of LCOOCS. The acquisition results of meta search engine are all the problems of HTML web page source code. A kind of automatic text extraction algorithm TextEx. 4) is designed and implemented, and a collection meta search system is designed and implemented. This paper summarizes and extracts the query syntax and result page structure of six Internet search engines, such as Baidu News, bing Information and so on, and realizes the automation of query submission and result download.
【学位授予单位】：西安电子科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP393.092;TP391.3

【参考文献】