基于元搜索引擎的网页采集技术的研究与实现
[Abstract]:With the rapid development of the Internet and the rapid expansion of network information, government departments and enterprises that are sensitive to Internet information can no longer rely solely on manual monitoring to grasp the trend of the Internet. In order to help users monitor and analyze network information in real time, a large number of Internet information processing platforms have emerged in recent years. With the help of high performance computers, these Internet information processing platforms collect network information in a timely, accurate and comprehensive manner, and further provide valuable analysis results for users. However, the existing web page information collection technology still has some defects in the timeliness, comprehensiveness and efficiency of collecting data, and the design is complex and the maintenance is difficult, so it needs to consume a lot of manpower and material resources. In order to overcome the above shortcomings, this paper applies the meta-search technology migration to the Internet information collection system, and puts forward the web page acquisition technology based on meta search engine, which is the acquisition meta-search technology. The experimental results show that the new technology can ensure the timeliness, comprehensiveness and efficiency of the data collection. The main work of this paper is as follows: 1) the traditional web page acquisition technology is studied and analyzed in detail, and the advantages and disadvantages of various web crawlers in meeting the needs of the web page collection of the Internet information processing platform are expounded. This paper puts forward the technology of web page acquisition based on meta search engine. 2) aiming at the problem that the existing meta-search engine is used in the collection module, the scale of collection is too small. A query expansion technique based on local co-occurrence statistics (LCOOCS),) is proposed to obtain more relevant web pages by increasing the number of queries. 3) the text analysis of the first check results is carried out according to the needs of LCOOCS. The acquisition results of meta search engine are all the problems of HTML web page source code. A kind of automatic text extraction algorithm TextEx. 4) is designed and implemented, and a collection meta search system is designed and implemented. This paper summarizes and extracts the query syntax and result page structure of six Internet search engines, such as Baidu News, bing Information and so on, and realizes the automation of query submission and result download.
【学位授予单位】:西安电子科技大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP393.092;TP391.3
【参考文献】
相关期刊论文 前10条
1 沈宇;黄卫东;;基于领域本体的元搜索技术研究[J];信息通信;2008年02期
2 周德懋;李舟军;;高性能网络爬虫:研究综述[J];计算机科学;2009年08期
3 刘国靖;康丽;罗长寿;;基于遗传算法的主题爬虫策略[J];计算机应用;2007年S2期
4 王磊;蒋建中;郭军利;;基于扩展DOM树的Web页面信息抽取[J];计算机应用与软件;2007年06期
5 黄名选;严小卫;张师超;;查询扩展技术进展与展望[J];计算机应用与软件;2007年11期
6 林子熠;沈备军;;基于统计的自动化Web新闻正文抽取[J];计算机应用与软件;2010年12期
7 孙承杰,关毅;基于统计的网页正文信息抽取方法的研究[J];中文信息学报;2004年05期
8 梅雪;程学旗;郭岩;张刚;丁国栋;;一种全自动生成网页信息抽取Wrapper的方法[J];中文信息学报;2008年01期
9 崔航,文继荣,李敏强;基于用户日志的查询扩展统计模型[J];软件学报;2003年09期
10 杨少华;林海略;韩燕波;;针对模板生成网页的一种数据自动抽取方法(英文)[J];软件学报;2008年02期
相关博士学位论文 前4条
1 郭秀娟;基于关联规则数据挖掘算法的研究[D];吉林大学;2004年
2 李荣陆;文本分类及其相关技术研究[D];复旦大学;2005年
3 李强;基于本体论的个性化和社会化元搜索引擎的研究[D];浙江大学;2006年
4 高茂庭;文本聚类分析若干问题研究[D];天津大学;2007年
相关硕士学位论文 前4条
1 陈剑锐;基于Hadoop海量数据存储仿真平台的研究与设计[D];华南理工大学;2011年
2 万晶;Web网页正文抽取方法研究[D];南昌大学;2010年
3 程锦佳;基于Hadoop的分布式爬虫及其实现[D];北京邮电大学;2010年
4 于洪波;中文网页自动采集与分类系统设计与实现[D];北京邮电大学;2010年
本文编号:2281433
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2281433.html