基于RSS的聚焦网络爬虫在高校网站群中的研究
[Abstract]:With the rapid development of the network and the increasing number of web pages, people often need to read a large number of pages in order to obtain the information they need, wasting time and energy, and not necessarily getting the latest and most complete information. The publishers of network information also hope that more users can read their own information in real time. For this reason, there are a lot of research on this need, such as search engine supported by web crawler, RSS information push and so on. However, each of them has its own limitations. For example, we need to get the latest notifications from all websites of a university according to the classification, such as the latest notifications of all scientific research categories of that university. Search engine is used to search, the results are unsatisfactory. RSS can push the latest information into categories, but only those sites that offer RSS feed. It's not going to be helpful for objects like college web groups that didn't implement RSS push when they were built early. Therefore, this paper mainly studies the focused web crawler based on RSS to solve the above problems, and applies it to the university website group, and obtains good results. Its principle is to use focused web crawlers to capture, analyze and process the data of the target site group, and then provide RSS push. In this way, users can subscribe to their latest information through RSS readers, even if they don't have a RSS feed site. Avoid the trouble of looking through a large number of web pages to find information, as well as the omission of information. The main contents of this paper are as follows: (1) A new focused web crawler based on RSS is proposed, which enables users to use RSS readers to subscribe and read the latest information of Web sites that do not provide RSS feed. Filter useless advertising and other spam information, to avoid the trouble of finding information. (2) based on the TF-IDF algorithm, the text is classified, and the feature vectors of different categories are extracted by TF-IDF, which is improved according to the features of the web pages. The extracted feature vectors can better represent the categories and the classification results are more accurate. (3) the incremental crawling of network crawler is improved. Based on the traditional incremental crawling algorithm, a new algorithm is proposed to calculate the predictive update time, which makes the prediction time closer to the actual update time and reduces the overhead of the system. Improve efficiency. (4) the research of focused web crawler based on RSS is applied to the university website group, and the PageRank algorithm is improved to improve the recall rate of the network crawler according to the characteristics of the university website group.
【学位授予单位】:南昌大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.092
【参考文献】
相关期刊论文 前10条
1 孙立伟;何国辉;吴礼发;;网络爬虫技术的研究[J];电脑知识与技术;2010年15期
2 谢剑猛;高校网站的规划与设计[J];华东交通大学学报;2004年05期
3 胡海燕;;RSS技术在高校网站中的设计与实现[J];吉林工商学院学报;2009年03期
4 骆斌,费翔林;多线程技术的研究与应用[J];计算机研究与发展;2000年04期
5 王津涛,兰皓;面向主题元搜索引擎的设计与实现[J];计算机工程;2005年07期
6 秦玉平;王秀坤;艾青;刘卫江;;多主题文本分类的实现算法[J];计算机工程;2008年02期
7 李勇;韩亮;;主题搜索引擎中网络爬虫的搜索策略研究[J];计算机工程与科学;2008年03期
8 周立柱,林玲;聚焦爬虫技术研究综述[J];计算机应用;2005年09期
9 施聪莺;徐朝军;杨晓江;;TFIDF算法研究综述[J];计算机应用;2009年S1期
10 刘金红;陆余良;;主题网络爬虫研究综述[J];计算机应用研究;2007年10期
相关硕士学位论文 前10条
1 林捷;主题网络爬虫的研究和实现[D];武汉理工大学;2011年
2 于魁飞;基于RSS的信息发布与订阅技术研究[D];北京邮电大学;2007年
3 刘喜亮;面向主题的网络爬虫设计与实现[D];湖南大学;2009年
4 韩冰;基于BP网络的高校主题爬虫的设计与实现[D];东北师范大学;2009年
5 杨溥;搜索引擎中爬虫的若干问题研究[D];北京邮电大学;2009年
6 袁浩;主题爬虫搜索Web页面策略的研究[D];中南大学;2009年
7 陈丛丛;主题爬虫搜索策略研究[D];山东大学;2009年
8 贺晟;搜索引擎中主题网络爬虫的研究与设计[D];安徽大学;2010年
9 张红云;基于页面分析的主题网络爬虫的研究[D];武汉理工大学;2010年
10 张航;主题爬虫的实现及其关键技术研究[D];武汉理工大学;2010年
,本文编号:2329116
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2329116.html