基于RSS的聚焦网络爬虫在高校网站群中的研究

发布时间：2018-11-13 12:32

【摘要】：网络发展迅速,网页数量越来越庞大,人们为了获取需要的信息,往往需要翻阅大量的网页,浪费时间和精力,并且还不一定能够获取最新最全的信息,而网络信息的发布者也希望有更多的用户能够实时的阅读自己的信息,为此有很多针对该需求的研究孕育而生,例如由网络爬虫支持的搜索引擎、RSS信息推送等。但是它们都各有各的局限性,例如我们需要按照分类得到某高校的所有网站中的最新通知,比如该高校所有科研类别的最新通知。使用搜索引擎进行搜索,结果差强人意。而RSS虽然可以实现分类的推送最新信息,但是它推送的信息仅限于那些提供RSS feed的网站。对于一些类似于高校网站群这种早期建立的时候就没有实现RSS推送功能的对象来说,它就爱莫能助了。因此,本文主要研究基于RSS的聚焦网络爬虫来解决上述问题,并将其应用在高校网站群中,取得了较好的效果。它的原理是用聚焦网络爬虫对目标网站群的数据进行抓取、分析和处理,然后提供RSS推送。通过这种方式,对于即使没有提供RSS feed的网站,用户也可以通过RSS阅读器分类订阅其最新的信息。免去了大量翻阅网页查找信息的麻烦,以及查找疏忽对信息的遗漏。本文的主要研究内容包括： (1)提出一种新的基于RSS的聚焦网络爬虫的研究,使得用户可以使用RSS阅读器,订阅并阅读到没有提供RSS feed的网站的最新的信息。过滤无用的广告等垃圾信息,免去查找信息的麻烦。 (2)基于TF-IDF算法对抓取的网页文本进行分类,并且在用TF-IDF提取不同类别的特征向量部分,针对网页的特征对其进行了改进。使得提取出的特征向量更能好的代表类别,分类结果更准确。 (3)对网络爬虫的增量式爬取进行改进,基于传统的增量式爬取算法提出了一种新的计算预测更新时间的算法,使得预测时间更贴近实际更新时间的值,减少系统的开销,提高效率。 (4)将基于RSS的聚焦网络爬虫的研究应用到高校网站群中,针对高校网站群的特征对PageRank算法进行改进,提高网络爬虫的查全率。
[Abstract]:With the rapid development of the network and the increasing number of web pages, people often need to read a large number of pages in order to obtain the information they need, wasting time and energy, and not necessarily getting the latest and most complete information. The publishers of network information also hope that more users can read their own information in real time. For this reason, there are a lot of research on this need, such as search engine supported by web crawler, RSS information push and so on. However, each of them has its own limitations. For example, we need to get the latest notifications from all websites of a university according to the classification, such as the latest notifications of all scientific research categories of that university. Search engine is used to search, the results are unsatisfactory. RSS can push the latest information into categories, but only those sites that offer RSS feed. It's not going to be helpful for objects like college web groups that didn't implement RSS push when they were built early. Therefore, this paper mainly studies the focused web crawler based on RSS to solve the above problems, and applies it to the university website group, and obtains good results. Its principle is to use focused web crawlers to capture, analyze and process the data of the target site group, and then provide RSS push. In this way, users can subscribe to their latest information through RSS readers, even if they don't have a RSS feed site. Avoid the trouble of looking through a large number of web pages to find information, as well as the omission of information. The main contents of this paper are as follows: (1) A new focused web crawler based on RSS is proposed, which enables users to use RSS readers to subscribe and read the latest information of Web sites that do not provide RSS feed. Filter useless advertising and other spam information, to avoid the trouble of finding information. (2) based on the TF-IDF algorithm, the text is classified, and the feature vectors of different categories are extracted by TF-IDF, which is improved according to the features of the web pages. The extracted feature vectors can better represent the categories and the classification results are more accurate. (3) the incremental crawling of network crawler is improved. Based on the traditional incremental crawling algorithm, a new algorithm is proposed to calculate the predictive update time, which makes the prediction time closer to the actual update time and reduces the overhead of the system. Improve efficiency. (4) the research of focused web crawler based on RSS is applied to the university website group, and the PageRank algorithm is improved to improve the recall rate of the network crawler according to the characteristics of the university website group.
【学位授予单位】：南昌大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.092

【参考文献】