用于个性推荐系统的文本爬虫设计与实现

发布时间：2018-11-16 14:26

【摘要】：近年来互联网技术发展迅猛,从互联网上获取信息已经成为人们查找有用信息的重要方式。信息种类繁多、传播迅速、含量庞大是互联网的特点。如何针对这些特点及时准确的抓取有关信息,为教育云中个性推荐系统建设学科资源库服务,成为个性推荐系统学科资源库建立过程中需要解决重要问题。针对这一问题,本文结合互联网的特点,运用信息抽取和网页处理技术,设计和实现了个性推荐系统中的网络爬虫部分,以提供分类更细致精确、数据更全面深入、更新更及时的信息抓取服务。具体工作如下： 1.本文介绍了网络爬虫的发展现状,然后分析了网络爬虫的体系结构以及实现原理,并深入分析了主题页面在Web上的分布特征。 2.搜索策略。本文利用URL (Uniform Resource Locator)字符串特征、锚文本、父页面以及兄弟URL等影响因素,计算并预测‘URL的主题相关度。对URL依据预测的主题相关度大小依次爬行,尽可能下载与主题相关度高的网页。 3.网页解析过程。包括编码转换、HTML (Hyper Text Markup Language)解析、URL提取、网页消噪和正文提取。本文通过读取HTML文件的头部信息中meta标签http-equiv属性中获得网页的编码方式,从互联网下载数据时指定编码方式读取,然后采用链接分析和统计相结合的方法提取网页正文,进一步有效的剔除噪声,提高网页正文提取的完整性,对于大部分内容型的网页都能正确的提取出正文部分。 4.最后,本文在以上设计的基础上实现了一个网络爬虫系统,并分析了爬虫的运行结果。本文给出的网络爬虫可用于教育云的个性化推荐系统中,通过学科领域文章的获得、存储、分析和推荐,为用户快速推荐感兴趣的文献和相关资料,从而提高了研究效率。
[Abstract]:In recent years, with the rapid development of Internet technology, obtaining information from the Internet has become an important way for people to find useful information. The characteristic of the Internet is the wide variety of information, the rapid spread and the huge content. How to grasp the relevant information timely and accurately in view of these characteristics and to serve the construction of subject resource bank in the educational cloud has become an important problem to be solved in the course of establishing the subject resource bank of personality recommendation system. Aiming at this problem, this paper combines the characteristics of the Internet, using the technology of information extraction and web page processing, designs and implements the web crawler part of the personality recommendation system to provide more detailed and accurate classification, more comprehensive and thorough data. Update more timely information grab service. The specific work is as follows: 1. This paper introduces the development of web crawlers, then analyzes the architecture and implementation principle of web crawlers, and analyzes the distribution characteristics of theme pages on Web. 2. Search strategy. In this paper, the theme correlation of 'URL' is calculated and predicted by using URL (Uniform Resource Locator) string feature, anchor text, parent page and sibling URL. Crawling the URL according to the predicted correlation degree of the topic, download as many pages as possible with the high correlation degree of the topic. 3. Web page parsing process. Including encoding conversion, HTML (Hyper Text Markup Language) parsing, URL extraction, page denoising and text extraction. In this paper, the encoding method of the web page is obtained by reading the meta tag http-equiv attribute in the header information of the HTML file, and the encoding mode is specified when the data is downloaded from the Internet. Then the text of the web page is extracted by the method of link analysis and statistics. Further effectively eliminate the noise, improve the integrity of the page text extraction, for most of the content pages can correctly extract the text part. 4. Finally, a web crawler system is implemented on the basis of the above design, and the results of the crawler operation are analyzed. The web crawler presented in this paper can be used in the personalized recommendation system of the educational cloud. Through the acquisition, storage, analysis and recommendation of the articles in the subject field, the web crawler can quickly recommend the interested documents and related materials for the users, thus improving the efficiency of the research.
【学位授予单位】：中国科学院大学（工程管理与信息技术学院）
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.09;TP391.3

【参考文献】