当前位置:主页 > 科技论文 > 搜索引擎论文 >

特定网站新闻检索系统的设计与实现

发布时间:2018-06-10 12:34

  本文选题:新闻搜索 + RSS ; 参考:《华南理工大学》2013年硕士论文


【摘要】:互联网快速的发展,人们的生活越来越离不开互联网,网络信息量爆发式地增长给搜索引擎带来了巨大的挑战。人们每天都花一定的时间来浏览新闻网站,了解当前国内外正在发生的一些时事新闻,然而互联网上的新闻门户网站也越来越多,人们获取自己感兴趣的新闻也就越来越难。在很多情形下(例如舆情检测等),人们只对一些特定网站的新闻感兴趣,而通用搜索引擎并不提供这种选择。这种情况下,我们需要一个面向特定网站的新闻搜索系统,能为用户搜集、整理并提供感兴趣的新闻服务。 本文旨在设计并实现一个及时准确的、用户可配置和定制的、可扩展的新闻搜索系统,,该系统能实时采集指定网站的新闻,并给用户提供个性化的新闻搜索服务。本文调研了搜索引擎及新闻搜索国内外的研究现状,基于搜索引擎的主要工作原理,提出了面向特定网站的新闻检索系统的设计。本文使用MVC分层思想对系统进行实现,将系统分成数据采集层、业务逻辑层和展示层。本文通过新闻网站的RSS源来发现最新的新闻报道,使用Boilerpipe开源库提取网页的正文信息,使用IK分词器对网页正文进行分词并为网页建立倒排索引,最后为用户提供个性化的新闻搜索服务。同时本文还根据新闻的特性,提出了基于新闻相关性、新鲜性、新闻类别、新闻来源站点这四个因素的新闻搜索结果排序算法对新闻结果进行排序。 本文对系统进行测试,统计新闻的采集情况,对新闻网页正文提取进行测试,对新闻搜索系统的Web服务部分进行功能测试。
[Abstract]:With the rapid development of the Internet, people's lives are more and more inseparable from the Internet. The explosive growth of network information has brought great challenges to search engines. People spend a certain amount of time browsing news websites every day to find out what is happening at home and abroad. However, there are more and more news portals on the Internet, so it is more and more difficult for people to get the news they are interested in. In many cases, such as public opinion testing, people are only interested in news from specific sites, whereas generic search engines do not offer this option. In this case, we need a Web-oriented news search system that can collect, organize and provide interesting news services for users. This article aims to design and implement a timely, accurate, user-configurable and customizable news service. An extensible news search system, which can collect news from designated websites in real time, and provide personalized news search service to users. This paper investigates the research status of search engine and news search at home and abroad. Based on the main working principle of search engine, this paper puts forward the design of news retrieval system for specific website. This paper implements the system with MVC layer idea, and divides the system into three layers: data acquisition layer, business logic layer and display layer. In this paper, the latest news reports are found through RSS feeds of news websites, and the text information of web pages is extracted by Boilerpipe open source library, and the text of web pages is partitioned by IK particifier and inverted index is established for the pages. Finally, to provide users with personalized news search service. At the same time, according to the characteristics of news, this paper puts forward a news search result sorting algorithm based on the four factors of news relevance, freshness, news category and news source site. According to the collection of news, the text extraction of news pages is tested, and the function of Web service in news search system is tested.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3

【参考文献】

相关期刊论文 前4条

1 印鉴,陈忆群,张钢;搜索引擎技术研究与发展[J];计算机工程;2005年14期

2 陈钊;张冬梅;;Web信息抽取技术综述[J];计算机应用研究;2010年12期

3 萨支斌;;RSS技术研究[J];情报探索;2006年09期

4 伍玉伟;;RSS:网络信息“聚合”利器[J];现代情报;2006年02期



本文编号:2003240

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2003240.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户b923f***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com