当前位置:主页 > 文艺论文 > 广告艺术论文 >

基于移动终端的Web信息检索技术研究

发布时间:2018-04-04 01:26

  本文选题:移动互联网 切入点:信息提取 出处:《浙江理工大学》2012年硕士论文


【摘要】:随着移动互联网的快速发展,人们越来越习惯于随时随地通过手机等移动终端来上网。在浏览网页时经常会看到网页中会包含大量和我们所关心的内容无关的导航条、广告信息、版权信息以及其他一些信息等。对于移动用户来说,这些信息不仅让他们被动的去浏览而浪费宝贵的时间,而且也因为浏览了这些信息造成不必要的流量浪费。所以如何除去网页中多余的信息,,让网页为用户做出需求应答时所展现的内容只是用户想看的内容,这是非常有必要的。比如,用户只想获取一个词的名词解释,那搜索引擎返回的结果就是单纯的名词解释。基于这一点,本文在研究了网页净化的相关技术和Lucene搜索引擎的基础上,开发设计了一套适合手机等移动终端获取主题文本信息的搜索系统。 首先,论文对本系统需要用到的相关技术作了大致的介绍。主要研究了网页净化领域的相关技术,包括网页适应、网页分割和网页主题信息提取,同时,对Lucene开发工具包的技术和应用特点作了重点介绍,主要涉及Lucene的索引和查询,还有分析了自动摘要和正则表达式。 然后,论文针对本系统的两个重要模块分别作介绍。一个是网页预处理模块,基于对网页净化技术的研究,采用信息提取的方法实现对主题信息的获取;另一个是信息检索模块,所检索的信息就是网页预处理模块得到的主题信息。在改进的中文分词的基础上,采用Lucene搜索引擎包实现对信息的索引和查询。 最后,论文对整个系统的设计进行了介绍。系统实现了网页搜集,网页预处理和内容服务三个模块,完成了根据用户输入的关键字提供给用户文本信息服务的功能,实验证明这种方法既能提高查询的准确率,也大大的减少了网络流量。
[Abstract]:With the rapid development of mobile Internet, people are more and more accustomed to mobile terminals such as mobile phones.When you browse the web page, you often see that it contains a lot of navigation bars, advertising information, copyright information and other information that are not related to the content we are concerned about.For mobile users, this information not only allows them to passively browse and waste valuable time, but also caused unnecessary waste of traffic because of browsing the information.Therefore, it is necessary to remove the redundant information from the web page and make the content displayed when the web page is responding to the needs of the user, which is only what the user wants to see.For example, if a user only wants to get a noun explanation of a word, the search engine returns a simple noun explanation.Based on this, based on the research of the technology of web page purification and the Lucene search engine, this paper develops a search system which is suitable for mobile terminals such as mobile phones to obtain topic text information.First of all, the paper makes a general introduction to the relevant technologies that need to be used in this system.This paper mainly studies the related technologies in the field of web page purification, including web page adaptation, page segmentation and page subject information extraction. At the same time, the technology and application characteristics of Lucene development toolkit are introduced emphatically, mainly involving the index and query of Lucene.Automatic abstracts and regular expressions are also analyzed.Then, the paper introduces two important modules of the system.One is the web page preprocessing module, based on the research of the page purification technology, the method of information extraction is used to obtain the subject information; the other is the information retrieval module.The information retrieved is the topic information obtained by the web page preprocessing module.Based on the improved Chinese word segmentation, Lucene search engine package is used to index and query information.Finally, the design of the whole system is introduced.The system realizes three modules of web page collection, page preprocessing and content service, and completes the function of providing user text information service according to the keywords input by the user. The experiment proves that this method can improve the accuracy of query.Also greatly reduced the network traffic.
【学位授予单位】:浙江理工大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3

【参考文献】

相关期刊论文 前5条

1 程晓伟;田东风;;基于树及索引的HTML表格数据挖掘算法研究[J];电脑知识与技术;2009年10期

2 李峰;陈达;刘泽宏;彭青立;朱春梅;;手机浏览器技术与发展探讨[J];电信技术;2011年02期

3 潘以锋;;基于Lucene的网站全文检索系统的开发[J];广西教育学院学报;2006年05期

4 王琦,唐世渭,杨冬青,王腾蛟;基于DOM的网页主题信息自动提取[J];计算机研究与发展;2004年10期

5 郭炜强;戴天;文贵华;;基于领域知识的专利自动分类[J];计算机工程;2005年23期

相关博士学位论文 前1条

1 孙晓;中文词法分析的研究及其应用[D];大连理工大学;2010年



本文编号:1707764

资料下载
论文发表

本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1707764.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户50664***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com