中文垂直搜索技术的研究与实现
发布时间:2018-01-19 05:03
本文关键词: 搜索引擎 垂直搜索 Nutch 中文分词 文本聚类 出处:《河北科技大学》2012年硕士论文 论文类型:学位论文
【摘要】:随着互联网的迅捷发展,中国网民人数日益增多,网络提供的服务也五花八门,网站数量急剧增加,网站信息资源日益膨胀。面对浩如烟海的信息资源,如何精准有效的检索到令人满意的结果,不必在众多选择中游移不定而被信息海洋淹没,成了人们最为关注的问题。垂直搜索引擎的出现正迎合了这一契机,它致力于为人们提供更快,更高,更专业的检索服务。 本文对目前搜索引擎技术领域的热点问题进行了探索性的研究,内容主要包括: 1)爬虫爬取网页的过程,爬取初始种子集选择,运行时打开线程数与网络资源开销的关系。 2)研究中文分词的分词方法,及目前流行的ICTCLAS,JE分词,paoding分词等几种分词方案在垂直搜索引擎中被植入后的分词效果。 3)研究了在线网页聚类算法在Nutch中的应用,主要分析了开源的carrot2中lingo和STC聚类算法的运行情况比对。 4)对搜索引擎个性化方面研究主要完成语音输入,检索同义词转换,以及异构文档的处理。 垂直搜索是和某专题相关的目标集中的资源的搜索。本文在垂直搜索的关键技术研究的基础上,设计了采用Nutch框架的面向全国高校的校园采风垂直搜索引擎系统。通过对该系统的测试,实验结果表明该系统有良好的查准率。
[Abstract]:With the rapid development of the Internet, the number of Internet users in China is increasing day by day, the services provided by the network are also various, the number of websites has increased dramatically, and the information resources of the website are expanding day by day. How to accurately and effectively retrieve satisfactory results without being swamped by information in many choices has become the most concerned issue. The emergence of vertical search engines caters to this opportunity. It aims to provide people with faster, higher, more professional retrieval services. This paper has carried on the exploratory research to the current hot spot question in the search engine technical domain, the content mainly includes: 1) the relationship between the number of open threads and the cost of network resources, including the process of crawling the web page, the selection of the initial seed set, and the number of threads opened at runtime. 2) the segmentation method of Chinese word segmentation and the segmentation effect of the popular ICTCLASJE participle segmentation in vertical search engine were studied. 3) the application of online web page clustering algorithm in Nutch is studied, and the comparison between lingo and STC clustering algorithm in open source carrot2 is analyzed. 4) the research of search engine personalization mainly completes the speech input, the retrieval synonym conversion, and the heterogeneous document processing. Vertical search is the search of resources in the target set related to a topic. This paper is based on the research of the key technology of vertical search. A vertical search engine system for campus mining in colleges and universities is designed using Nutch framework. The test results show that the system has a good precision.
【学位授予单位】:河北科技大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前1条
1 秦文,苑春法;基于决策树的汉语未登录词识别[J];中文信息学报;2004年01期
相关硕士学位论文 前3条
1 王思力;面向大规模信息检索的中文分词技术研究[D];中国科学院研究生院(计算技术研究所);2006年
2 邓锦辉;受限域中文问答系统中答案抽取的研究[D];昆明理工大学;2008年
3 张脂平;因子分析算法的研究及其在Web文本特征提取中的应用[D];福州大学;2005年
,本文编号:1442724
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1442724.html