当前位置:主页 > 科技论文 > 搜索引擎论文 >

面向网页排序的关键词权值计算

发布时间:2018-11-01 16:11
【摘要】:随着信息科技的发展和互联网的日益普及,搜索引擎深受人们的重视,近年来最主流的搜索引擎是基于关键词检索的搜索引擎,在基于关键词检索的搜索引擎中,用户查询语句中各个词语权值计算的精度将直接影响到后续网页排序的好坏,因此正确计算检索条件中词语权值是至关重要的。 本文的研究是试图寻找一种面向网页排序的用户查询语句关键词权值计算方法,使基于关键词检索的搜索引擎在网页排序这一环节达到一个更高的水平,为后续检索处理打下良好的基础。为了完成研究目的,本文的工作主要包括以下三个部分: 用户查询语句自身特点分析。对标注了核心词的5000句查询语句自身特点与词语权值关系进行分析,对查询语句中含有的停用词和现代汉语语料中停用词进行分析,并对不同类别下查询语句中停用词进行了分析和举例。 面向网页排序的关键词权值计算。对用户查询日志进行分词和词性标注,将关键词抽取任务视为分类任务,结合查询语句自身的特点,,最终确定出每个词语的八个上下文特征作为决策树森林分类的特征,并分别介绍了各个特征的计算方法。并对实验结果进行错误分析,加入一些规则对模型分类的结果进行后处理。 实验结果分析。对决策树分类方法与传统关键词提取和权值计算方法的结果进行对比分析,从用户查询日志中随机抽取1000条左右查询语句进行人工评测,使用交叉验证的方法评测模型准确率和召回率;比较模型方法与传统的网页排序中权值计算方法的胜出率;选择几个查询语句,到“百度”上搜索,得出由模型确定的关键词序列进行搜索与不对关键词进行处理的查询语句搜索对网页排序效果的影响。实验结果表明本文采用的关键词抽取和权值计算方法在网页排序的权值计算中是切实可行的。
[Abstract]:With the development of information technology and the increasing popularity of the Internet, search engines are paid more attention by people. In recent years, the most mainstream search engine is the search engine based on keyword search, which is based on keyword search engine. The accuracy of calculating the weight of each word in the user query statement will directly affect the order of the subsequent web pages, so it is very important to correctly calculate the word weight value in the retrieval condition. In this paper, we try to find a method to calculate the keyword weight of user query statements in order to make the search engine based on keyword search reach a higher level. It lays a good foundation for the subsequent retrieval processing. In order to accomplish the purpose of the research, this paper mainly includes the following three parts: the characteristics of user query statements. This paper analyzes the relationship between the characteristics of the 5000 sentence query sentences marked with the core words and the weight of the words, and analyzes the stop words contained in the query statements and the stop words in the modern Chinese corpus. At the same time, the analysis and examples of stop-word in query statements under different categories are given. Keyword weight calculation for web page sorting. The segmentation and part of speech tagging of user query log is carried out, and the task of keyword extraction is regarded as a classification task. Combining with the characteristics of query statements, the eight contextual features of each word are finally determined as the characteristics of forest classification in decision tree. The calculation methods of each characteristic are introduced respectively. Error analysis of the experimental results is carried out, and some rules are added to post-process the results of model classification. Analysis of experimental results. The results of decision tree classification method and traditional keyword extraction and weight calculation methods are compared and analyzed. About 1000 query statements are randomly extracted from the user's query log for manual evaluation. The accuracy and recall rate of the model are evaluated by cross-validation. Compare the winning rate between the model method and the traditional weight calculation method in web page sorting; Several query statements are selected to search on "Baidu", and the influence of the keyword sequence determined by the model and the search statement that does not deal with the keywords on the ranking effect of the web pages is obtained. The experimental results show that the method of keyword extraction and weight calculation used in this paper is feasible in the weight calculation of web page sorting.
【学位授予单位】:中国社会科学院研究生院
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.3

【参考文献】

相关期刊论文 前10条

1 罗智勇;宋柔;;基于多特征的自适应新词识别[J];北京工业大学学报;2007年07期

2 李卫东;宋威;李欣;杨炳儒;;一种多标准决策树剪枝方法及其在入侵检测中的应用[J];北京科技大学学报;2007年04期

3 吕鸣剑;;数据挖掘在知识工程中的应用研究[J];电脑知识与技术;2011年23期

4 熊文新;宋柔;;信息检索用户查询语句的停用词过滤[J];计算机工程;2007年06期

5 张映海;何中市;陈永锋;;搜索引擎结果中Web文档的排序研究[J];计算机与数字工程;2007年02期

6 文炯;;搜索引擎之竞价排名研究[J];江西图书馆学刊;2006年01期

7 游荣彦;Zipf定律与汉字字频分布[J];中文信息学报;2000年03期

8 黄永文,何中市;基于互信息的统计语言模型平滑技术[J];中文信息学报;2005年04期

9 索红光;刘玉树;曹淑英;;一种基于词汇链的关键词抽取方法[J];中文信息学报;2006年06期

10 黄昌宁;赵海;;中文分词十年回顾[J];中文信息学报;2007年03期

相关会议论文 前2条

1 张建强;;基于语料库的现代汉语疑问句使用情况调查[A];第五届全国语言文字应用学术研讨会论文集[C];2007年

2 魏志成;;汉语句型系统的解构与重构[A];中国英汉语比较研究会第七次全国学术研讨会论文集[C];2006年

相关博士学位论文 前1条

1 张俊林;基于语言模型的信息检索系统研究[D];中国科学院研究生院(软件研究所);2004年

相关硕士学位论文 前1条

1 毛婷婷;中文专有名词识别的研究[D];大连理工大学;2006年



本文编号:2304434

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2304434.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户aec8a***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com