面向网页排序的关键词权值计算

发布时间：2018-11-01 16:11

【摘要】：随着信息科技的发展和互联网的日益普及，搜索引擎深受人们的重视，近年来最主流的搜索引擎是基于关键词检索的搜索引擎，在基于关键词检索的搜索引擎中，用户查询语句中各个词语权值计算的精度将直接影响到后续网页排序的好坏，因此正确计算检索条件中词语权值是至关重要的。本文的研究是试图寻找一种面向网页排序的用户查询语句关键词权值计算方法，使基于关键词检索的搜索引擎在网页排序这一环节达到一个更高的水平，为后续检索处理打下良好的基础。为了完成研究目的，本文的工作主要包括以下三个部分：用户查询语句自身特点分析。对标注了核心词的5000句查询语句自身特点与词语权值关系进行分析，对查询语句中含有的停用词和现代汉语语料中停用词进行分析，并对不同类别下查询语句中停用词进行了分析和举例。面向网页排序的关键词权值计算。对用户查询日志进行分词和词性标注，将关键词抽取任务视为分类任务，结合查询语句自身的特点，，最终确定出每个词语的八个上下文特征作为决策树森林分类的特征，并分别介绍了各个特征的计算方法。并对实验结果进行错误分析，加入一些规则对模型分类的结果进行后处理。实验结果分析。对决策树分类方法与传统关键词提取和权值计算方法的结果进行对比分析，从用户查询日志中随机抽取1000条左右查询语句进行人工评测，使用交叉验证的方法评测模型准确率和召回率；比较模型方法与传统的网页排序中权值计算方法的胜出率；选择几个查询语句，到“百度”上搜索，得出由模型确定的关键词序列进行搜索与不对关键词进行处理的查询语句搜索对网页排序效果的影响。实验结果表明本文采用的关键词抽取和权值计算方法在网页排序的权值计算中是切实可行的。
[Abstract]:With the development of information technology and the increasing popularity of the Internet, search engines are paid more attention by people. In recent years, the most mainstream search engine is the search engine based on keyword search, which is based on keyword search engine. The accuracy of calculating the weight of each word in the user query statement will directly affect the order of the subsequent web pages, so it is very important to correctly calculate the word weight value in the retrieval condition. In this paper, we try to find a method to calculate the keyword weight of user query statements in order to make the search engine based on keyword search reach a higher level. It lays a good foundation for the subsequent retrieval processing. In order to accomplish the purpose of the research, this paper mainly includes the following three parts: the characteristics of user query statements. This paper analyzes the relationship between the characteristics of the 5000 sentence query sentences marked with the core words and the weight of the words, and analyzes the stop words contained in the query statements and the stop words in the modern Chinese corpus. At the same time, the analysis and examples of stop-word in query statements under different categories are given. Keyword weight calculation for web page sorting. The segmentation and part of speech tagging of user query log is carried out, and the task of keyword extraction is regarded as a classification task. Combining with the characteristics of query statements, the eight contextual features of each word are finally determined as the characteristics of forest classification in decision tree. The calculation methods of each characteristic are introduced respectively. Error analysis of the experimental results is carried out, and some rules are added to post-process the results of model classification. Analysis of experimental results. The results of decision tree classification method and traditional keyword extraction and weight calculation methods are compared and analyzed. About 1000 query statements are randomly extracted from the user's query log for manual evaluation. The accuracy and recall rate of the model are evaluated by cross-validation. Compare the winning rate between the model method and the traditional weight calculation method in web page sorting; Several query statements are selected to search on "Baidu", and the influence of the keyword sequence determined by the model and the search statement that does not deal with the keywords on the ranking effect of the web pages is obtained. The experimental results show that the method of keyword extraction and weight calculation used in this paper is feasible in the weight calculation of web page sorting.
【学位授予单位】：中国社会科学院研究生院
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】