[Abstract]:With the development of information technology and the increasing popularity of the Internet, search engines are paid more attention by people. In recent years, the most mainstream search engine is the search engine based on keyword search, which is based on keyword search engine. The accuracy of calculating the weight of each word in the user query statement will directly affect the order of the subsequent web pages, so it is very important to correctly calculate the word weight value in the retrieval condition. In this paper, we try to find a method to calculate the keyword weight of user query statements in order to make the search engine based on keyword search reach a higher level. It lays a good foundation for the subsequent retrieval processing. In order to accomplish the purpose of the research, this paper mainly includes the following three parts: the characteristics of user query statements. This paper analyzes the relationship between the characteristics of the 5000 sentence query sentences marked with the core words and the weight of the words, and analyzes the stop words contained in the query statements and the stop words in the modern Chinese corpus. At the same time, the analysis and examples of stop-word in query statements under different categories are given. Keyword weight calculation for web page sorting. The segmentation and part of speech tagging of user query log is carried out, and the task of keyword extraction is regarded as a classification task. Combining with the characteristics of query statements, the eight contextual features of each word are finally determined as the characteristics of forest classification in decision tree. The calculation methods of each characteristic are introduced respectively. Error analysis of the experimental results is carried out, and some rules are added to post-process the results of model classification. Analysis of experimental results. The results of decision tree classification method and traditional keyword extraction and weight calculation methods are compared and analyzed. About 1000 query statements are randomly extracted from the user's query log for manual evaluation. The accuracy and recall rate of the model are evaluated by cross-validation. Compare the winning rate between the model method and the traditional weight calculation method in web page sorting; Several query statements are selected to search on "Baidu", and the influence of the keyword sequence determined by the model and the search statement that does not deal with the keywords on the ranking effect of the web pages is obtained. The experimental results show that the method of keyword extraction and weight calculation used in this paper is feasible in the weight calculation of web page sorting.
相关期刊论文 前10条
1 罗智勇;宋柔;;基于多特征的自适应新词识别[J];北京工业大学学报;2007年07期
2 李卫东;宋威;李欣;杨炳儒;;一种多标准决策树剪枝方法及其在入侵检测中的应用[J];北京科技大学学报;2007年04期
3 吕鸣剑;;数据挖掘在知识工程中的应用研究[J];电脑知识与技术;2011年23期
4 熊文新;宋柔;;信息检索用户查询语句的停用词过滤[J];计算机工程;2007年06期
5 张映海;何中市;陈永锋;;搜索引擎结果中Web文档的排序研究[J];计算机与数字工程;2007年02期
6 文炯;;搜索引擎之竞价排名研究[J];江西图书馆学刊;2006年01期
7 游荣彦;Zipf定律与汉字字频分布[J];中文信息学报;2000年03期
8 黄永文,何中市;基于互信息的统计语言模型平滑技术[J];中文信息学报;2005年04期
9 索红光;刘玉树;曹淑英;;一种基于词汇链的关键词抽取方法[J];中文信息学报;2006年06期
10 黄昌宁;赵海;;中文分词十年回顾[J];中文信息学报;2007年03期
相关会议论文 前2条
1 张建强;;基于语料库的现代汉语疑问句使用情况调查[A];第五届全国语言文字应用学术研讨会论文集[C];2007年
2 魏志成;;汉语句型系统的解构与重构[A];中国英汉语比较研究会第七次全国学术研讨会论文集[C];2006年
相关博士学位论文 前1条
1 张俊林;基于语言模型的信息检索系统研究[D];中国科学院研究生院(软件研究所);2004年
相关硕士学位论文 前1条
1 毛婷婷;中文专有名词识别的研究[D];大连理工大学;2006年