一种词汇共现算法及共现词对检索系统排序的影响
发布时间:2019-07-22 17:39
【摘要】:为了探讨共现词对检索系统排序相关性的影响,提出一种新的共现词汇算法——FDC。算法中考虑了词汇在文档中的共现频度、相对距离和共文档率。从天网搜索引擎查询日志中选取部分查询词,用本算法和潜在语义索引(LS I)方法分别求其共现词汇,并以相同的评分策略改变原始排序结果。D iscoun ted cum u lative ga in(DCG)评估结果表明,本算法获得的共现词在99%的置信度下对原始排序的相关性有改进;而LS I方法获得的共现词对排序相关性也表现出同样显著的改进效果。结果显示共现词汇能改进检索系统结果排序的相关性,并且不依赖于特定算法。
[Abstract]:In order to study the influence of the co-occurrence word on the retrieval system's rank correlation, a new co-occurrence vocabulary algorithm _ FDC is proposed. The co-occurrence frequency, relative distance and co-document rate of the words in the document are considered in the algorithm. A partial query word is selected from the query log of the Skynet search engine, and the common-current vocabulary is obtained by using the algorithm and the potential semantic index (LS I) method, and the original ordering result is changed with the same scoring strategy. The results of D iscoun ted cum u-native ga (DCG) show that the correlation of the co-occurrence word obtained by the algorithm is improved with the confidence of 99%, and the co-occurrence word obtained by the LS I method also shows the same significant improvement effect. The results show that the co-occurrence vocabulary can improve the relevance of the retrieval system results ordering and does not rely on a particular algorithm.
【作者单位】: 北京大学信息科学技术学院 北京大学信息科学技术学院 北京大学信息科学技术学院 北京大学信息科学技术学院
【基金】:国家自然科学基金重点资助项目(60435020) 教育部博士点基金项目(20030001076)
【分类号】:TP391.3;
[Abstract]:In order to study the influence of the co-occurrence word on the retrieval system's rank correlation, a new co-occurrence vocabulary algorithm _ FDC is proposed. The co-occurrence frequency, relative distance and co-document rate of the words in the document are considered in the algorithm. A partial query word is selected from the query log of the Skynet search engine, and the common-current vocabulary is obtained by using the algorithm and the potential semantic index (LS I) method, and the original ordering result is changed with the same scoring strategy. The results of D iscoun ted cum u-native ga (DCG) show that the correlation of the co-occurrence word obtained by the algorithm is improved with the confidence of 99%, and the co-occurrence word obtained by the LS I method also shows the same significant improvement effect. The results show that the co-occurrence vocabulary can improve the relevance of the retrieval system results ordering and does not rely on a particular algorithm.
【作者单位】: 北京大学信息科学技术学院 北京大学信息科学技术学院 北京大学信息科学技术学院 北京大学信息科学技术学院
【基金】:国家自然科学基金重点资助项目(60435020) 教育部博士点基金项目(20030001076)
【分类号】:TP391.3;
【参考文献】
相关期刊论文 前1条
1 李晓明;对中国曾有过静态网页数的一种估计[J];北京大学学报(自然科学版);2003年03期
【共引文献】
相关期刊论文 前3条
1 冯是聪,王继民;关于“中文网页自动分类竞赛”结果的分析[J];中文信息学报;2003年05期
2 朱家稷,闫宏飞;一种Web多维分析模型及应用[J];情报学报;2004年05期
3 刘晓莉,彭波;基于概率模型的名人网页相关度评价[J];清华大学学报(自然科学版);2005年S1期
相关博士学位论文 前1条
1 吴丽辉;个性化的Web信息采集技术研究[D];中国科学院研究生院(计算技术研究所);2005年
相关硕士学位论文 前2条
1 尹奇椺;基于语义Web的信息表达与语义化过程研究[D];浙江大学;2003年
2 刘玉莲;WEB信息搜集系统设计与实现的研究[D];哈尔滨工程大学;2003年
【相似文献】
相关期刊论文 前1条
1 陈,
本文编号:2517777
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2517777.html