基于拼音索引的中文模糊匹配算法
发布时间:2019-08-05 11:19
【摘要】:主流商业搜索引擎主要基于关键词精确匹配技术。为提高在用户的输入错误时的检索效率,提出了有索引的汉语模糊匹配算法。该算法采用汉字、拼音和拼音改良的编辑距离这3种汉字相似程度的不同度量方式,对用户查询进行扩展,将模糊匹配转化为多个精确匹配,对精确匹配的结果按与查询串的相似程度进行排序。在实验中,将该方法应用于网页文本语料库中。在使用基于拼音改良的编辑距离度量方式时,在时间和空间复杂度增长不大的情况下,该方法取得了60.42%的准确率与50.41%召回率。
[Abstract]:Mainstream commercial search engines are mainly based on keyword accurate matching technology. In order to improve the retrieval efficiency in the case of user input errors, an indexed Chinese fuzzy matching algorithm is proposed. The algorithm uses three different measures of similarity degree of Chinese characters, Pinyin and Pinyin improved editing distance, to extend user query, to transform fuzzy matching into multiple accurate matches, and to sort the results of accurate matching according to the similarity degree with query string. In the experiment, this method is applied to the web text corpus. When the improved editing distance measurement based on pinyin is used, the accuracy of the method is 60.42% and the recall rate is 50.41% when the complexity of time and space increases little.
【作者单位】: 清华大学计算机科学与技术系 清华信息科学技术国家实验室技术创新和开发部语音和语言技术中心
【基金】:国家自然科学基金资助项目(60703051)
【分类号】:TP391.1
[Abstract]:Mainstream commercial search engines are mainly based on keyword accurate matching technology. In order to improve the retrieval efficiency in the case of user input errors, an indexed Chinese fuzzy matching algorithm is proposed. The algorithm uses three different measures of similarity degree of Chinese characters, Pinyin and Pinyin improved editing distance, to extend user query, to transform fuzzy matching into multiple accurate matches, and to sort the results of accurate matching according to the similarity degree with query string. In the experiment, this method is applied to the web text corpus. When the improved editing distance measurement based on pinyin is used, the accuracy of the method is 60.42% and the recall rate is 50.41% when the complexity of time and space increases little.
【作者单位】: 清华大学计算机科学与技术系 清华信息科学技术国家实验室技术创新和开发部语音和语言技术中心
【基金】:国家自然科学基金资助项目(60703051)
【分类号】:TP391.1
【参考文献】
相关期刊论文 前1条
1 王静帆;邬晓钧;夏云庆;郑方;;中文信息检索系统的模糊匹配算法研究和实现[J];中文信息学报;2007年06期
【共引文献】
相关期刊论文 前10条
1 杨朋;唐文玲;;实现异步交换机间话单稽核的自适应窗口模糊匹配方法[J];中国新通信;2018年18期
2 吴振华;高瑞泽;;智能家居场景下改进的中文字符串匹配算法[J];南昌航空大学学报(自然科学版);2018年02期
3 石永革;张毫;;基于BPM-BM过滤优化的近似字符串匹配算法[J];青岛科技大学学报(自然科学版);2016年01期
4 吴茜;刘嘉勇;卿粼波;;基于VIPS算法和模糊字典匹配的网页提取技术研究[J];信息网络安全;2014年10期
5 施恒利;刘亮亮;王石;符建辉;张再跃;曹存根;;汉字种子混淆集的构建方法研究[J];计算机科学;2014年08期
6 陈何峰;林柏钢;杨e,
本文编号:2523096
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2523096.html