基于拼音索引的中文模糊匹配算法

发布时间：2019-08-05 11:19

【摘要】：主流商业搜索引擎主要基于关键词精确匹配技术。为提高在用户的输入错误时的检索效率,提出了有索引的汉语模糊匹配算法。该算法采用汉字、拼音和拼音改良的编辑距离这3种汉字相似程度的不同度量方式,对用户查询进行扩展,将模糊匹配转化为多个精确匹配,对精确匹配的结果按与查询串的相似程度进行排序。在实验中,将该方法应用于网页文本语料库中。在使用基于拼音改良的编辑距离度量方式时,在时间和空间复杂度增长不大的情况下,该方法取得了60.42%的准确率与50.41%召回率。
[Abstract]:Mainstream commercial search engines are mainly based on keyword accurate matching technology. In order to improve the retrieval efficiency in the case of user input errors, an indexed Chinese fuzzy matching algorithm is proposed. The algorithm uses three different measures of similarity degree of Chinese characters, Pinyin and Pinyin improved editing distance, to extend user query, to transform fuzzy matching into multiple accurate matches, and to sort the results of accurate matching according to the similarity degree with query string. In the experiment, this method is applied to the web text corpus. When the improved editing distance measurement based on pinyin is used, the accuracy of the method is 60.42% and the recall rate is 50.41% when the complexity of time and space increases little.
【作者单位】：清华大学计算机科学与技术系清华信息科学技术国家实验室技术创新和开发部语音和语言技术中心
【基金】：国家自然科学基金资助项目(60703051)
【分类号】：TP391.1

【参考文献】