当前位置:主页 > 科技论文 > 软件论文 >

基于互信息改进算法和t-测试差的壮文分词算法研究

发布时间:2018-06-22 05:28

  本文选题:壮文分词 + MI改进算法 ; 参考:《中南民族大学学报(自然科学版)》2017年04期


【摘要】:针对传统的壮文分词方法将单词之间的空格作为分隔标志,在多数情况下,会破坏多个单词关联组合而成的语义词所要表达的完整且独立的语义信息,在借鉴前人使用互信息MI方法来度量相邻单词间关联程度的基础上,首次采用互信息改进算法MI~k和t-测试差对壮文文本分词,并结合两者在评价相邻单词间的静态结合能力和动态结合能力的各自优势,提出了一种MI~k和t-测试差相结合的TD-MIk混合算法对壮文文本分词,并对互信息改进算法MI~k、t-测试差、TD-MI~k混合算法三种方法的分词效果进行了比较.使用人民网壮文版上的文本集作为训练及测试语料进行了实验,结果表明:三种分词方法都能够较准确而有效地提取文本中的语义词,并且TD-MI~k混合算法的分词准确率最高.
[Abstract]:In view of the traditional Zhuang word segmentation method, the space between words is taken as the separation mark, in most cases, the complete and independent semantic information to be expressed by the semantic words formed by the association of multiple words will be destroyed. On the basis of using the mutual information MI method to measure the correlation degree between adjacent words, the improved mutual information algorithms MIK and t- test difference are used for the first time. Combined with their respective advantages in evaluating the static and dynamic combination of adjacent words, a TD-MIK hybrid algorithm combining MIK and t- test difference is proposed for word segmentation in Zhuang text. The segmentation effect of the improved mutual information algorithm, MIGK / TD-MIK hybrid algorithm, is compared in this paper. The experimental results show that the three word segmentation methods can extract the semantic words from the text accurately and effectively, and the segmentation accuracy of the TD-MIPK hybrid algorithm is the highest. The experiment results show that the text set on the Zhuang text version of people's net can be used as the training and testing corpus, and the results show that all the three word segmentation methods can extract the semantic words from the text more accurately and effectively.
【作者单位】: 中南民族大学计算机科学学院;河池学院计算机与信息工程学院;
【基金】:国家科技支撑计划项目子课题(2015BAD29B01) 中南民族大学研究生学术创新基金项目(2017sycxjj051)
【分类号】:TP391.1


本文编号:2051772

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2051772.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户8d30b***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com