基于CRFs和词典信息的中古汉语自动分词

发布时间：2018-02-26 06:02

本文关键词： CRFs模型分词一致性中古汉语自动分词　出处：《数据分析与知识发现》2017年05期 　论文类型：期刊论文

【摘要】：【目的】验证中古时期分词一致性和语料类别对CRFs分词效率的影响,在此基础上进一步提高分词效率,降低人工校对的工作量。【方法】以中古时期的史书、佛经、小说类语料为例,针对中古汉语的自动分词问题,优化分词原则,运用CRFs模型和词典相结合的方法,消除中古汉语人工分词结果中易出现的分词不一致问题;同时在CRFs分词中引入字符分类、字典信息两种特征,并通过对比实验选取每种特征最合适的分词模板。【结果】实验结果显示,分词结果的总F值在封闭测试中达到99%以上,开放测试的综合测试中也达到89%-95%。【局限】分词不一致研究主要针对双字词,因此三字以上词语(多字词)的识别效果稍有欠缺。【结论】在有效提高分词一致性的前提下,字符分类、词典标记特征能够有效提高中古汉语CRFs分词的精确度。同时本文提出的中古汉语分词系统可以服务于中古时期多类别的汉语语料。
[Abstract]:[objective] to verify the influence of word segmentation consistency and corpus classification on the efficiency of CRFs participle, and to further improve the efficiency of word segmentation and reduce the workload of artificial proofreading. [methods] the history books and Buddhist scriptures of the Middle Ancient period were used to improve the efficiency of word segmentation and reduce the workload of artificial proofreading. For the example of novel corpus, aiming at the problem of automatic word segmentation in middle ancient Chinese, the principle of word segmentation is optimized, and the method of combining CRFs model with dictionary is used to eliminate the disconsistency of word segmentation in the result of artificial word segmentation in middle ancient Chinese. At the same time, we introduce character classification and dictionary information into CRFs word segmentation, and select the most suitable segmentation template for each feature by contrast experiment. [results] the experimental results show that the total F value of word segmentation results is more than 99% in the closed test. In the comprehensive test of open test, 89% -95% is also achieved. The research on the inconsistency of participle is mainly aimed at two-character words, so the recognition effect of more than three words (multi-character words) is slightly deficient. [conclusion] on the premise of effectively improving the consistency of participle, Character classification and dictionary tagging features can effectively improve the accuracy of middle ancient Chinese CRFs participle. At the same time, the middle ancient Chinese word segmentation system proposed in this paper can serve for many kinds of Chinese corpus of Middle Ancient Chinese.
【作者单位】：南京师范大学文学院;
【基金】：国家社会科学基金重大项目“汉语史研究语料库建设研究”(项目编号:10&ZD117);国家社会科学基金重大项目“基于《汉学引得丛刊》的典籍知识库构建及人文计算研究”(项目编号:15ZDB127)的研究成果之一教育部人文社会科学青年项目“汉语历时词汇数据库的构建与计量研究”(项目编号:16YJC740034)
【分类号】：TP391.1

【参考文献】