吕苏语口语标注语料的自动分词方法研究

发布时间：2018-05-22 18:37

本文选题：吕苏语 + 汉语标注语料　；参考：《计算机应用研究》2017年05期

【摘要】：濒危语言典藏以抢救和长久保存濒危语言口语中所包含的声学、语言学以及文学、历史、传统文化等内涵的全部信息为目的,吕苏语作为一种无文字文献记录的濒危语言,对其口语语料典藏意义重大。吕苏语口语的汉语标注语料自动分词是后续建立高质量的吕苏语口语语料库和吕苏语典藏系统的基础性工作。目前对于吕苏语标注语料分词的研究几乎为零,对吕苏语特点进行了分析,同时将中文自动分词结巴方法应用到吕苏语汉语标注语料中;并针对结巴分词算法对吕苏语标注语料分词存在的误分词问题,提出了改进结巴算法。经过实验对比,改进结巴的分词方法准确率更高,提高了吕苏语汉语标注语料的分词效果。
[Abstract]:The purpose of the endangered language collection is to save and preserve for a long time the acoustic, linguistic, literary, historical, traditional and other connotations contained in the spoken language of the endangered language. Lu Su, as an endangered language without written documentation, It is of great significance to the collection of oral data. The automatic segmentation of Chinese tagged corpus in Lu Su's spoken language is the basic work for the establishment of a high quality spoken corpus of Luthu and the collection system of Luthu. At present, the research on the tagging corpus segmentation of Lusu is almost zero. This paper analyzes the characteristics of Lusu, and applies the method of Chinese automatic segmentation and stutter to the Chinese tagging corpus of Lusu. In order to solve the problem of incorrect segmentation of Lusu tagged corpus, an improved algorithm of stutter is put forward. The experimental results show that the improved segmentation method is more accurate and improves the segmentation effect of Chinese tagging corpus in Luthu language.
【作者单位】：北京工商大学计算机与信息工程学院;中国社会科学院民族学与人类学研究所;
【基金】：国家社会科学基金重大资助项目(14ZDB156) 国家教育部人文社会科学研究规划基金资助项目(15YJCZH224)
【分类号】：TP391.1

【参考文献】