信息处理用藏语谚语语料库构建研究
[Abstract]:Based on Tibetan proverbs and Gesar proverbs in Ando, Kangba and Weizang dialects, the corpus of Tibetan proverbs is established, and the corpus is automatically partitioned and artificially proofread. The principle of lexical segmentation of proverbs is established, and the corpus and thesaurus of Tibetan proverbs are constructed. Tibetan proverbs are subdivided into twelve types on the basis of relevant literature. In the process of collecting and sorting, proverbs have been added to 32 kinds of forms. This paper studies Tibetan proverbs from the following three aspects: number distribution of proverbs, frequency and frequency of vocabulary. Finally, according to Tibetan dialect Tibetan and Chinese contrast, according to alphabetical order and content classification three methods respectively sort and search. Its function is mainly applied in two aspects: the first is the construction of Tibetan proverbs corpus as a computer Tibetan information processing system to serve Tibetan information processing. Second, as the reference book of Tibetan language learning and the basic resource of Tibetan proverbs study, it is used by Tibetan language learners and researchers. The purpose of this paper is to do some basic work in the field of Tibetan information processing, such as syntactic classification and tagging, automatic word segmentation, syntactic research, phrase research, machine translation, search engine, electronic dictionary compilation and so on. It provides a new research method and means for Tibetan literature research in the future. The innovation lies in the following aspects: first, collecting and sorting out a large number of scattered Tibetan proverbs, up to now, most; second, classifying and tagging the computer information processing; third, establishing a bilingual comparative corpus of Tibetan proverbs; Fourth, the retrieval program of Tibetan proverbs is constructed, which provides convenient conditions for future study and study of bilingual teaching. The next step is to translate the Tibetan proverbs. In mixed sorting, the tagging of content, form, paragraph and syllable pause can appear at the same time when clicking the relevant entry, which is the task of further study and research. This paper holds that the construction of a high-quality Tibetan proverbs database can not only better grasp and utilize the treasure of Tibetan proverbs, but also provide indispensable language materials for the study of Tibetan language and literature. Thus also expand the Tibetan natural language processing related text database.
【学位授予单位】:西北民族大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:H214
【相似文献】
相关期刊论文 前8条
1 梁社会;陈小荷;;先秦文献《孟子》自动分词方法研究[J];南京师范大学文学院学报;2013年03期
2 徐艳华;;新词语结构分析在自动分词中的作用[J];烟台职业学院学报;2007年04期
3 程节华;自动分词中的歧义字段分析及处理[J];安徽农业技术师范学院学报;2000年03期
4 李迅;;自动分词与分词规范——关于《信息处理现代汉语分词规范》的重新思考[J];山东文学;2010年01期
5 殷峰,何克抗;语句级拼音┐汉字转换系统的设计与实现[J];计算机研究与发展;1997年05期
6 葛伟;;从计算机自动分词的障碍谈汉语书面语改革[J];语文学刊;2008年02期
7 祁坤钰;;信息处理用藏文自动分词研究[J];西北民族大学学报(哲学社会科学版);2006年04期
8 ;[J];;年期
相关会议论文 前7条
1 黄昌宁;高剑峰;李沐;;对自动分词的反思[A];语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集[C];2003年
2 郑泽之;;中文自动分词的一些问题[A];内容计算的研究与应用前沿——第九届全国计算语言学学术会议论文集[C];2007年
3 徐润华;陈小荷;;一种利用注疏的《左传》分词新方法[A];中国计算语言学研究前沿进展(2009-2011)[C];2011年
4 黄昌宁;林娟;孙承杰;;何谓金本位[A];全国第八届计算语言学联合学术会议(JSCL-2005)论文集[C];2005年
5 陈晓;;中文文本自动分词研究述要[A];第四届全国语言文字应用学术研讨会论文集[C];2005年
6 刘怀t,
本文编号:2352964
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2352964.html