信息处理用藏语谚语语料库构建研究

发布时间：2018-11-24 08:22

【摘要】：本文首先以收集整理录入的方法,以安多、康巴、卫藏三大方言藏语谚语及《格萨尔谚语》为基础,建立了藏语谚语语料库,并对语料进行自动分词和人工校对,确定谚语词汇切分原则,从而构建了藏语谚语语料库及分词库。在内容上将藏族谚语按相关文献的基础上再细分为十二种类型。在搜集整理的过程中,谚语在形式的划分上新增加至三十二种。从谚语中词条数量分布、词汇的频度和频率三方面对《藏语谚语》进行了研究。最后根据藏族三大地区方言藏汉对照、按字母顺序排序和内容分类三种方法分别进行排序和检索。其作用主要应用于两个方面：一是作为计算机藏文信息处理系统的藏语谚语语料库构建,为藏文信息处理服务。二是作为藏语文学习的工具书及藏语谚语词语研究的基本资源,供藏语文学习者和研究者使用。本文的研究目的在为未来藏文信息处理领域中的句法分类标注、自动分词、句法研究、短语研究、机器翻译、搜索引擎、电子词典编纂等方面做基础工作；为今后藏族文学研究提供了一种新的研究方法和手段。其创新点在于：一是搜集整理了大量零散的藏语谚语,到目前为止是整理最多的;二是针对计算机信息处理进行了分类及标注；三是建立了藏语谚语双语对照语料库；四是对藏语谚语构建了检索程序,为今后学习、研究双语教学提供了便利条件。下一步的工作是把所搜集整理的藏语谚语词条进行翻译；在混合排序中,把内容、形式、段落和音节停顿的标注在点击相关词条时可同时在该词条中出现,是进一步学习和研究的任务。本文认为构建高质量的藏语谚语库不仅能够更好的掌握和利用藏语谚语这块瑰宝,为研究藏语言文学领域提供不可或缺的语言材料,从而也扩充藏语自然语言处理相关文本资料库。
[Abstract]:Based on Tibetan proverbs and Gesar proverbs in Ando, Kangba and Weizang dialects, the corpus of Tibetan proverbs is established, and the corpus is automatically partitioned and artificially proofread. The principle of lexical segmentation of proverbs is established, and the corpus and thesaurus of Tibetan proverbs are constructed. Tibetan proverbs are subdivided into twelve types on the basis of relevant literature. In the process of collecting and sorting, proverbs have been added to 32 kinds of forms. This paper studies Tibetan proverbs from the following three aspects: number distribution of proverbs, frequency and frequency of vocabulary. Finally, according to Tibetan dialect Tibetan and Chinese contrast, according to alphabetical order and content classification three methods respectively sort and search. Its function is mainly applied in two aspects: the first is the construction of Tibetan proverbs corpus as a computer Tibetan information processing system to serve Tibetan information processing. Second, as the reference book of Tibetan language learning and the basic resource of Tibetan proverbs study, it is used by Tibetan language learners and researchers. The purpose of this paper is to do some basic work in the field of Tibetan information processing, such as syntactic classification and tagging, automatic word segmentation, syntactic research, phrase research, machine translation, search engine, electronic dictionary compilation and so on. It provides a new research method and means for Tibetan literature research in the future. The innovation lies in the following aspects: first, collecting and sorting out a large number of scattered Tibetan proverbs, up to now, most; second, classifying and tagging the computer information processing; third, establishing a bilingual comparative corpus of Tibetan proverbs; Fourth, the retrieval program of Tibetan proverbs is constructed, which provides convenient conditions for future study and study of bilingual teaching. The next step is to translate the Tibetan proverbs. In mixed sorting, the tagging of content, form, paragraph and syllable pause can appear at the same time when clicking the relevant entry, which is the task of further study and research. This paper holds that the construction of a high-quality Tibetan proverbs database can not only better grasp and utilize the treasure of Tibetan proverbs, but also provide indispensable language materials for the study of Tibetan language and literature. Thus also expand the Tibetan natural language processing related text database.
【学位授予单位】：西北民族大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：H214

【相似文献】