当前位置:主页 > 科技论文 > 软件论文 >

印刷体蒙古文文档中多文种识别技术的研究与实现

发布时间:2018-03-06 03:23

  本文选题:蒙古文 切入点:文档图像 出处:《内蒙古大学》2017年硕士论文 论文类型:学位论文


【摘要】:目前,能识别单一文种的文字识别系统(OCR)有很多。但是,在全球一体化的趋势下,文档中出现了多种不同的文字。在现存的一些蒙古文文档中不只包括蒙古文,还会混有一定数量的汉文与英文。因此,设计一个多文种识别系统是十分必要的。本文提出的多文种识别技术分为文档图像预处理和文种识别两个过程。文档图像预处理的过程为:首先,将文本区域和图像区域分离,提取出文本区域;然后,对文本区域进行段落划分;随后,运用垂直投影和高斯平滑进行列切分,获得文字列;最后,运用连通域分析方法实现字切分。在预处理阶段,本文对每个文字图像在原文档图像的坐标位置进行了记录,以便版面恢复。本文提出的蒙汉英多文种识别技术包括粗分类与细分类两个阶段。在粗分类阶段,依据文字图像的宽度、高度等信息进行分类,将所有文字图像粗略的分为蒙古文类、汉文类和英文类,汉文类中除了汉文,还混有一定量的英文和蒙古文,英文类中除了英文,还混有一定量的汉文和蒙古文,因此,还需进一步分类。在细分类阶段,根据粗分类的结果,对汉文类、英文类以及标点符号/英文/数字类分别使用卷积神经网络(CNN)进行细分类。在实验数据集上进行测试,预处理阶段中的列切分正确率达99.13%,字切分正确率达97.87%;在细分类阶段,本文所提的细分类方法对汉文细分类的平均识别正确率达99.41%,对英文细分类的平均识别正确率达98.86%,对标点/英文/数字细分类的平均识别正确率达98.34%。
[Abstract]:At present, there are many OCRs that can recognize a single language. However, in the trend of global integration, there are many different characters in the documents. There is also a certain amount of Chinese and English. So, It is necessary to design a multi-language recognition system. The multi-language recognition technology proposed in this paper is divided into two processes: document image preprocessing and document recognition. The process of document image preprocessing is as follows: firstly, the text region is separated from the image region. Extract the text area; then, divide the text area into paragraphs; then, use vertical projection and Gao Si smooth column segmentation, get the text column; finally, use the connected domain analysis method to achieve word segmentation. In this paper, the coordinate position of each text image in the original document image is recorded in order to restore the layout. The multilingual recognition technology of Mongolian, Chinese and English proposed in this paper includes two stages: coarse classification and fine classification. According to the width and height of the text image, all the text images are roughly classified into Mongolian, Chinese and English. In Chinese, in addition to Chinese, there is also a certain amount of English and Mongolian, and English is the exception of English. There is also a certain amount of Chinese and Mongolian, so further classification is needed. In the detailed classification stage, according to the results of rough classification, English class and punctuation / English / digital class are subdivided by convolution neural network (CNN) respectively. The experimental data set is tested. The accuracy rate of column segmentation and word segmentation is 99.13 and 97.87 respectively in the preprocessing stage, and in the subdivision stage, the accuracy of column segmentation is 99.13 and that of word segmentation is 97.87. The average recognition accuracy of the proposed method is 99.41 for Chinese subclassification, 98.86 for English subclassification and 98.34 for punctuation / English / digital subclassification.
【学位授予单位】:内蒙古大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.4

【参考文献】

相关期刊论文 前10条

1 蔡娟;蔡坚勇;廖晓东;黄海涛;丁侨俊;;基于卷积神经网络的手势识别初探[J];计算机系统应用;2015年04期

2 沈夏炯;王晶晶;范家铭;周兵;;MGSI-8CA标记算法[J];计算机工程与应用;2013年20期

3 徐姗姗;刘应安;徐f;;基于卷积神经网络的木材缺陷识别[J];山东大学学报(工学版);2013年02期

4 李全喜;;充分利用蒙古文图书资料努力构筑“精神家园”[J];内蒙古师范大学学报(哲学社会科学版);2013年02期

5 范会敏;王浩;;模式识别方法概述[J];电子设计工程;2012年19期

6 杨亚威;李俊山;杨威;赵方舟;;利用稀疏化生物视觉特征的多类多视角目标检测方法[J];红外与激光工程;2012年01期

7 吕刚;;基于卷积神经网络的多字体字符识别[J];浙江师范大学学报(自然科学版);2011年04期

8 郭俊平;王福;;蒙古文文献资源数字化共建共享的研究[J];四川图书馆学报;2011年05期

9 童立靖;张艳;舒巍;占国亮;钱W,

本文编号:1573133


资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1573133.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户0a55d***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com