蒙古文文档图像版面分析及识别后处理的研究与实现

发布时间：2018-07-13 07:42

【摘要】：光学字符识别(Optical Character Recognition,简称OCR)技术的研究在近年来得到了飞速发展,中文、英文等文字识别技术的研究已经取得了显著的成果。文字识别率是OCR系统中最重要的一个性能指标,对于印刷体蒙古文字识别系统来说,要想完善整个系统,提高蒙古文字的识别率,就要对蒙古文文档图像在识别前期的版面分析技术和后期的识别后处理技术进行研究和实现。因此,本文的主要研究内容包括两个部分,一个是蒙古文文档图像的版面分析,另一个是蒙古文字识别后处理。在印刷体蒙古文字识别过程中,版面分析是一个很重要的基础工作,而目前对蒙古文文档图像的版面分析研究较少,蒙古文文档图像的版面形式多种多样,且存在文字、图片、表格等多种版面元素混排的情况,这些都给印刷体蒙古文字识别工作带来诸多困难。本文采用自底向上和自顶向下相结合的版面分析法,通过标记连通域、合并连通域、去除连通域等相关流程,将非文字部分去除,只保留文字部分。之后再经过段落划分,获得各段落的位置信息,这些位置信息可供后续版面恢复使用。在蒙古文字识别系统中,文档图像经过切分和识别得到的识别结果是蒙古文字形编码,目前常用的为国际标准编码,因此要对识别结果进行编码转换,本文所关注的后处理是将字形识别结果转换为国际标准编码的过程。文中所采用的是基于对照词典的编码转换方式,首先需要将已有的国际标准码词典(涵盖了目前常用的50553个蒙古文单词)依次转换为WORD文档、PDF文件,最后转换为图片并进行版面分析和列切分、字切分以及字元切分,将经过切分得到的蒙古文字元图像作为训练好的卷积神经网络分类器的输入,输出即为蒙古文字形编码,利用已有的国际标准码词典与获取到的字形编码按照一一对应的关系整理成编码转换词典。进行后处理时在整理好的词典中查找与识别结果相同的字形编码的位置,即可在词典中找到该字形编码相对应的国际标准码,完成编码转换过程。本文研究的蒙古文文档图像版面分析技术,能够对多种复杂版面格式的蒙古文文档图像进行处理,包括去除非文字部分、将文字区域划分段落并标记段落位置等,在一定数量的样本集上进行测试,版面分析准确率达到了 97.87%。本文研究的识别后处理,能够快速、有效、准确的将蒙古文字形编码识别结果转换为国际标准码,使得印刷体蒙古文字识别系统更加完善。
[Abstract]:The research of optical character recognition (OCR) technology has been developed rapidly in recent years. Character recognition rate is the most important performance index in OCR system. For printed Mongolian character recognition system, it is necessary to perfect the whole system and improve the recognition rate of Mongolian characters. It is necessary to study and implement the layout analysis technology of Mongolian document image in the early stage and the post processing technology in the later stage. Therefore, the main content of this paper includes two parts, one is the layout analysis of Mongolian document images, the other is the post-processing of Mongolian text recognition. In the process of printed Mongolian character recognition, layout analysis is a very important basic work, but at present, there are few researches on layout analysis of Mongolian document image, and Mongolian document image has a variety of layout forms, and there are characters and pictures. The mixed arrangement of various layout elements, such as tables, brings many difficulties to the recognition of printed Mongolian characters. In this paper, a bottom-up and top-down layout analysis method is used to remove the non-text part, only the text part, by marking the connected domain, merging the connected domain, removing the connected domain, and so on. After paragraph division, the location information of each paragraph is obtained, which can be used for subsequent page restoration. In Mongolian character recognition system, the result of document image segmentation and recognition is Mongolian font coding. The post-processing of this paper is the process of converting the result of font recognition into international standard coding. The coding conversion method based on contrast dictionary is adopted in this paper. Firstly, we need to convert the existing international standard code dictionary (covering 50553 Mongolian words) into word document and PDF file in turn. Finally, the images are converted into pictures, and the layout analysis and column segmentation, word segmentation and character segmentation are carried out. The Mongolian character element image obtained by the segmentation is used as the input of the trained convolution neural network classifier, and the output is Mongolian font coding. The existing international standard code dictionaries and the obtained glyph codes are arranged into a coding conversion dictionary according to the one-to-one correspondence. After the post-processing, we can find the corresponding international standard code in the dictionary and complete the coding conversion process by looking up the position of the glyph code which is the same as the recognition result in the arranged dictionary. The Mongolian document image layout analysis technology studied in this paper can process the Mongolian document image in many complicated layout formats, including removing the text part, dividing the text area into paragraphs and marking the paragraph position, etc. A certain number of samples were tested, and the accuracy of layout analysis reached 97.87. The post-processing in this paper can quickly, effectively and accurately convert the recognition result of Mongolian font coding into international standard code, which makes the printed Mongolian character recognition system more perfect.
【学位授予单位】：内蒙古大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.4

【参考文献】