蒙古文文档图像版面分析及识别后处理的研究与实现
[Abstract]:The research of optical character recognition (OCR) technology has been developed rapidly in recent years. Character recognition rate is the most important performance index in OCR system. For printed Mongolian character recognition system, it is necessary to perfect the whole system and improve the recognition rate of Mongolian characters. It is necessary to study and implement the layout analysis technology of Mongolian document image in the early stage and the post processing technology in the later stage. Therefore, the main content of this paper includes two parts, one is the layout analysis of Mongolian document images, the other is the post-processing of Mongolian text recognition. In the process of printed Mongolian character recognition, layout analysis is a very important basic work, but at present, there are few researches on layout analysis of Mongolian document image, and Mongolian document image has a variety of layout forms, and there are characters and pictures. The mixed arrangement of various layout elements, such as tables, brings many difficulties to the recognition of printed Mongolian characters. In this paper, a bottom-up and top-down layout analysis method is used to remove the non-text part, only the text part, by marking the connected domain, merging the connected domain, removing the connected domain, and so on. After paragraph division, the location information of each paragraph is obtained, which can be used for subsequent page restoration. In Mongolian character recognition system, the result of document image segmentation and recognition is Mongolian font coding. The post-processing of this paper is the process of converting the result of font recognition into international standard coding. The coding conversion method based on contrast dictionary is adopted in this paper. Firstly, we need to convert the existing international standard code dictionary (covering 50553 Mongolian words) into word document and PDF file in turn. Finally, the images are converted into pictures, and the layout analysis and column segmentation, word segmentation and character segmentation are carried out. The Mongolian character element image obtained by the segmentation is used as the input of the trained convolution neural network classifier, and the output is Mongolian font coding. The existing international standard code dictionaries and the obtained glyph codes are arranged into a coding conversion dictionary according to the one-to-one correspondence. After the post-processing, we can find the corresponding international standard code in the dictionary and complete the coding conversion process by looking up the position of the glyph code which is the same as the recognition result in the arranged dictionary. The Mongolian document image layout analysis technology studied in this paper can process the Mongolian document image in many complicated layout formats, including removing the text part, dividing the text area into paragraphs and marking the paragraph position, etc. A certain number of samples were tested, and the accuracy of layout analysis reached 97.87. The post-processing in this paper can quickly, effectively and accurately convert the recognition result of Mongolian font coding into international standard code, which makes the printed Mongolian character recognition system more perfect.
【学位授予单位】:内蒙古大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.4
【参考文献】
相关期刊论文 前10条
1 杨戈;张威强;黄静;;一个感知机神经网络字符识别器的实现[J];电子技术应用;2015年03期
2 单煜翔;陈谐;史永哲;刘加;;基于扩展N元文法模型的快速语言模型预测算法[J];自动化学报;2012年10期
3 王健;哈力木拉提·买买提;;印刷体维吾尔文识别后处理[J];新疆大学学报(自然科学版);2011年02期
4 苏志祁;方康玲;;一种钢筋图像自动计数的方法[J];现代电子技术;2010年06期
5 董广宇;吕学强;王涛;施水才;;基于N-gram语言模型的汉字识别后处理研究[J];微计算机信息;2009年10期
6 魏宏喜;高光来;;一种基于连通域的蒙古文文档图像版面分析方法[J];内蒙古大学学报(自然科学版);2007年05期
7 魏宏喜;高光来;;印刷体蒙古文字识别中蒙古文字特征的选择[J];内蒙古大学学报(自然科学版);2006年06期
8 张广渊;李晶皎;王爱侠;;基于知识的满文识别后处理[J];计算机辅助工程;2006年03期
9 赵骥;李晶皎;王丽君;张继生;;基于HMM的满文文本识别后处理的研究[J];中文信息学报;2006年04期
10 徐兆军,业宁,王厚立;基于神经网络的版面分析[J];计算机应用;2004年S2期
相关博士学位论文 前2条
1 赵于前;基于数学形态学的医学图像处理理论与方法研究[D];中南大学;2006年
2 刘建胜;文档图象版面理解的研究[D];重庆大学;2002年
相关硕士学位论文 前9条
1 姚志鹏;基于Hadoop平台的印刷体蒙古文字识别系统的研究与实现[D];内蒙古大学;2016年
2 张文杰;基于移动终端的报纸版面分析及识别[D];北京邮电大学;2014年
3 施晟;文档图像的版面分析技术研究[D];中南大学;2011年
4 郭军;信息资源数字化文本型数字图像OCR识别准确度影响因素及提高策略研究[D];郑州大学;2011年
5 党兴;复杂的中文文档图像版面分析研究[D];苏州大学;2010年
6 包艳花;蒙古文识别文本后处理相关技术研究[D];内蒙古大学;2007年
7 魏宏喜;印刷体蒙古文字识别中关键技术的研究[D];内蒙古大学;2006年
8 邓立国;基于多层次可信度指导下的自底向上版面分析[D];西华大学;2006年
9 杨芳;基于纹理分析的印刷字体识别研究及应用[D];河北大学;2003年
,本文编号:2118662
本文链接:https://www.wllwen.com/shoufeilunwen/xixikjs/2118662.html