基于SVM的印刷体数学公式识别方法研究与系统设计

发布时间：2018-04-30 11:30

本文选题：公式识别 + SVM　；参考：《沈阳工业大学》2015年硕士论文

【摘要】：光学字符识别（OCR）是近年来广泛应用于银行、邮电、物流等领域的一种识别技术，目的是将以图像方式输入的印刷体或手写体字符转化为可编辑的符号。目前，对印刷体文档中的中英文以及阿拉伯数字的识别已达到较高水平，但由于数学公式符号的种类多，变化大，结构复杂，实现正确、快速的识别比较困难，需要探索更有效的识别方法。本文针对印刷体数学公式识别中的几个关键问题展开研究，重点解决倾斜图像的快速与准确校正、粘连符号的有效分割和基于SVM的多层分类器的符号识别问题。为了提高版面倾角检测的效率和精度，提出了一种基于连通域分析和Hough变换的倾斜校正方法，通过连通域分析预估倾角，以较长的连通域为依据划分出文本区域，结合经边缘检测处理后的版面区域，以不同角度步长分别进行Hough变换，得到最终精确的倾角。同时，通过凹凸轮廓和分割因子确定待分割位置，进而对分割后的符号进行识别验证。由于公式符号众多，为了有效地降低分类器的负担并提高分类的准确性，对公式符号的特征进行详细筛选和分类，并以此为基础构造了粗、细分类相结合多层分类器。在细分类时，利用一对多的方法改进了传统DAG-SVM训练模型中的一对一方法，提高了分类器的训练效率，并利用类间可分性对DAG-SVM中的节点进行重新组合，降低了误差累积对识别所造成的影响。实验和分析表明，所提出的算法能够高效检测出版面的倾角，实现准确的粘连字符分割，完成有效的公式符号识别。基于上述方法，，本文应用VC++设计并实现了一个印刷体数学公式识别系统。以包含公式的文档图像作为系统的输入，经过版面分析、公式图像预处理、公式符号识别和公式结构分析，将其以Latex的格式输出。通过对识别结果的分析，使用本文所提出的改进的SVM分类器对数学符号进行识别，可以达到94.7%的识别率，要高于现有的SVM分类器的识别率。
[Abstract]:Optical character recognition (OCR) is a recognition technology widely used in the fields of bank, post and telecommunication, logistics and so on in recent years. The aim is to convert printed or handwritten characters input by image into editable symbols. At present, the recognition of Chinese and English and Arabic numerals in printed documents has reached a high level, but due to the large variety of mathematical formula symbols, large changes, complex structure, it is difficult to realize correct and rapid recognition. More effective identification methods need to be explored. In this paper, several key problems in the recognition of printed mathematical formulas are studied, focusing on the problems of fast and accurate correction of skew images, effective segmentation of adhesive symbols and symbol recognition of multi-layer classifiers based on SVM. In order to improve the efficiency and accuracy of layout dip detection, a tilt correction method based on connected domain analysis and Hough transform is proposed. Combined with the area of layout after edge detection, the Hough transform is carried out with different angle step sizes, and the final inclination angle is obtained. At the same time, the position to be segmented is determined by the concave and convex contour and the segmentation factor, and then the symbol after segmentation is recognized and verified. In order to effectively reduce the burden of classifiers and improve the accuracy of classification, the features of formula symbols are screened and classified in detail. Based on this, a coarse and fine classifier combined with multi-layer classifier is constructed. In fine classification, the one-to-one method of traditional DAG-SVM training model is improved by one-to-many method, and the training efficiency of classifier is improved, and the nodes in DAG-SVM are recombined by using inter-class separability. The effect of error accumulation on recognition is reduced. Experiments and analysis show that the proposed algorithm can efficiently detect the dip angle of the layout, achieve accurate segmentation of the adherent characters, and achieve effective formula symbol recognition. Based on the above methods, a printing mathematical formula recognition system is designed and implemented by VC. The document image containing the formula is taken as the input of the system. After layout analysis, formula image preprocessing, formula symbol recognition and formula structure analysis, it is output in the format of Latex. Based on the analysis of the recognition results, the improved SVM classifier proposed in this paper is used to recognize the mathematical symbols. The recognition rate of the improved SVM classifier can reach 94.7%, which is higher than that of the existing SVM classifier.
【学位授予单位】：沈阳工业大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.41

【参考文献】