基于卷积神经网络的场景文本定位及多方向字符识别研究

发布时间：2018-04-30 16:53

本文选题：文本定位 + 字符识别　；参考：《华中科技大学》2016年博士论文

【摘要】：随着智能交通、盲人导航和智能物流应用的快速发展,包含路标、广告牌、车牌、书籍和物品包装等场景图像中的文本定位与识别已成为计算机视觉领域研究的热点。由于场景文本图像不仅存在分辨率低、光照不均匀、失焦模糊、仿射失真问题,还含有树木、砖墙和栏杆等复杂多变的背景纹理干扰,文字本身的颜色、字体、大小、方向和排列方式也具有多样性,直接利用现有的光学字符识别技术处理,识别精度低,对应用环境变化的适应性差。因此,如何快速、准确、鲁棒地定位和识别场景图像中的文字仍然是一个具有挑战性的研究课题。大量的观察试验发现,虽然场景文本图像中的背景纹理干扰是复杂多变的,但字符笔画区域的纹理特征却是相对不变的。基于字符笔画区域纹理特征的这种不变性,本文利用卷积神经网络,提出一种字符笔画区域的纹理特征提取方法,并分别结合字符笔画的几何特征以及字符区域的场景上下文特征,来抑制背景纹理干扰,以提高场景图像中文本定位的准确性与适应性。此外,为了提高字符识别对文本方向变化的适应性,我们提出一种字符均匀采样点区域的纹理特征和对应的结构特征提取方法,并利用特征词袋模型和支撑向量机(SVM)进行字符分类。因此,本文分别从场景图像文本定位和识别两个方向进行研究,并取得了如下的研究成果:首先,由于卷积神经网络通过设计层次结构学习可以获取丰富的高层语义信息,能有效地提取背景纹理复杂的目标区域特征,故本文通过卷积神经网络,提取候选字符的纹理特征,并设计了基于联合几何和纹理特征的连通域SVM分类器,以抑制非字符连通域。此外,为了精确定位多方向文本区域,本文对倾斜矫正后的候选文本区域,利用几何相似度度量和基于梯度统计特征的SVM分类器进行过滤,排除背景干扰,实现文本的精确定位。本文提出的方法对场景文本的位置、角度、尺度和灰度变化有较好的适应性,而且能有效地抑制复杂背景纹理干扰,提高场景图像文本定位的精确性和适应性。其次,利用场景分割模型,提出将场景上下文和卷积神经网络结合的场景文本定位方法。对于场景图像中字符和背景区域的分类,大多数方法仅仅考虑字符级的特征,如对边缘密度、笔画宽度或梯度分布等进行判断,对于类字符的背景,容易得到错误的分类结果。对此,本文提出利用候选字符周边区域的场景上下文信息辅助场景文本定位的思想。首先,利用纹理基元增强方法(TextonBoost)和全连接的条件随机场获取图像中每个像素点属于树木、路标、墙、天空等14类目标的概率,同时,提取场景图像中最大稳定极值区域,并将其扩展为矩形块区域。然后,将矩形块区域中所有像素点的概率向量平均作为该区域的场景上下文特征,结合卷积神经网络和SVM分类器,进行字符和非字符分类。最后,利用场景上下文特征以及几何和颜色信息将字符区域组合成文本区域。该方法可以有效地抑制文字存在概率较小的场景中的复杂背景纹理干扰,提高场景文本定位的准确性。最后,为了适应不同方向的文本识别,提出一种结合区域纹理特征和结构特征的抗旋转性字符表达模型。目前,场景文本的识别技术往往只研究水平方向的字符,缺乏通用的文字表达模型。针对这一问题,本文利用字符结构之间的相对方向和相对位置关系设计字符特征。在归一化字符图像的均匀采样点上以任意一点为目标,计算和其他点的去方向性梯度统计特征获得其纹理特征,并同时记录对应的空间坐标关系作为结构特征,利用特征词袋模型对这两种特征进行统计,进而通过SVM分类器分类识别。由于提取的字符特征具有旋转不变性,故该模型能适应不同文本方向的变化。在标准字符数据集和任意方向字符数据集上的实验结果表明,本文提出的方法可以获得较高的识别精度。
[Abstract]:With the rapid development of intelligent traffic, blind navigation and intelligent logistics applications, the location and recognition of text in scene images, including road signs, billboards, license plates, books and goods packaging, has become a hot spot in the field of computer vision. Because scene text images not only have low resolution, uneven illumination, blurred and affine distortion, and affine distortion. It also contains complex and changeable background texture interference such as trees, brick walls and railings. The color, font, size, direction and arrangement of the text itself are also diverse, and are processed directly by the existing optical character recognition technology, and the recognition accuracy is low and the adaptability to the application environment is poor. Therefore, how to quickly, accurately and robust location The text in the scene image is still a challenging research topic. A large number of observation experiments have found that although the background texture interference in the scene text image is complex and changeable, the texture features of the character strokes are relatively unchanged. This invariance based on the texture features of the character strokes is used in this paper. The convolution neural network (convolution neural network) proposes a method of texture feature extraction in character strokes. It combines the geometric features of character strokes and the context features of the character region to suppress the background texture interference, in order to improve the accuracy and adaptability of the text location in the scene image. In addition, in order to improve the character recognition to the text direction. According to the adaptability of the change, we propose a texture feature and the corresponding structural feature extraction method for the region of the character uniform sampling point, and use the feature word bag model and the support vector machine (SVM) to classify the characters. Therefore, this paper studies the two directions from the scene image text location and recognition, and the following research results are obtained. First, because the convolution neural network can acquire rich high-level semantic information through the design hierarchy process, it can effectively extract the complex target area features of the background texture. Therefore, this paper extracts the texture features of the candidate characters through the convolution neural network, and designs a connected domain SVM classifier based on the joint geometry and texture features. In order to suppress the non character connected domain. In addition, in order to locate the multi direction text area accurately, this paper filters the candidate text region after the tilt correction, uses the geometric similarity measure and the SVM classifier based on the gradient statistical feature to filter, excludes the background interference, and realizes the precise location of the text. The method proposed in this paper has the position and angle of the scene text. Degree, scale and gray scale change have good adaptability, and can effectively suppress complex background texture interference and improve the accuracy and adaptability of scene image text location. Secondly, using scene segmentation model, a scene text location method combining scene context and convolution neural network is proposed. The classification of regions, most of the methods only consider the character level features, such as the edge density, the stroke width or the gradient distribution. For the background of the character class, it is easy to get the wrong classification results. In this paper, the idea of using the scene context information in the surrounding region of the candidate character to assist the scene text location is proposed. Using the texture element enhancement method (TextonBoost) and the full connection conditional random field, each pixel in the image belongs to the probability of 14 kinds of targets, such as trees, road signs, walls, and sky. At the same time, the maximum stable extremum area in the scene image is extracted and expanded into a rectangular block region. Then, the probability of all pixels in the rectangle block region is given. As the context feature of the scene in the region, it combines the convolution neural network and the SVM classifier to classify the character and non character. Finally, the character region is combined into the text region by using the scene context features and the geometric and color information. This method can effectively suppress the complex background in the scene with small probability. Texture interference improves the accuracy of scene text location. Finally, in order to adapt to different directions of text recognition, an anti rotation character expression model combining regional texture features and structural features is proposed. At present, the scene text recognition technology often only studies the characters in the horizontal direction and lacks a universal word expression model. In this paper, the character features are designed by using the relative direction and relative position relationship between the character structures. At the uniform sampling point of the normalized character image, the texture features are obtained by any point in the uniform sampling point of the normalized character image, and the statistical features of the direction gradient of other points are calculated, and the corresponding spatial coordinates are recorded as the structural features. The two features are classified by the feature word bag model, and then the SVM classifier is classified. Because the extracted character features have rotation invariance, the model can adapt to the change of different text directions. The results of the standard character data set and the arbitrary direction character dataset show that the proposed method can be obtained. Higher recognition accuracy.

【学位授予单位】：华中科技大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.41;TP183

【相似文献】