基于图像特征及OCR的垃圾图像过滤方法研究
本文关键词: 垃圾图像 特征提取 KNN 短文本分类 出处:《南京理工大学》2017年硕士论文 论文类型:学位论文
【摘要】:随着互联网的蓬勃发展,电子邮件成为人们日常交流的重要工具。人们在通过电子邮件接收大量有用信息的同时,也会接收到很多广告、色情、欺诈、木马甚至是反动的内容,这些不良内容占用了大量的网络资源、增加了用户风险、降低了用户体验,属于垃圾邮件。目前,垃圾邮件由文本型逐渐发展为图像型和图像文本混合型,以往针对文本的垃圾邮件过滤方法研究较多,而针对图像的垃圾邮件过滤方法仍不尽人意。本文主要针对垃圾邮件中的垃圾图像过滤技术进行研究。本文设计了一种两层垃圾图像过滤方法,通过利用图像底层特征和OCR识别两种途径逐级筛选垃圾图像,在提高检出率的同时降低了误检率。根据采用的特征类型,该方法分为基于特征的过滤层和基于内容的过滤层。前者为第一层过滤,属于粗分类,利用图像的底层特征初步筛选出垃圾图像;后者为第二层过滤,属于细分类,利用垃圾图像中识别的文本内容来提取关键词并进行垃圾类别的分类。在基于特征的过滤层中,本文提出了基于置信度分析的KNN过滤方法。首先分析垃圾图像和正常图像的颜色、梯度以及HOG等图像底层特征;然后分析各特征KNN分类结果及置信度分布,通过置信度实现多特征分类结果的融合,降低误识率。在基于内容的过滤层中,本文首先设计了垃圾图像中文本的检测、分割和识别方法,针对垃圾图像中文本倾斜问题设计了基于傅立叶和投影的单字分割方法;然后提出了融入相对词频的卡方检验方法用于提取文本中的关键词特征,降低了低频词被选为特征的概率;最后设计了基于SVM及先验语料库的短文本分类方法,将垃圾图像进一步分类为犯罪、教育、保险和商品促销等几类。采用SPAM公共图像集和搜集整理的图像集上进行了实验分析和比较,结果表明本文两层垃圾图像过滤方法获得了比较理想的准确率和误识率。
[Abstract]:With the rapid development of the Internet, email has become an important tool for daily communication. People receive a lot of useful information through email, but also receive a lot of advertisements, pornography, fraud, Trojan horses and even reactionary content. This bad content takes up a lot of network resources, increases the risk of users, reduces the user experience, and belongs to spam. At present, spam has gradually evolved from text-based to image-based and image-text hybrid. In the past, there have been many researches on spam filtering methods for text. However, the spam filtering method for images is still unsatisfactory. In this paper, the spam image filtering technology in spam is mainly studied. A two-layer spam image filtering method is designed in this paper. In order to improve the detection rate and reduce the false detection rate, the garbage images are screened by using the image bottom feature and OCR recognition step by step. The method is divided into feature-based filtering layer and content-based filtering layer, the former is the first layer filtering, which belongs to coarse classification, the garbage image is preliminarily filtered by the bottom features of the image, and the latter is the second layer filtering, which belongs to the fine classification. The text content recognized in garbage images is used to extract keywords and classify garbage categories. In the feature-based filtering layer, In this paper, a new KNN filtering method based on confidence analysis is proposed. Firstly, the color, gradient and HOG underlying features of garbage image and normal image are analyzed, and then the KNN classification results and confidence distribution of each feature are analyzed. In the content-based filtering layer, this paper first designs the methods of Chinese text detection, segmentation and recognition of junk image. Aiming at the problem of Chinese text tilt in garbage images, a new segmentation method based on Fourier transform and projection is proposed, and then a chi-square test method is proposed to extract the keyword features from the text, which is based on the relative word frequency. The probability of low frequency words being selected as features is reduced. Finally, a short text classification method based on SVM and a priori corpus is designed to further classify garbage images into crimes, education, etc. The experimental analysis and comparison between the SPAM common image set and the collected image set are carried out. The results show that the two-layer garbage image filtering method in this paper has achieved an ideal accuracy rate and a false recognition rate.
【学位授予单位】:南京理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.41
【参考文献】
相关期刊论文 前9条
1 刘新瀚;钱侃;王宇飞;朱向霄;孙知信;;自然场景下基于连通域检测的文字识别算法研究[J];计算机技术与发展;2015年05期
2 XU Bin;LI Ruiguang;LIU Yashu;YAN Hanbing;LI Siyuan;ZHANG Honggang;;Filtering Chinese Image Spam Using Pseudo-OCR[J];Chinese Journal of Electronics;2015年01期
3 刘艳洋;曹玉东;贾旭;;基于内容的图像型垃圾邮件过滤技术研究[J];辽宁工业大学学报(自然科学版);2014年02期
4 秦伟;;基于OCR的图像型垃圾邮件过滤系统研究[J];机械工程与自动化;2013年06期
5 王宗辉;张卫丰;张迎周;周国强;;基于陆地移动距离的相似度测量检测图像型垃圾邮件[J];江苏科技大学学报(自然科学版);2012年01期
6 王忠桃;岳焱;彭鑫;;含倾斜文字的图像垃圾邮件过滤技术研究[J];计算机与数字工程;2010年05期
7 程红蓉;秦志光;万明成;曾志华;;垃圾图像判别中的特征提取与选择研究[J];计算机应用研究;2009年06期
8 耿技;万明成;程红蓉;周俊怡;;基于文本区域特征的图像型垃圾邮件过滤算法[J];计算机应用;2008年08期
9 许洋洋;袁华;;一种基于内容的广告垃圾图像过滤方法[J];山东大学学报(理学版);2006年03期
相关硕士学位论文 前1条
1 郑冬冬;基于贝叶斯网络的图像型垃圾邮件识别研究[D];江苏大学;2010年
,本文编号:1508293
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1508293.html