“天眼查”分布式爬虫系统中验证码识别模块的设计与实现
发布时间:2018-10-24 09:55
【摘要】:"天眼查"是一款提供了全面的企业信息查询、专业的企业关系挖掘的工具平台,可查询企业工商信息、法律诉讼、商标专利、对外投资、招投标、失信、经营异常、企业年报、招聘及新闻动态等,覆盖全国超8000万家企业信息,与工商局网站同步更新。"天眼查"平台通过抓取互联网公开信息,将主体间的关系以可视化的方式直观呈现,为用户提供全面可靠的企业数据分析,帮助用户发现更多隐藏的商业利益关系,适合金融、投资、律师、记者、商务人士及时了解企业经营状况、洞察企业经营信息。然而,在抓取互联网公开信息的时候,会遇到各种类型的验证码,如填写成语、汉语拼音、算术题、英文数字字母等等,人工识别或传统技术识别无法适应大量数据爬取的需求。因此需要设计一套高效的验证码识别系统以有效提高信息的获取速度,并为将来的数据挖掘获取提供保障。论文选题来源于公司实际应用项目,在分析"天眼查"产品的验证码识别需求的基础上,设计和实现了基于深度学习的验证码识别系统。论文完成的具体工作包括:完成了验证码识别系统的需求分析;设计了技术架构;将系统功能分解为基于深度学习的验证码训练子系统、验证码识别服务子系统和爬虫应用子系统三个相对独立的部分,并分别完成了三个部分的概要设计、详细设计和实现;完成了对原有Spring、Redis技术架构进行相匹配的架构升级设计;完成了系统功能测试。本文的成果最终已经成功应用到"天眼查"平台的实际生产环节中,验证码识别率高,大大提高了爬虫的爬取效率。论文涉及的软件成果也已成功申请到了软件著作权。本文成果的成功应用,证实了机器学习,特别是深度学习,在验证码识别的领域具有很大应用前景,值得进一步探究。
[Abstract]:"Sky Eye check" is a tool platform that provides comprehensive enterprise information inquiry, professional enterprise relationship mining, can query enterprise business information, legal proceedings, trademark patents, foreign investment, bidding, breach of trust, abnormal management, Annual reports, recruitment and news trends, covering more than 80 million enterprises across the country, updated with the website of the Bureau of Industry and Commerce. " Through grabbing public information on the Internet, the platform visually presents the relationship between subjects in a visual manner, provides users with comprehensive and reliable enterprise data analysis, helps users to discover more hidden commercial interests, and is suitable for finance. Investment, lawyers, journalists, business people timely understand the status of business operations, insight into business information. However, when grabbing public information on the Internet, you will encounter various types of verification codes, such as filling in idioms, Pinyin, arithmetic problems, English numerals, and so on. Manual identification or traditional technology recognition can not meet the needs of a large number of data crawling. Therefore, it is necessary to design a set of efficient verification code recognition system to effectively improve the speed of information acquisition and provide a guarantee for data mining in the future. This paper is based on the practical application project of the company. On the basis of analyzing the requirement of the verification code recognition of the "Sky Eye check" product, the paper designs and implements the verification code recognition system based on the deep learning. The specific work accomplished in this paper includes: the requirement analysis of the verification code recognition system is completed; the technical framework is designed; the function of the system is decomposed into the CAPTC-code training subsystem based on in-depth learning. There are three relatively independent parts of CAPTCA service subsystem and crawler application subsystem, and the outline design, detailed design and implementation of the three parts are completed, and the architecture upgrade design of the original Spring,Redis technology architecture is completed. The system function test is completed. The results of this paper have been successfully applied to the actual production of "Sky Eye check" platform. The recognition rate of the verification code is high and the crawler crawling efficiency is greatly improved. The software products involved in the paper have also been successfully applied for software copyright. The successful application of this paper proves that machine learning, especially deep learning, has a great application prospect in the field of verification code recognition, and it is worthy of further exploration.
【学位授予单位】:北京交通大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.52
本文编号:2291052
[Abstract]:"Sky Eye check" is a tool platform that provides comprehensive enterprise information inquiry, professional enterprise relationship mining, can query enterprise business information, legal proceedings, trademark patents, foreign investment, bidding, breach of trust, abnormal management, Annual reports, recruitment and news trends, covering more than 80 million enterprises across the country, updated with the website of the Bureau of Industry and Commerce. " Through grabbing public information on the Internet, the platform visually presents the relationship between subjects in a visual manner, provides users with comprehensive and reliable enterprise data analysis, helps users to discover more hidden commercial interests, and is suitable for finance. Investment, lawyers, journalists, business people timely understand the status of business operations, insight into business information. However, when grabbing public information on the Internet, you will encounter various types of verification codes, such as filling in idioms, Pinyin, arithmetic problems, English numerals, and so on. Manual identification or traditional technology recognition can not meet the needs of a large number of data crawling. Therefore, it is necessary to design a set of efficient verification code recognition system to effectively improve the speed of information acquisition and provide a guarantee for data mining in the future. This paper is based on the practical application project of the company. On the basis of analyzing the requirement of the verification code recognition of the "Sky Eye check" product, the paper designs and implements the verification code recognition system based on the deep learning. The specific work accomplished in this paper includes: the requirement analysis of the verification code recognition system is completed; the technical framework is designed; the function of the system is decomposed into the CAPTC-code training subsystem based on in-depth learning. There are three relatively independent parts of CAPTCA service subsystem and crawler application subsystem, and the outline design, detailed design and implementation of the three parts are completed, and the architecture upgrade design of the original Spring,Redis technology architecture is completed. The system function test is completed. The results of this paper have been successfully applied to the actual production of "Sky Eye check" platform. The recognition rate of the verification code is high and the crawler crawling efficiency is greatly improved. The software products involved in the paper have also been successfully applied for software copyright. The successful application of this paper proves that machine learning, especially deep learning, has a great application prospect in the field of verification code recognition, and it is worthy of further exploration.
【学位授予单位】:北京交通大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.52
【参考文献】
相关期刊论文 前5条
1 沈金萍;;第39次《中国互联网络发展状况统计报告》发布我国网民达7.3亿[J];传媒;2017年03期
2 刘欢;邵蔚元;郭跃飞;;卷积神经网络在验证码识别上的应用与研究[J];计算机工程与应用;2016年18期
3 ;CNNIC发布第38次《中国互联网络发展状况统计报告》[J];信息网络安全;2016年08期
4 李小正;成功;赵全军;;分布式爬虫系统的设计与实现[J];中国科技信息;2014年15期
5 覃光华;李祚泳;;BP网络过拟合问题研究及应用[J];武汉大学学报(工学版);2006年06期
相关硕士学位论文 前4条
1 吕霁;基于神经网络的验证码识别技术研究[D];华侨大学;2015年
2 吕阳;分布式网络爬虫系统的设计与实现[D];电子科技大学;2013年
3 许可;卷积神经网络在图像识别上的应用的研究[D];浙江大学;2012年
4 吕刚;带干扰的验证码识别研究[D];浙江工业大学;2009年
,本文编号:2291052
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2291052.html