基于Web内容挖掘的医药类广告监控系统的实现
发布时间:2018-04-27 00:04
本文选题:Web内容挖掘 + 网络爬虫 ; 参考:《哈尔滨理工大学》2011年硕士论文
【摘要】:伴随着互联网的迅速发展,庞大的网民规模吸引着越来越多的广告主将注意力转向网络广告市场,网络广告的数量急剧增长。但是伴随而来的是违法广告层出不穷,尤其是违法医药广告危害最为严重。由于网络上存在着巨大的信息量,仅仅依靠人工审查的方法难以应对网上海量信息的收集和处理,需要加强相关信息技术的研究,形成一套自动化的网络医药广告监控系统。 本论文对网络爬虫、网页信息抽取、网页分类等技术分别进行了深入的研究,并提出了相应的解决方案,以这些技术为基础本文实现了一个网络医药广告监控系统,较好地解决了互联网中医药广告的监控问题。本文完成的主要工作如下: 1.对现有的网络爬虫技术进行了深入研究,详细介绍了爬虫工作的原理。针对网页的构成,结合网页提取的开源工具提出了本文的网页信息抽取方法。测试结果表明本文提出的方法有着较好的效率和准确性。 2.介绍了网页分类的现状和处理流程,详细讲解了网页分类中涉及的各个模块的理论。在此基础上,充分利用相关的开源工具,并针对χ2统计法在文本分类中的缺陷提出了改进的办法,搭建了判断网络爬虫所爬取的信息是否为医药类信息的分类模块,实验结果表明本文提出的分类模块有着较好的性能。 3.设计并实现了一个医药类网络广告监控系统,可以完成对网络上医药广告的自动追踪处理,提供分布式计算支持,有着较强的操作性和良好的展示界面。
[Abstract]:With the rapid development of the Internet, the huge scale of Internet users attracts more and more advertisers to turn their attention to the online advertising market. But with it, illegal advertisements emerge in endlessly, especially the harm of illegal medical advertisements is the most serious. Because there is a huge amount of information on the network, it is difficult to deal with the collection and processing of the massive information on the network only by the method of manual examination. Therefore, it is necessary to strengthen the research of relevant information technology and form an automatic network medicine advertisement monitoring system. This thesis has carried on the thorough research to the network crawler, the web page information extraction, the webpage classification and so on technology, and has proposed the corresponding solution, based on these technologies, this paper has realized a network medicine advertisement monitoring system. A better solution to the Internet Chinese medicine advertising monitoring problem. The main work of this paper is as follows: 1. The existing web crawler technology is studied in detail, and the principle of crawler work is introduced in detail. According to the composition of web pages, this paper proposes a web page information extraction method combined with the open source tools of web page extraction. The test results show that the proposed method has good efficiency and accuracy. 2. This paper introduces the current situation and processing flow of web page classification, and explains the theory of each module involved in web page classification in detail. On this basis, we make full use of the relevant open source tools, and in view of the defects of 蠂 2 statistics in text classification, put forward an improved method, and build a classification module to judge whether the information crawled by a web crawler is medical information. Experimental results show that the proposed classification module has better performance. 3. A pharmaceutical network advertisement monitoring system is designed and implemented, which can automatically track and process pharmaceutical advertisements on the network, provide distributed computing support, and have a strong operability and a good display interface.
【学位授予单位】:哈尔滨理工大学
【学位级别】:硕士
【学位授予年份】:2011
【分类号】:TP277;TP393.09
【参考文献】
相关期刊论文 前6条
1 易伯春;医药虚假广告何时能够绝迹![J];价格月刊;2002年06期
2 李刚;周立柱;郭奇;林玲;;领域相关的Web网站抓取方法[J];计算机科学;2007年02期
3 周德懋;李舟军;;高性能网络爬虫:研究综述[J];计算机科学;2009年08期
4 吴军,,王作英,禹锋,王侠;汉语语料的自动分类[J];中文信息学报;1995年04期
5 欧健文,董守斌,蔡斌;模板化网页主题信息的提取方法[J];清华大学学报(自然科学版);2005年S1期
6 曹冬林;廖祥文;许洪波;白硕;;基于网页格式信息量的博客文章和评论抽取模型[J];软件学报;2009年05期
相关硕士学位论文 前2条
1 刘小雪;基于XML的Web内容挖掘技术研究[D];贵州大学;2008年
2 李晓红;中文文本分类技术研究[D];兰州理工大学;2009年
本文编号:1808264
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1808264.html