基于机器学习的网站分级研究与实现

发布时间：2018-06-20 05:58

本文选题：内容分类 + 深度学习　；参考：《电子科技大学》2017年硕士论文

【摘要】：互联网中的不良信息长久以来一直存在,而且数量上呈现增长趋势,其中以色情信息居多,还包含着赌博,传销等违法内容。为此社会各界为整治互联网环境献计出力,国家也出台了相应的法律法规以规范网络环境,但是不良信息却屡禁不止,泛滥成灾。目前已经有许多不良信息拦截系统以软件或硬件的方式为我们的网络环境更加美好出力,但是其中大多数系统都“各自为政”,重复建立自己的黑名单库。本系统的目标是通过主动检测网站内容,建立共享的不良信息数据库,为拦截系统提供公共数据支持。本系统通过研究深度学习的图像分类与文本分类算法,将新型算法运用到不良信息分类的任务中。深度学习算法较传统知识工程或统计学方法需要手动提取特征的方法相比,深度学习具有自动学习特征提取的能力,在图像识别方面具有更高的分类准确度。在文本分类算法上提出新方法,将网页长文本截取为短文本再分类,将分类结果汇总得到网页文本的色情比例,并且根据服务人群不同调节色情比例阈值以满足不同人群的过滤需求。在图像分类算法上,深度卷积模型最为有效,并且深度卷积模型在近几年的发展中,又有了长足进步,并发展出几种类型的模型,如直线型、局部双分支型和局部多分支型。本文通过研究不同类型模型在不良图片分类任务上的表现,并采用微调的方式去训练多种深度卷积模型,最终根据模型的计算量消耗与模型的准确率选择最合适的图像分类算法。系统设计充分考虑了系统扩展性与移植性,并且可利用老旧或闲散设备作为系统工作节点,节省项目资金。本系统主要包括五个部分,网络爬虫模块、文本分类模块、图片分类模块、数据存储模块和数据展示模块。其中网络爬虫模块,文本分类模块,图片分类模块为本论文的主要研究方向。
[Abstract]:The bad information in the Internet has been existed for a long time and the quantity is increasing. Among them, pornographic information is the majority, but also contains illegal content such as gambling, pyramid selling and so on. In order to improve the Internet environment, the government has also issued the corresponding laws and regulations to regulate the network environment, but the bad information is not only banned, but also overflowed. At present, there are many bad information intercepting systems to help our network environment better by software or hardware, but most of them are "doing their own thing" and repeatedly establishing their own blacklist database. The aim of this system is to provide public data support for intercepting system by actively detecting website content, establishing shared bad information database. This system applies the new algorithm to the task of bad information classification by studying the image classification and text classification algorithms of depth learning. Compared with the traditional knowledge engineering or statistical methods, depth learning has the ability to extract features automatically and has higher classification accuracy in image recognition. In the text classification algorithm, a new method is put forward, which intercepts the long text of the web page and classifies it into short text, and then summarizes the classification results to get the pornographic proportion of the page text. And adjust the threshold of pornography proportion according to different service groups to meet the filtering needs of different groups. In the image classification algorithm, the depth convolution model is the most effective, and the depth convolution model has made great progress in recent years, and developed several types of models, such as linear type, local double branching type and local multi-branching type. In this paper, we study the performance of different types of models in the task of bad image classification, and use fine-tuning to train various kinds of deep convolution models. Finally, the most suitable image classification algorithm is selected according to the computational cost of the model and the accuracy of the model. The system design fully considers the expansibility and portability of the system, and can use the old or idle equipment as the work node of the system, thus saving the project money. The system mainly includes five parts, web crawler module, text classification module, picture classification module, data storage module and data display module. Among them, web crawler module, text classification module and picture classification module are the main research directions of this paper.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092

【参考文献】