当前位置:主页 > 管理论文 > 移动网络论文 >

基于规则和统计的网络不良信息识别研究

发布时间:2018-12-16 13:21
【摘要】:互联网的高速发展,给社会和人们的生活带来了巨大而深远的影响。互联网作为信息传播的载体,与传统的纸媒相比具有无法比拟的优越性,为不同领域如政治、经济、文化等的信息传播提供了优质的平台,也为人与人之间的交流创建了一种新的途径。互联网给人们生活带来便利的同时,也带来一些负面的效应。虚拟的网络环境中,每一个用户都被转化为一串虚拟的符号,用户通过个人网页、微博、微信公众号、论坛等形式的网络媒体发布的信息、言论等,都具有一定的不确定性,即使许多平台采取一定的事前审核、事后过滤措施,但仍然有某些身份隐蔽、道德意识、文化素养较差的人存在,使得大量虚假的、色情类、政治敏感类、诈骗类、迷信类等信息充斥网络的角角落落,败坏社会风气,蛊惑人心,给人们的身心健康造成极大的损害。作为一种用户量巨大的网络社交媒体,微博是一种基于用户关系的信息分享、传播、获取的平台,用户发布的微博消息可以通过客户端或者平台及时推送给粉丝,实现了实时、快捷的信息传播。同时微博粉丝也可以通过发表评论与博主进行互动,或者可以进行转发、评论、收藏等操作,实现信息分享、传播,扩大信息传播的范围,增强信息的影响力。微博的这个特点同时也导致了微博成为不良信息的藏身之地。因此微博已经成为许多学者研究的对象。为了净化网络环境,让未成年人远离不良信息的侵害,给互联网用户提供良好的搜索体验,有必要控制这些不良信息的发布和传播,采取相应的措施和手段加强监督和管理。为此,本文以网络中不良信息的识别为目的,结合已有的中文文本挖掘技术来进行实验研究。通过爬虫程序采集微博用户针对特定微博正文进行评论和转发内容,得到原始数据。并对原始数据进行去除无关的符号、分词处理、依存关系标注、词频统计等操作,并利用得到的数据来提取文本的特征集。为了提高分词的准确性,本文设计了不良词库,其中包含不良词语本身对应的基本词表、近义词表、缩写词表、词语之间的依存关系表;将基于统计的特征提取算法与依存关系分析相结合,有效提取文本特征,并使用朴素贝叶斯算法实现了文本分类模型。进一步将该模型应用于微博中用户评论的分类处理,通过实验对分类器进行测试,与改进前相比,分类的准确率和召回率有明显的提高。最后针对本文的研究做出总结,提出本文的创新点和不足之处,并在后续的研究过程继续完善。
[Abstract]:The rapid development of the Internet has brought great and profound influence to the society and people's life. As a carrier of information dissemination, Internet has unparalleled advantages compared with traditional paper media. It provides a high quality platform for information dissemination in different fields such as politics, economy, culture and so on. It also creates a new way for people to communicate with each other. Internet brings convenience to people's life, but also brings some negative effects. In the virtual network environment, every user is transformed into a string of virtual symbols. The information and comments issued by the users through personal web pages, Weibo, WeChat public numbers, forums, etc., are all uncertain. Even though many platforms take certain measures of prior vetting and filtering after the event, there are still some people with hidden identities, moral awareness, and poor cultural attainment, making a large number of false, pornographic, politically sensitive, and swindling types. Superstition and other information are filled with Internet corner, corrupt social atmosphere, demagoguery, and cause great damage to people's physical and mental health. As a kind of network social media with a large number of users, Weibo is a platform for sharing, disseminating and obtaining information based on user relations. The information posted by users can be pushed to fans through clients or platforms in a timely manner, thus realizing real time. Quick dissemination of information. At the same time, Weibo fans can interact with the blogger by publishing comments, or can transmit, comment, collect and other operations, achieve information sharing, dissemination, expand the scope of information dissemination, enhance the influence of information. Weibo's this characteristic also led to Weibo to become the hiding place of bad information at the same time. Therefore, Weibo has become the object of many scholars. In order to purify the network environment, keep minors away from the violation of bad information and provide Internet users with good search experience, it is necessary to control the publication and dissemination of these bad information and take appropriate measures and means to strengthen supervision and management. Therefore, the purpose of this paper is to identify the bad information in the network, combined with the existing Chinese text mining technology to carry out experimental research. The crawler program collects Weibo users to comment and forward the text of a particular Weibo, and gets the original data. The original data are removed independent symbols, word segmentation, dependency tagging, word frequency statistics and so on, and the text feature set is extracted by using the obtained data. In order to improve the accuracy of word segmentation, this paper designs a bad thesaurus, which includes the basic word list, the synonym table, the abbreviated lexicon and the dependency table of the words. The feature extraction algorithm based on statistics is combined with dependency analysis to extract text features effectively, and a text classification model is implemented by using naive Bayes algorithm. Furthermore, the model is applied to the classification of user comments in Weibo, and the classifier is tested by experiments. Compared with the improved model, the classification accuracy and recall rate are obviously improved. Finally, this paper summarizes the research, puts forward the innovation and shortcomings of this paper, and continues to improve in the follow-up research process.
【学位授予单位】:华中师范大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1;TP393.092

【相似文献】

相关期刊论文 前10条

1 科卞;信号细微特征提取分析技术[J];电子科技大学学报;2000年02期

2 马少华,高峰,李敏,吴成东;神经网络分类器的特征提取和优选[J];基础自动化;2000年06期

3 管聪慧,宣国荣;多类问题中的特征提取[J];计算机工程;2002年01期

4 胡威;李建华;陈波;;入侵检测建模过程中特征提取最优化评估[J];计算机工程;2006年12期

5 朱玉莲;陈松灿;赵国安;;推广的矩阵模式特征提取方法及其在人脸识别中的应用[J];小型微型计算机系统;2007年04期

6 赵振勇;王保华;王力;崔磊;;人脸图像的特征提取[J];计算机技术与发展;2007年05期

7 冯海亮;王丽;李见为;;一种新的用于人脸识别的特征提取方法[J];计算机科学;2009年06期

8 朱笑荣;杨德运;;基于入侵检测的特征提取方法[J];计算机应用与软件;2010年06期

9 王菲;白洁;;一种基于非线性特征提取的被动声纳目标识别方法研究[J];软件导刊;2010年05期

10 陈伟;瞿晓;葛丁飞;;主观引导特征提取法在光谱识别中的应用[J];科技通报;2011年04期

相关会议论文 前10条

1 尚修刚;蒋慰孙;;模糊特征提取新算法[A];1997中国控制与决策学术年会论文集[C];1997年

2 潘荣江;孟祥旭;杨承磊;王锐;;旋转体的几何特征提取方法[A];第一届建立和谐人机环境联合学术会议(HHME2005)论文集[C];2005年

3 薛燕;李建良;朱学芳;;人脸识别中特征提取的一种改进方法[A];第十三届全国图象图形学学术会议论文集[C];2006年

4 杜栓平;曹正良;;时间—频率域特征提取及其应用[A];2005年全国水声学学术会议论文集[C];2005年

5 黄先锋;韩传久;陈旭;周剑军;;运动目标的分割与特征提取[A];全国第二届信号处理与应用学术会议专刊[C];2008年

6 魏明果;;方言比较的特征提取与矩阵分析[A];2009系统仿真技术及其应用学术会议论文集[C];2009年

7 林土胜;赖声礼;;视网膜血管特征提取的拆支跟踪法[A];1999年中国神经网络与信号处理学术会议论文集[C];1999年

8 秦建玲;李军;;基于核的主成分分析的特征提取方法与样本筛选[A];2005年中国机械工程学会年会论文集[C];2005年

9 刘红;陈光,

本文编号:2382415


资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2382415.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户17d7c***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com