手机短消息不良信息过滤方法的研究与实现

发布时间：2018-05-19 07:46

本文选题：短消息 + 分词　；参考：《上海交通大学》2008年硕士论文

【摘要】： 手机短消息在最近几年进入了爆发式的快速增长时期。然而,短消息在给用户带来极大便利的同时,也成为信息安全的重大隐患。通过短消息这一新兴的信息工具,各种色情暴力、政治谣言、反动言论、诈骗信息和非法广告的传播,已经成为影响社会稳定的重要因素之一。非法手机短消息考验着社会应对不法侵害的能力。面对这种运用现代信息技术作案的新型犯罪,如何防范和打击,对公、检、法机关乃至银行、工信等部门都是新的挑战。本文提出了基于文本内容分类的短消息分类与过滤机制,设计出改进型的基于贝叶斯算法短消息过滤模型,开发了文本短消息拦截过滤平台,给出了该模型的几个关键功能模块的具体实现,完成对短消息内容的识别和短消息的自动过滤,所做的主要工作如下:首先,依据短消息分类的特点,分析了短消息分类权重的不一致性。在正常情况下,人们最不希望将正常短消息误判为不良短消息而被过滤掉,为使希望损失最小,不但要求短消息分类的准确性要高,并且正常短消息被误判为不良短消息的权重要高于不良短消息误判为正常短消息的权重。其次,我们设计了短消息分类与过滤相关主要模块:短消息采集、中文分词、特征选取、短消息分类与过滤。最后,我们对该模型进行了测试,借鉴了文本分类和信息检索领域中的评价指标对系统平台实验结果进行了质量评价。本文设计和实现中的特点和创新性在以下三个方面。第一,提出了在短消息服务器上设计与实现短消息过滤。与一般在手机端进行短消息过滤不同,服务器端同时收到由短消息猫发送的大量相同内容的短消息,只要一条判别为垃圾短消息,那么其他的短消息也同样可判别为垃圾短消息,并把它抛弃,节省了网络流量,也克服了普通手机处理能力不强、过滤处理效率不高的缺点。第二,在中文分词模块中,采用多级哈希表数据结构来实现中文词条的快速查找,其速度比基于数据库中文词表的词条查询速度快很多,提高了中文分词的效率;在分词过程中采用了最大匹配法,提高了分词的准确度。第三,使用文档频度与词条频度相结合来进行特征选取。既体现了词条在同类文档中出现的普遍性,也体现了词条对于单个文档本身的表意能力。该方法比文档频度法更接近实际情况,能够更有效地纯化分类的特征向量。将文本分类和信息过滤技术引用到了短消息过滤平台中,实验结果证明该短消息自动过滤平台具有较好的应用前景。依据公安部、工业和信息化部、国家安全部和国务院新闻办联合发文精神,相信运用本文研究的方法,一定能够做到打击查处破获一批违法短消息案件,监控、封堵一些涉及重大敏感事件的有害公众短消息。
[Abstract]:Cell phone SMS has entered a explosive period of rapid growth in recent years. However, short message not only brings great convenience to users, but also becomes a major hidden danger of information security. Through short message as a new information tool, various sexual violence, political rumors, reactionary speech, fraud information and the spread of illegal advertising, has become one of the important factors affecting social stability. The illegal mobile phone short message tests the society's ability to deal with illegal infringement. In the face of this new type of crime using modern information technology, how to prevent and crack down on it is a new challenge to the public, prosecutors, legal organs, even banks, industry and credit departments. In this paper, a text message classification and filtering mechanism based on text content classification is proposed, an improved short message filtering model based on Bayesian algorithm is designed, and a text short message interception and filtering platform is developed. The realization of several key function modules of the model is given, and the recognition of short message content and the automatic filtering of short message are completed. The main work is as follows: firstly, according to the characteristics of short message classification, The inconsistency of the weight of short message classification is analyzed. Under normal circumstances, people do not want to be filtered out by misjudging normal short messages as bad ones. In order to minimize the loss, it is not only required that the accuracy of short message classification be high. And the weight of normal short message is higher than that of bad short message. Secondly, we design the main modules of short message classification and filtering: short message collection, Chinese word segmentation, feature selection, short message classification and filtering. Finally, we test the model, and use the evaluation indexes in the field of text classification and information retrieval to evaluate the experimental results of the system platform. In this paper, the design and implementation of the characteristics and innovation in the following three aspects. First, the design and implementation of short message filtering on short message server is proposed. Unlike the usual short message filtering on the phone, the server receives a large number of the same messages sent by the short message cat at the same time, as long as one message is classified as spam. So other short messages can also be identified as spam short messages, and discard it, save network traffic, but also overcome the common mobile phone processing capacity is not strong, filter processing efficiency is not high shortcomings. Secondly, in the Chinese word segmentation module, the multi-level hash table data structure is used to realize the fast search of Chinese words, which is much faster than the query speed of Chinese word table based on database, and improves the efficiency of Chinese word segmentation. In the process of word segmentation, the maximum matching method is used to improve the accuracy of word segmentation. Thirdly, the feature selection is based on the combination of document frequency and term frequency. It not only reflects the universality of terms in the same document, but also reflects the ability of the entry to express itself to a single document. This method is closer to the actual situation than the document frequency method and can purify the classification feature vector more effectively. The text classification and information filtering techniques are applied to the short message filtering platform. The experimental results show that the short message automatic filtering platform has a good application prospect. In accordance with the spirit of joint issuance by the Ministry of Public Security, the Ministry of Industry and Information, the Ministry of National Security and the Information Office of the State Council, it is believed that by using the method studied in this paper, we will be able to crack down on and deal with a number of illegal short message cases and monitor them. Block some harmful public short messages involving major and sensitive events.
【学位授予单位】：上海交通大学
【学位级别】：硕士
【学位授予年份】：2008
【分类号】：TN929.53

【参考文献】