维、哈文不良网页判别方法研究

发布时间：2018-10-09 18:39

【摘要】：随着信息技术的飞速发展,互联网已成为人们快速发布信息和获取信息的重要工具。近年来,维吾尔文、哈萨克文网站数量有了快速增长,具不完全统计,国内维哈文网站目前有两千多个,且在不断增加中。民族语言网站在为广大少数民族用户提供丰富多彩本民族文化信息的同时,部分不法之徒利用互联网传播反动、煽动性言论等不良信息,此类信息严重歪曲我党方针政策,扭曲事实真相,极易引起公众的不理性判断,对社会和谐稳定带来巨大隐患。如何对此类信息进行有效监控、过滤成为政府部门关心的问题,维哈文不良网页识别技术也成为科研单位研究热点。笔者首先设计了维哈文网站识别模型,并利用搜索引擎技术对互联网维哈文网站进行搜寻和数据采集,同时对维哈文不良网页识别模型中涉及的以下技术进行了研究：维哈文网页正文内容抽取方法、维哈文分词技术、特征词提取方法,文本分类算法,分类器性能评价指标。本文在对维哈文不良网页特征分析基础上,利用卡方检验方法对训练集进行特征词抽取。为检测不同文本分类算法对维哈文不良网页识别模型判别性能的影响,笔者分别研究了支持向量机、K临近、朴素贝叶斯等文本分类算法,并根据多元线性回归原理,设计了多元线性回归模型。本文分别对这四类方法进行了测试对比,测试结果表明,当文本采用带权重特征向量表示,且支持向量机采用径向基核函数时,利用该算法设计的维哈文不良网页识别模型识别准确率和召回率能达到95%以上,且识别性能稳定,识别效率也相对较高,在实际应用中,该算法也取得了很好的识别效果。
[Abstract]:With the rapid development of information technology, the Internet has become an important tool for people to publish and obtain information quickly. In recent years, the number of Uygur and Kazakh websites has a rapid growth, with incomplete statistics. While providing the vast majority of ethnic minority users with rich and colorful information on their own culture, some lawless people use the Internet to disseminate undesirable information, such as reactionary and inflammatory remarks, which seriously distort our party's principles and policies. Distorting the truth easily leads to irrational judgment of the public and brings great hidden danger to social harmony and stability. How to effectively monitor and filter this kind of information has become a concern of government departments, and the technology of identifying bad web pages has also become a hot research topic in scientific research institutions. First of all, the author designs the identification model of Weihawen website, and makes use of the search engine technology to search and collect the data of the Web site. At the same time, the following technologies are studied in the model: the text content extraction method, the word segmentation technology, the feature word extraction method, the text classification algorithm and the performance evaluation index of the classifier. Based on the analysis of the features of the bad pages of Weihawen, the chi-square test method is used to extract the feature words from the training set. In order to detect the influence of different text classification algorithms on the discriminant performance of the bad page recognition model, the support vector machine (SVM) and naive Bayes text classification algorithms are studied respectively, and according to the principle of multiple linear regression, the text classification algorithms such as support vector machine (SVM) and naive Bayes are studied. A multivariate linear regression model is designed. The test results show that when the text is represented by weighted eigenvector and the support vector machine adopts radial basis kernel function, The recognition accuracy and recall rate of this algorithm can reach more than 95%, and the recognition performance is stable and the recognition efficiency is relatively high. In practical application, the algorithm has achieved a good recognition effect.
【学位授予单位】：新疆农业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.092

【相似文献】