基于文本的敏感信息的监测调度与去重研究

发布时间：2019-01-10 14:57

【摘要】：互联网发展给人们的生活带来了很大的便利,极大的推动了社会的进步。但与此同时一些不法分子利用网络传播信息的方便和迅速,在网络上传播一些包含色情、暴恐、反动等不良内容的敏感信息,给国家安全,社会的发展,人们的生活带来了极大的负面影响。从庞大的互联网中及时的检索到这些敏感信息并对其进行监控成为网络安全领域的一个研究热点。为了及时的发现敏感信息,本文对敏感信息的监测调度策略和敏感网页去重进行了研究,主要的工作如下:1提出一种基于网页敏感度的敏感网页分类监测策略。本文通过对网页进行敏感关键词匹配,得到敏感关键词及其在网页中的位置,结合敏感词本身的敏感度及其在网页中位置的影响因子,给出了一种计算网页敏感度的算法。计算网页的敏感度后,根据敏感网页的敏感程度分类进行不同频率的监测,优化敏感网监测,提高发现敏感信息的及时性。实验表明该策略能够有效提高系统发现敏感信息的及时性以及重点敏感信息的比例。2提出了一种基于非敏感网页变化时间预测的敏感信息补充发现策略。本文根据最近几次网页的变化次数和时间间隔,对网页的下次变化时间进行预测,对满足时间条件的网页进行爬取,提高爬取经常变化的网页的频率,降低爬取不发生变化的网页的频率,提高经常变动网页的敏感信息发现速度,提高其发现新敏感网页的总数。实验结果表明该策略能够较好对基于网页敏感度的敏感网页监测策略进行补充,进一步提高敏感信息的发现率。3提出了一种基于敏感信息摘要的去重策略。通过网页敏感关键词匹配,得到网页包含的敏感关键词位置,提取敏感词对应的敏感上下文,将网页的所有敏感词对应的敏感上下文合并生成网页的敏感信息摘要。通过网页敏感摘要信息的编辑距离计算出敏感摘要信息的相似度。然后比较敏感摘要信息的相似度达到敏感网页的去重功能。实验表明该策略能够较好的提高去除重复网页的效果。4在本文提出的策略和方法的基础上对敏感信息监测与重复展示去除进行了设计与实现,对本校的部分网站进行了扫描和监测,测试了系统的有效性和稳定性。测试系统运行表明,本文提出的敏感信息发现及去重策略能够较为及时的发现敏感信息。
[Abstract]:The development of the Internet has brought great convenience to people's life and greatly promoted the progress of society. But at the same time, some lawless elements used the convenience and speed of spreading information on the Internet, spreading sensitive information on the Internet that contained pornographic, violent, reactionary and other undesirable content, thus giving national security and social development. People's lives have had a great negative impact. Retrieving and monitoring these sensitive information from the huge Internet has become a research hotspot in the field of network security. In order to discover sensitive information in time, the monitoring and scheduling strategy of sensitive information and the rescheduling of sensitive web pages are studied in this paper. The main work is as follows: 1. A sensitive web page classification and monitoring strategy based on web sensitivity is proposed. In this paper, the sensitive keywords and their position in the web pages are obtained by matching the sensitive keywords to the web pages. Combined with the sensitivity of the sensitive words and the influencing factors of their location in the web pages, an algorithm for calculating the sensitivity of the web pages is presented. After calculating the sensitivity of web pages, we can monitor the sensitive web pages with different frequency according to the sensitivity classification of sensitive web pages, optimize the monitoring of sensitive web pages, and improve the timeliness of discovering sensitive information. Experiments show that the strategy can effectively improve the timeliness of sensitive information discovery and the proportion of key sensitive information. 2 A complementary discovery strategy for sensitive information is proposed based on the prediction of the change time of non-sensitive web pages. According to the times and time intervals of the recent page changes, this paper predicts the next change time of the web page, crawls the web page that meets the time condition, and improves the frequency of crawling the frequently changing web page. It can reduce the frequency of crawling the pages that do not change, improve the speed of detecting sensitive information, and increase the total number of new sensitive pages. The experimental results show that this strategy can complement the sensitive web page monitoring strategy based on the sensitivity of web pages and further improve the detection rate of sensitive information. 3 A strategy of removing heavy weight based on the summary of sensitive information is proposed. By matching the sensitive keywords of the web page, the location of the sensitive keywords contained in the web page is obtained, and the sensitive context corresponding to the sensitive word is extracted, and the sensitive context corresponding to all the sensitive words in the page is combined to generate the sensitive information summary of the web page. The similarity of the sensitive summary information is calculated by the editing distance of the web page sensitive summary information. Then, the similarity of sensitive summary information achieves the function of de-reduplication of sensitive web pages. Experiments show that the strategy can improve the effect of removing duplicate web pages. 4 on the basis of the strategies and methods proposed in this paper, the design and implementation of sensitive information monitoring and repeated display removal are carried out. Some websites of our school were scanned and monitored, and the effectiveness and stability of the system were tested. The operation of the test system shows that the sensitive information discovery and de-reduplication strategy proposed in this paper can discover sensitive information in a more timely manner.
【学位授予单位】：重庆大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP393.092;TP391.1

【参考文献】

中国期刊全文数据库前4条

1 孟涛,闫宏飞,王继民;一个增量搜集中国W eb的系统模型及其实现[J];清华大学学报(自然科学版);2005年S1期

2 周立柱,林玲;聚焦爬虫技术研究综述[J];计算机应用;2005年09期

3 韩客松,王永成;一种用于主题提取的非线性加权方法[J];情报学报;2000年06期

4 孟涛;闫宏飞;王继民;;Web网页信息变化的时间局部性规律及其验证[J];情报学报;2005年04期

中国硕士学位论文全文数据库前5条

1 李森;基于漏洞管理平台的聚焦爬虫技术研究分析[D];北京邮电大学;2015年

2 温都日娜;一种基于本体的敏感词过滤方法研究[D];吉林大学;2014年

3 李e呡，

本文编号：2406451

资料下载

论文发表

支付宝下载

Download by Alipay
微信下载

Download by Wechat
会员下载

Download by Member

本文链接：https://www.wllwen.com/guanlilunwen/ydhl/2406451.html

上一篇：基于用户内容信息转移的社会网络链接预测研究
下一篇：基于群智能优化的云制造资源服务搜索方法研究

论文发表

·知网|万方|维普|龙源|省级|国家级|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|