互联网上少数民族信息统计分析的关键技术研究

发布时间：2018-06-20 19:12

本文选题：聚焦搜索 + 舆情监控　；参考：《中央民族大学》2012年硕士论文

【摘要】：随着网络的迅速发展,互联网已成为海量信息的载体。搜索引擎的出现为人们使用互联网提供了较好的便利性,同时也成为研究网站用户行为的有效工具。近年来伴随着网络的兴起,民族问题是困扰我国发展的一大障碍,其在互联网中的传播也愈来愈突出。如何运用已有搜索引擎对互联网中民族问题的传播进行监督成为目前网络舆情监控的一大课题。本文着重对网络中民族问题特定信息提取面临的关键技术进行研究。本文首先介绍了聚焦搜索引擎及相关关键技术发展概况及原理,重点介绍了常见的网页分类算法、网页关键信息提取及抓取策略,为本文所设计的基于搜索引擎的聚焦爬虫算法及实现提供理论基础。搜索引擎搜索结果并不能完全与用户的需求匹配,且在某些情况下给出的搜索信息量明显不足。因此对搜索引擎搜索结果进行进一步聚焦搜索具有一定的价值。互联网中信息主要以HTML页面形式出现,而HTML具有明显的分类特点。网页代码中大量的信息与搜索信息关联度很低,使得优化网页代码搜索机制显得极其重要。由于搜索的目的性较强,使得搜索的要求,如对特定事件中网页的共同特点,具有明显的结构化,因此选用空间向量对网页代码进行简化,并基于向量空问模型对问题进行算法设计。算法首先将模型分为两大模块,百度搜索模块和聚焦搜索模块。百度搜索模块通过算法实现对搜索词在百度搜索引擎上进行抓取搜索结果对应的URL等信息,得到相应的初始URL队列；聚焦搜索模块实现以此初始URL队列作为起点,基于空间向量模型通过KNN分类算法在网络中实现聚焦爬虫搜索,得到相应的搜索结果。最后本文完成对算法的初步实现,并对结果进行简要统计分析。通过搜索结果中所含信息的特点与社会中影响网络传播的事件进行分析,得到搜索结果与社会中敏感信息来源匹配,证明搜索结果的可操作性和有效性,为算法实现的进一步优化提供数据支持。
[Abstract]:With the rapid development of network, the Internet has become the carrier of mass information. The appearance of search engine provides a good convenience for people to use the Internet, and it has also become an effective tool to study the behavior of web users. In recent years, with the rise of the network, the national problem is a major obstacle to the development of our country, which is in the Internet. The spread is also becoming more and more prominent. How to use the existing search engines to spread ethnic issues in Internet supervision has become an important subject of the current network public opinion monitoring. This paper focuses on the network in the information extraction of ethnic problems the key technology research.
This paper first introduces the development and principle of focused search engine and related key technologies, and focuses on the common web page classification algorithm, the key information extraction and grasping strategy of web pages, which provides a theoretical basis for the search engine based focused crawler algorithm and implementation. The search engine search results can not be completely used. The needs of users, and in some cases, the amount of search information is obviously insufficient. So the search engine search results to further focus has a certain value to search.
The information in the Internet appears mainly in the form of HTML pages, while HTML has obvious classification characteristics. A large number of information in the web code is very low in association with search information. It makes it extremely important to optimize the search mechanism of the web page. It has obvious structure, so the space vector is used to simplify the web code, and the algorithm is designed based on vector space query model.
The algorithm first divides the model into two modules, the Baidu search module and the focus search module. The Baidu search module achieves the corresponding initial URL queue through the algorithm, which is corresponding to the search results of the search results on the Baidu search engine, and the focus search module realizes the initial URL queue as the starting point and is based on the space. The inter vector model realizes the focused crawler search in the network through the KNN classification algorithm, and obtains the corresponding search results.
Finally, the preliminary realization of the algorithm is completed, and the results are briefly analyzed. Through the analysis of the characteristics of the information contained in the search results and the events that affect the network propagation in the society, the search results are matched with the sensitive information sources in the society, which proves the maneuverability and effectiveness of the search results, which is the advance of the algorithm. Step optimization provides data support.
【学位授予单位】：中央民族大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.09

【相似文献】