网络信息采集及智能处理技术研究

发布时间：2018-04-12 11:20

本文选题：网络信息采集 + KNN算法　；参考：《广东工业大学》2012年硕士论文

【摘要】：无论是科研还是学习人们都需要通过网络去寻找最新的专业信息和新闻动态,但信息的爆炸式增长,也让人们越来越难以从信息海洋中快速获取所需信息。一方面是因为网络信息量与日俱增,且更新速度非常快,需要投入大量的时间进行信息的搜索；另一方面,网络上的信息存在大量重复的现象,且格式非常不规范,更加大了用户搜寻信息的难度。因此,对网络信息进行快速采集和智能处理的技术应运而生。用户可以通过搜索引擎检索出大量信息,却不能对信息进行提取、组织和处理,随着信息化的进步,人们对获取信息的要求越来越高,信息搜索也从“通用”进入“个性和智能”。目前市面上已经出现了很多信息采集的工具,这些工具可以在一定程度上满足用户获取信息的需求,但是对信息的处理却不尽人意。由于文本信息占据网络中信息的大部分,因此如何自动地分类网络中的文本信息成为信息处理的重中之重。本论文在分析现有信息采集和信息处理技术的基础上,首先对网页抓取工具网络爬虫进行介绍,分析其采集网页信息的原理及网页去重和信息抽取的方法；然后对智能信息处理中的文本分类这一关键技术进行了深入研究,改进了现有的特征选择方法和分类算法,并采用改进的KNN算法构造了一个文本自动分类器,将搜狗语料库作分类模型的训练语料库,通过实验训练出适应于该语料库的最佳K值和特征维数,同时验证了改进的KNN算法的分类效果。本论文的创新之处是： (1)对文本信息处理中的特征选择方法进行了改进,提出了同义词合并的思想,引入《同义词词林》,在特征选择之前先对特征项中的同义词进行替换、统计,有效降低了特征空间的维数。 (2)提出了一种改进的KNN算法,通过引入类中心向量对相似度计算公式进行了改进,将待分类的测试文本与类别的距离作为参数加入到KNN算法的相似度计算公式中,把两个文本中都出现的特征项数量与各自出现的特征项数量的最大值的比值作为相似度公式的调节因子。 (3)结合改进的KNN算法,构造一个文本自动分类器,在分类阶段优先考虑待分类的测试文本与各类别之间的联系,当待分类文本与类别之间的关系不明确时,再将其与所有训练文本比较,根据比较的结果判定待分类文本所属的类别。
[Abstract]:People need to find the latest professional information and news through the Internet, but the explosion of information makes it more and more difficult for people to get the information quickly from the ocean of information.On the one hand, because the amount of information on the network is increasing, and the speed of updating is very fast, it needs a lot of time to search for information; on the other hand, there is a large number of duplicates of information on the network, and the format is very irregular.It also increases the difficulty of searching for information.Therefore, the rapid collection and intelligent processing of network information emerged as the times require.Users can retrieve a great deal of information through search engine, but they can't extract, organize and process information. With the development of information, people are demanding more and more information.Information search has also moved from "universal" to "personality and intelligence."At present, there are many information collection tools in the market, which can meet the needs of users to some extent, but the processing of information is not satisfactory.Because the text information occupies the majority of the information in the network, how to automatically classify the text information in the network becomes the top priority of information processing.Based on the analysis of the existing information collection and information processing technology, this paper first introduces the web crawler, analyzes the principle of web page information collection and the methods of web page removal and information extraction.Then, the paper deeply studies the key technology of text classification in intelligent information processing, improves the existing feature selection methods and classification algorithms, and constructs a text automatic classifier using the improved KNN algorithm.Using Sogou corpus as the training corpus of classification model, the best K value and feature dimension suitable for the corpus are trained through experiments, and the classification effect of the improved KNN algorithm is verified at the same time.The innovations of this thesis are:1) the method of feature selection in text information processing is improved, the idea of synonym merging is put forward, and the synonym forest is introduced to replace the synonym in the feature item before feature selection.The dimension of feature space is reduced effectively.(2) an improved KNN algorithm is proposed. By introducing the class center vector, the similarity calculation formula is improved, and the distance between the test text and the category to be classified is added to the similarity calculation formula of the KNN algorithm.The ratio of the number of feature items in both texts to the maximum value of the number of feature items in each text is taken as the adjustment factor of the similarity formula.In combination with the improved KNN algorithm, an automatic text classifier is constructed, which gives priority to the relationship between the test text to be classified and each category in the classification stage, when the relationship between the text to be classified and the category is not clear.Then compare it with all the training texts, and determine the category of the text to be classified according to the comparison results.
【学位授予单位】：广东工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1;TP274.2

【相似文献】