农业信息搜索引擎分类器的研究

发布时间：2018-03-23 12:02

本文选题：朴素贝叶斯　切入点：文本信息分类　出处：《东北农业大学》2015年硕士论文

【摘要】：当今互联网高速发展,人类社会迈入网络信息爆炸时代,这带来了网络上农业知识信息的激增,给农业从业者带来了农业信息查找的便利。知识意味着财富,农业从业者从这些农业信息中撷取财富信息,然而,海量的农业知识信息不意味着可以快速有效的查询出所需信息,农业领域细化信息的快速定位与分类查找是必要与必须的。本文以农业信息搜索引擎分类器为研究对象,全面的介绍了当前信息文本分类器现状、国内外分类器发展历程,在分类特征提取、训练样本和众多分类算法基础上,从农业信息文本分类特征项提取方式上从手,提出了具有农业信息文本特色的特征提取方式,在此特征项训练基础上,建立农业信息文本训练库,针对分类算法分类效果各有差异,使用改进优化后的朴素贝叶斯分类器对农业信息进行分类,设计实现了农业信息搜索引擎分类器系统。世界上不会存在一模一样的两片叶子,每个对象都具有其独特性,文本信息对象也都具有各自独有识别特征以供识别分类。本文对文本特征提取四种方式信息增益、互信息、卡方统计和文档频率进行算法论述与实现实验比较,提出农业信息文本特征提取方式:基于文档频率的文本特征提取,将TF-IDF、空间向量模型与余弦相关度的计算运用其中,在此基础上,依据农业信息分类原则,根据识别度,选取各农业类别的文本信息,最终建立了农业信息文本训练库。任何一种分类算法都不具有绝对优越性,都存在不同分类偏差,不同文本信息,分类器分类效果不一样。本文实验比较了决策树算法、K-近邻算法、支持向量机和朴素贝叶斯四种分类算法对农业信息文本分类情况,运用并改进优化朴素贝叶斯分类器,主要改进点两个方面:朴素贝叶斯算法计算公式变化,将二值模型变换成多项式模型,建立多项式模型公式,进行实验结果数据比较;在分类器部署方式上,将分类器分布式部署到多台计算机,采用Top-N算法排序结果,进行实验结果数据比较。本文根据多组分类实验比较结果,在软件设计理论上,结合上述改进优化后朴素贝叶斯算法,使用农业信息文本训练库,设计并实现了农业信息搜索引擎分类器系统,对农业信息文本分类实验测试得出结果数据。实验结果表明,经改进优化后朴素贝叶斯分类器分类精度更高,分类速度更快,是实用可靠的农业信息搜索引擎分类器系统。综上,本文在农业信息搜索引擎抓取农业信息文本基础上,从分类信息文本特征提取、农业信息文本训练、分类算法上对农业信息文本分类器研究,通过实验对比,提出农业信息分类特征提取方式,建立农业信息文本训练库,从算法上对朴素贝叶斯分类器改进,从部署上,将分类器系统分布式部署分类,最终达到改进优化农业信息文本分类器。本文为农业信息文本分类提供了理论和基础实验平台,同时,本文研究也可作为实际应用推广应用。
[Abstract]:With the rapid development of the Internet, the human society has entered the era of information explosion, which brought a surge in agricultural knowledge and information network, brings convenience to the agricultural information search agricultural practitioners. Knowledge means wealth, the wealth of agricultural practitioners capture information from these agricultural information in agricultural knowledge however, massive information does not mean to the required information quickly and effectively, rapid positioning of agricultural information classification and search field refinement is necessary and necessary. Based on the agricultural information search engine classifier as the research object, comprehensively introduces the current information text classifier the status quo, development at home and abroad in the extraction of feature classification, classifier training samples and many classification algorithm based from the agricultural information, text classification feature extraction method from the hand of agricultural information with text feature extraction method in The characteristics of training on the basis of the establishment of agricultural information text training base, according to the classification results of different classification algorithms, for agricultural information classification using Naive Bayesian classifier improved after optimization, the design and implementation of agricultural information search engine system. The world does not exist classifier two leaves each object as like as two peas, has its unique characteristics also, the text information objects have their own unique feature for recognition and classification. In this paper, four kinds of text feature extraction, information gain, mutual information, chi square statistics and document frequency method is discussed and experimental comparison, put forward the feature extraction of text information extraction, text: agricultural characteristics based on document frequency TF-IDF. Vector space model and cosine calculation of correlation to use them on the basis of this, according to the principle of agricultural information classification, according to the degree of recognition, the selection of agricultural Industry categories of text information, finally established the agricultural information database. Any kind of training text classification algorithm has no absolute superiority, there are different classification bias, different text classification, the effects are not the same. This paper compared the decision tree algorithm, K- nearest neighbor algorithm, support vector machine and Naive Bayesian four classification the algorithm of text classification and use of agricultural information, optimization of Naive Bayesian classifier, the main improvement in two aspects: Naive Bayesian algorithm formula changes the value of the two models are transformed into polynomial model, a polynomial model formula, experimental results for data comparison; in the classifier deployment, the classifier distributed deployment to multiple computers the sequencing results, Top-N algorithm, the experimental results were compared. Based on the data sets classification experimental results, in the software design theory According to the above, the improved and optimized Naive Bayesian algorithm, the use of agricultural information text training base, the design and implementation of agricultural information search engine results of classifier system, data classification experiment of agricultural information text test. The experimental results show that the improved Subayers Park classifier has higher classification accuracy and faster classification speed is practical and reliable agricultural information search engine classification system. To sum up the search engine grab text based agricultural information in agricultural information, classified information extracted from text feature, text classification algorithm of agricultural information training, research on agricultural information text classifier, through the experimental comparison, put forward the agricultural information classification feature extraction method, the establishment of agricultural information text training base, improvement the Naive Bayesian classifier from the algorithm, from the deployment, the classifier system distributed classification, finally To improve the optimization of the agricultural information text classifier. This paper provides theoretical basis and experimental platform for agricultural information classification at the same time, this study can also be used as a practical application.

【学位授予单位】：东北农业大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.1

【参考文献】