中文农业网页去重及相似度判断研究

发布时间：2018-11-13 09:04

【摘要】：随着网络信息技术的飞速发展，农业信息化的建设、服务水平得到了极大的促进与提高。互联网中海量、重复的农业信息为从事农业领域的朋友们带来方便的同时，也增加了快速、准确获取有效信息的难度。如何对农业网页中重复以及近似重复的网页进行有效的管理，成为农业垂直搜索引擎领域研究的重要课题之一。本文的工作主要包括以下几个方面： 1）深入研究了文本去重及相似度判断的关键技术，，网页预处理、网页正文内容提取、中文分词、特征加权算法、网页去重方法、文本相似度计算方法以及相似度评价标准技术，以农业网页语料库为基础，重点研究了网页去重技术、特征加权算法以及相似度计算的方法。 2）对中文农业网页中重复及近似重复的网页的定义标准进行研究，构建出中文农业网页语料库。建立一个由人工鉴别出的网页集合，包含225组网页集，每组网页集中有2至14张近似重复网页，共1110篇网页作为网页测试集。 3）首先对网页进行预处理，使用MD5方法去除网页集合中完全相同的网页，再对其余网页提取出正文内容，利用庖丁解牛分词方法进行分词、去除停用词后，分别使用布尔权重、词频权重、词频倒文档权重三种方法对特征词进行加权计算；最后分别使用三种相似度算法（向量空间模型、基于《知网》的语义相似度、潜在语义分析）对三种不同权重的特征向量空间模型进行了相似度计算，最终得到9组中文农业网页相似度判断结果。 4）分析比较了9组实验的准确率、召回率、F1测度。结果表明，没有哪种特征加权算法对相似度判断有绝对的优势，三种特征加权算法在不同的相似度判断中各有优劣。不同相似度判断方法分析对比表明潜在语义分析相似度判断结果最好。通过MD5方法去除了41篇与其它网页完全重复的网页，对剩余1069篇网页使用不同的相似度判断方法结合权重计算对农业网页去重及相似度判断进行了深入研究。通过实验结果的分析与对比，结果表明潜在语义分析结合布尔权重值获得的结果，对农业网页相似度判断有最好的结果，综合评价F1测度为90.1%，且准确率达到了93.7%。
[Abstract]:With the rapid development of network information technology and the construction of agricultural informatization, the service level has been greatly promoted and improved. The massive and repeated agricultural information in the Internet not only brings convenience to friends engaged in the field of agriculture, but also increases the difficulty of obtaining effective information quickly and accurately. How to effectively manage the duplicated and approximately duplicated web pages in agricultural web pages has become one of the most important research topics in the field of agricultural vertical search engines. The main work of this paper includes the following aspects: 1) the key technologies of text removal and similarity judgment, page preprocessing, page text content extraction, Chinese word segmentation, feature weighting algorithm, web page de-duplication method, are studied in depth. Text similarity calculation method and similarity evaluation standard technology, based on agricultural web page corpus, this paper focuses on web page de-duplication technology, feature weighting algorithm and similarity calculation method. 2) the definition standard of Chinese agricultural web pages is studied, and the corpus of Chinese agricultural web pages is constructed. A set of manually identified web pages is established, which consists of 225 sets of web pages. Each set of web pages consists of 2 to 14 approximately repeated pages. A total of 1110 pages are used as web page test sets. 3) preprocessing the web page, using MD5 method to remove the same page in the web page collection, then extracting the text of the other web pages, using the word segmentation method of Pao Ding Jie Niu, after removing the stop word. Three methods, Boolean weight and word frequency inverted document weight, are used to calculate the weight of feature words. Finally, three similarity algorithms (vector space model, semantic similarity based on knowledge net, latent semantic analysis) are used to calculate the similarity of three kinds of feature vector space models with different weights. Finally, 9 groups of Chinese agricultural web page similarity judgment results are obtained. 4) the accuracy, recall rate and F1 measure of 9 groups of experiments were analyzed and compared. The results show that none of the feature weighting algorithms has an absolute advantage in similarity judgment, and each of the three feature weighting algorithms has its own advantages and disadvantages in different similarity judgment. The comparison of different similarity judgment methods shows that the potential semantic analysis has the best similarity judgment result. The MD5 method was used to remove 41 web pages which were completely duplicated with other web pages, and the other 1069 web pages were further studied by using different similarity judgment methods combined with weight calculation to determine the similarity of agricultural web pages. Through the analysis and comparison of the experimental results, the results show that the potential semantic analysis combined with the Boolean weight value has the best result in judging the similarity of agricultural web pages, and the comprehensive evaluation of F1 measure is 90.1. And the accuracy rate reached 93. 7%.
【学位授予单位】：新疆农业大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.1;TP393.092

【参考文献】