搜索引擎分类展示技术研究

发布时间：2018-05-13 15:49

本文选题：搜索引擎 + 分类索引　；参考：《哈尔滨工业大学》2012年硕士论文

【摘要】：随着科学技术的进步，互联网技术和通信技术也得以蓬勃发展。网络信息含量逐渐呈现出爆炸式增长的趋势。人们也越来越习惯通过网络获取自己所需的信息资源。但是，信息膨胀在为网络用户带来便利的同时，也在某种程度上增加了他们的困扰：要在浩瀚的信息海洋中快速定位目标已经变得越来越困难。为了解决这一问题，本文对搜索引擎分类展示技术进行了研究，试图通过合适的类别体系为用户提供指引，帮助其减少不必要的时间浪费。本文将搜索引擎分类展示的实现过程划分为两部分：其一作为分类模块，用来对网页类别进行标识；其二作为搜索引擎模块，，用来建立分类索引和分类检索，为用户实现最终的分类展示。在分类模块中，首先要对网页集合进行预处理工作，将网页由文本形式转换为空间向量形式。本文提出了基于网页分块的正文抽取算法，通过判断标签树中的节点找到网页正文，再利用基于文档频率的特征提取算法过滤文本中区分度过低的词语，来实现网页向空间向量的转化。然后是对文本分类器进行训练，本文采取基于决策树的方法对支持向量机二元分类器进行扩展，以解决多类别分类问题，并提出更加适用于层次分类的多重特征选择技术，文本在不同类别层次使用不同的特征向量表示，并且同一文本特征在不同层次分类器被赋予不同的权值，提高了层次体系中的分类精度。在搜索引擎模块中，本文采用开源搜索引擎Lucene作为系统实现的基础架构，利用Lucene索引文件中域的概念建立分类索引，在索引中存入网页的类别信息。当用户希望查看某一类别搜索结果时，通过对该类别层次所在的域进行检索，就可以为用户提供分类展示的结果。最后，本文对上述方法进行了实现，以分类准确率和样本召回率作为分类模块的评估标准，以分类展示检索时间以及搜索结果的准确率作为搜索引擎模块的评估标准，对得到的实验结果进行分析，从而确认在实际应用中实现搜索引擎分类展示的可行性。
[Abstract]:With the progress of science and technology, Internet technology and communication technology have also flourished. The content of network information has gradually shown an explosive growth trend. People are also increasingly used to obtain the information resources they need through the network. However, information is expanding to the convenience of network users, but also to some extent. Their trouble: it is becoming more and more difficult to locate the target quickly in the vast ocean of information. In order to solve this problem, this paper studies the search engine classification display technology, trying to provide guidance to users through a suitable category system to help reduce unnecessary waste of time.
The realization process of the search engine classification display is divided into two parts: one is used as a classification module, which is used to identify the category of web pages; secondly, as a search engine module, it is used to establish classified index and classified retrieval for the user to realize the final classification. In this paper, the web page is transformed from text form to space vector form. In this paper, a text extraction algorithm based on Web page partition is proposed. The text is found by judging the node in the label tree, and then the feature extraction algorithm based on the document frequency is used to filter the words in the text to transform the space vector to the space vector. In this paper, the text classifier is trained. In this paper, a decision tree based method is adopted to extend the two element classifier of support vector machine to solve the multi class classification problem, and a multi feature selection technique which is more suitable for hierarchical classification is put forward. The text is represented by different feature vectors at different classes of classification, and the same text feature is in the same text. Different hierarchical classifiers are given different weights to improve the classification accuracy in the hierarchical system. In the search engine module, the open source search engine Lucene is used as the basic framework of the system implementation. The classification index is established by using the concept of the middle domain of the Lucene index file, and the category information of the web page is stored in the cable quotation. When we look at a certain category of search results, we can provide users with the results of category display by retrieving the domain in which the category is located.
Finally, this paper implements the above method. The classification accuracy rate and sample recall rate are used as the evaluation criteria of the classification module. The classification and retrieval time and the accuracy rate of the search results are used as the evaluation criteria of the search engine module, and the results are analyzed to confirm the realization of the search engine in the practical application. The feasibility of class display.

【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1;TP393.092

【参考文献】