当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于语义的搜索结果聚类方法研究

发布时间:2018-05-16 09:45

  本文选题:搜索结果 + 聚类 ; 参考:《北京邮电大学》2014年硕士论文


【摘要】:随着网络的发展,越来越多的人们在互联网上获取信息。搜索引擎作为用户与互联网交互的中转站,负责信息的获取和检索,给人们带来了极大的便利。但是,随着互联网上信息量的增长,搜索引擎返回的检索结果也日益繁杂,包含了很多不相干的、·重复的、混杂的结果。人们往往需要浪费很多的精力和时间来浏览这些信息才能找到满意的结果。因此,一些研究人员将信息检索中的聚类技术应用于搜索结果的分类中,将繁杂的搜索结果分类呈现给用户,这种方法称为搜索结果聚类。搜索结果聚类是指利用聚类这种无监督的机器学习手段,按照“最大化类内相似度,最小化类间相似度”的原则,将搜索结果聚集成类提取聚类标签给予用户一个类目导航。另外,搜索结果聚类对象不是传统的长文本而是搜索结果的短文摘。目前,搜索结果聚类技术多是采用独立的词语表示搜索结果短文摘,忽略了词语之间的语义关联等语义信息,存在严重的语义缺失。 本论文针对搜索结果聚类技术中的语义缺失现象,对基于语义的搜索结果聚类方法进行了深入研究,主要的研究内容有:搜索结果预处理方法和建模方法,经典的搜索结果聚类方法以及基于语义的搜索结果聚类方法。另外,本论文在以上研究的基础上提出了基于OPTICS的搜索结果聚类算法和基于WordNet的后缀树聚类算法。这两种算法针对搜索结果聚类的语义缺失现象均提出了相应的改进,侧重于挖掘和利用搜索结果短文摘中的语义信息,以达到提高搜索结果聚类准确率的目的。最后,本论文在搜索结果数据集上进行了聚类实验,并对比分析了两种新算法的聚类性能。实验结果表明,本论文中提出的两种改进算法在聚类准确率方面较原算法有明显提高,并且缩短了运行时间,能够提高搜索结果聚类的可浏览性和实时性。
[Abstract]:With the development of the network, more and more people get information on the Internet. As the transfer station of the interaction between the user and the Internet, the search engine is responsible for the acquisition and retrieval of information, which has brought great convenience to people. However, with the increase of the amount of information on the Internet, the retrieval results of the search engine return are also increasingly complex, including a lot of information. Unrelated, repetitive, mixed results. People often need to waste a lot of energy and time to browse the information in order to find satisfactory results. Therefore, some researchers apply clustering techniques in information retrieval to the classification of search results, and classify the complex search results to users. This method is called search. The clustering of search results is an unsupervised machine learning method based on clustering. According to the principle of "maximizing the intra class similarity, minimizing the similarity between classes", the search results are aggregated into classes to extract clustering tags to give users a category navigation. In addition, the search result clustering object is not the traditional long text but the traditional long text. At present, most of the search results clustering techniques use independent words to express search results, ignore semantic information and semantic information between words, and have serious semantic loss.
In this paper, the semantic based search results clustering method is studied deeply in the search result clustering technology. The main research contents are: search results preprocessing method and modeling method, classic search result clustering method and semantic based search result clustering method. On the basis of the research, the OPTICS based search results clustering algorithm and the WordNet based suffix tree clustering algorithm are proposed. These two algorithms have proposed corresponding improvements to the semantic missing phenomenon of the search results clustering, focusing on mining and utilizing the semantic information in the search results short text, in order to improve the clustering accuracy of the search results. Finally, this paper carries out clustering experiments on the data set of the search results, and compares and analyzes the clustering performance of the two new algorithms. The experimental results show that the two improved algorithms proposed in this paper are significantly higher in clustering accuracy than those of the original algorithm, and the running time is shortened, and the clustering of the search results can be improved. Browsing and real-time.

【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP391.1

【参考文献】

相关期刊论文 前10条

1 李建江;崔健;王聃;严林;黄义双;;MapReduce并行编程模型研究综述[J];电子学报;2011年11期

2 文坤梅;卢正鼎;孙小林;李瑞轩;;语义搜索研究综述[J];计算机科学;2008年05期

3 刘德山;;一种改进的基于后缀树模型搜索结果聚类算法[J];计算机科学;2011年11期

4 徐戈;王厚峰;;自然语言处理中主题模型的发展[J];计算机学报;2011年08期

5 杨燕;靳蕃;KAMEL Mohamed;;聚类有效性评价综述[J];计算机应用研究;2008年06期

6 郭庆琳;李艳梅;唐琦;;基于VSM的文本相似度计算的研究[J];计算机应用研究;2008年11期

7 郭晓娟;刘晓霞;李晓玲;;层次聚类算法的改进及分析[J];计算机应用与软件;2008年06期

8 黄莉;;词法分析在自然语言处理中的地位和作用[J];价值工程;2010年10期

9 孙学刚,陈群秀,马亮;基于主题的Web文档聚类研究[J];中文信息学报;2003年03期

10 曾依灵;许洪波;白硕;;改进的OPTICS算法及其在文本聚类中的应用[J];中文信息学报;2008年01期



本文编号:1896368

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1896368.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户6e01f***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com