改进后缀树的中文检索结果聚类系统

发布时间：2018-06-19 19:27

本文选题：检索结果聚类 + 后缀树　；参考：《北京林业大学》2013年硕士论文

【摘要】：随着科技的不断发展,人们与网络的联系已经变得十分紧密,网络的交流和分享给人们的生活带来了极大的便利。而网络信息的迅猛增长,使得用户在查找资料时不得不从搜索引擎返回的结果列表中仔细查找,如果用户输入的查询词带有歧义,很可能要查看很多页后才能找到满意的答案,这就给用户的使用带来了不便。比如搜索“美洲虎”时,用户可能是想查找一种武器、或者是汽车,更或者是一种动物,然而在返回的结果列表中这几类信息相互参杂着呈现给用户,如果用户需要查找某一类信息的详细情况,就需要翻很多页才能找到。基于此,本文在传统搜索引擎的基础上设计了检索结果聚类系统。系统流程主要包括三步：首先,利用HTML分析器获取搜索引擎返回的结果项标题和摘要,用分词工具对获取到的文本进行分词、标注词性并且记录词语的位置和词频,去除停用词,剩下的词语构成每一个结果项的关键词集；然后,用各结果项的关键词集统一构建一颗后缀树,以词语为单位插入后缀树各节点,通过位置、词频、词性和词长几项约束条件计算各节点词语得分；最后,合并基类取得分高的节点词作标签。实验结果显示本方法的聚类簇纯度较高,提取的标签准确且区分性较强,方便用户使用。
[Abstract]:With the development of science and technology, the connection between people and the network has become very close. The communication and sharing of the network bring great convenience to people's life. With the rapid growth of network information, users have to search through the results list returned by the search engine. If the query words entered by the user are ambiguous, they will probably have to look at many pages before they can find a satisfactory answer. This brings inconvenience to the use of users. For example, when searching for Jaguars, the user may want to find a weapon, or a car, or an animal, but in the returned results list, these types of information are mixed and presented to the user. If a user needs to look for details of a particular type of information, it takes a lot of pages to find it. Based on this, this paper designs the retrieval result clustering system based on the traditional search engine. The system flow mainly includes three steps: firstly, the HTML analyzer is used to obtain the title and summary of the result item returned by the search engine, and the word segmentation tool is used to segment the obtained text, annotate the part of speech and record the position and frequency of the word. After removing the stop word, the remaining words constitute the keyword set of each result item; then, a suffix tree is constructed by using the keyword set of each result item, and each node of the suffix tree is inserted in the unit of words. The score of each nodal word is calculated under the constraint conditions of word length and part of speech. Finally, the node word label with high score is obtained by combining the base class. The experimental results show that the proposed method is of high purity, accurate and discriminative labels, and is convenient for users to use.
【学位授予单位】：北京林业大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【相似文献】