基于Map-Reduce 和Trie树的搜索需求识别研究

发布时间：2018-10-13 16:32

【摘要】：在数据量爆炸式增长的互联网时代,人们同时面临着机遇和挑战。一方面人们在不断地从大数据金矿中挖掘出有用的信息,另一方面又可能面对大量的Web冗余信息束手无策。而搜索引擎作为人们最常用的信息检索工具,在帮助人们从互联网中找到所需信息的同时,也承受着数据增长带来的极大负担。目前由于搜索引擎的索引数据正变得越来越庞大,其查询的工作量正变得日益繁重,同时,搜索引擎所查询到的绝大多数信息都是与用户需求无关的。如果搜索引擎在发起搜索之前就能预测用户的搜索需求,就能为用户提供体验更好的搜索服务。通过搜索引擎对用户搜索需求进行实时分析,不仅能为用户提供更加个性化的搜索结果,同时也可以省略很多不必要的计算。于是搜索引擎的用户搜索需求成了国内外学者们重点研究的领域。要完成对用户需求的预判,必须对用户的搜索词进行识别,这种识别往往需要借助一些日志挖掘的手段。但是现在的搜索日志数据量都在TB级别,在单机上难以实现。本文针对大规模数据计算的特点,提出了构建需求识别模板的Paratemp策略。该策略借助Map-Reduce技术,通过对搜索日志的训练从分布式集群上挖掘出具有代表性的分类模板,从而得到能识别用户搜索需求的模式。同时本文借鉴关联规则挖掘中的置信度和支持度变量,提出了针对模板的筛选标准。通过筛选的模板可以作为分类搜索需求的支持依据。在成功提取用户搜索模板后,为了达到识别搜索需求的目的,需要一套高效的自然语言算法来对这些模板加以利用。本文设计了Tempaser识别算法,利用Trie树空间换时间的思想对搜索词进行解析,最终实现了搜索需求的识别。最后的实验证明了基于Map-Reduce和Trie树的搜索需求识别具有正确性和高效性。文章的结尾对本次研究进行了总结和展望。
[Abstract]:In the era of Internet data explosion, people are faced with opportunities and challenges at the same time. On the one hand, people are constantly mining useful information from big data Gold Mine, on the other hand, they may be faced with a lot of redundant Web information. As the most commonly used information retrieval tool, search engine not only helps people to find the information they need from the Internet, but also bears the great burden of data growth. At present, because the index data of the search engine is becoming more and more huge, the workload of the search engine is becoming more and more heavy. At the same time, most of the information queried by the search engine is independent of the user's demand. If a search engine can predict users' search needs before launching a search, it can provide users with a better experience of search services. The real-time analysis of users' search requirements through search engines can not only provide users with more personalized search results, but also omit a lot of unnecessary calculations. As a result, search engine user search requirements have become the focus of domestic and foreign scholars. It is necessary to recognize the search term of the user in order to complete the pre-judgment of the user's demand. This recognition often needs some means of log mining. But now the amount of search log data is at the TB level, difficult to implement on a single machine. According to the characteristics of large-scale data computing, this paper proposes a Paratemp strategy to construct requirement recognition templates. With the help of Map-Reduce technology, the strategy mine representative classification templates from distributed clusters by training the search logs, and then obtain the pattern that can identify the users' search requirements. At the same time, based on the variables of confidence and support in association rule mining, the selection criteria for templates are proposed. The selected templates can be used as the support basis for classifying search requirements. After the user search templates are extracted successfully, a set of efficient natural language algorithms are needed to make use of these templates in order to identify the search requirements. In this paper, Tempaser recognition algorithm is designed, and the search term is analyzed by using the idea of changing time in Trie tree space. Finally, the recognition of search requirements is realized. Finally, experiments show that the search requirement recognition based on Map-Reduce and Trie tree is correct and efficient. At the end of the article, the research is summarized and prospected.
【学位授予单位】：江西师范大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.3

【参考文献】