基于Hadoop和支持向量机的紧密度后处理的研究与实现

发布时间：2018-04-09 18:41

本文选题：自然语言处理　切入点：紧密度　出处：《北京交通大学》2015年硕士论文

【摘要】：如何将用户所查结果准确地提取出来并展示已经成为目前搜索引擎的主要目标。搜索引擎涉及多项技术,自然语言处理是极为重要的一项,也是其他技术研究进行提升的基础。紧密度是分词并去停用词之后的关键技术之一,用于描述分词之后的最小单位(Term)之间的关系,是网页搜索的相关性排序中一项重要指标数据,对于排序的结果起着决定性的作用,在搜索引擎中都发挥着重要的作用,同时对于提升用户搜索结果的准确率以及召回率有着十分重要的意义。由于分词的策略是最小切割,会尽可能地将语句进行细粒度切分,这就会将一些长词组切分成多个Term,在随后的搜索结果中,会召回一些不符合用户的搜索需求的网页,影响搜索结果的准确率,并造成较差的用户体验。论文以搜狗搜索引擎的实际项目为背景,对于搜索引擎的中文分词中新词发现的算法策略进行了研究,设计了基于策略进行Term关系提取的算法,将这些关系进行提取组成特征,通过支持向量机(Support Vector Machine, SVM)进行特征分类,并对紧密度的实际效果进行提升。论文主要完成了下面的几项工作： (1)数据预处理。对原始搜索日志进行分词以及初始统计工作,得出后续策略的基础数据。 (2)基于搜索回话日志的初步后处理。通过对搜索会话数据计算搜索语句差异值,得出部分会话数据,并对紧密度进行初步后处理； (3)基于网页正文的二步后处理。针对专有名词级别的紧密度结果,基于新词发现的算法,利用信息熵、互信息等方法,得出两两term之间的特征关系,并将特征值通过SVM进行分类。 (4)实验结果验证以及分析,通过训练集合对最终离线数据进行验证,紧密度后处理的策略提升了相关性排序的效果,使得搜狗搜索引擎搜索结果更加准确。 (5)策略效果。通过后处理策略对紧密度值进行调整,使得在相关性排序的结果更加准确,将优质结果排序较前,差的结果靠后。
[Abstract]:How to extract and display the search results accurately has become the main target of the current search engine.Search engine involves many technologies, natural language processing is an extremely important one, and it is also the basis of other technical research.Tightness is one of the key techniques of word segmentation and deactivation. It is used to describe the relationship between the smallest units after word segmentation and is an important index data in the correlation ranking of web search.It plays a decisive role in ranking results, plays an important role in search engines, and also plays a very important role in improving the accuracy and recall rate of user search results.Because the strategy for word segmentation is to cut the words at a minimum, the statements are partitioned as fine-grained as possible, which divides long phrases into multiple Terms.In subsequent search results, web pages that do not meet the user's search requirements will be recalled.It affects the accuracy of search results and results in poor user experience.Based on the actual project of Sogou search engine, this paper studies the algorithm strategy of new word discovery in Chinese word segmentation of search engine, designs the algorithm of Term relation extraction based on strategy, and extracts the component features of these relationships.Feature classification is carried out by support Vector machine (SVM), and the actual effect of tightness is improved.The main work of the thesis is as follows:Data preprocessing.The participle of the original search log and the initial statistical work are carried out, and the basic data of the subsequent strategy are obtained.Initial post-processing based on search-in-call logs.By calculating the difference value of search statement to search session data, some session data are obtained, and the initial post-processing of tightness is carried out.3) two-step post-processing based on the body of a web page.According to the compactness result of proper noun level, based on the algorithm of neologism discovery, using the methods of information entropy and mutual information, the feature relationship between pairwise term is obtained, and the eigenvalues are classified by SVM.4) the experimental results are verified and analyzed. The final off-line data is verified by training set. The tightness post-processing strategy improves the effect of correlation ranking and makes the search results of Sogou search engine more accurate.5) the effect of strategy.The compactness value is adjusted by post-processing strategy, which makes the results of correlation ranking more accurate, ranking the high quality results before and putting the poor results behind.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.3;TP18

【参考文献】