互联网搜索词分类关键技术研究
发布时间:2018-05-20 05:12
本文选题:搜索关键词 + 伪相关反馈 ; 参考:《浙江大学》2011年硕士论文
【摘要】:随着互联网的飞速发展,互联网上的数字信息量也开始呈指数型增长,人们要从信息海洋中获取自己需要的特定信息变得越来越困难。能帮助人们从海量信息中找到真正所需的搜索引擎,作为网络用户的信息获取平台,已成为互联网上不可或缺的网络应用。网络用户对搜索引擎的依赖越来越严重,用户的搜索行为已经成为其上网行为中很重要的一部分,而搜索行为中最为重要的就是用户提供的搜索词,这些搜索词直接或间接的反映了用户潜在的兴趣和需求,能够很好的为用户个性化应用以及网络定向广告投放等网络服务提供基础。 因此,本文提出了对搜索词进行分类分析。针对互联网搜索词分类的问题,本文详细分析了互联网搜索词产生的相关背景,总结概括了搜索词的定义,详细描述了搜索词的特点,并针对现存的一些技术方法分析了搜索词分类的难点,最终提出了一个二阶段的搜索词分类解决方案:基于伪相关反馈的搜索词预处理与基于文本分类技术的搜索词分类。将未知的搜索词分类问题通过伪相关反馈理念转化为可以利用已有文本分类技术解决的问题。 在搜索词分类问题的解决过程中,本文针对文本分类技术中的一些技术进行了研究比较,提出了一种在初步特征选择后进一步精减特征的基于重构思想的特征精选方法,该方法结合列选择方法定义了一个对初选特征选取特征子集的目标函数,利用贪心和直推式实验设计的思想来求解目标函数,最终获得局部最优精简特征子集,并通过实验证实了此方法的可用性。本文还通过详细全面的实验,对比分析了多种特征选择方法与分类方法组合的分类结果,最终选择出了适用于本文分类问题的特征选择方法与分类方法。在最后,本文还提出了搜索词分类问题可以进一步改进与应用的方向。
[Abstract]:With the rapid development of the Internet, the amount of digital information on the Internet is increasing exponentially. It is becoming more and more difficult for people to get the specific information they need from the information ocean. It can help people find the really needed search engines from the mass information. As the information acquisition platform of network users, it has become the Internet. The Internet users' dependence on the search engine is becoming more and more serious. The user's search behavior has become a very important part of its Internet behavior. The most important thing in the search behavior is the search term provided by the user. These search words directly or indirectly reflect the potential interests and needs of the users, and can be very important. Good for users personalized applications and network targeted advertising and other network services to provide the basis.
Therefore, this paper puts forward the classification and analysis of search words. In view of the classification of Internet search words, this paper analyzes the related background of Internet search words in detail, summarizes the definition of search words, describes the characteristics of the search words in detail, and analyzes the difficulties of the classification of search words according to some existing technical methods. Finally, the difficulties of the search words are analyzed. Finally, the difficulties of the search word classification are analyzed. A two phase search term classification solution is proposed: search word preprocessing based on pseudo correlation feedback and search word classification based on text classification technology. The unknown search word classification problem is transformed into a problem that can be solved by using the existing text classification technology through the pseudo correlation feedback concept.
In the process of solving the problem of classification of search words, this paper studies and compares some of the techniques in text classification, and proposes a feature selection method based on the reconfiguration idea, which is a step down feature in the initial feature selection. This method combines the column selection method to determine the feature subset of the selected feature. Objective function, using the idea of greedy and direct push experimental design to solve the objective function, and finally obtain the local optimal set of feature subsets, and verify the availability of this method through experiments. This paper also compares and analyzes the classification results of the combination of multiple feature selection methods and classification methods through a detailed and comprehensive experiment. Finally, the results are selected. In the end, this paper puts forward the direction of further improvement and application of the classification of search words.
【学位授予单位】:浙江大学
【学位级别】:硕士
【学位授予年份】:2011
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 周宏宇;张政;;中文分词技术综述[J];安阳师范学院学报;2010年02期
2 张德鑫;“水至清则无鱼”——我的新生词语规范观[J];北京大学学报(哲学社会科学版);2000年05期
3 高军,陈锡先;无监督的动态分词方法[J];北京邮电大学学报;1997年04期
4 钟晓;;自动分类在搜索引擎中的应用[J];福建电脑;2009年10期
5 伍建军;康耀红;;文本分类中特征降维方式的研究[J];海南大学学报(自然科学版);2007年01期
6 贺敏;龚才春;张华平;程学旗;;一种基于大规模语料的新词识别方法[J];计算机工程与应用;2007年21期
7 徐威;董渊;白若鹞;张素琴;;针对中文文本自动分类算法的评估体系[J];计算机科学;2007年08期
8 都云琪,肖诗斌;基于支持向量机的中文文本自动分类研究[J];计算机工程;2002年11期
9 张玉芳;艾东梅;黄涛;熊忠阳;;结合编辑距离和Google距离的语义标注方法[J];计算机应用研究;2010年02期
10 张仰森;曹元大;俞士汶;;基于规则与统计相结合的中文文本自动查错模型与算法[J];中文信息学报;2006年04期
,本文编号:1913321
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1913321.html