基于用户行为和语义扩展的中文商品查询分类方法研究
发布时间:2018-07-17 03:43
【摘要】:Web查询分类就是把查询分到预先定义好的一个或者多个类别中。Web查询语句通常十分简短,很难全面地表达用户的查询意图。手工标注查询类别的成本过高,使得训练数据缺少,这也让Web查询分类更加困难。目前研究查询分类一般从两方面入手:一方面,可以通过自动获取更多训练数据的方法来提高分类器的准确率;另一方面,可以通过对查询本身进行扩展来丰富查询文本的特征信息。Web查询分类是一种有效识别用户查询意图的方法,它不仅可以应用于Web搜索,提高搜索的准确率,而且可以应用于垂直搜索、商品推荐和广告推荐等众多领域。本文主要研究的是中文商品查询分类,它是一种特殊的Web查询意图分类。选择商品查询分类作为研究课题,是因为商品查询十分重要,特别是现在越来越多的人选择了网上购物,准确的商品查询分类不仅方便客户、提高用户体验,而且能给商家们带来巨大的利益。另一方面是因为有充足的有关商品查询的数据。本文的方法不仅可以解决商品查询分类的问题,而且可以把该方法应用于其他查询分类领域。 本文以用户点击行为和查询相似性扩展两种方法,从商品搜索日志中自动获取大量训练和测试数据,解决了通常Web查询分类训练数据缺少的问题。对于商品查询文本太短的问题,使用了基于搜索引擎和中文维基百科扩展两种不同的方法对商品查询进行扩展。其中基于搜索引擎返回信息的扩展方法分类效果较优,但这种方法需要在线获取搜索引擎返回结果并对结果进行处理,效率较低。根据搜索引擎扩展方法的优缺点,,提出了一种混合的商品查询分类方法。首先把原商品查询放进已经学习好的分类器中分类,如果分类的置信度高于阈值则直接分类,否则,再使用搜索引擎扩展方法对查询进行扩展,最后把扩展后的查询放进分类器分类得出最终结果。置信度阈值是通过实验获取的,实验表明使用这种方法可以准确和高效地获得商品查询分类结果。并使用了两个分类器组合的方法,进一步提高分类的正确率和效率。最后实现了商品查询层次分类算法,并把混合分类算法应用于层次分类中,取得了较好的分类效果。
[Abstract]:Web query classification is to divide the query into one or more predefined categories. Web query statements are usually very short, and it is difficult to express the user's query intention comprehensively. The cost of manually tagging query categories is too high, which makes training data scarce, which makes Web query classification more difficult. At present, the research on query classification generally starts from two aspects: on the one hand, it can improve the accuracy of classifier by automatically obtaining more training data; on the other hand, Web query classification can enrich the feature information of query text by extending the query itself. Web query classification is an effective method to identify users' query intention. It can not only be applied to Web search, but also improve the accuracy of search. And can be applied to vertical search, commodity recommendation and advertising recommendation and many other areas. This paper mainly studies the classification of Chinese commodity query, which is a special classification of Web query intention. The choice of commodity query classification as a research topic is because commodity query is very important, especially now more and more people choose online shopping. Accurate commodity query classification is not only convenient for customers, but also improves user experience. And can bring great benefits to businessmen. On the other hand, there is sufficient data about commodity queries. This method can not only solve the problem of commodity query classification, but also can be applied to other query classification fields. Based on user click-behavior and query similarity expansion, this paper automatically acquires a large number of training and test data from commodity search logs, which solves the problem of lack of general Web query classification training data. For the problem that the query text is too short, two different methods based on search engine and Chinese Wikipedia extension are used to extend the query. The extended method based on the return information of search engine has better classification effect, but this method needs to obtain the result of search engine return online and deal with the result, which is inefficient. According to the advantages and disadvantages of search engine extension method, a mixed commodity query classification method is proposed. First of all, the original commodity query is put into the classifier that has been learned well. If the confidence of the classification is higher than the threshold value, it will be classified directly. Otherwise, the search engine expansion method will be used to expand the query. Finally, the extended query is put into the classifier to get the final result. The confidence threshold is obtained by experiments. Experiments show that the classification results of commodity queries can be obtained accurately and efficiently by using this method. The combination of two classifiers is used to further improve the accuracy and efficiency of classification. Finally, the hierarchical classification algorithm of commodity query is implemented, and the hybrid classification algorithm is applied to the hierarchical classification, and a good classification effect is obtained.
【学位授予单位】:中山大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
本文编号:2128850
[Abstract]:Web query classification is to divide the query into one or more predefined categories. Web query statements are usually very short, and it is difficult to express the user's query intention comprehensively. The cost of manually tagging query categories is too high, which makes training data scarce, which makes Web query classification more difficult. At present, the research on query classification generally starts from two aspects: on the one hand, it can improve the accuracy of classifier by automatically obtaining more training data; on the other hand, Web query classification can enrich the feature information of query text by extending the query itself. Web query classification is an effective method to identify users' query intention. It can not only be applied to Web search, but also improve the accuracy of search. And can be applied to vertical search, commodity recommendation and advertising recommendation and many other areas. This paper mainly studies the classification of Chinese commodity query, which is a special classification of Web query intention. The choice of commodity query classification as a research topic is because commodity query is very important, especially now more and more people choose online shopping. Accurate commodity query classification is not only convenient for customers, but also improves user experience. And can bring great benefits to businessmen. On the other hand, there is sufficient data about commodity queries. This method can not only solve the problem of commodity query classification, but also can be applied to other query classification fields. Based on user click-behavior and query similarity expansion, this paper automatically acquires a large number of training and test data from commodity search logs, which solves the problem of lack of general Web query classification training data. For the problem that the query text is too short, two different methods based on search engine and Chinese Wikipedia extension are used to extend the query. The extended method based on the return information of search engine has better classification effect, but this method needs to obtain the result of search engine return online and deal with the result, which is inefficient. According to the advantages and disadvantages of search engine extension method, a mixed commodity query classification method is proposed. First of all, the original commodity query is put into the classifier that has been learned well. If the confidence of the classification is higher than the threshold value, it will be classified directly. Otherwise, the search engine expansion method will be used to expand the query. Finally, the extended query is put into the classifier to get the final result. The confidence threshold is obtained by experiments. Experiments show that the classification results of commodity queries can be obtained accurately and efficiently by using this method. The combination of two classifiers is used to further improve the accuracy and efficiency of classification. Finally, the hierarchical classification algorithm of commodity query is implemented, and the hybrid classification algorithm is applied to the hierarchical classification, and a good classification effect is obtained.
【学位授予单位】:中山大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前2条
1 李荣陆,王建会,陈晓云,陶晓鹏,胡运发;使用最大熵模型进行中文文本分类[J];计算机研究与发展;2005年01期
2 张森;王斌;;Web检索查询意图分类技术综述[J];中文信息学报;2008年04期
本文编号:2128850
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2128850.html