无监督的中文商品属性结构化方法

发布时间：2018-04-02 13:01

本文选题：结构化　切入点：相对不选取条件概率场　出处：《软件学报》2017年02期

【摘要】：从非结构化商品描述文本中抽取结构化属性信息,对于电子商务实现商品的对比与推荐及用户需求预测等功能具有重要意义.现有结构化方法大多采用监督或半监督的分类方法抽取属性值与属性名,通过文法分析器分析属性值与属性名之间的文法依存关系,并根据关联规则实现属性值与属性名的匹配.这些方法存在以下不足:(1)需要人工标记部分属性值、属性名及它们之间的对应关系;(2)属性值-属性名匹配的准确度受到语言习惯、句意逻辑、语料库及属性名候选集质量的严重制约.提出了一种无监督的中文商品属性结构化方法.该方法借助搜索引擎,基于小概率事件原理分析文法关系来抽取属性值与属性名.同时,提出相对不选取条件概率场,并使用Page Rank算法来计算属性值与属性名的配对概率.该方法无需人工标记的开销,且无论商品描述中是否显式地包含相应的属性名,该方法都能自动抽取到属性值并匹配相应的属性名.使用百度搜索引擎上的真实语料,针对4类商品的中文描述进行了实验.实验结果验证了对于候选属性名的自动生成,所提出的基于搜索引擎搜索属性值,并在包含属性值的搜索结果中抽取一般名词的候选属性名生成方法与只在描述句中抽取一般名词的候选属性名生成方法相比,查全率提高了20%以上;对于非量化类属性,所提出的基于相对不选取条件概率场的属性值-属性名匹配方法与基于依存关联的方法相比,Rank-1的准确率提高了30%以上,平均MRR提高了0.3以上.
[Abstract]:Description of extracting structured attribute information from unstructured text in commodities, is of great significance for the realization of e-commerce goods compared with the recommendation and user demand forecasting and other functions. The existing structured method mostly adopts supervised classification method or semi supervised extraction of attribute values and attribute names, through the analysis of grammar analyzer attribute value and attribute dependency relation between grammar according to the related rules, attribute values and attribute name matching. These methods have the following problems: (1) the need for manual marking part attribute values, correspondence between attribute names and their relationships; (2) attribute value - attribute name matching accuracy by language, sentence meaning and logic, restricted corpus the attribute name candidate quality. This paper proposed an unsupervised Chinese commodity structured methods. This method uses the search engine, the small probability event principle based on grammar analysis To extract attribute values and attribute names. At the same time, the relative conditional probability selection field, and use the Page Rank algorithm to calculate the value of the attribute matching probability and attribute name. This method without artificial markers overhead, and regardless of whether the commodity description explicitly contains the corresponding attribute name, the method can automatically to extract attribute value and attribute name matching. The corresponding authentic materials used on the Baidu search engine, aiming at the 4 types of goods Chinese described in the experiment. The experimental results verify the for automatic generation of candidate attribute names, based on the search engine search attribute value, compared with the only candidate attribute in the description of general sentence extraction the name of the noun generation method and in containing the attribute value in the search results from general noun candidate attribute name generation method, the recall rate increased by more than 20%; for non quantitative attributes, proposed based on the Compared with the method based on dependency relation, the accuracy of Rank-1 is increased by more than 30% and the average MRR is increased by more than 0.3 compared with the method based on dependency relation.

【作者单位】：西北工业大学计算机学院;
【分类号】：TP311

【相似文献】