基于电子商务领域分类树和众包的商品语义标注方法研究

发布时间：2018-06-28 01:20

本文选题：电子商务领域分类树 + 语义标注　；参考：《华东师范大学》2017年硕士论文

【摘要】：随着电商行业和互联网技术如火如茶的发展,一种将视频与电子商务相结合的新型商业模式T20应运而生。视频中一闪而过的商品画面可以通过图像匹配算法与商品资源库中的商品图片准确匹配,从而向用户提供商品的购买链接。如果在构建商品资源库的时候为商品资源添加更多的语义标签,那么能够在节约用户浏览商品详情时间的同时,根据商品的不同标签信息为用户进行商品推荐。本文主要对商品文本资源进行语义标注研究。现有对文本资源语义标注的研究中,标注资源(如文档、网页)多为结构文本或者长文本,依赖领域本体或知识库等知识组织体系。然而,在电子商务领域,缺乏共享通用的领域本体,商品描述文本具有"碎片化"、缺乏上下文语义信息等特点。针对这种情况,本文以电子商务领域分类树为知识组织体系,提出基于词向量的商品语义标注方法,由此为商品添加类别、属性等语义标签。本文的主要研究内容包括:首先,利用在线商品资源库的商品目录以及大规模商品资源的属性描述,抽取商品概念、概念关系以及概念属性,构建电子商务领域的商品分类树;其次,通过训练电子商务领域的Word2vec词向量提取商品描述文本的语义特征;然后,将电子商务领域分类树的商品概念视为已知的分类标签集合,训练基于词向量的商品分类器,将待标注的商品视为待分类的数据,通过分类器将商品映射到分类树中的商品概念,标注商品的类别;根据商品概念映射的结果,在分类树上获取商品的概念属性,从词形和语义两方面衡量商品描述文本中属性-属性值对的属性与概念属性之间的相似度,标注商品的属性值;最后,通过融合众包和主动学习迭代训练商品分类器,提高商品分类的准确率,改进商品语义标注的质量。本文的主要贡献如下:1.提出了一种基于电子商务领域分类树和词向量的商品语义标注方法,以电子商务领域分类树为知识组织体系,能够同领域本体一样较好地表达出领域知识的层次关系,并且相较于本体构建更为简单,更容易理解;利用Word2vec词向量生成商品描述的语义特征,使得商品描述具有明确的语义信息。通过两者的结合使得在构建商品资源库时能够为商品资源添加类别、属性、属性值等语义标签。本文的方法适用于不同商品资源库的构建,解决了商品来源的异构性。2.提出了一种融合众包和主动学习的商品语义标注质量改进方法,结合众包标注准确率高和机器分类速度快的优势,通过主动学习的采样策略,选取机器分类结果中可信度低的结果交于众包进行标注,能够利用少量已知分类标签的商品数据和大量未知分类标签的商品数据,通过迭代训练出一个精度较高的商品分类器,能够提升分类质量的同时节约标注成本。
[Abstract]:With the development of e-commerce industry and Internet technology such as tea a new business model T20 which combines video and electronic commerce emerges as the times require. The flash of commodity images in the video can match accurately with the commodity images in the commodity resource database through the image matching algorithm, so as to provide a link to the purchase of the products to the user. If we add more semantic tags to the commodity resources when we build the commodity resource bank, then we can save the time for users to browse the details of the goods, and then we can recommend the goods to the users according to the different label information of the goods. This paper focuses on the semantic annotation of commodity text resources. In the current research on semantic annotation of text resources, annotation resources (such as documents, web pages) are mostly structured or long text, relying on domain ontology or knowledge base and other knowledge organization systems. However, in the field of electronic commerce, there is a lack of shared domain ontology, and commodity description texts are characterized by "fragmentation" and lack of contextual semantic information. In this paper, the classification tree of electronic commerce is taken as the knowledge organization system, and the semantic tagging method based on word vector is proposed to add category, attribute and other semantic labels to the product. The main research contents of this paper are as follows: firstly, the commodity classification tree in the field of electronic commerce is constructed by using the commodity catalogue of online commodity resource bank and attribute description of large-scale commodity resources, extracting commodity concept, concept relation and conceptual attribute; Secondly, the semantic feature of the product description text is extracted by training Word2vec word vector in the field of electronic commerce, and then, the concept of commodity in the electronic commerce domain classification tree is regarded as a known set of classification labels, and the commodity classifier based on word vector is trained. The goods to be labeled are regarded as the data to be classified, and the goods are mapped to the concept of goods in the classification tree by classifier, and the categories of goods are marked; according to the results of the mapping of commodity concepts, the conceptual attributes of goods are obtained on the classification tree. The similarity between attribute-attribute value pair and conceptual attribute in commodity description text is measured from word form and semantic aspect. Finally, product classifier is trained by combining crowdsourcing and active learning iteration. Improve the accuracy of commodity classification, improve the quality of commodity semantic tagging. The main contributions of this paper are as follows: 1. This paper presents a semantic labeling method for goods based on the domain classification tree and word vector of electronic commerce. Taking the domain classification tree as the knowledge organization system, it can express the hierarchical relationship of domain knowledge as well as the domain ontology. Compared with ontology construction, it is simpler and easier to understand. By using Word2vec word vector to generate semantic features of commodity description, the product description has clear semantic information. The combination of the two makes it possible to add categories, attribute values and other semantic labels to commodity resources. The method proposed in this paper is suitable for the construction of different commodity resource banks and solves the isomerism of commodity sources. 2. 2. In this paper, a new method for improving the quality of commodity semantic tagging is proposed, which combines crowdsourcing and active learning. It combines the advantages of high accuracy of crowdsourcing tagging and fast machine classification, and adopts the sampling strategy of active learning. The results with low credibility in the machine classification results are selected to be annotated by crowdsourcing. It can use a small number of commodity data of known classification labels and a large number of commodity data of unknown classification labels to train a high precision commodity classifier through iterations. It can improve the classification quality and save the marking cost at the same time.
【学位授予单位】：华东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;F724.6

【参考文献】