基于电商数据和用户行为的信息抽取

发布时间：2018-10-15 11:47

【摘要】：随着互联网和电子商务在中国的爆炸式发展,以阿里巴巴为首的电子商务公司,正在产生海量的数据并吸引数以亿计的用户。换言之,大数据时代正在步步逼近,面对海量的数据,怎样提高数据利用率,怎样提取用户最想要的,最有价值的信息是核心价值的问题。在电子商务这块战斗在互联网产业最前沿的阵地上,尤其需要快速完成从数据到信息的转化。这就是本文要研究的信息抽取(information extraction)问题,尤其专注于电子商务领域。现有的信息抽取技术主要包括命名实体识别(Named Entity Recognition)和关系抽取(Relation Extraction)。命名实体识别现在主要有以下技术方法:基于规则和词典的方法、基于统计的方法、二者混合的方法等。其中基于规则和词典的方法,在有针对性的优化规则的基础上,准确率很高,但是人力成本较大,可复用和可扩展性不强,往往只能解决某些特定的应用场景。基于统计的方法准确率和召回率往往不尽如人意,算法复杂度也较高,但是可扩展性强,进步空间很大,大量学者致力于改进数学统计模型,以达到更高的准确率和召回率,从而真正实现机器智能识别。经典的命名实体识别模型有HMM(隐马尔科夫模型),ME-HMM(最大熵隐马尔科夫模型),CRF(条件随机场)等。关系抽取是从海量语料中分析抽取命名实体之间的关系,比如地名与机构名之间的从属关系,物品名之间的相似关系,各种简称与全称之间的同义关系等。同时,信息抽取是一个应用性很强的领域,理论算法必须要形成系统实现,才能准确评定算法模型的效果。但是,现在流行的信息抽取系统有华盛顿大学领导开发的OPENIE系列软件包,只能应用于英文信息抽取。现在迫切需要一种高效使用的中心信息抽取系统。本文的主要贡献为:1)介绍了经典的信息抽取模型,分别是命名实体识别领域的HMM,ME-HMM,CRF等,近义词关系抽取领域的词向量模型。同时还介绍了信息抽取任务常用的评价指标准确率,召回率和F值。2)基于经典的命名实体识别模型——隐马尔科夫模型做了针对于电子商务数据的优化,提出了一种基于词汇的隐马尔科夫模型(Lexical-HMM),提升了模型对于电商应用场景下,对于命名实体识别的准确率。对于近义词关系抽取,则提出了一种基于用户搜索和浏览行为的二部图模型,可以高效准确的抽取实体近义关系,并做了对比实验,证明了算法效果。3)设计并验证了本文提出的信息抽取系统。基于Spark平台和人工训练集,采用DAG的设计方式,可以高效准确地从输入数据从抽取命名实体库和近义词库,并验证了系统的效率和稳定性。
[Abstract]:With the explosive development of the Internet and e-commerce in China, e-commerce companies led by Alibaba are generating huge amounts of data and attracting hundreds of millions of users. In other words, big data era is approaching step by step, facing the massive data, how to improve the utilization rate of data, how to extract what users want most, the most valuable information is the core value problem. In the battle of e-commerce, which is at the forefront of the Internet industry, the transition from data to information is particularly needed. This is the problem of information extraction (information extraction), especially in the field of e-commerce. The existing information extraction techniques mainly include named entity identification (Named Entity Recognition) and relational extraction (Relation Extraction). The methods of named entity recognition are as follows: based on rules and dictionaries, based on statistics, and mixed with each other. The methods based on rules and dictionaries have high accuracy on the basis of targeted optimization rules, but the human costs are high, the reusability and expansibility are not strong, so they can only solve some specific application scenarios. The accuracy and recall rate of the methods based on statistics are often not satisfactory, the algorithm complexity is also high, but the expansibility is strong, the improvement space is very big, a large number of scholars devote themselves to improving the mathematical statistical model, in order to achieve higher accuracy and recall rate. Thus the machine intelligent recognition is realized. Classical named entity recognition models include HMM (Hidden Markov Model) and ME-HMM (maximum Entropy Hidden Markov Model), CRF (conditional Random Field). Relational extraction is to analyze and extract the relations between named entities from massive corpus, such as the subordinate relationship between place names and agency names, the similar relations between object names, the synonyms between various abbreviations and full names, and so on. At the same time, information extraction is a very applicable field, theoretical algorithm must form a system to achieve, in order to accurately evaluate the effectiveness of the algorithm model. However, the popular information extraction system has a series of OPENIE software packages developed by the University of Washington, which can only be applied to English information extraction. There is an urgent need for an efficient central information extraction system. The main contributions of this paper are as follows: 1) the classical information extraction model, named entity recognition (HMM,ME-HMM,CRF), and the word vector model of synonym relation extraction are introduced. At the same time, the paper also introduces the evaluation index accuracy, recall rate and F value. 2) based on the classical named entity recognition model, hidden Markov model, the paper optimizes the data of electronic commerce. A lexical based Hidden Markov Model (Lexical-HMM) is proposed to improve the accuracy of the model for the recognition of named entities in the context of e-commerce applications. For synonym extraction, a bipartite graph model based on user search and browsing behavior is proposed, which can extract entity synonyms efficiently and accurately. The algorithm effect is proved. 3) the information extraction system proposed in this paper is designed and validated. Based on Spark platform and artificial training set, the named entity library and synonym library can be extracted from input data efficiently and accurately by using DAG design method, and the efficiency and stability of the system are verified.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】