基于语义的文本向量表示方法研究

发布时间：2018-01-20 14:49

本文关键词： 文本表示语义文本分类观点抽取词向量神经网络　出处：《中国科学技术大学》2017年硕士论文　论文类型：学位论文

【摘要】：互联网技术的发展和普及使得人们可以快速的获取信息,反过来人们获取信息的方式也越来越依赖于互联网。人们在互联网上获取信息的主要方式是通过文本,并且互联网中的文本数目也呈现爆发式的增长。为了使人们更方便准确的找到需要的信息,互联网服务提供商需要对文本进行分类、聚类以及排序等。这些任务通常需要将文本表示成向量形式以便应用不同的机器学习模型。从用户角度来说,需要根据文本的语义来对它们分类、聚类、排序等。语义是一种抽象的,高层次的特征,而现在广泛使用的文本的词袋表示将文本看成相互独立的字符的集合,而没有考虑这些字符的语义以及它们的关联,从而导致词袋表示不够泛化。在文本向量表示中包含进文本更高层次的语义信息成为很多学者的研究目标。基于语义的文本向量表示的优点是能够将文本用低维的稠密的向量表示起来,且这种表示更加泛化,也就是说即使两个文本在表达相同意思时使用了不同的用词,它们的基于语义的向量表示也是相似的,而词袋模型不能捕捉到这种相似。主题模型,包括LDA,pLSI通过模拟文本的生成过程得到文本中隐含的主题,并将文本表示成在主题上的分布。深度神经网络能够学习到数据的不同层次的特征因此也被用来得到文本的语义表示。本文以基于语义的文本向量表示为研究对象,开展了以下工作:1.在无监督情况下,本文针对词袋模型不能考虑词之间的相似度而导致表示不够泛化的问题以及维度灾难问题提出基于词团的表示(BOWL)。词团是语义相似的词的集合,每一个词团表达了一个"概念",其相对于词是更高层次,更抽象的特征,从而在文本表示中考虑到了词的语义信息。BOWL表示的每一个维度的值使用k-max池化操作来计算。实验显示了 BOWL表示的表示有效性和表示效率。2.在有监督情况下,复杂的神经网络结构虽然能够捕捉到更准确的语义信息,但这种神经网络的训练非常耗时并且往往依赖GPU,本文在神经网络的输入层将词的词向量求平均,经过隐藏层的非线性变换得到文本的更高层次的语义向量表示,最后在文本的向量空间对文本分类。实验表明这种向量平均神经网络相对于低层次的词袋表示大大提高了分类的准确率。并且本文通过实验展示了神经网络的工作原理并分析了优化的过程。3.针对具体的在商品评论文本中抽取观点标签的任务中,传统的基于词匹配的方法不够泛化的问题提出通过计算文本间的语义相似度的方式来匹配评论文本和观点标签,并且对长句和短句本文设计不同的计算相似度的方法。这相当于通过内核方法隐式的将文本投影到语义空间计算它们的距离。实验表明这种方法大大提高了抽取的召回率,模型更加泛化。
[Abstract]:The development and popularization of Internet technology makes it possible to obtain information, in turn, the way people access to information is increasingly dependent on the Internet. The main way for people to obtain information on the Internet through the text, and the number of text in the Internet also showed explosive growth. In order to make people more convenient and accurate to find needed information that Internet service providers need to text classification, clustering and ranking. These tasks usually need to represent text into a vector form to apply different machine learning model. From the angle of the user, according to the semantics of the text to their classification, clustering, ranking. Semantic is an abstract, high-level features the text is now widely used in the bag of words that will set the text as independent character, without considering the semantics to these characters Correlation of them and the resulting bag of words in the text. That is not the generalization of vector representation of semantic information contained in the text into a higher level has become a research goal of many scholars. The advantages of text representation based on semantic vector is able to text with low dimensional dense vector representation to the representation and generalization are more. It is said that even if the two text express the same meaning in using different words and their semantic vector based representation is similar, and the bag of words model cannot capture this similar topic model, including LDA, pLSI through the analog text generation process has been implicated in the text topic and text representation distribution in the subject. The depth of the neural network can learn the different features of the data it was also used to obtain the semantics of the text. The text vector representation based on semantic representation. The object of study, carried out the following work: 1. under no supervision, according to the bag of words model does not consider the similarity between words and that the proposed generalization of the problem and not enough dimension curse word group based on (BOWL). The word group is a collection of semantic similar words, each word group expressed a "concept", the word is higher, more abstract features, resulting in the text representation considering the semantic information of the.BOWL word representation of each dimension value using the K-MAX pool operation to calculate. The experiment shows that BOWL expressed in the effectiveness and efficiency of.2. in said supervise the case, although the complex structure of neural network is able to capture semantic information more accurately, but this kind of neural network training is very time-consuming and often rely on GPU, the input layer in the neural network the word vector word average after hiding The nonlinear transformation layer are more high-level semantic vector representation of the text, the text vector space of text classification. Experimental results show that the average relative to the vector neural network low level said bag of words greatly improves the accuracy of classification. And through the experiment shows the working principle of neural network optimization and analysis.3. specific product reviews in text extraction task view tag, the traditional word matching method is proposed based on the generalization of the problem of insufficient semantic similarity between texts, the way to review papers and views and different labels, calculating the similarity of long and short sentences designed in this paper. This method is equivalent to through the kernel method of implicit semantic space of the text is projected to calculate their distance. The experimental results show that this method greatly improves the recall rate of extraction, The model is more generalized.

【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】