基于语义的网络流行语趋势分析

发布时间：2018-11-03 16:56

【摘要】：在自然语言处理方面,构建可计算的词语、文本语义特征是多数自然语言处理任务的基础。本文提出一种词语语义相似度计算方法,通过结合文本之外的先验知识,提高在特征稀疏情况下的模型准确率;同时结合词语语义相似度计算与LDA(Latent Dirichlet Allocation)定义文本间的语义距离,通过K-Means聚类获取语料中的事件。两个方法可以结合外部知识改进对词语、文本向量化的过程,提升基于向量的相似度计算的效果。论文的两个主要方面分别为:改进词语语义相似度计算:向量化是词语的语义可计算的关键。本文提出了一种结合词语关系的改进词语语义向量计算法。该方法基于Word2Vec的思路,在通过当前词语预测上下文词语的基础上,同时预测词语在词语关系中的邻接位置。模型将词语经过编码矩阵得到语义向量,再经过解码矩阵得到对上下文词语与词语关系等稀疏特征的预测。通过模型参数对误差的梯度来迭代调整模型,最终得到词语到语义向量的映射方法。该方法可以用过添加额外的词语关系网来缓解文本本身的特征稀疏情况,提高词语语义相似度计算的准确性。改进基于LDA的事件发现:基于LDA的事件发现是通过LDA模型得到文本的主题词向量,并通过主题词向量之间的余弦距离聚类得到文本簇的方法。本文提出了一种融合了词语语义相似度计算与词语在频域特征的文本语义距离计算方法,进而改进了基于LDA的事件发现算法。首先将文本根据时间窗分割后进行LDA计算得到文本的主题词向量,并根据融合词语语义相似度的距离定义进行K-Means聚类得到时间窗粒度的事件;然后根据主题词的词频特征合并时间窗粒度的事件,最终得到事件。该方法可以通过融合额外文本中的词语语义相似度信息,改善对短文本事件发现的准确性。在将本文方法与对比方法进行对照试验后,可以看出本文方法相比对比方法在准确度上有一定的提高。同时由于模型对于关系数据格式与数量没有特殊要求,使得模型拥有较好的通用性与可扩展性。本文的创新点如下:1)通过向量的矩阵表示与局部点乘来表达词语与其他元素之间的多种关系,并通过梯度下降来学习词语的向量表示。2)融合词语的语义相似度与词语词频信息来重新定义主题向量之间的距离,进而改善事件聚类的效果。
[Abstract]:In natural language processing, the construction of computable words, text semantic features are the basis of most natural language processing tasks. In this paper, a semantic similarity calculation method is proposed to improve the accuracy of the model in the case of sparse features by combining the prior knowledge outside the text. At the same time, combining the semantic distance between word semantic similarity calculation and LDA (Latent Dirichlet Allocation) definition text, the events in the corpus are obtained by K-Means clustering. The two methods can be combined with external knowledge to improve the process of word and text vectorization and improve the effect of vector based similarity calculation. The two main aspects of this paper are as follows: to improve the semantic similarity calculation of words: vectorization is the key to the semantic computability of words. In this paper, an improved semantic vector calculation method based on word relation is proposed. This method is based on the idea of Word2Vec, based on the prediction of contextual words by the current words, and the adjacent position of words in the word relationship at the same time. In the model, the semantic vector is obtained by the encoding matrix, and the sparse features such as the relation between the context words and the words are predicted by the decoding matrix. Finally, the mapping method of word to semantic vector is obtained by iteratively adjusting the model by the gradient of error between the parameters of the model and the error. This method can be used to improve the accuracy of word semantic similarity calculation by adding additional word relationship network to alleviate the sparse feature of the text itself. Improved event discovery based on LDA: event discovery based on LDA is a method to get the theme word vector of text by LDA model, and to obtain text cluster by clustering cosine distance between theme word vectors. This paper proposes a method of text semantic distance computation which combines semantic similarity calculation of words and features of words in frequency domain and improves the event discovery algorithm based on LDA. Firstly, the text is divided according to the time window and the text is computed by LDA, and the event of the time window granularity is obtained by K-Means clustering according to the distance definition of the semantic similarity of the fused words. Then the time window granularity event is merged according to the frequency feature of the subject word, and the event is finally obtained. This method can improve the accuracy of finding short text events by merging the semantic similarity information of words in extra text. After comparing the method with the contrast method, it can be seen that the accuracy of the method is higher than that of the contrast method. At the same time, because the model has no special requirements for the format and quantity of relational data, the model has better generality and extensibility. The innovations of this paper are as follows: 1) the matrix representation and local dot multiplication of vectors are used to express the relationships between words and other elements. The vector representation of words is learned by gradient descent. 2) the distance between topic vectors is redefined by combining the semantic similarity of words with word frequency to improve the effect of event clustering.
【学位授予单位】：北方工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】