基于LDA模型的领域自动问答系统

发布时间：2018-06-17 13:49

本文选题：分词 + LDA模型　；参考：《安徽大学》2013年硕士论文

【摘要】：随着因特网的发展,其包含的信息量不断增加,人们普遍希望能在互联网中快速地找到自己想要的信息。同时,目前搜索引擎的有效应用率不高,搜索引擎的不足仍有很多,限制着人们获取信息的效率。自动问答系统可以更智能、更快速、更准确地获取用户想查询的内容,近年来成为了国内外学者广泛研究的热点。本文以实现一个针对计算机常见故障的解决办法这一领域的自动问答系统为目标,深入探讨了自动问答系统从问题处理一直到最终给出答案的全过程。在研究过程中,发现领域分词和语义相似度的计算是自动问答系统的核心内容,相对于目前的系统需求以及研究现状,还有很多改进的地方。本文主要对这两个方面进行改进,在每一节也地改进后的结果进行了实验论证,说明改进后的确加强了检索的结果。最后设计实现了一个可以对用户提出的计算机故障相关问题自动给出解决办法的一个原型系统。首先,本文讨论了在中文分词领域常用的方法,对基于词典的分词方法、基于统计的分词方法这两个经典的方法做了深入分析,对其他方法做了简要介绍,并比较了不同的方法的特性和效果。然后提出了一个基于领域词典与词串互信息的分词方法,该方法加入了语义的信息,并考虑到领域专业词汇的特性,最后加入了词串的互信息来解决分词中的岐义问题。通过实验证明,这些改进提升了领域文本的分词性能。其次,本文对语义相似度的概念和计算原则做了简单讨论,并研究了基于编辑距离的语义相似度计算方法、基于依存关系的语义相似度计算方法以及基于语义距离和本体的相似度计算方法,同时提出了对经典相似度计算方法改进的一个新方法。新方法使用LDA模型,经过领域语料库的训练,得到一个领域相关的词一主题的分布,由于考虑了同一个主题下的词之间的语义相关性,因此计算得到的语义相似度更为可靠。最后,本文对针对计算机常见故障的解决办法这一领域的自动问答系统进行了系统设计,良好的设计使系统的框架具备了高内聚、低耦合的特性,这样可以大大减小系统的升级和后期的维护的代价。同时在Windows XP平台下,基于.NET Framework框架开发实现了这一系统的演示版本,通过实际测试,系统的运行效果良好。
[Abstract]:With the development of the Internet, the amount of information it contains is increasing. People generally hope to find the information they want quickly in the Internet. At the same time, the effective application rate of search engine is not high, and the lack of search engine is still a lot, which limits the efficiency of people to obtain information. The automatic question answering system can be more intelligent, faster and more accurate to obtain the content that the user wants to query, which has become a hot spot of domestic and foreign scholars in recent years. Aiming at the realization of an automatic question answering system in the field of solving common computer faults, this paper deeply discusses the whole process of the automatic question answering system from question processing to the final answer. In the research process, it is found that the computation of domain word segmentation and semantic similarity is the core content of the automatic question answering system, and there are still many improvements compared with the current system requirements and research status. In this paper, the two aspects are improved, and the experimental results are demonstrated in each section, which shows that the improved results really strengthen the retrieval results. In the end, a prototype system is designed and implemented, which can automatically solve the problems related to computer faults raised by users. First of all, this paper discusses the commonly used methods in the field of Chinese word segmentation, and makes an in-depth analysis of the two classical methods of word segmentation based on dictionary and statistics, and briefly introduces the other methods. The characteristics and effects of different methods are compared. Then, a word segmentation method based on domain dictionary and string mutual information is proposed. This method adds semantic information and takes into account the characteristics of domain specialized vocabulary, and finally adds the mutual information of string to solve the ambiguity problem in word segmentation. Experimental results show that these improvements improve the performance of domain text segmentation. Secondly, the concept and calculation principle of semantic similarity are briefly discussed, and the method of calculating semantic similarity based on editing distance is studied. The semantic similarity calculation method based on dependency relationship and the similarity calculation method based on semantic distance and ontology are presented. A new method to improve the classical similarity calculation method is proposed. The new method uses the LDA model and the domain corpus is trained to obtain the distribution of a domain-dependent word-topic. Because the semantic correlation between the words under the same topic is considered the calculated semantic similarity is more reliable. Finally, the system design of the automatic question answering system in the field of the solution of common computer faults is carried out in this paper. The good design makes the system frame have the characteristics of high cohesion and low coupling. This can greatly reduce the system upgrade and later maintenance costs. At the same time, the demo version of the system is developed based on. Net Framework on Windows XP platform. Through the actual test, the running effect of the system is good.
【学位授予单位】：安徽大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】