中医药症状的中文分词与句子相似度研究

发布时间：2018-05-13 03:44

本文选题：中医药 + 症状　；参考：《浙江大学》2017年硕士论文

【摘要】：中医药是中国传统医药,也是中华民族的文化瑰宝。随着医学技术的发展,中医药由于其整体性、动态性、辩证性等特征,越来越被人们重视。信息技术、人工智能的不断突破,也为中医药的发展提供了新思路。目前国家已将中医药信息化列在国家信息化发展战略纲要中。由于中医药信息化起步较晚、长期投入不足,中医药信息化的研究总体滞后。本文结合自然语言处理技术,对中医药信息化过程中的中医药症状,进行了深入的研究。重点研究了中医药症状分词和中医药症状句子相似度计算,具体贡献如下:1)研究了中医药症状的数据特征。在大量的观察实验和互联网搜索的基础上,将其总结成表达各异、理解不同、表述不清、单字成词、部分字词用法特殊、用字不规范、词典不完善这七大特征。2)研究了中文分词的主要算法、技术难点以及评价指标。分析了每种算法的优点和缺点。针对已有分词算法的不足和中医药症状数据的特征,设计了一种基于双向条件概率统计模型和相对位置的中医药症状分词算法。通过与互信息模型、二元文法模型、正向条件概率模型、双向条件概率模型比较,本论文的方法在准确率和召回率上分别较其他算法平均提高了 13.39%和17.88%。3)研究了汉语句子相似度计算的主要算法、技术难点以及评价指标。分析了每种算法在中医药环境下的优缺点。改进了已有的词语相似度计算方法。提出了中医药症状词语的分级概念,按照症状词语的重要性将其分为六个等级。综合词语相似度和词语重要性两个指标,改进了原来的基于语义向量的句子相似度计算方法。新方法较传统的方法在句子相似度打分的准确性上提高了 11%。4)为使中医药算法可以方便的被中医药领域的研究者使用,本文从中医药信息化角度出发,设计并实现了一个完整的、易用的、可扩展的中医药数据挖掘平台。该平台将所有算法看成一个算子,用户通过组合不同的算子来进行实验。
[Abstract]:Traditional Chinese medicine is a traditional Chinese medicine and a cultural treasure of the Chinese nation. With the development of medical technology, traditional Chinese medicine has been paid more and more attention because of its integrity, dynamics and dialectics. The continuous breakthrough of information technology and artificial intelligence also provides new ideas for the development of traditional Chinese medicine. In the outline of national information development strategy, the research of TCM information is lagging behind due to the late start of Chinese medicine informatization and insufficient long-term investment. This paper studies the symptoms of traditional Chinese medicine in the information process of traditional Chinese medicine, and focuses on the symptoms participle and TCM syndrome of traditional Chinese medicine. Sentence similarity calculation, specific contributions are as follows: 1) study the data characteristics of Chinese medicine symptoms. On the basis of a large number of observation experiments and Internet search, they are summarized into different expressions, different understanding, vague expression, single word formation, special use of words, unstandardized words and incomplete dictionaries, the seven characteristics.2). The main algorithm, technical difficulty and evaluation index of word segmentation. The advantages and disadvantages of each algorithm are analyzed. In view of the shortcomings of the existing segmentation algorithms and the characteristics of TCM symptom data, a Chinese medicine symptom segmentation algorithm based on the two-way conditional probability statistical model and relative position is designed. Through the mutual information model, the two element grammar model is used. Comparison of the positive conditional probability model and two-way conditional probability model, the methods of this paper are improved by 13.39% and 17.88%.3 respectively compared with other algorithms in accuracy and recall. The main algorithms, technical difficulties and evaluation indexes of Chinese sentence similarity calculation are studied. The advantages and disadvantages of each algorithm in the environment of traditional Chinese medicine are analyzed. This paper improves the existing method of calculating the similarity degree of words and phrases. It puts forward the classification concept of Chinese medicine symptom words and divides them into six grades according to the importance of the symptom words. It improves the original sentence similarity calculation method based on the semantic vector based sentence similarity degree and the word importance. The new method is more than the traditional method in the sentence. The accuracy of sub similarity score is improved by 11%.4). In order to make traditional Chinese medicine algorithms easy to be used by researchers in the field of traditional Chinese medicine, this paper designs and implements a complete, easy to use and extensible data mining platform for traditional Chinese medicine from the perspective of Chinese medicine information. This platform regards all algorithms as an operator and users are connected. A combination of different operators is used to experiment.

【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】