面向症状表型的命名实体抽取方法研究

发布时间：2019-06-16 19:14

【摘要】：症状表型(症状体征)是临床数据和医学题录文献数据中重要的实体性信息,是中西医诊断与治疗的主要依据。但医学数据中的症状表型信息往往以自由文本型的临床病历(以主诉和现病史为主要文本内容)和题录文献数据为主要载体,因此,进行症状表型命名实体抽取是利用症状表型信息的首要关键步骤。近年来,面向临床病历的命名实体抽取成为热点方向,但主要相关研究以疾病,药物和临床问题等为主要抽取目标,对更具复杂性的症状表型实体抽取研究仍涉及较少。鉴于症状表型信息在中医诊疗中的重要性,本文结合中医临床病历(以现病史为主)和PubMed题录文献文本开展症状表型命名实体的抽取方法研究,通过构建的较大规模语料集和未标注数据,进行了基于Bootstrapping,分类学习(条件随机场和结构化支持向量机)和特征学习(词嵌入与网络嵌入)等多种方法研究,具体研究工作包括如下三个方面。(1)在人工审核和数据预处理的基础上,构建了包含1200个以现病史为主的中医临床病历标注语料。在此基础上,分别研制了基于Bootstrapping的无监督症状表型实体抽取方法和基于条件随机场(CRF)的命名实体抽取方法,其F1值分别达到64.73%和95.03%,表明CRF基本达到了从临床病历现病史文本中抽取症状表型实体的要求;为测试完全开放性的抽取性能,本文分别构建了不同病种,主诉和现病史,以及首诊与复诊等交叉测试语料,CRF的性能分别达到82%,58.21%和81.18%等,为后续进一步的迁移性命名实体抽取方法研究提供了借鉴。(2)通过引入深度特征表示方法(词嵌入和网络嵌入方法),结合结构化支持向量机(SSVM)与CRF分类模型,整合未标注临床病历数据,研制了多种症状表型实体抽取方法(WENER和GENER方法),WENER方法的F1值分别达到了 98.08%(SSVM)和97.63%(CRF);基于字特征的GENER方法的F1值分别达到88.42%和86.01%,而基于词特征的GENER方法的F1值分别达到了 95.04%和 95.00%。(3)针对医学文献中症状表型实体抽取问题,利用1200条PubMed题录文献数据,应用WENER和GENER方法进行分析实验研究,研究表明,WENER方法的F1值分别达到93.58%和93.23%;GENER方法的F1值分别达到93.57%和92.04%。以上研究表明,基于深度表示的症状表型实体命名抽取方法在未标注语料的整合与性能方面都存在较大优势,已经具备一定的中英文命名实体抽取实用价值。通过整合更大规模的未标注语料,将为各类型医学命名实体的高性能抽取提供技术基础,从而促进大规模医学知识图谱的构建和发展。
[Abstract]:Symptoms and phenotypes (symptoms and signs) are important substantive information in clinical data and medical subject literature data, and are the main basis for diagnosis and treatment of traditional Chinese and western medicine. However, the symptom phenotypic information in medical data is often based on the free text clinical medical records (with the main complaint and the present disease history as the main text content) and the subject record literature data as the main carrier. Therefore, the extraction of symptom phenotypic named entity is the first key step to use the symptom phenotypic information. In recent years, named entity extraction for clinical medical records has become a hot direction, but the main related research focuses on diseases, drugs and clinical problems, but the research on more complex phenotypic entity extraction is still less involved. In view of the importance of symptom phenotypic information in TCM diagnosis and treatment, this paper studies the extraction method of symptom phenotypic naming entity combined with TCM clinical medical records (mainly current medical history) and PubMed inscription literature text. Through the constructed large-scale corpus set and unmarked data, various methods, such as Bootstrapping, classification learning (conditional random field and structured support vector machine) and feature learning (word embedding and network embedding), are carried out. The specific research work includes the following three aspects. (1) on the basis of manual audit and data preprocessing, 1200 tagging corpus of clinical medical records of traditional Chinese medicine (TCM) with current medical history is constructed. On this basis, the unsupervised symptom phenotypic entity extraction method based on Bootstrapping and the named entity extraction method based on conditional random field (CRF) were developed respectively. the F1 values reached 64.73% and 95.03% respectively, which indicated that CRF basically met the requirements of extracting symptom phenotypic entity from the current medical history text of clinical medical records. In order to test the completely open extraction performance, different types of diseases, main complaint and current medical history, as well as cross-test corpus such as first diagnosis and rediagnosis, were constructed in this paper. The performance of CRF reached 82%, 58.21% and 81.18%, respectively, which provided a reference for further research on migration named entity extraction. (2) by introducing depth feature representation (word embedding and network embedding), Combined with structured support vector machine (SSVM) and CRF classification model and unmarked clinical medical record data, a variety of symptom phenotypic entity extraction methods (WENER and GENER), WENER methods with F1 values of 98.08% (SSVM) and 97.63% (CRF);, respectively) were developed. The F1 values of GENER method based on word features are 88.42% and 86.01%, respectively, while those of GENER method based on word features are 95.04% and 95.00%, respectively. (3) in order to solve the problem of symptom phenotypic entity extraction in medical literature, the F1 values of WENER method are 93.58% and 93.23%, respectively, using the literature data of 1200 PubMed titles and WENER and GENER methods. The F1 values of GENER method are 93.57% and 92.04%, respectively. The above research shows that the naming and extraction method of symptom phenotypic entities based on depth representation has great advantages in the integration and performance of unmarked corpus, and has a certain practical value in Chinese and English named entity extraction. By integrating larger unmarked corpus, it will provide a technical basis for the high performance extraction of various types of medical named entities, thus promoting the construction and development of large-scale medical knowledge graph.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】