科技成果的自动提取与融合

发布时间：2018-04-15 23:05

本文选题：信息融合 + Web信息抽取　；参考：《中南大学》2014年硕士论文

【摘要】：从Web页面中抽取出学术成果信息并加以融合,能够帮助实现学术成果的科学管理,同时能够为专家学术轨迹的深入挖掘提供重要的基础资源。现有的信息抽取系统对Web页面结构的频繁变化的适应性较低,同时由于资源规模巨大,信息存在高冗余度、低可信度、描述方式不一致等问题,导致结果的准确性难以得到保证。因此本论文面向专家科技成果信息,重点聚焦Web信息融合中的抽取和去重两项关键技术进行研究。虽然目前存在多种Web信息抽取方式,但它们要么强烈依赖于抽取模板,要么对网页结构的变化有严格要求,针对此问题,本论文提出一种基于空间连接和DOM相结合的Web信息抽取算法(Spatial Relation Based DOM,简称SRB-DOM),实现从Web页面中抽取出成果信息。该方法将DOM树中的各个元素节点映射成二维空间中的对象,利用矩形代数中的相关理论得到各个对象之间空间关系的描述,利用元素节点之间的空间关系,抽取出成果信息的元数据,然后根据最大无连接边界元组构建完整的成果记录,最终实现成果信息的抽取。分析与模拟实验结果表明,该方法在对页面结构变化的适应性方面远优于现有的基于路径的信息抽取算法。信息源的多样性和描述方式的不同导致存在大量相似或重复的抽取结果,因此在对成果信息作进一步的融合与挖掘之前,必须对其进行一定的清洗工作。本文利用熵增度量成果记录中各个数据项的重要性程度,依此对各数据项分配权值,完成成果记录间相似度的计算,实现对成果的分类。在此之后,论文提出了一种基于数据标准化的成果记录完整化算法(Data Standardization Based Record Combine,简称DSBRC),该算法首先对成果记录进行基于特征的描述标准化,然后据此对每条成果记录的数据状态进行标注,得到数据状态矩阵,根据该矩阵得到成果记录的完整描述信息。分析与实验结果表明,该算法在结果的准确度和完整度方面由于其他同类算法。 Web信息抽取适应页面结构变化的能力对系统的实用性有很重要的影响,所以应当尽可能提高信息抽取系统对页面结构变化的适应性。使用本论文提出的SRB-DOM算法实现信息抽取,完全消除了对路径的依赖,与传统的基于路径的抽取方法相比,适应性得到了很大的提高。论文提出的基于熵增分类能够提高成果记录的分类准确度,而DSBRC算法能够有效提高成果记录合并的完整度与准确度,这对接下来数据的深入挖掘与知识发现有重要的研究价值。
[Abstract]:To extract information from the academic achievements and be integrated in the Web page, can help to realize the scientific management of academic achievements, at the same time can provide the important basic resources for further mining expert academic trajectory. The frequent change of Web structure of the page information extraction system to adapt to the existing low, at the same time because the resource is huge, high information redundancy, low reliability, description of inconsistencies and other issues, it is difficult to ensure the accuracy of the results. Therefore the expert oriented science and technology achievements in information extraction, focusing Web in information fusion and to two key technologies are studied.
Although there are many kinds of Web information extraction, but they are either strongly depends on the selected template, or to change the structure of a web page has strict requirements, in order to solve this problem, this paper proposes a Web information extraction algorithm of spatial connection and based on the combination of DOM (Spatial Relation Based DOM, referred to as SRB-DOM), the extraction results of information from the Web page. This method will each element node in the DOM tree mapping object in two-dimensional space, get the spatial relationship between the objects described by using the theory of rectangle algebra, the element space relations between nodes, metadata extraction results of information, then according to the maximum non connecting boundary tuples to build a complete the results of record, and ultimately results in information extraction. Analysis and simulation results show that the method on the page structure adaptability is far superior to the existing Based on the information extraction algorithm of path.
There are a large number of similar or duplicate extraction results of diversity of information sources and describes the different ways of lead, so before making further achievements of fusion and mining information, must carry on the cleaning work. This paper uses the entropy measure the degree of importance of each data record results, according to the distribution of weight of each data item. The calculation results of the similarity of complete record, classify the results. After this, the paper puts forward a data based on the results of standardization record complete algorithm (Data Standardization Based Record Combine, referred to as DSBRC), the algorithm first describing the characteristics of standardization based on the achievements of the record, and then based on the results of each record the state of the data dimension, data matrix description information according to the obtained matrix results recorded. Analysis and experimental results show that the The algorithm in terms of accuracy and integrity of the results with other algorithms.
Web information extraction has important influence to practical ability to change the page structure of the system, so it should be possible to improve information extraction system of page structure adaptability. Using the SRB-DOM algorithm proposed in this paper to achieve information extraction, completely eliminates the dependence on the path, and the traditional extraction method based on path compared. Adaptability has been greatly improved. Based on the entropy classification can improve the accuracy of the classification results recorded, while the DSBRC algorithm can effectively improve the achievement record combined integrity and accuracy, in-depth mining and knowledge of this next data found to have important research value.

【学位授予单位】：中南大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】