面向deep web的数据抽取与结果聚合技术研究

发布时间：2018-06-19 11:19

本文选题：deep + web　；参考：《哈尔滨工程大学》2012年硕士论文

【摘要】：随着计算机网络的高速发展，网络资源越来越丰富，一方面拓宽了人们获取信息的渠道，另一方面信息的秩序混乱又使得用户难以浩瀚万千的信息中获取需要的信息，搜索引擎为用户提供网络信息的检索与分类功能。在网络资源中，有一种资源是传统搜索引擎索引不到的。这种资源叫deep web资源。Deep web资源是指传统搜索引擎不能索引到的资源，是能够被访问的在线web数据库。deep web资源因其资源丰富，专业性强，自动更新速度快，数据海量，，领域范围广等优点。越来越受到人们的青睐。研究如何对通过deep web查询接口返回的数据进行抽取以及对抽取结果进行聚合具有重要的理论意义和实践价值。本文针对deep web资源的数据抽取与结果聚合进行研究，数据抽取阶段，首先简要介绍MDR，总结MDR在deep web页面信息抽取中遇到的效率问题，从MDR数据抽取算法中得到启示，对MDR算法进行改进以降低数据抽取的时间复杂度。抽取算法使用标签树对HTML页面进行表示，在抽取之前对页面清洗，规范化并构造标签树。使用标签树的结构相似度定位数据记录。相似度计算方法改进了树编辑距离算法时间复杂度高的缺点，改进了元素比较法的不能真实反映树结构的缺点，在面向deep web的数据抽取中有较好的抽取效果。然而有些数据记录之间的相似度较低，使用基于标签树的相似度的数据抽取算法也会有不好的情况，为了解决这种标签结构的数据记录识别问题，在改进通过标签树结构相似度判定数据记录的基础上，提出一种基于子树不完全匹配的数据记录抽取算法。结果聚合主要研究的是抽取结果去重，在去重之前先按照属性权重排序，减少了比较次数，实现数据记录的快速有效去重。实验表明，基于标签树路径的结构相似度的数据记录抽取算法的抽取效率比MDR高，同时证明基于子树不完全匹配的数据记录发现算法的抽取效果比MDR和基于标签树路径的结构相似度的数据记录抽取算法都好。按照属性权重排序后的去重算法比直接去重算法效率要高。
[Abstract]:With the rapid development of computer network, network resources are more and more abundant. On the one hand, it broadens the channels for people to obtain information; on the other hand, the disorder of information makes it difficult for users to obtain the information they need in the vast amount of information. Search engine provides users with the function of searching and classifying network information. In the network resources, there is one kind of resources that the traditional search engine can not index. This kind of resource is called deep web resource. Deep web resource refers to the resource that can not be indexed by traditional search engine. It is an online web database .deep web resource that can be accessed because of its rich resources, strong specialization, fast automatic updating speed and massive data. The advantages of a wide range of fields. People are getting more and more popular. It is of great theoretical and practical value to study how to extract the data returned through the deep web query interface and how to aggregate the extracted results. In this paper, the data extraction and result aggregation of deep web resources are studied. In the stage of data extraction, first of all, the paper briefly introduces MDR, summarizes the efficiency problems encountered by MDR in deep web page information extraction, and draws inspiration from the MDR data extraction algorithm. The MDR algorithm is improved to reduce the time complexity of data extraction. The extraction algorithm uses tag tree to represent HTML pages, and then cleans the pages before extraction, normalizes and constructs the tag tree. The structural similarity of label tree is used to locate the data record. The similarity calculation method improves the high time complexity of tree editing distance algorithm and the disadvantage of element comparison method which can not truly reflect the tree structure. It has a better extraction effect in deep web oriented data extraction. However, the similarity between some data records is low, so it is not good to use the similarity algorithm based on label tree. In order to solve the problem of data record recognition based on label structure, On the basis of improving the similarity of label tree structure to judge data record, a data record extraction algorithm based on subtree mismatch is proposed. Results aggregation is mainly focused on the extraction of the results to remove weight, before the weight of the attribute ranking, reduce the number of comparisons, to achieve the rapid and effective data records. Experimental results show that the extraction efficiency of the data record extraction algorithm based on structural similarity of label tree path is higher than that of MDR. At the same time, it is proved that the extraction effect of the data record discovery algorithm based on subtree mismatch is better than that of MDR and the data record extraction algorithm based on structural similarity of label tree path. The efficiency of the algorithm is higher than that of the direct algorithm.
【学位授予单位】：哈尔滨工程大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.09

【参考文献】