Deep web数据源的自动识别与分类研究

发布时间：2018-08-25 09:20

【摘要】：Deep Web深度网络资源,又称作不可见网或隐藏网(译为Invisible Web or Hidden Web),它常常被人称为谷歌查不到的网络信息,这些信息不属于我们所熟知的那些标准搜索引擎所能够搜索到的。通常认为搜索引擎查不到的信息要占网络全部信息的90%。据Bright Planet公司技术白皮书的中描述,Deep Web资源容量约为Surface Web的500倍,而且包含着更多有价值的资源。超过一半的Deep Web内容都保存在专业领域的数据库中。海量的表面信息固然可以通过普通的搜索引擎查询到,可是还有相当大了的信息由于隐藏在深处无法被搜索引擎查到,而且Deep Web数据源同时又是不断变化的,绝大部分隐藏的信息必须通过动态请求产生网页信息,标准的搜索引擎是没有办法对它进行查找的。因为这些动态请求产生的网页信息必须要通过Deep Web查询接口来获取,使得Deep Web信息获取变的更加困难,为了有效的获取Deep Web信息,我们必须要对Deep Web进行数据自动识别和分类。本文通过对Deep Web数据源的自动识别和分类研究这两大重点问题展开深入研究。主要的研究内容包括： (1)对普通网页表单及Deep Web网页的表单特征进行分析,经过合并、添加、筛选得到的得到本文采用的表单特征提取方案,包含各控件值,控件数量,包含语义信息的词条等一系列特征值作为分类属性。 (2) Deep Web数据集成的关键问题研究,查询接口的识别及分类判定。针对朴素贝叶斯方法的限制,使用粗糙集算法进行优化约简。该方法利用两次随机抽样建立基于朴素贝叶斯算法的分类器组,利用粗糙集算法的属性约简方法进行分类器组的约简处理,然后利用优化后的分类器组进行分类,对得到的分类结果进行加权平均,得到最终的分类结果。实验结果显示,在优化后的贝叶斯分类分类器组,对Deep Web查询接口及其分类的查准率及查全率上均有明显提高。 (3) Deep Web数据源识别及分类性能对比。将数据挖掘中的几种分类方法,如：C4.5决策树、ID3等以及本文算法进行分析对比,在查全率和查准率上效果验证了此方法可行。本文所采取的方法是分析现有的相关研究,通过对Deep Web数据源的学习和分析,并在目前已有的研究成果的之上,通过改进的算法,加以实验数据来验证我们的算法的有效性。从实验的结果来看本文的方法还是比较满意的。实验中难免存在不足之处,在今后的研究中我们将进一步的对相关问题和算法进行修正。Deep Web的研究如今还有一段很长的路要走,存在的难题需要广大的研究者们逐个的去解决。
[Abstract]:Deep Web deep web resources, also known as invisible or hidden networks, are often referred to as Google's unsearchable web information, which is not something that standard search engines known to us can search. It is generally believed that the search engine can not find the information to account for 90% of all the information on the network. The resource capacity of Deep Web is about 500 times that of Surface Web and contains more valuable resources, according to Bright Planet's technical white paper. More than half of Deep Web content is stored in professional databases. The massive amount of surface information can be queried by ordinary search engines, but there is still quite a lot of information that can not be found by search engines because it is hidden in the depths, and the Deep Web data sources are constantly changing at the same time. Most hidden information must be generated by dynamic request, and the standard search engine can not find it. Because the web page information generated by these dynamic requests must be obtained through the Deep Web query interface, it is more difficult to obtain the Deep Web information. In order to obtain Deep Web information effectively, we must recognize and classify the Deep Web data automatically. In this paper, the automatic identification and classification of Deep Web data sources are studied. The main research contents are as follows: (1) analyzing the form features of common web pages and Deep Web web pages, after merging, adding, screening the form feature extraction scheme adopted in this paper, including the values of each control, the main contents of this paper are as follows: 1. A series of feature values such as the number of controls and the entries containing semantic information are used as classification attributes. (2) Research on key issues of) Deep Web data integration, identification and classification of query interfaces. Aiming at the limitation of naive Bayes method, rough set algorithm is used to optimize reduction. In this method, the classifier group based on naive Bayes algorithm is established by twice random sampling, the attribute reduction method of rough set algorithm is used to deal with the reduction of classifier group, and then the optimized classifier group is used to classify. The results are weighted average and the final classification results are obtained. The experimental results show that the precision and recall rate of the Deep Web query interface and its classification are improved obviously in the optimized Bayesian classifier group. (3) Deep Web data source recognition and classification performance comparison. Several classification methods in data mining, such as: C4.5 decision tree ID3 and this algorithm, are analyzed and compared. The results show that this method is feasible on recall and precision. The method adopted in this paper is to analyze the existing relevant research, through the study and analysis of the Deep Web data sources, and on the basis of the existing research results, through the improved algorithm, to verify the effectiveness of our algorithm through experimental data. The experimental results show that the method is satisfactory. In the future, we will further study the related problems and algorithms. Deep Web is still a long way to go, the existing problems need to be solved one by researchers.
【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】