基于维基百科的Web网页数据质量评估系统

发布时间：2018-05-05 02:08

本文选题：Web数据质量 + 支持向量机　；参考：《南京邮电大学》2014年硕士论文

【摘要】：近几年来，Web信息资源呈爆炸性增长，Web网上充斥着大量重复、篡改、虚假的信息。用户在浏览网页时，往往会迷失在信息的海洋中，无法得知其所获取的信息是否精确、完整。数据质量评估是解决该问题的关键环节。本文在调研了国内外网页质量评估技术的基础上，结合机器学习、信息抽取等相关知识，提出了一种以维基网页作为参照基准，对用户输入源网页进行评估的方法。该方法主要步骤为：首先针对用户输入的一个网页链接，抽取网页关键字，到维基百科进行网页采集。随后用机器学习的方式对维基网页进行质量鉴别，并对通过鉴别的网页进行信息抽取，以语义三元组的形式存储。最后利用语义三元组，，以比照的形式对源网页进行多维度的质量分析。本方法具有以下优点：第一，通过集成维基百科相关网页作为基准，充分利用了群众的集体智慧，能较好地反映出源网页的质量缺陷。第二，使用了支持向量机对维基网页进行质量鉴别，并以LDA模型进行主题相关度鉴别，为源网页提供了高质量、高相关度的参照网页。第三，传统的网页评估方法主要是非语义的，在本文的源网页质量评估中，采用了语义的方法，充分挖掘了网页的语义信息。理论分析和实验对比证明了本方法的可行性和有效性。
[Abstract]:In recent years, the information resources of Web are increasing explosively. There are a lot of repeated, tampered and false information on the Web. Users are often lost in the ocean of information when they browse the web, so they can not know whether the information they get is accurate and complete. Data quality evaluation is the key to solve this problem. Based on the research of the domestic and foreign web page quality assessment technology, combined with machine learning, information extraction and other related knowledge, this paper puts forward a method of user input source page evaluation based on Wikimedia page as reference. The main steps of the method are as follows: firstly, the key words are extracted from a web page link input by the user, and then the web page is collected to Wikipedia. Then we use machine learning to identify the quality of Wikimeaks and extract information from the authenticated pages to store them as semantic triples. Finally, semantic triples are used to analyze the multi-dimensional quality of the source pages. The method has the following advantages: first, by integrating Wikipedia related pages as a benchmark, the collective wisdom of the masses is fully utilized, and the quality defects of the source pages can be well reflected. Secondly, support vector machine (SVM) is used to identify the quality of Wikimeaks, and the LDA model is used to identify the relevance of the subject, which provides the reference pages with high quality and high correlation for the source pages. Thirdly, the traditional methods of web page evaluation are mainly non-semantic. In this paper, the semantic method is used to fully excavate the semantic information of web pages. The theoretical analysis and experimental results show that the method is feasible and effective.
【学位授予单位】：南京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【相似文献】