Web数据融合中网页清洗相关技术研究

发布时间：2018-02-23 12:17

本文关键词： 网页清洗词叶率重复网页主题分割分级检索　出处：《中南大学》2014年硕士论文　论文类型：学位论文

【摘要】：互联网中存在大量的重复网页和网页噪声,用户可能需要花费比预期更长时间以获取所需信息。利用Web数据融合给用户呈现所需信息之前,需要对这些内容进行清洗。利用网页代码的层次结构以及网页正文内容的特征信息,本文采用基于DOM结构树和词叶率(WLR值)的方法对网页噪声进行清洗,所有的操作都在DOM树上完成,保留Web正文完整的结构信息。在节点的统计信息中只计算所包含的叶子节点数(所有的文本内容都是包含在叶子节点中),统计信息更精确。在重复网页的识别过程中,为提高特征项对全文的表征性,采用“先分割,再提取”的特征提取方法,在原有的经典分割方法—TSF的基础上加以改进,根据句子相似性矩阵,动态指定块大小,自动识别主题边界,不依赖用户的参与,将网页文本分割成局部连贯的子主题片段。从每个主题片段提取关键句作为片段的特征项,特征项在一定程度上遵循子主题的变化,能更完整表示一个网页的内容。本文中借鉴simHash指纹的生成思路为每个主题片段生成一个特征指纹,根据指纹之间的汉明距离判断片段之间的相似性,进行检测之前利用主题片段数和文本长度对网页库进行过滤,减少需要进行检索的网页数,借鉴原有的分组检索方法,对片段指纹进行分级检索,提高检索的效率。使用本文方法对网页进行处理,可以提高网页噪声和重复网页清洗的准确率和召回率,以避免对无关内容的操作和网页的重复处理,可以节约存储空间,提高检索性能,减少后续处理过程中的时间和空间开销,提高整个Web融合系统的效率和准确率。
[Abstract]:There are a lot of duplicate web pages and web page noise in the Internet, and users may need to spend more time than expected to obtain the required information, which needs to be cleaned before using Web data fusion to present the required information to users. Based on the hierarchical structure of the page code and the feature information of the text of the page, this paper uses the method based on the DOM structure tree and the word leaf rate to clean the noise of the web page. All the operations are done in the DOM tree. Only the number of leaf nodes included is calculated in the node statistics (all text content is contained in the leaf node, so the statistics are more accurate. In the process of duplicate web page recognition, in order to improve the representativeness of feature items to the full text, the feature extraction method of "first segmentation, then extraction" is adopted, which is improved on the basis of the original classical segmentation method (TSF), according to the sentence similarity matrix. Dynamically specifying the block size, automatically recognizing the subject boundary, dividing the web page text into locally coherent sub-topic fragments without relying on the participation of the user, and extracting key sentences from each topic fragment as the feature items of the segment. To a certain extent, feature items follow the change of subthemes and can represent the content of a web page more completely. In this paper, we use the idea of simHash fingerprint generation to generate a feature fingerprint for each subject segment, and judge the similarity between fragments according to the hamming distance between fingerprints. In order to reduce the number of web pages that need to be retrieved, we can use the number of topic fragments and the length of text to filter the web page library before the detection, and use the original grouping retrieval method for reference to search the segment fingerprint in order to improve the efficiency of retrieval. Using this method to deal with web pages can improve the accuracy and recall rate of page noise and repeated page cleaning, avoid the operation of irrelevant content and the repeated processing of web pages, save the storage space and improve the retrieval performance. It reduces the time and space cost in the process of subsequent processing and improves the efficiency and accuracy of the whole Web fusion system.
【学位授予单位】：中南大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】