基于统计的网页净化模板生成算法

发布时间：2018-04-02 10:21

本文选题：网页净化　切入点：信息提取　出处：《科学技术与工程》2013年04期

【摘要】：同一个站点的大部分网页拥有几乎相同的DOM标签树,处理后的标签树作为一个模板,该站点的所有网页只保留这个模板中叶子节点包含的内容,由此可以实现这个站点的所有网页的净化。首先对一个站点内的一组样本网页提取内容块树,针对每个树统计每个标签节点包含的文本字数,同级节点只保留字数最多的一个,从而生成单边子树UST;然后把这组UST合并,同级节点中出现次数最多的即为重要内容节点,把这些节点串起来就构成重要单边子树PUST;最后比对每个父节点与子节点之间的字数,当比值超过一个阈值时则删除子节点以下的所有节点,从而生成该站点的重要单边子树SPUST。这个SPUST就是该站点的网页净化模板。
[Abstract]:Most pages of the same site have almost the same DOM tag tree, the processed tag tree acts as a template, and all pages of the site retain only the content contained in the leaf node in this template. In this way, we can purify all the web pages of this site. Firstly, we extract the content block tree from a set of sample pages in a site. For each tree, we count the number of text words contained in each label node, and the peer node only retains the one with the largest number of words. Thus, the single side subtree USTs are generated, and then the UST set is merged. The most frequent occurrence in the same level nodes is the important content node, and the number of words between each parent node and the child node is compared to the number of words between each parent node and the child node by stringing these nodes together to form the important unilateral subtree UST. When the ratio exceeds a threshold, all the nodes below the child node are deleted to generate the important unilateral subtree SPUST. this SPUST is the page purification template of the site.
【作者单位】：重庆第二师范学院网络中心数学与信息工程系;重庆第二师范学院网络中心财务处;重庆第二师范学院网络中心教务处;
【基金】：重庆教育学院研究项目(KY201176C)资助
【分类号】：TP393.092

【参考文献】