Web文档清洗系统中HTML解析器的开发

发布时间：2018-03-23 17:33

本文选题：HTML解析器　切入点：词法器　出处：《计算机应用研究》2002年02期

【摘要】：对于组建一个面向Web的信息系统来说 ,去除掉脚本、广告链接以及导航链接等无用数据 ,将提高信息存储和检索的效率 ;同时 ,基于语义对Web文档进行合并和分割也会有助于信息的管理 ,这些都是Web文档清洗系统的任务。在Web文档清洗中 ,无论是脱机的规则学习还是联机的文档清洗 ,都需要建立在对Web文档的结构和内容进行分析的基础之上。从HTML解析的一般概念入手 ,结合Web文档清洗系统的需求 ,描述了一个自主开发的HTML解析器的结构 ,并对其组成部分 :词典、词法分析器和语法分析器的设计作了详细的讨论
[Abstract]:For building an Web oriented information system, removing useless data such as scripts, advertising links, and navigation links will improve the efficiency of information storage and retrieval; at the same time, Merging and splitting Web documents based on semantics will also help to manage information, which is the task of Web document cleaning system. In Web document cleaning, whether offline rule learning or online document cleaning, It is necessary to base on the analysis of the structure and content of Web document. Starting with the general concept of HTML parsing and combining with the requirements of Web document cleaning system, this paper describes the structure of a self-developed HTML parser. The design of dictionary, lexical analyzer and parser is discussed in detail.
【作者单位】：南京大学计算机科学与技术系南京大学计算机软件新技术国家重点实验室
【基金】：国家自然科学基金资助项目 (60 0 73 0 3 0 ) 国家教育部“现代远程教育关键技术研究重点项目” 日本富士通研究所“Web文档清洗技术研究”资助项目
【分类号】：TP393.092

【共引文献】