基于网页结构的Web信息提取系统的设计与实现

发布时间：2018-06-12 21:11

本文选题：DOM + 信息提取　；参考：《吉林大学》2011年硕士论文

【摘要】：随着网络及其应用的不断普及,Internet已经成为世界上最大的信息库,但这些信息对用户来说并非都是有用的,这些有用的信息通常夹杂在大量无关的结构和文字中,这些无关的结构和文字严重影响了用户获取网页主题信息的效率,也导致Web的可用性的降低。网络信息提取和集成系统的作用是提取Internet网上的数据,将其集成到XML或者关系数据库中,从而为用户提供数据检索、数据挖掘和OLAP等其他信息服务。但是,HTML网页中的数据并非结构化的,并且网页中包含了大量和网页数据无关的HTML标签、图片、flash广告等、这就给信息集成系统集成数据带来了不小的困难,为解决这一困难,相关学者做了大量研究,随之出现了网页主题信息提取相关技术,通过删除网页中冗余网页标签和于主题信息无关的图片、flash广告等,提取出网页的真正主题内容,可以明显地降低网页大小并增加信息的有用性,从而能提高信息集成系统的效率和准确性,也为后续的数据检索、数据挖掘、OLAP等数据服务奠定了基础。因此,网页主题信息提取在理论和应用上都有着十分重要的研究意义和应用价值,并成为近些年来信息系统领域的研究热点之一。本文通过大量的研究,发现目前的网页主题信息提取方法都有着这样或那样的缺点和不足,因此,本文提出了一种新的网页主题信息提取方法,该方法基于STU-DOM模型,提出了基于该模型的页面结构过滤和分块算法以及基于主题相关度的剪枝,并根据此算法设计和实现了网页主题信息提取系统。基于分块理论,设计了STU树模型和STU-DOM模型。STU-DOM模型能够有效地描述网页的结构、内容和分块布局,提高了算法的准确性、可靠性和可扩展性。基于STU-DOM模型,提出了HTML结构过滤和分块算法,以及基于主题相关度的剪枝算法。这些算法可以自动地从异构网页中提取出主题信息,有较高的准确性和通用性。提出并实现了一些优化策略:改进了分块粒度,设计了虚词表和关键词表,加权计算主题相关度。通过优化显著提高了算法的效率和准确性,降低了网页信息冗余度。实验测试表明,本文提出的方法能够自动、准确、快速地提取出网页的主题信息,而且不改变网页的内容、结构和布局,因此有较高的研究意义和应用价值。
[Abstract]:With the increasing popularity of the Internet and its applications, the Internet has become the largest information base in the world, but this information is not always useful to users. These irrelevant structures and text seriously affect the efficiency of the user to obtain the information on the topic of the web page, and also lead to the decrease of the usability of the Web. The function of network information extraction and integration system is to extract data from Internet and integrate it into XML or relational database, thus providing users with other information services such as data retrieval, data mining and OLAP. However, the data in HTML pages are not structured, and the web pages contain a large number of HTML tags, pictures and flash advertisements that are independent of the page data, which makes it difficult for information integration systems to integrate data. Related scholars have done a lot of research, and then appeared the relevant technology of web page subject information extraction, by deleting redundant page tags and image flash advertising, the real theme content of the page is extracted. It can significantly reduce the size of web pages and increase the usefulness of information, thus improving the efficiency and accuracy of the information integration system. It also lays the foundation for subsequent data retrieval, data mining and other data services such as OLAP. Therefore, the topic information extraction of web pages has important research significance and application value in theory and application, and has become one of the research hotspots in the field of information system in recent years. Through a lot of research, this paper finds that the current methods of extracting topic information of web pages have some shortcomings and shortcomings. Therefore, a new method of extracting topic information of web pages is proposed in this paper, which is based on STU-Dom model. A page structure filtering and blocking algorithm based on this model and pruning based on topic correlation are proposed. According to this algorithm, a web page topic information extraction system is designed and implemented. Based on block theory, STU tree model and STU-Dom model. STU-Dom model can effectively describe the structure, content and block layout of web pages, and improve the accuracy, reliability and scalability of the algorithm. Based on STU-Dom model, HTML structure filtering and blocking algorithm and pruning algorithm based on topic correlation are proposed. These algorithms can automatically extract topic information from heterogeneous web pages, and have high accuracy and versatility. Some optimization strategies are put forward and implemented, such as improving block granularity, designing function word table and keyword table, and calculating the correlation degree of topic weighted. By optimizing the algorithm, the efficiency and accuracy of the algorithm are improved significantly, and the redundancy of web page information is reduced. The experimental results show that the proposed method can automatically, accurately and quickly extract the subject information of the web page without changing the content, structure and layout of the web page, so it has high research significance and application value.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2011
【分类号】：TP393.09

【相似文献】