基于web的改进信息抽取算法的设计与实现

发布时间：2018-03-03 22:33

本文选题：信息抽取　切入点：双序列比对　出处：《电子科技大学》2014年硕士论文　论文类型：学位论文

【摘要】：随着Internet及其相关技术的飞速发展,互联网已成为人们发布和获取信息的主要平台。由于互联网上的信息泛滥,使得用户获取有用信息变得困难。从Web网页中搜索特定信息的能力还不足以满足用户的需求。所以,如何研究出一种有效的信息抽取方法应用在Web页面信息抽取系统中,已经成为当今亟需解决的热点研究问题。本文主要研究了一种新的信息抽取算法。该方法针对数据密集型的页面自动进行信息抽取。其中包括了下面几个问题。首先,要进行初始化工作。将训练集合中的所有样本页面转换成HTML文档形式。其次,如何自动去除页面噪声的问题。目前很多网站的页面上都会有导航栏、广告、LOGO、版权信息等与主题内容无关的信息,例如淘宝、团购、旅游网等商业网站。本文运用一种改进的双序列比对算法来去除网页中的噪声。然后,进行模板自动抽取。如今,动态页面技术被许多网站采用,应用于网站设计等各个方面。本文研究的“动态”为模板和后台数据库相结合的技术进行Web信息抽取方法。并将去噪后的页面修补成规范的标准页面作为训练集合,利用模板抽取算法进行实验。最后,在来自真实网站的数据密集型网页集合上进行实验,实验结果充分说明了改进的双序列比对在页面去噪方面的有效性,以及本文所设计的信息抽取系统在信息抽取方面的有效性。
[Abstract]:With the rapid development of Internet and its related technologies, the Internet has become the main platform for people to publish and obtain information. It is difficult for users to obtain useful information. The ability of searching specific information from Web pages is not enough to meet the needs of users. Therefore, how to develop an effective information extraction method for Web page information extraction system, This paper mainly studies a new information extraction algorithm, which is used to extract information automatically for data-intensive pages. It includes the following several problems. To initialize. Convert all sample pages in the training collection into HTML document form. Second, how to automatically remove page noise. Currently, many websites have navigation bars on their pages. Advertising, copyright information and other information that is not related to the subject content, such as Taobao, Group purchase, Travel net and other commercial websites. This paper uses an improved algorithm of double sequence alignment to remove the noise in the web page. Then, the template is extracted automatically. Dynamic page technology is used by many websites, It is applied to website design and other aspects. The technology of "dynamic" in this paper combines template and backstage database to extract Web information, and the de-noised page is patched into a standard page as a training set. Finally, the experiment is carried out on the data intensive web pages set from real websites. The experimental results fully demonstrate the effectiveness of the improved double sequence alignment in page denoising. And the effectiveness of the information extraction system designed in this paper.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.1

【参考文献】