Web信息自动标引研究

发布时间：2018-06-27 01:35

本文选题：Web信息 + 自动标引　；参考：《浙江大学》2014年博士论文

【摘要】：互联网络的发展及信息化工程的推进,促使Web信息逐步累积成为一个能够提供信息交互、信息共享,并影响人类生活各个层面的巨大资源空间。为了从具有海量性、无序性、异构性、实时更新性、多样性等特征的Web信息中快速、准确地获取所需资源,人们开始逐渐认识到Web信息组织管理的重要性,并开始探索各种Web信息处理方法,自动标引即为其中之一。本研究以自动提取Web信息标引词为切入点,以Web坐标系、Web页面组织结构和Web页面浏览者的阅读习惯等特点为研究对象,探索Web信息自动标引过程中的具体影响因素。在总结前人研究工作的基础上,提出设想：根据网页坐标系,按照不同站点类型,用不同分割比例把网页划分若干区域；判析Web信息块归属区域并针对网站类型,探索各区域信息块在自动标引过程中的权重,最后编写程序验证以上设想,完成自动标引各个环节。具体步骤如下：(1)研究实现Web页面采集。根据研究需要,分别实现Web页面批量采集和手动采集,解决Web页面采集过程中的页面编码转换、html转换xml等问题。(2)利用Web页面坐标系,结合页面浏览者阅读习惯,将Web页面划分成9个区域。每个区域占据页面一定比例,且区域中信息块被视为一个信息块集群,在后期运算中具有同样的标引权重并被统一处理。(3)寻找发现不同类型网站的适宜页面分割比例。不同类型网站有着自己独特的页面信息发布方式。如新闻类站点,往往图片较少,文字报道占主要部分；大部分新闻类站点都向页面浏览者提供对某新闻进行评价的功能,从而造成网页高度变动幅度较大。本文分别选择新闻类、体育类、科学类站点页面,用不同页面分割比例进行测试,找出各类型站点的适宜页面分割比例值。(4)摸索不同区域信息块在自动标引过程中的权重。浏览者在访问Web页面时,总会有视觉焦点、阅读习惯等特性,从而Web页面设计者在制作网页时,也会有所重点地安排Web页面信息。因此能否发现不同Web页面区域的信息重要程度,对后期自动标引结果的准确性有着直接影响。本文通过样本实验,对新闻类、科学类站点网页的不同区域信息块重要性进行了摸索,并分别得出不同类型站点的Web页面区域信息块在自动标引中的权重。(5)实现对Web页面进行自动标引。在考虑Web页面信息噪音和区域特性的基础上,结合文本方法特色,给出一种Web信息自动标引的方法,编写程序予以实现和验证。此外,本文还分别对网页宽度、网页高度与不同页面分割比例下的信息抽取查全率、准确率等的相关性等问题进行了探讨,以期对以后该领域研究有所帮助。综上所述,本文对Web信息自动标引过程中各环节的关键技术进行了探索,探讨了不同类型站点网页的适宜分割比例,研究了网页坐标系与Web信息自动标引过程的相互关系,对相关研究有着借鉴和参考意义。
[Abstract]:With the development of Internet and the promotion of information engineering, Web information is gradually accumulated into a huge resource space which can provide information exchange, information sharing and influence human life. In order to obtain the required resources quickly and accurately from the Web information with the characteristics of magnanimity, disorder, heterogeneity, real-time update and diversity, people begin to realize the importance of the organization and management of Web information. And began to explore a variety of Web information processing methods, automatic indexing is one of them. In this study, we take the automatic extraction of Web information indexing words as the starting point, take the characteristics of the web page organization structure and the reading habits of the web page visitors in the Web coordinate system as the research object, and explore the specific influencing factors in the process of automatic indexing of Web information. On the basis of summarizing the previous research work, this paper puts forward some tentative ideas: according to the web coordinate system, according to the different site types, the web page is divided into several areas with different proportion, and the Web information block belongs to the area and aims at the website type. The weight of each region information block in the process of automatic indexing is explored. Finally, the program is written to verify the above assumption, and each link of automatic indexing is completed. The concrete steps are as follows: (1) Web page collection is realized. According to the needs of the research, we realize the batch and manual collection of web pages, and solve the problems of page coding conversion / html conversion xml in the process of web page collection. (2) using the web page coordinate system, combining with the reading habits of the page viewer, Divide the Web page into nine regions. Each area occupies a certain proportion of the page, and the information block in the region is regarded as a cluster of information blocks, which has the same indexing weight in the later operation and is uniformly processed. (3) to find the appropriate proportion of page segmentation to find different types of websites. Different types of websites have their own unique way of publishing page information. For example, news sites tend to have fewer pictures and text reports account for the main part; most news sites provide page views with the function of evaluating a certain news, resulting in a large range of page height changes. This article selects the news class, sports class, science type website page separately, carries on the test with the different page partition proportion, finds out each type site suitable page segmentation proportion value. (4) gropes the different area information block in the automatic indexing process weight. When visitors visit Web pages, they always have some features such as visual focus, reading habits and so on, so the web page designer will also arrange Web page information with emphasis when making web pages. Therefore, whether we can find the importance of information in different Web page regions has a direct impact on the accuracy of the automatic indexing results in the later period. In this paper, the importance of different regional information blocks of news and science websites is explored through sample experiments. The weight of Web page area information block in automatic indexing of different types of sites is obtained respectively. (5) automatic indexing of Web pages is realized. On the basis of considering the noise and region characteristics of Web page information, a method of automatic indexing of Web information is presented, which is realized and verified by programming. In addition, this paper also discusses the correlation of information extraction recall rate, accuracy rate and so on under the conditions of page width, page height and different page segmentation ratio respectively, in order to be helpful to the future research in this field. To sum up, this paper explores the key technologies in the process of automatic indexing of Web information, probes into the appropriate proportion of web pages of different types of sites, and studies the relationship between web coordinates and the process of automatic indexing of Web information. It has reference and reference significance to relevant research.
【学位授予单位】：浙江大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP393.09

【参考文献】