基于WEB挖掘的网页主题标签系统的设计与实现

发布时间：2018-04-19 01:11

本文选题：Web网页 + 主题标签　；参考：《北京邮电大学》2017年硕士论文

【摘要】：随着Internet的快速发展,互联网上的信息呈爆炸式增长。这大大丰富了用户获取信息的渠道,但也使得Web信息呈现出驳杂和冗余的特点,给用户快速精确定位自己感兴趣的信息带来了一定困难。Web2.0时代的到来,使标签成为一种互联网信息组织方式。目前,一些研究者通过文木分类、文摘自动生成等技术来对Web网页进行标引,从而提高用户检索的效率和准确率。但是这种粗粒度的Web网页关键信息提取和标引仍然无法满足用户对信息查找的需求,它忽略了网页自身的特点。另外,不同类型的网页采用统一的处理方式,使得输出结果准确度不高,缺乏具体应用场景具体分析的功能。因此,利用合理的技术和网页信息组织方式帮助用户获取有价值的信息,成为Web网页主题标签提取亟需解决的问题。本文采用自然语言标引方式对Web网页进行分析和研究,提出了构建Web网页主题标签的解决方案,并完成相应的网页主题标签系统。其中,主要研究内容和成果包括:1)实现了网页主题标签的提取。本文利用Web文本挖掘技术,同时结合网页自身特点,设计了网页主题标签提取的流程,并实现了数据准备、网页信息抽取、文本预处理、网页主题标签构建等功能模块;2)研究了三种应用场景下的网页标签构建技术。分别对关键词提取方法和命名实体识别技术进行了研究,并在此基础上,针对有正文信息的网页、需要识别特殊信息的网页和无正文信息的网页分别实现了多特征融合关键词提取、命名实体识别和基于TF的关键词提取方法,并将其应用到不同类型网页的主题标签构建中;3)不同分类网页的主题标签提取方案研究。通过对新闻类、视频类和电商类网页特点进行分析及对比,提出了其各自合适的网页主题标签提取方案。首先需要抽取能够代表网页中心思想的文本内容,然后根据其特点采取合适的网页标签构建技术生成网页主题标签,最后进行可视化展示。4)提出了系统的应用方案。本文利用网页主题标签提取为用户提供数据分析能力,实现批量URL的分析。对批量URL进行分析后,用户可直观地看到数据分析结果,这样可以帮助用户发掘数据背后隐含的价值和意义,并客观地认识和理解数据。基于上述研究内容和成果,本文构建并实现了基于Web文本挖掘的网页主题标签系统,该系统能够对Web网页进行挖掘分析,从而为网页生成具有一定准确性的主题标签,实现网页信息的有效组织和管理,以便用户有效获取所需的知识。
[Abstract]:With the rapid development of Internet, the information on the Internet is increasing explosively.This greatly enriches the channels for users to obtain information, but also makes the Web information present the characteristics of complexity and redundancy, which brings some difficulties to the users to locate the information they are interested in quickly and accurately. The arrival of the era of Web 2.0.Make tagging a way of organizing information on the Internet.At present, some researchers use the techniques of document classification and automatic generation of abstracts to index Web pages, so as to improve the efficiency and accuracy of user retrieval.However, this coarse-grained Web page key information extraction and indexing still can not meet the needs of users to find information, it ignores the characteristics of the page itself.In addition, different types of web pages adopt a unified processing method, which makes the output accuracy is not high, and lacks the function of specific analysis of specific application scenarios.Therefore, the use of reasonable technology and web information organization to help users to obtain valuable information, Web page topic label extraction needs to be solved.In this paper, the natural language indexing method is used to analyze and study the Web web pages, and a solution to construct the Web web page theme tags is proposed, and the corresponding web page theme label system is completed.Among them, the main research contents and results include: 1) to achieve the extraction of page theme tags.In this paper, we use Web text mining technology, and combine the characteristics of web pages, design the process of page topic label extraction, and realize the data preparation, page information extraction, text preprocessing.This paper studies the construction technology of web page label in three application scenarios.The methods of keyword extraction and named entity recognition are studied respectively, and on this basis, for web pages with text information,Web pages that need to recognize special information and pages without text information have realized multi-feature fusion keyword extraction, named entity recognition and TF based keyword extraction methods, respectively.It is applied to the topic label construction of different web pages.Through the analysis and comparison of the features of news, video and ecommerce web pages, this paper puts forward their own suitable schemes for extracting the theme tags of their web pages.Firstly, it is necessary to extract the text content which can represent the central idea of the web page, and then according to its characteristics, we adopt the appropriate technology of page label construction to generate the web page theme label. Finally, we present a systematic application scheme.In this paper, we use topic label extraction to provide users with data analysis ability and realize batch URL analysis.After analyzing the batch URL, the user can see the result of the data analysis intuitively, which can help the user to discover the hidden value and meaning behind the data, and to understand and understand the data objectively.Based on the above research contents and achievements, this paper constructs and implements a topic label system based on Web text mining. The system can mine and analyze Web pages, thus generating a certain accuracy of topic labels for web pages.Realize the effective organization and management of web information, so that users can obtain the required knowledge effectively.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092;TP391.1

【参考文献】