面向Web挖掘的主题网络爬虫的研究与实现

发布时间：2018-03-25 20:16

本文选题：Web挖掘　切入点：主题网络爬虫　出处：《西安电子科技大学》2012年硕士论文

【摘要】：随着互联网的迅速发展，越来越多的信息资源以网络为媒介呈现在人们面前，而通过搜索引擎获取生活、生产所需的信息资料也开始成为人们掌握资讯的主流方式之一。但是由于Web信息资源的爆炸式增长及其半结构化、实时性、异构性和离散性等的特点，如何对Web资源进行挖掘分析、提取人们需要的特定主题的信息，已经成为一项重要的研究课题。本文的研究内容是基于企业竞争情报、面向Web挖掘的主题式搜索，在介绍了课题的研究背景和现状之后，着重讨论了Web挖掘和主题搜索引擎的核心技术。具体的研究工作如下：主题网络爬虫：综合分析了现有搜索引擎的网络搜索算法，改进了相关的搜索策略，提出了一种非贪婪遗传搜索算法。 Web文档分析：本文利用HTML Tidy工具将Web文档转换为其对应的树型结构，然后根据用户的需求利用不同的遍历算法提取相关的信息；爬虫系统对网页的正文内容进行提取和分词之后，，采用经过改进的特征项权重计算方法建立文本的特征向量。主题相关性评价：在利用向量空间模型对网页正文内容进行主题相关性评价的基础上，系统结合超链接的锚文本、自身字符串和它所在的网页对其进行了主题相关性的计算。在以上研究内容的基础上，设计并实现了基于企业竞争情报的主题网络爬虫系统。
[Abstract]:With the rapid development of the Internet, more and more information resources appear in front of people through the network, and get life through the search engine. But due to the explosive growth of Web information resources and its characteristics of semi-structured, real-time, heterogeneity and discreteness, and so on, the production of information materials has become one of the main methods of people to master information, but due to the explosive growth of Web information resources and its characteristics of semi-structured, real-time, heterogeneity and discreteness. How to mine and analyze Web resources and extract the information of specific topics that people need has become an important research topic. The research content of this paper is based on enterprise competitive intelligence, the topic search oriented to Web mining, after introducing the research background and present situation of the subject, The core technologies of Web mining and subject search engine are discussed emphatically. The specific research work is as follows:. Topic crawler: a non-greedy genetic search algorithm is proposed by synthetically analyzing the existing search engine network search algorithms and improving the relevant search strategies. Web document analysis: this paper uses HTML Tidy tools to transform Web document into its corresponding tree structure, and then uses different traversal algorithms to extract relevant information according to the user's needs. After extracting and segmenting the text of the web page, the improved method of calculating the weight of the feature item is used to establish the feature vector of the text. Subject relevance evaluation: on the basis of the vector space model to evaluate the theme correlation of the text of the web page, combining the anchor text of the hyperlink, its own string and the web page in which it is located, the theme correlation is calculated. On the basis of the above research, a subject web crawler system based on enterprise competitive intelligence is designed and implemented.
【学位授予单位】：西安电子科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【相似文献】