面向企业信息的Web聚焦搜索的设计与实现

发布时间：2018-03-10 14:03

本文选题：聚焦搜索　切入点：聚焦爬虫　出处：《南京师范大学》2013年硕士论文　论文类型：学位论文

【摘要】：从海量网络资源中获取企业基本信息,为企业的客户关系管理、潜在竞争对手发现等提供信息支持,对于企业的生存和发展壮大具有重要意义。鉴于通用搜索引擎处理这类问题时存在的局限性,本文设计实现了面向企业信息的聚焦搜索来满足此类需求。 Web中的企业信息页面可以分为两大类：企业信息以结构化表格形式呈现的POI页面、企业信息以非结构化文本形式呈现的TOI页面。两类页面结构差异较大,聚焦搜索过程需分开进行。聚焦爬虫和信息抽取是实现聚焦搜索的两个核心任务,围绕聚焦搜索的这两个核心任务,且面向企业信息的两种不同表现形式,本文主要展开了如下几个方面的研究工作： 1、面向POI的聚焦爬虫。现有聚焦爬虫研究多是面向主题的,对于面向POI的用户需求目前还较缺乏相关研究。本文利用朴素贝叶斯与支持向量机等分类器模型,通过设计有效的特征模板,实现了面向POI的聚焦爬虫。实验结果表明利用爬虫对面向POI的用户需求进行聚焦是可行的。 2、面向TOI的聚焦爬虫。现有聚焦爬虫在处理文本页面时,大多直接对页面内的所有文本进行处理,这就引入了较多的噪音内容。本文采用改进的页面相关性分析算法,仅获取与主题最相关的五块文本,对不同块赋予相应的权重,利用分类模型方法判断整体的相关性,实现了面向TOI的聚焦爬虫。实验也采用朴素贝叶斯和支持向量机分类模型,实验结果与基于页面全部文本实现的聚焦爬虫Baseline系统相比,收获率平均高出20%左右,最高差值可达51.35%,充分说明了改进的页面相关性算法是非常有效的。 3、企业信息抽取。以聚焦爬虫获取的相关网页集为数据源,抽取POI域、TOI域内的企业信息。POI域内的企业信息布局规范,结构规律性较强,因此仅采用包装器方式对相对简单的POI域进行信息抽取。对于相对较复杂的TOI域内的企业信息,本文采用统计学习模型将任务分解为两步进行抽取：先判断一个句子是否包含槽信息,然后判断句中短语所属的槽类别,根据句子和短语的联合概率确定最终的槽填充内容。实验定义了8种企业属性作为待填充槽,各类槽的平均F-measure达到93.8%,比基于规则方法实现的Baseline系统结果平均高出7.6%,充分显示了算法的有效性。
[Abstract]:To obtain the basic information of the enterprise from the massive network resources, to provide the information support for the customer relationship management of the enterprise, the discovery of potential competitors, etc. In view of the limitations of general search engine in dealing with this kind of problems, this paper designs and implements focused search for enterprise information to meet this kind of requirements. The enterprise information pages in Web can be divided into two categories: the POI pages in which the enterprise information is presented in the form of structured tables, and the TOI pages in which the enterprise information is presented in the form of unstructured text. Focusing crawler and information extraction are the two core tasks of focusing search. The main work of this paper is as follows:. 1. Focus crawler for POI. Most of the existing focused crawler research is theme-oriented, but there is still a lack of relevant research on POI user requirements. In this paper, we use naive Bayes and support vector machine classifier model, and other classifier models, such as naive Bayes and support vector machine, are used in this paper. An effective feature template is designed to realize the POI oriented focused crawler. The experimental results show that it is feasible to use the crawler to focus the POI oriented user requirements. 2, focus crawler for TOI. Most of the existing focused crawlers directly process all the text in the page, which introduces more noise content. In this paper, the improved page correlation analysis algorithm is used. Only five pieces of text that are most relevant to the topic are obtained, the corresponding weights are given to the different blocks, and the whole correlation is judged by the classification model method, and the focused crawler oriented to TOI is realized. The experiment also uses naive Bayes and support vector machine classification models. Compared with the focused crawler Baseline system based on all page text, the experimental results show that the average harvest rate is about 20% and the maximum difference is 51.35, which fully shows that the improved page correlation algorithm is very effective. 3. Enterprise information extraction. Taking the relevant web page set obtained by focused crawler as data source, extracting enterprise information layout standard in POI domain and TOI domain, the structure is more regular. Therefore, only the wrapper is used to extract information from the relatively simple POI domain. For the enterprise information in the relatively complex TOI domain, In this paper, a statistical learning model is used to decompose the task into two steps: first to determine whether a sentence contains slot information, and then to determine the slot category of the phrase in the sentence. According to the joint probability of sentences and phrases, the final slot filling content is determined. Eight kinds of enterprise attributes are defined as the slots to be filled. The average F-measure of various grooves is 93.8, which is 7.6 times higher than the average result of the rule-based Baseline system, which fully shows the effectiveness of the algorithm.
【学位授予单位】：南京师范大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【相似文献】