基于Web的企业信息获取技术的研究与应用

发布时间：2018-05-03 03:18

本文选题：企业信息 + 模式抽取　；参考：《沈阳航空航天大学》2014年硕士论文

【摘要】：互联网包含了丰富的信息资源，成为了企业获取信息的主要途径。然而由于互联网上的海量信息，如何从海量信息中找到企业所需要的信息仍为一个难题。因此，基于Web的企业信息获取技术成为目前研究的热点。本文从企业的产品出发，基于Web发现产品的生产企业，找到企业的首页。企业首页上含有大量的有关企业的产品介绍、企业荣誉、发展目标等信息，获得了企业首页就可以全面地、及时地获取企业信息。本文主要工作如下：首先，针对企业名称的命名特点，本文提出了基于LCS的企业名模式抽取算法。本文首先根据已知的企业信息建立索引，实现给定产品名检索出相应的生产企业，然后基于LCS算法提取企业名称的最长公共子序列，最后根据最长公共子序列和企业名称相匹配的方法抽取出企业名模式。实验结果表明，该方法可以有效的抽取出企业名模式作为查询扩展的扩展词集。其次，本文采用了基于贝叶斯的信息过滤方法。该方法将基于搜索引擎搜索到的网页利用贝叶斯分类器过滤后，获取企业的首页，将非企业首页过滤掉。在分类器选择特征时，本文提出了基于网页链接块的导航条锚文本抽取方法，根据网页链接间字符的间距来识别网页块，，抽取平均长度为3-5个字且数量在两个以上的锚文本，将这些锚文本作为特征词。本文选取了机械类、电力电气类、建筑建材类、材料类等产品做实验，实验结果表明，该方法取得了较好的效果。
[Abstract]:The Internet contains abundant information resources and becomes the main way for enterprises to obtain information. However, due to the huge amount of information on the Internet, how to find the information needed by enterprises from the mass information is still a difficult problem. Therefore, the technology of enterprise information acquisition based on Web has become a hot topic. This article from the enterprise's product, based on the Web discovery product production enterprise, finds the enterprise home page. The first page of the enterprise contains a lot of information about the product introduction, honor and development goal of the enterprise, so that the first page of the enterprise can obtain the information of the enterprise comprehensively and in time. The main work of this paper is as follows: Firstly, according to the naming characteristics of enterprise name, this paper proposes an algorithm of enterprise name pattern extraction based on LCS. In this paper, first of all, the index is built according to the known enterprise information, and the corresponding manufacturing enterprise is retrieved by the given product name, and then the longest common sub-sequence of the enterprise name is extracted based on the LCS algorithm. Finally, the enterprise name pattern is extracted by matching the longest common subsequence with the enterprise name. The experimental results show that this method can extract the enterprise name schema effectively as the extended word set of query. Secondly, the information filtering method based on Bayes is adopted in this paper. In this method, the web pages searched by search engine are filtered by Bayesian classifier, then the first page of the enterprise is obtained, and the non-enterprise homepage is filtered out. When the classifier selects features, this paper proposes a navigation bar anchor text extraction method based on the web link block, which can identify the web page block according to the distance between the characters of the page link, and extract the anchor text with an average length of 3-5 words and more than two words. These anchor texts are used as feature words. In this paper, mechanical, electric, building building materials and other products are selected for experiments. The experimental results show that the method has achieved good results.
【学位授予单位】：沈阳航空航天大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.1

【相似文献】