基于Web的企业竞争情报收集技术研究

发布时间：2018-05-10 21:02

本文选题：竞争情报 + 主题爬虫　；参考：《大连理工大学》2012年硕士论文

【摘要】：随着信息技术的不断发展,越来越多的资源通过互联网呈现给用户,这给不少企业的情报收集带来了新的机遇；同时也使企业面临新的挑战,企业如何有效地从互联网中的海量信息资源中获取准确可靠的信息一时间成为研究的热点。通用搜索引擎可以解决一般用户的检索,但面对企业情报收集时在页面的及时性和个性化方面就不能满足用户的需求。本文旨在利用开源软件的优势,通过Web挖掘技术获取互联网中的信息情报,进而搭建和实现企业自动化情报收集平台,方便用户开展情报工作；同时提高企业获取情报的效率,提升企业的市场竞争力。本文通过对企业竞争情报获取技术的研究分析,设计了一个企业自动化竞争情报收集系统,该系统主要解决用户在互联网中收集信息所面临的问题；同时为管理者提供了决策支持。具体工作如下 (1)本文首先指出企业开展竞争情报工作在经济全球化下的现实意义,阐明企业构建竞争情报系统的必要性,并指出目前市场上主流竞争情报软件的不足之处。 (2)从系统开发角度研究了Web信息收集中主题爬虫工作原理,爬虫种子的定制,Web文档预处理,字符编码,中文分词,页面格式化等一系列关键技术。 (3)深入研究了主题爬虫的体系结构,并根据第三方门户网站的页面特征对爬虫内部结构做了优化设计。 (4)针对在高质量数据源中获取的Web文档,利用一种改进的TF-IDF方法提取领域主题词,作为后期情报加工和分析的基础。改进的算法在主题词提取准确性方面有了较大提高。 (5)最后,根据本文研究的内容,我们设计开发了一个面向医药领域情报自动收集系统。该系统可以定制竞争对手网站页面,定期的收集信息并将信息转换为一定的格式呈现给情报工作人员。
[Abstract]:With the continuous development of information technology, more and more resources are presented to users through the Internet, which brings new opportunities for many enterprises to collect information, and also makes enterprises face new challenges. How to effectively obtain accurate and reliable information from the massive information resources in the Internet has become a hot research topic. General search engine can solve the retrieval of general users, but it can not meet the needs of users in the aspect of timeliness and individuation of pages in the face of enterprise intelligence gathering. The purpose of this paper is to make use of the advantage of open source software to obtain information in the Internet through Web mining technology, and then to build and realize the automatic information gathering platform of enterprises, which is convenient for users to carry out intelligence work, and at the same time to improve the efficiency of obtaining information by enterprises. Enhance the market competitiveness of enterprises. Based on the research and analysis of enterprise competitive intelligence acquisition technology, this paper designs an enterprise automated competitive intelligence gathering system, which mainly solves the problems that users face in collecting information in the Internet. At the same time, it provides decision support for managers. The specific work is as follows Firstly, this paper points out the practical significance of enterprises' competitive intelligence work under the economic globalization, expounds the necessity for enterprises to build competitive intelligence systems, and points out the shortcomings of the mainstream competitive intelligence software in the market at present. From the point of view of system development, this paper studies a series of key technologies, such as the working principle of topic crawler in Web information collection, the preprocessing of custom web document of crawler seed, character encoding, Chinese word segmentation, page formatting and so on. (3) the architecture of theme crawler is deeply studied, and the internal structure of crawler is optimized according to the page features of third-party portal. For Web documents obtained from high quality data sources, an improved TF-IDF method is used to extract the domain subject words as the basis of information processing and analysis. The improved algorithm has greatly improved the accuracy of the subject word extraction. Finally, according to the content of this paper, we design and develop an automatic information collection system for medicine field. The system can customize the competitors' website pages, collect information regularly and transform the information into a certain format for the information staff.
【学位授予单位】：大连理工大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：G351;F272

【相似文献】