大规模异构环境下的文本分类算法研究及应用

发布时间：2018-10-29 19:16

【摘要】：以网络为重要组成部分的计算机应用走到了一个空前繁荣的时代，各种新的应用环境、应用需求纷纷出现，在一些如搜索引擎、社交网络等大规模应用中，数据每天都在以极高的速度增长。如何能在有效时间内快速地对这些数据进行处理，获得其中的应用价值，，是业界正在努力解决的问题。同时，多数的数据都是以异构的形式存在，使得对其利用的过程变得更加具有挑战性。文本分类作为一门比较重要的技术，在大规模的数据环境下也同样很重要，它使得我们能够快速地获得未知文档的类别，对于信息的处理是非常有益的。传统的分类算法固然有诸多优点，但是速度多存在限制，这对一些高数据流量的环境是不相称的。对于如何解决这些问题，论文中作者做了以下几点尝试： 1）基于一些传统分类领域的优秀思想，提出了一种基于单字计算的快速文本分类算法； 2）为了能够快速抓取网页，设计出一种简洁、可扩展的分布式网页爬虫； 3）对如何利用XML技术对异构数据进行整合做了研究，在网页处理环节里，设计出了一种利用网页的DOM结构快速抽取网页正文的算法； 4）实现了一个可运行的通用检索系统，整合了按分类检索的功能，方便用户对搜索结果进行进一步的过滤细化，提高检索质量。
[Abstract]:The computer application which takes the network as the important component has entered an unprecedented prosperous era, various new application environments, the application demand appears one after another, in some large-scale applications such as the search engine, the social network and so on, The data is growing at a very high rate every day. How to process these data quickly and obtain the application value in the effective time is a problem that the industry is trying to solve. At the same time, most of the data exists in heterogeneous form, which makes the process of using it more challenging. As an important technology, text classification is also very important in large-scale data environment. It enables us to quickly obtain the categories of unknown documents, which is very useful for the processing of information. The traditional classification algorithm has many advantages, but the speed is limited, which is not suitable for some high data traffic environment. As to how to solve these problems, the author has made the following attempts: 1) based on some excellent ideas in traditional classification field, a fast text classification algorithm based on word computing is proposed; 2) in order to capture web pages quickly, a simple and extensible distributed web crawler is designed. 3) how to integrate heterogeneous data with XML technology is studied. In the process of web pages, an algorithm is designed to extract the text of web pages quickly by using the DOM structure of web pages. 4) A running universal retrieval system is implemented, which integrates the function of classified retrieval, which is convenient for users to further filter and refine the search results and improve the retrieval quality.
【学位授予单位】：河北科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】