网络文本分类技术研究
发布时间:2018-05-03 23:39
本文选题:网页文本提取 + 中文分词 ; 参考:《北方工业大学》2012年硕士论文
【摘要】:如今,由于网络技术的发展,使得互联网已成为人们获取信息的主要资源库。但网络的开放性使得网络中充满了各式各样的信息。为了使人们能够迅速从网络中获取到自己感兴趣的信息,如何使用网络文本分类技术来处理杂乱的网络信息,让这些信息资源变得有序,开始变得越来越重要。网络文本分类技术是信息过滤、搜索引擎等领域的基础,因此网络文本分类技术已逐步成为当今的研究热点。 本文首先介绍了网络文本提取技术和文本分类的相关理论,如:HTML语言、中文分词、相似度计算、权重值计算、特征提取以及常用的文本分类方法。并且介绍了根据这些基本的理论方法,设计并实现了网络文本分类系统。 本文主要进行了以下几方面的研究:在对网络文本提取部分,通过对HTML语言特点和一般网页结构的分析设计实现了网页的文本提取。在文本分类部分中,主要详细分析了KNN文本分类算法和朴素贝叶斯文本分类算法,并通过文本分类的算法实现对文本的领域分类。在对朴素贝叶斯分类方法分析的基础上,针对该方法的独立性假设的问题,采用了贝叶斯网络TAN模型对贝叶斯分类方法进行了改进,考虑了两词间的关系,一定程度上放宽了独立性假设。提出了文本态度判断的方法,通过针对文本情感特征词提取,对情感词进行权值分析,评估文本态度,从而判断出文本的态度实现对文本的二层分类。最后对网络文本分类系统测试,通过使用语料库文本的实验测试,证明该系统有一定的准确性,通过提取网页的文本内容对分类系统进行实验测试,证明该系统有一定的实用性。
[Abstract]:Nowadays, with the development of network technology, the Internet has become the main resource for people to obtain information. But the openness of the network makes the network full of all kinds of information. In order to get the interesting information from the network quickly, how to use the network text classification technology to deal with the messy network information, make these information resources become orderly, began to become more and more important. Network text classification technology is the basis of information filtering, search engine and other fields, so network text classification technology has gradually become a hot research topic. This paper first introduces the network text extraction technology and the related theories of text classification, such as: HTML language, Chinese word segmentation, similarity calculation, weight calculation, feature extraction and common text classification methods. According to these basic theories and methods, a network text classification system is designed and implemented. This paper mainly studies the following aspects: in the part of web text extraction, the text extraction of web pages is realized through the analysis and design of the characteristics of HTML language and the structure of general web pages. In the part of text classification, KNN text classification algorithm and naive Bayesian text classification algorithm are analyzed in detail, and text domain classification is realized by text classification algorithm. Based on the analysis of the naive Bayesian classification method, the Bayesian network TAN model is used to improve the Bayesian classification method, considering the relationship between the two words. Independence assumptions have been relaxed to some extent. This paper puts forward a method of judging the text attitude. By extracting the emotional feature words of the text, analyzing the weight value of the emotion words and evaluating the text attitude, we can judge the attitude of the text to realize the two-layer classification of the text. Finally, the network text classification system test, through the use of corpus text test, proved that the system has a certain accuracy, by extracting the text content of the web page of the classification system for experimental testing. It is proved that the system is practical.
【学位授予单位】:北方工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.1
【相似文献】
相关期刊论文 前10条
1 吴谋硕;;基于遗传算法的文本分类技术[J];电脑知识与技术;2011年22期
2 高金勇;徐朝军;冯奕z,
本文编号:1840635
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1840635.html