Web文本分类方法研究与系统实现
[Abstract]:In recent years, Web has developed rapidly into the largest public information source in the world. How to enable Web users to locate the needed information conveniently and quickly in the vast information resources, The correct classification of Web text is the core problem. Web text classification is derived from automatic classification technology and is an important part of Web text mining. It not only can effectively improve the search efficiency of users, help users to locate the target knowledge quickly and accurately, but also can obtain the interest characteristics of different users, and provide a reference to meet the personalized service requirements of users. Most of the current classification studies regard document categories as flat, disjoint, and do not take into account the hierarchical relationship between categories. When the number of categories is large, the time cost of learning classifier by plane classification is very large, and when classifying unknown documents, we need to compare them with all class models, which is obviously not appropriate. Based on the in-depth study of Web text mining and automatic classification technology, this paper implements a multi-level Web text classification system based on the hierarchical relationship between categories. The innovations and key technologies of this paper are as follows: 1. A hierarchical training and classification model is established. Aiming at the features of many kinds of web pages which are rich in content and involving many fields, this paper analyzes the problems existing in the method of plane classification in the case of multiple categories, and puts forward the idea of hierarchical classification. A hierarchical training and classification model is established. 2. An automatic Web text extractor is designed and implemented. The noise such as ads and hyperlinks in Web pages brings great trouble to Web text classification. In this paper, an automatic Web text extractor is implemented, which makes the Web page become pure text containing title and text. 3. In this paper, a keyword extraction method suitable for Web web pages is proposed. Different positions and different parts of speech in web pages play different roles in the expression of web pages. In view of this characteristic, this paper proposes a new method based on part of speech. Position and word frequency information weighted keyword extraction method to further filter out the page noise words, and achieved good results. 4. A classification method based on the weighting of 蠂 2 statistics is proposed. 蠂 2 statistics can well reflect the correlation between features and categories. This paper innovatively applies 蠂 2 statistics to text classification, which not only simplifies the classification process, but also obtains better classification speed and accuracy in practical application. According to the characteristics of Web texts, this paper proposes a set of implementation schemes for large-scale, multi-class Web text classification, and designs a multi-level classification system for Web texts. The results show that the classification performance of this system is better than that of general plane classifier in practice.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2010
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 付雪峰,王明文;基于模糊-粗糙集的文本分类方法[J];华南理工大学学报(自然科学版);2004年S1期
2 王继成,潘金贵,张福炎;Web文本挖掘技术研究[J];计算机研究与发展;2000年05期
3 李晓黎,刘继敏,史忠植;概念推理网及其在文本分类中的应用[J];计算机研究与发展;2000年09期
4 王本年,高阳,陈世福,谢俊元;Web智能研究现状与发展趋势[J];计算机研究与发展;2005年05期
5 李波,李新军;一种基于粗糙集和支持向量机的混合分类算法[J];计算机应用;2004年03期
6 涂承胜,鲁明羽,陆玉昌;Web内容挖掘技术研究[J];计算机应用研究;2003年11期
7 范焱,郑诚,王清毅,蔡庆生,刘洁;用Naive Bayes方法协调分类Web网页[J];软件学报;2001年09期
8 白翎雁;才书训;;Web文本挖掘及相关技术研究[J];沈阳工程学院学报(自然科学版);2008年03期
9 高淑琴;;Web文本分类技术研究现状述评[J];图书情报知识;2008年03期
10 许高建;;基于Web的文本挖掘技术研究[J];计算机技术与发展;2007年06期
相关博士学位论文 前2条
1 刘永丹;文档数据库若干关键技术研究[D];复旦大学;2004年
2 王煜;基于决策树和K最近邻算法的文本分类研究[D];天津大学;2006年
相关硕士学位论文 前7条
1 孙丽华;中文文本自动分类的研究[D];哈尔滨工程大学;2002年
2 罗强;基于粗糙集理论的知识发现在web文本挖掘上的应用研究[D];广西大学;2003年
3 张滨;中文文档分类技术研究[D];武汉大学;2004年
4 彭雅;文本分类算法及其应用研究[D];湖南大学;2004年
5 汪传建;基于混合模型的文本分类的研究[D];东北大学;2005年
6 邹丹;基于Web的中文文本分类的研究与实现[D];中国地质大学(北京);2006年
7 邢丽莉;基于Web的中文文本分类技术的研究[D];河北工程大学;2008年
,本文编号:2367860
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/2367860.html