网页自动分类算法的设计与实现

发布时间：2018-05-08 04:08

本文选题：网页自动分类 + 网页内容提取　；参考：《南昌大学》2012年硕士论文

【摘要】：在这个信息数字多元化的年代,人们可以通过Internet、企业内部网和电子图书馆等多种渠道获取丰富的包括数据、文字、声音、图像等信息。我们想简单化、快捷化、有效率的获取有用的讯息有一定难度。因此,自动分类尤其是网页自动分类的重要性日趋显著。自动分类可较大程度减少整理文档的时间,较大程度提高采集信息的效率,极大的方便了用户检索信息,也对文档的有效存档和管理起到重要作用。本文通过探索网页自动分类技术的发展历程和目前的研究现状,了解当前搜索引擎系统的优缺点。通过分析学习系统开发语言Java和开发技术Swing以及TF-IDF算法,试图提出网页自动分类算法新的设计,提出实验方案。经过相关测试,本方法比较符合中文网页自动分类的在大规模分类上的需要,在相关网页的平均分类准确率超过80%。这项研究在应用领域有较大实用价值。
[Abstract]:In this age of digital diversity, people can obtain a wealth of information including data, text, sound, image and so on through Internet, Intranet and electronic library. We want to simplify, quickly, and efficiently access useful information has some difficulty. Therefore, the importance of automatic classification, especially the automatic classification of web pages, is becoming more and more significant. Automatic classification can greatly reduce the time of sorting documents, greatly improve the efficiency of collecting information, greatly facilitate users to retrieve information, but also play an important role in the effective archiving and management of documents. In this paper, the advantages and disadvantages of the current search engine system are discussed by exploring the development history and current research status of the web page automatic classification technology. By analyzing the learning system development language Java, the development technology Swing and the TF-IDF algorithm, this paper attempts to put forward a new design of the web page automatic classification algorithm, and puts forward the experimental scheme. Through the correlation test, the method meets the needs of the Chinese web page automatic classification in large-scale classification, and the average classification accuracy of the related web pages is over 80%. This research has great practical value in application field.
【学位授予单位】：南昌大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.092

【参考文献】