基于自然标注的文本分类
发布时间:2018-03-09 08:10
本文选题:文本分类 切入点:链接分析 出处:《哈尔滨工业大学》2013年硕士论文 论文类型:学位论文
【摘要】:文本分类的研究和搜索引擎中,分类语料库的构建一直是通过人工标注等方式实现的,这个过程往往需要大量人力,使成本较高。同时,这种方式构建好的分类体系总是不灵活的,对于分类体系的改变必须重新经过人工修缮,需要专人维护。在互联网中,各个网站通常会按照分类体系去组织网站结构,通过各级导航栏等对网站提供的信息进行不同层级的分类。对于含有噪音的粗分类结果再通过聚类分析的方法去掉其中的误分类。根据这个思路,本文提出一种基于网站自然标注信息的自动文本分类系统,通过以下步骤实现: 通过对获取的网页结构进行分析,得到网页结构块,即网页中的相同功能的板块,导航栏就被划分到其中的一个块中,通过基于图的链接分析的方法得到页面之间的关系提取出网站中各个网页的导航栏。 对于提取出的导航栏将导航栏中的锚文本进行分析,作为分类关键词,根据网页的自身信息进行分析,得出网页在网站中的是否到达网站结构的叶节点,以确定网页在网站中的层次结构。网站的分类结构与指定的分类体系作比较,确定网页的分类。再通过计算网页中正文与网页中每一行的非正文的格式信息的比值,对这个值平滑化后通过聚类的方法确定网页的正文。 仅使用这种方式得到的结果往往因为各网站分类标准不同和欺骗链接等原因使结果中含有一定量的噪音,需要进行进一步净化处理。通过对各个分类内部的数据进行聚类得到数据的分布情况,通过选择空间中分布较近的簇丢弃离群的簇,提高分类的准确率。 本文通过将生成的分类语料应用于SVM分类器中,将自动生成的语料作为训练集,我们看到测试集的分类可以达到一个较高的准确率。同时在英文语料和中文语料的实验结果也都有很好的效果。说明在用户提供的分类体系下系统可以得到一个比较高的准确率,,在文本分类和信息检索中有较高的可用性。
[Abstract]:In the research of text classification and search engine, the construction of classification corpus has always been realized by manual annotation, which often requires a lot of manpower to make the cost higher. At the same time, In this way, it is always inflexible to construct a good classification system. The changes to the classification system must be repaired manually and need special maintenance. In the Internet, each website usually organizes the website structure according to the classification system. The information provided by the website is classified at different levels through navigation bars at all levels. For the coarse classification results with noise, the false classification is removed by clustering analysis. In this paper, an automatic text classification system based on natural tagging information is proposed, which is realized by the following steps:. By analyzing the structure of the web page, we get the structure block of the web page, that is, the block of the same function in the web page, and the navigation bar is divided into one of the blocks. The relationship between pages is extracted from the navigation bar of each web page by graph-based link analysis. For the extracted navigation bar, the anchor text in the navigation bar is analyzed as a classification key word, and according to the information of the page itself, the paper obtains whether the web page in the website reaches the leaf node of the website structure. In order to determine the hierarchical structure of the web page in the website, the classification structure of the website is compared with the designated classification system, and the classification of the page is determined. Then, by calculating the ratio of the format information of the text of the page to the non-text of each line in the page, After smoothing the value, the text of the web page is determined by clustering method. Results obtained only in this way tend to contain a certain amount of noise due to differences in the classification criteria of websites and spoofing links. The distribution of the data is obtained by clustering the data within each classification, and the accuracy of classification is improved by selecting the nearest cluster in the space to discard the outlier cluster. In this paper, the generated classifier is applied to the SVM classifier, and the automatically generated corpus is used as the training set. We see that the classification of test sets can achieve a higher accuracy. At the same time, the experimental results of English corpus and Chinese corpus are also very good. It shows that the system can obtain a classification system provided by users. A relatively high accuracy rate, It has high availability in text classification and information retrieval.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 张丽敏;;垂直搜索引擎的主题爬虫策略[J];电脑知识与技术;2010年15期
2 朱岩;景丽萍;于剑;;一种利用近邻和信息熵的主动文本标注方法[J];计算机研究与发展;2012年06期
3 李培峰;朱巧明;钱培德;;基于Web的大规模语料库构建方法[J];计算机工程;2008年07期
4 周立柱,林玲;聚焦爬虫技术研究综述[J];计算机应用;2005年09期
5 罗俊;;一种基于图的层次多标记文本分类方法[J];计算机应用研究;2010年03期
6 孙茂松;;基于互联网自然标注资源的自然语言处理[J];中文信息学报;2011年06期
7 王开军;张军英;李丹;张新娜;郭涛;;自适应仿射传播聚类[J];自动化学报;2007年12期
8 韩忠明;张玉沙;张慧;万月亮;黄今慧;;有效的中文微博短文本倾向性分类算法[J];计算机应用与软件;2012年10期
9 孙吉贵;刘杰;赵连宇;;聚类算法研究[J];软件学报;2008年01期
10 齐鹏;张俊;李冠宇;;基于本体的垂直搜索引擎分类索引模型设计[J];计算机工程与设计;2010年23期
本文编号:1587698
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1587698.html