面向领域的Web文本采集与分类
发布时间:2018-03-21 18:49
本文选题:主题爬虫 切入点:特征提取 出处:《西安建筑科技大学》2011年硕士论文 论文类型:学位论文
【摘要】:随着互联网的大规模普及和各行业信息化程度的提高,与行业领域相关的Web文本信息快速积累,如何从这些海量信息中定向提取符合要求的知识,是当前信息处理领域的研究热点。 本文以陕西省教育厅专项科研项目“面向特定领域需求的概念设计方案自动生成方法研究”为课题研究背景,通过网络信息采集和分类技术,对领域相关主题网络资源发现与采集、采集到的网页文本信息预处理与分类这两方面的问题进行研究,主要研究工作如下: (1)对主题描述方法进行研究,将专业词库与特征选择相结合,在专家给出的有限专业词库基础上,对已有的领域代表性文本和通过网络采集到的主题相关文本进行特征提取和特征选择,筛选主题特征词,扩充专业词库,通过由主题特征词构成的向量来明确表示主题; (2)鉴于主题爬虫网页采集的不确定性,对一般网页的结构特点进行分析,采用基于行块分布函数的方法抽取网页正文,去掉干扰主题相关度判断与文本分类的广告、导航等无用文本信息,取得了较好的网页去噪效果,且具有通用性。 (3)采用综合价值评价的主题爬虫搜索策略,综合考虑网页内容分析和链接分析两方面的因素,结合PageRank算法,计算网页的综合链接价值,筛选出与主题相关的URL。 (4)对采集到的网页提取出标题和网页正文,保存为文本文档并进行预处理,根据现有的机械主题类别信息,采用基于KNN的机械主题文本分类算法对文档集合进行多子类分类,并对该分类算法进行了实验分析。 最后,结合以上研究内容,以机械领域挖掘机为主题,实现了一个机械领域Web文本采集与挖掘原型系统。
[Abstract]:With the large-scale popularization of the Internet and the improvement of the degree of informatization of various industries, the Web text information related to the industry field is accumulated rapidly. How to extract the required knowledge from these massive information, It is a hot topic in the field of information processing. In this paper, the research background of the special research project of Shaanxi Provincial Education Department, "Research on automatic Generation method of Conceptual Design Scheme oriented to specific Domain demand", is studied through network information collection and classification technology. The main research work is as follows: (1) this paper studies the discovery and collection of web resources and the preprocessing and classification of web page text information. The main research work is as follows:. 1) researching the method of subject description, combining professional lexicon with feature selection, and based on the limited professional lexicon given by experts. Feature extraction and feature selection are carried out on the existing domain representative text and related text collected through the network, theme feature words are screened, professional lexicon is expanded, and the theme is clearly represented by vector composed of theme feature words. 2) in view of the uncertainty of the collection of subject crawler pages, the structural characteristics of general web pages are analyzed, and the text of the web pages is extracted by the method of line block distribution function, and the advertisements that interfere with the judgment of the relevance of the topic and the classification of the text are removed. Navigation and other useless text information, achieved a better effect of web denoising, and universal. (3) using the topic crawler search strategy of comprehensive value evaluation, considering the two factors of web content analysis and link analysis, combining with PageRank algorithm, calculating the comprehensive link value of the web page, the URLs related to the topic are screened out. The title and text of the collected pages are extracted and stored as text documents. According to the existing mechanical subject category information, the text classification algorithm of mechanical topic based on KNN is used to classify the document set with multiple subclasses. The classification algorithm is analyzed experimentally. Finally, a prototype system of Web text acquisition and mining in mechanical field is implemented by taking the excavator in mechanical field as the subject of the above research.
【学位授予单位】:西安建筑科技大学
【学位级别】:硕士
【学位授予年份】:2011
【分类号】:TP393.09
【引证文献】
相关硕士学位论文 前2条
1 魏胜辉;机械领域文本采集和分类的研究与设计[D];西安建筑科技大学;2012年
2 代宏;基于流媒体技术的农村基层党员干部远程教育系统设计与实现[D];电子科技大学;2013年
,本文编号:1645104
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1645104.html