基于RSS源文本的自动文摘系统研究

发布时间：2018-06-22 00:03

本文选题：自动文摘 + 机器学习　；参考：《浙江大学》2012年硕士论文

【摘要】：随着网络信息资源总量指数级的增长,如何在海量的数据中检索信息并获取主旨,是一个值得研究的问题。搜索引擎和RSS推送技术解决了信息的“源”问题,却没有很好的解决信息的“量”问题。自动文摘技术正是对信息进行压缩和精炼的有效应用之一。自动文摘利用计算机技术,自动从原始文档中抽取或总结出能够反映文本中心内容的简短连贯短文,以帮助用户快速、准确和全面的获取信息主旨。本文认为不同主题类型的新闻文摘具有不同形式的文本特征组合模型,因此应将文本自动分类结果作为自动文摘的前提。通过网页抓取、网页清洗和数据存储构建分类语料库,并在此基础之上利用不同特征选择算法和分类算法实现了自动归类。提出文摘句的可能性(Probability)和可行性(Possibility)两种度量方式,基于文摘语料库的构建,采用基于回归分析的有监督机器学习算法(线性回归和Logistic回归)进行训练学习,以确定文摘句特征组合模型的最优参数。针对中文文本,提出改进型ROUGE-CN系列评价算法,用于对文摘句可能性的度量和对机器文摘的测评。基于机器学习的自动文摘方法产生的文摘与基准文摘和Word文摘的对比实验结果表明,以自动分类为前提,利用基于回归分析的有监督机器学习算法,能够有效的提高机器文摘质量。以在线RSS数据源与基于回归机器学习的自动文摘方法的结合作为创新点,最终设计和实现了基于RSS源文本的自动文摘系统。系统以在线RSS源文本为数据来源,利用正则表达式匹配的方式抽取原文元数据内容,提供不同特征选择算法、自动分类算法、机器学习算法和压缩率选项,结合自动分类和自动文摘技术得出分类标签并生成机器文摘,实现了新闻文摘与原文的在线双重呈现。
[Abstract]:With the increase of the total amount of network information resources, how to retrieve the information and obtain the gist in the massive data is a problem worth studying. Search engine and RSS push technology solve the problem of "source" of information, but do not solve the problem of "quantity" of information well. Automatic abstract technology is one of the effective applications of information compression and refining. By using computer technology, automatic abstracts can automatically extract or summarize short and short texts that can reflect the text center content from the original documents, so as to help users to obtain the information purport quickly, accurately and comprehensively. This paper holds that news abstracts of different subject types have different forms of text feature combination model, so the results of automatic text classification should be taken as the premise of automatic summarization. The classification corpus is constructed by web crawling, page cleaning and data storage. On this basis, different feature selection algorithms and classification algorithms are used to realize automatic classification. Based on the construction of abstract corpus, a supervised machine learning algorithm based on regression analysis (linear regression and logistic regression) is proposed. In order to determine the optimal parameters of the abstract sentence feature combination model. For Chinese text, an improved evaluation algorithm of ROUGE-CN series is proposed, which can be used to measure the possibility of abstracting sentences and to evaluate machine abstracts. The experimental results show that the supervised machine learning algorithm based on regression analysis is based on automatic classification. Can effectively improve the quality of machine abstracts. Based on the combination of online RSS data source and automatic summarization method based on regression machine learning, an automatic abstracting system based on RSS source text is designed and implemented. The system takes the online RSS source text as the data source, extracts the original metadata content by regular expression matching, provides different feature selection algorithms, automatic classification algorithms, machine learning algorithms and compression ratio options. Combined with automatic classification and automatic summarization techniques, classification labels are obtained and machine abstracts are generated. The online dual presentation of news abstracts and original texts is realized.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】