全文检索系统中文件预处理技术研究

发布时间：2018-04-09 02:06

本文选题：全文检索　切入点：消息队列　出处：《中国科学技术大学》2017年硕士论文

【摘要】：随着计算机技术和网络技术的发展,人类社会的数据量呈爆发式增长,信息检索就是研究如何在这些信息中快速有效地检索到有用信息。网络上获取的信息形式多样,其中半结构化和非结构化形式的信息占据了很大一部分,对于结构化信息的检索可以使用数据库技术,而对于非结构化信息的检索却缺乏有用的工具,因此全文检索技术应运而生。全文检索系统主要由文本预处理、索引建立、索引管理和web检索平台等多个部分组成。本文主要对全文检索系统中文件预处理模块用到的相关技术进行研究,主要包括文件实时监控、文件类型识别、文本内容提取等。该模块使用Inotify机制对数据源实时监控,将监控到的文件路径提交至基于高级消息队列协议实现的消息队列中,依次识别文件类型,根据不同文件类型使用不同的接口提取文件的文本内容。最后准备大量文件对预处理模块的功能和性能进行测试,实验结果表明该模块具有较高的识别正确率和较好的文本提取完整度,基本满足设计要求。本文对基于内容的文件类型识别算法进行了研究,将文件内容按字节值划分,使用字节值和字节值频率建立文件的向量空间模型。识别过程使用K近邻做分类算法,为降低分类过程的计算复杂度提高分类的效率,引入了主成分分析算法和聚类算法对样本空间做降维处理。最后对算法进行测试,实验结果表明改进后的算法减少了分类时间,具有较高的分类效率和识别正确率。本文最后研究了将信息增益特征选择算法和TFIDF权重计算算法用于文件分类过程,针对样本集分布不均衡时分类正确率下降的情况,在传统算法的基础上引入类间集中度和类内离散度,并对权重算法和特征选择算法进行改进,用支持向量机做分类算法。最后对算法进行实验验证,结果表明,使用改进后的算法分类正确率在一定程度上得到了提高。
[Abstract]:With the development of computer technology and network technology, the amount of data in human society increases explosively. Information retrieval is to study how to retrieve useful information quickly and effectively.The forms of information obtained on the network are various, among which semi-structured and unstructured forms of information occupy a large part. Database technology can be used for the retrieval of structured information.However, there is a lack of useful tools for the retrieval of unstructured information, so full-text retrieval technology emerges as the times require.Full-text retrieval system is mainly composed of text preprocessing, index building, index management and web retrieval platform.This paper mainly studies the related technologies used in the file preprocessing module in the full-text retrieval system, including file real-time monitoring, file type identification, text content extraction and so on.The module uses the Inotify mechanism to monitor the data source in real time. The monitored file path is submitted to the message queue based on the advanced message queue protocol, and the file type is recognized in turn.Use different interfaces to extract the text content of the file according to different file types.Finally, a large number of files are prepared to test the function and performance of the preprocessing module. The experimental results show that the module has higher recognition accuracy and better text extraction integrity, which basically meets the design requirements.In this paper, the content-based file type recognition algorithm is studied. The file content is divided according to the byte value, and the vector space model of the file is established by using the byte value and the byte value frequency.In order to reduce the computational complexity of the classification process, the principal component analysis (PCA) algorithm and the clustering algorithm are introduced to reduce the dimension of the sample space in order to reduce the computational complexity of the classification process.Finally, the experimental results show that the improved algorithm reduces the classification time, and has a higher classification efficiency and recognition accuracy.Finally, the information gain feature selection algorithm and the TFIDF weight calculation algorithm are used in the file classification process.Based on the traditional algorithm, the inter-class concentration and intra-class dispersion are introduced, and the weight algorithm and feature selection algorithm are improved, and the support vector machine is used as the classification algorithm.Finally, the experimental results show that the classification accuracy of the improved algorithm is improved to some extent.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】