基于聚团词的大规模文本转载识别算法

发布时间：2018-03-28 07:03

本文选题：转载识别　切入点：聚团词　出处：《计算机应用》2010年06期

【摘要】：文本转载识别是指从大规模文本库中检测出内容相同或相近的文档集合,在热门话题检测、搜索引擎结果凝练、学术文章抄袭识别等诸多应用上,存在普遍的需求。为适应网络文本转载形式的日趋多样化,并进一步提升实用系统效率,对各种文本特征及比较算法进行了研究分析,提出了基于聚团词的大规模文本转载识别算法,即:依据词语的分布属性,识别并提取高得分聚团词用于表征文本,之后通过对文本集进行扩展线性比较与多维比较两次操作,最终筛选出转载识别结果。对比实验表明:该算法在准确率、召回率与效率上有较高的综合性能。
[Abstract]:Text reprint recognition refers to the collection of documents with the same or similar contents detected from the large-scale text library, in many applications such as hot topic detection, search engine results condensed, academic articles plagiarism recognition, and so on. There is a general demand. In order to adapt to the increasing diversification of network text reprint forms and to further improve the efficiency of practical systems, various text features and comparison algorithms are studied and analyzed. In this paper, a large scale text reprint recognition algorithm based on cluster words is proposed, that is, the high score cluster words are recognized and extracted to represent the text according to the distributed attributes of the words, and then two operations of extended linear comparison and multidimensional comparison are carried out on the text set. Finally, the reprint recognition results are screened out. The comparison experiment shows that the algorithm has high comprehensive performance in accuracy, recall rate and efficiency.
【作者单位】：首都师范大学计算机科学联合研究院;中国科学院计算技术研究所;北京理工大学计算机学院;
【基金】：国家863计划项目(2007AA01Z438) 中国科学院计算技术研究所2008知识创新基金资助项目
【分类号】：TP391.1
，

本文编号：1675243

资料下载

论文发表

支付宝下载

Download by Alipay
微信下载

Download by Wechat
会员下载

Download by Member

本文链接：https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1675243.html

上一篇：基于聚团词的大规模文本转载识别算法
下一篇：互联网针灸医学资源检索与利用

论文发表

·知网|万方|维普|龙源|省级|国家级|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|