当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于聚团词的大规模文本转载识别算法

发布时间:2018-03-28 07:03

  本文选题:转载识别 切入点:聚团词 出处:《计算机应用》2010年06期


【摘要】:文本转载识别是指从大规模文本库中检测出内容相同或相近的文档集合,在热门话题检测、搜索引擎结果凝练、学术文章抄袭识别等诸多应用上,存在普遍的需求。为适应网络文本转载形式的日趋多样化,并进一步提升实用系统效率,对各种文本特征及比较算法进行了研究分析,提出了基于聚团词的大规模文本转载识别算法,即:依据词语的分布属性,识别并提取高得分聚团词用于表征文本,之后通过对文本集进行扩展线性比较与多维比较两次操作,最终筛选出转载识别结果。对比实验表明:该算法在准确率、召回率与效率上有较高的综合性能。
[Abstract]:Text reprint recognition refers to the collection of documents with the same or similar contents detected from the large-scale text library, in many applications such as hot topic detection, search engine results condensed, academic articles plagiarism recognition, and so on. There is a general demand. In order to adapt to the increasing diversification of network text reprint forms and to further improve the efficiency of practical systems, various text features and comparison algorithms are studied and analyzed. In this paper, a large scale text reprint recognition algorithm based on cluster words is proposed, that is, the high score cluster words are recognized and extracted to represent the text according to the distributed attributes of the words, and then two operations of extended linear comparison and multidimensional comparison are carried out on the text set. Finally, the reprint recognition results are screened out. The comparison experiment shows that the algorithm has high comprehensive performance in accuracy, recall rate and efficiency.
【作者单位】: 首都师范大学计算机科学联合研究院;中国科学院计算技术研究所;北京理工大学计算机学院;
【基金】:国家863计划项目(2007AA01Z438) 中国科学院计算技术研究所2008知识创新基金资助项目
【分类号】:TP391.1


本文编号:1675243

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/1675243.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户771f4***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com