当前位置:主页 > 科技论文 > 搜索引擎论文 >

基于评论性网站用户发言的数据挖掘研究

发布时间:2018-10-13 16:07
【摘要】:随着网络的蓬勃发展,互联网上出现很多与用户形成良好互动的评论性网站,这些网站最突出的特点是实时性和信息的快速交替性。正是由于这些特点,这些评论性网站上隐藏了很多有价值的知识,挖掘这些潜在的知识对社会发展有很重要的指导意义。 本文选取这类网站中最典型的代表BBS网站作为研究对象,通过使用搜索引擎对其评论性内容进行数据挖掘,提取出潜在的有价值信息。本文采用新的网页排序算法(P-OPIC算法),提高了网页内容的挖掘力度,让用户更加快速地定位到目标网页。 本文研究了搜索引擎的组成和框架,对开源搜索引擎Nutch的运行机制进行研究分析,主要工作内容分为以下几个方面: (1)详细对Nutch的爬虫框架和索引框架进行研究,对Nutch的运行流程进行深入分析。研究了PageRank算法、HITS算法和OPIC算法,提出基于OPIC算法的优化算法。优化算法加入网页PageRank值和BBS网站调整因子,其中调整因子提高了BBS网页排名的稳定性 (2)研究了Nutch的数据结构,在Nutch中添加新的数据结构并实现中文分词功能。通过修改Nutch源代码的数据,减少算法对搜索引擎系统性能的影响。 (3)提出实验方法对算法的性能进行研究,分别对OPIC算法和基于OPIC的改进算法进行数据对比。算法在BBS数据环境下测试,本文提出的改进算法能够很好的理解用户输入的关键词,网页排序效果也比OPIC算法好很多,网页排序的准确度有很明显的提高。分析对比算法的实验结果,总结算法的优势和劣势。
[Abstract]:With the rapid development of the network, there are many critical websites with good interaction with users on the Internet. The most outstanding characteristics of these websites are real-time and rapid alternation of information. Because of these characteristics, these critical websites hide a lot of valuable knowledge, mining these potential knowledge has a very important guiding significance for social development. In this paper, the most typical representative BBS sites of this kind of websites are selected as the research object, and the potential valuable information is extracted by using search engine to mine the data of its critical content. In this paper, a new sorting algorithm (P-OPIC algorithm) is used to improve the mining of web content, which enables users to locate the target pages more quickly. In this paper, the composition and framework of search engine are studied, and the operating mechanism of open source search engine (Nutch) is analyzed. The main work is as follows: (1) the crawler framework and index framework of Nutch are studied in detail. The running process of Nutch is analyzed in depth. PageRank algorithm, HITS algorithm and OPIC algorithm are studied, and an optimization algorithm based on OPIC algorithm is proposed. The optimization algorithm adds the PageRank value of the web page and the adjustment factor of the BBS website. Among them, the adjustment factor improves the stability of the BBS page ranking. (2) the data structure of the Nutch is studied, a new data structure is added to the Nutch and the Chinese word segmentation function is realized. By modifying the data of Nutch source code, the influence of the algorithm on the performance of search engine system is reduced. (3) the experimental method is proposed to study the performance of the algorithm, and the data comparison between the OPIC algorithm and the improved algorithm based on OPIC is carried out. The algorithm is tested in the BBS data environment. The improved algorithm proposed in this paper can understand the keywords input by the user very well, and the sorting effect of the web page is much better than that of the OPIC algorithm, and the accuracy of the web page sorting is obviously improved. The experimental results of the algorithm are analyzed and compared, and the advantages and disadvantages of the algorithm are summarized.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP311.13;TP391.3

【参考文献】

相关期刊论文 前10条

1 王仕仲;宁龙兵;;基于Nutch的中文搜索引擎的研究与实现[J];电脑开发与应用;2009年07期

2 罗武;方逵;朱兴辉;;网络搜索引擎排序算法研究进展[J];湖南农业科学;2010年07期

3 邹涛;王继成;杨文清;张福炎;;文本信息检索技术[J];计算机科学;1999年09期

4 姚文琳;刘文;;一种基于本体的PageRank算法的改进策略[J];计算机工程;2009年06期

5 刘昌钰,唐常杰,于中华,杜永萍,郭颖;基于潜在语义分析的BBS文档Bayes鉴别器[J];计算机学报;2004年04期

6 沈华伟;程学旗;陈海强;刘悦;;基于信息瓶颈的社区发现[J];计算机学报;2008年04期

7 张珩;;浅析基于BBS数据挖掘的研究[J];科技信息;2009年15期

8 何莘;王琬芜;;自然语言检索中的中文分词技术研究进展及应用[J];情报科学;2008年05期

9 曹军;Google的PageRank技术剖析[J];情报杂志;2002年10期

10 梁正友;潘涛;;Nutch中PageRank的并行实现[J];计算机工程与设计;2010年20期



本文编号:2269206

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2269206.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户2fe86***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com