当前位置:主页 > 科技论文 > 搜索引擎论文 >

WEB2.0网络热点发现与个性化检索研究

发布时间:2018-11-25 11:19
【摘要】:近几年来,所谓的Web2.0网站和技术发展迅速,彻底改变了互联网的面貌。Web2.0网站强调自由创作和用户参与,数以亿计的网民在新一代的Web平台上创造了海量的生动有趣的内容。越来越丰富的互联网信息资源使得用户难以在浩如烟海的数据中找到其真正感兴趣的信息,因此,各种各样的信息检索和搜索引擎技术得到了广泛的关注和巨大的发展。 现有的Web信息检索系统主要是搜索引擎,但是已有的搜索引擎还是存在着很多不足,主要表现为:一是Web2.0网站的内容被收录的比例很少;二是给出的结果不能反映当前网络的流行信息和热点话题;三是检索结果没有针对用户的兴趣爱好来排序和筛选。针对以上几点问题,论文所要探讨的就是如何在Web2.0环境下,帮助用户根据自己的兴趣爱好从Web2.0的信息海洋里获取流行的热点话题。 论文主要针对Web信息检索中的Web2.0社区网络热点发现以及个性化推荐进行了研究,以更好地改善用户的检索体验。为了达到这个目标,论文首先提出了研究的框架,然后探讨各个重要组成模块的关键技术,并针对Web2.0网站的特点提出相应改进的算法与模型。论文的主要内容和创新之处为: 1.针对Web2.0网站信息组织和层次结构的特点,抽象出面向对象的分布式深度爬虫(Object-Oriented Distributed Deep Crawler,简称OODDC),使用较经济的带宽来与真实数据保持同步,大大提高了爬虫的工作效率和采集数据的实时性。实验结果也证实了面向对象的分布式实时深度爬虫的优点。 2.详细研究了Web2.0网站数据格式和内容标签(Tag)化的特点,在传统Web信息抽取算法基础上,结合向量空间模型(VSM)和实体识别算法,采用少数几个Tag及其权重组成的向量来描述网页、图片、视频和博客等Web对象信息本体的特征,建立了基于Tag描述的统一信息表示模型。 3.基于Tag描述的统一信息表示模型,改进了已有的话题检测与跟踪(TDT)算法,用快速的聚类算法检测和聚合网络话题;同时结合用户反馈对于信息流行程度的影响,提出一种有效的网络话题热度评估算法(HotRank),对所收集的话题计算其热度,作为排序和推荐的依据。实践表明,以相关度和热度共同作为检索结果的排序依据更加吸引用户。 4.针对现有用户兴趣模型的缺陷,提出一种基于主题的在线用户兴趣模型。此模型自动提取用户访问网页的主题,并随时根据用户兴趣的变化以非常小的代价更新。该用户兴趣模型可以运用到各种个性化服务中。实验证明基于此模型的个性化推荐系统具有良好的性能。
[Abstract]:In recent years, the so-called Web2.0 website and technology have developed rapidly, completely changing the face of the Internet. Web2.0 website emphasizes free creation and user participation. Hundreds of millions of Internet users have created huge amounts of lively and interesting content on the new generation of Web platforms. More and more abundant Internet information resources make it difficult for users to find the information they are interested in the vast amount of data. Therefore, a variety of information retrieval and search engine technology has been widely concerned and greatly developed. The existing Web information retrieval system is mainly a search engine, but the existing search engine still has a lot of shortcomings, mainly as follows: first, the proportion of the content of the Web2.0 website is very small; The second is that the results can not reflect the current popular information and hot topics of the network, and the third is that the retrieval results are not sorted and filtered according to the interests and interests of the users. In view of the above problems, the thesis is to explore how to help users to get popular hot topics from the information ocean of Web2.0 according to their interests and hobbies under the Web2.0 environment. This paper mainly focuses on the hot spot discovery and personalized recommendation of Web2.0 community network in Web information retrieval in order to improve the retrieval experience of users. In order to achieve this goal, this paper first puts forward the framework of the research, then discusses the key technologies of each important component module, and puts forward the corresponding improved algorithm and model according to the characteristics of the Web2.0 website. The main contents and innovations of this paper are as follows: 1. In view of the characteristics of the information organization and hierarchy of Web2.0 Web sites, the distributed depth crawler (Object-Oriented Distributed Deep Crawler,), which is abstract to the object, uses more economical bandwidth to keep pace with the real data. The efficiency of crawler and the real time of collecting data are greatly improved. The experimental results also confirm the advantages of object-oriented distributed real-time depth reptiles. 2. The characteristics of Web2.0 website data format and content label (Tag) are studied in detail. On the basis of traditional Web information extraction algorithm, vector space model (VSM) and entity recognition algorithm are combined. A few vectors composed of Tag and their weights are used to describe the features of Web object information ontology, such as web pages, pictures, videos and blogs, and a unified information representation model based on Tag description is established. 3. Based on the unified information representation model described by Tag, the existing (TDT) algorithm of topic detection and tracking is improved, and the fast clustering algorithm is used to detect and aggregate network topics. Based on the influence of user feedback on the popularity of information, an effective heat evaluation algorithm, (HotRank), is proposed to calculate the heat of the collected topics, which can be used as the basis for sorting and recommendation. Practice shows that it is more attractive to users to use correlation and heat as the sorting basis of retrieval results. 4. Aiming at the defects of the existing user interest model, an online user interest model based on topic is proposed. This model automatically extracts the topics of users visiting web pages and updates them at a very small cost according to the changes of users' interests at any time. The user interest model can be applied to various personalized services. Experiments show that the personalized recommendation system based on this model has good performance.
【学位授予单位】:中国科学技术大学
【学位级别】:博士
【学位授予年份】:2012
【分类号】:TP391.3

【引证文献】

相关硕士学位论文 前1条

1 王星星;基于网络热点的个性化情报推荐系统设计与实现[D];华中师范大学;2014年



本文编号:2355919

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2355919.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户766ab***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com