面向藏文WEB热点事件发现系统的设计
[Abstract]:Since the birth of the Internet as a media in the 1970s, we have entered an era of unprecedented wealth of information, and at the same time, the way of information dissemination has also undergone great changes. More and more people are willing to communicate their views, ideas and attitudes through the Internet media. Due to the lack of unified organization and management of these information, it is difficult to find and manage the information we need. Therefore, people urgently need a tool to quickly obtain the information they need from the network. People can get the information they need through search engine (search engine), but because they use keyword matching algorithm and don't filter the results, they search many pages and list a lot of irrelevant information. Users spend a lot of time finding the information they need from these results. For hot issues, search engines are more helpless. However, every year, news organizations select hot events in a certain field, but because the time cycle is based on years and the results are chosen by people, the immediacy and objectivity of the results cannot be guaranteed. This paper takes the corpus of people's net Tibetan language website as the research object, uses topic Detection and tracking (TDT) technology to identify and track news events, and cluster news events, so as to design a hot spot discovery system. The system enables users to understand the hot events in Tibetan language network for any period of time, and the results are more objective. This paper first introduces the relevant theories and key technologies of TDT in order to realize the identification and tracking of events in the network news stream, and then introduces the use of Crawler to grab web pages in a specified range and extract the text to remove noise. The weight vector is generated by word segmentation, and a method to calculate the heat of the event is proposed through the research of the algorithm of hot spot event discovery, which improves the sensitivity of the system to the new hot spot event. Then the improved two-layer clustering strategy is used to cluster the text to get the list of events. Finally, through the experiment of news corpus in 2011, the algorithm and idea are verified and evaluated. The results show that the system has achieved good results.
【学位授予单位】:西北民族大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 顾益军,樊孝忠,王建华,汪涛,黄维金;中文停用词表的自动选取[J];北京理工大学学报;2005年04期
2 贾自艳 ,何清 ,张海俊 ,李嘉佑 ,史忠植;一种基于动态进化模型的事件探测和追踪算法[J];计算机研究与发展;2004年07期
3 于满泉;骆卫华;许洪波;白硕;;话题识别与跟踪中的层次化话题识别技术研究[J];计算机研究与发展;2006年03期
4 李保利,俞士汶;话题识别与跟踪研究[J];计算机工程与应用;2003年17期
5 熊文新;宋柔;;信息检索用户查询语句的停用词过滤[J];计算机工程;2007年06期
6 周钦强,孙炳达,王义;文本自动分类系统文本预处理方法的研究[J];计算机应用研究;2005年02期
7 罗杰;陈力;夏德麟;王凯;;基于新的关键词提取方法的快速文本分类系统[J];计算机应用研究;2006年04期
8 陈俊彬;;Web信息抽取策略及其实现方法研究[J];科技情报开发与经济;2008年23期
9 孙茂松,左正平,黄昌宁;汉语自动分词词典机制的实验研究[J];中文信息学报;2000年01期
10 孙学刚,陈群秀,马亮;基于主题的Web文档聚类研究[J];中文信息学报;2003年03期
相关博士学位论文 前1条
1 薛德军;中文文本自动分类中的关键问题研究[D];清华大学;2004年
相关硕士学位论文 前1条
1 李盛韬;基于主题的Web信息采集技术研究[D];中国科学院研究生院(计算技术研究所);2002年
,本文编号:2136561
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2136561.html