Web数据集成中有价值事件识别研究

发布时间：2018-05-06 06:14

本文选题：重复事件表象 + 事件表象统一　；参考：《山东大学》2014年博士论文

【摘要】：随着互联网技术的飞速发展,Web成为巨大的信息源,拥有海量数据,同时Web具有开放性、交互性、便捷性的特点,已成为人们获取信息的重要平台。如何准确、有效地从Web中获取所需信息,对信息进一步分析和挖掘,对诸如市场情报分析、商业智能等分析型应用尤为重要。相对于传统数据集成中结构化数据,Web网页包含大量无结构数据,其中在特定时间、地点发生,由特定参与者参加的活动语句称为事件。识别网页中有价值事件,即识别出分散在大量网页中的事件信息并关联事件的价值数据,为市场情报分析等应用提供数据支持。 Web网页的新闻报道中蕴含大量事件,为用户提供及时、广泛的信息,但报道这些事件的描述语句陈述角度各异,表达方式随意,难以识别是否指向同一事件。网页报道中对同一事件的不同描述语句称为事件表象。在Web大量网页中,通过聚合事件表象发现其共同所指的事件,利用共指同一事件的表象间互相印证和补充的信息对事件有一个较全面、准确的认识。另外,分析事件主题,集成事件主题热度信息,从不同层面识别有价值事件。识别出的有价值事件,数据较丰富和准确,而且集成了事件主题等不同层面的价值信息,可以为市场情报分析等应用提供支持,也是进一步进行数据分析和挖掘的基础。 Web有价值事件识别已经成为当前的热点研究问题之一,由于Web事件具有海量、无结构、描述随意和联系丰富等特点,有价值事件识别不仅进行Web事件发现,还要集成事件价值信息,研究中仍然存在以下问题有待解决。(1)同一事件网络中有不同的新闻报道,报道该事件的事件表象语句因描述角度不同,存在较大差异。这些事件表象分布于大量网页中,如何从网页中快速、准确的发现重复事件表象,聚合指向同一事件的表象,是需要研究的问题；(2)事件表象从不同角度描述事件,如何充分利用表象间相互印证和补充信息,将形式各异的共指事件表象统一成一条表象,保证合并后的事件表象具有较准确和丰富的数据,是需要解决的问题；(3)Web不同事件可以拥有共同主题,如何准确发现不同事件的主题,分析主题词热度,从主题层面识别有价值事件,是需要解决的问题。本文以Web数据集成为目标,针对Web有价值事件识别中存在的以上问题展开研究,本文的贡献主要包括以下三个方面： (1)提出一种基于维度匹配和共现约束的重复事件表象发现方法。使用事件的8维度表示形式,提出使用网页事件表象共现约束减少事件表象的匹配次数,能够准确、高效的发现网页中重复事件表象。本文提出一种基于维度匹配和共现约束的重复事件表象发现方法,事件使用{agent, activity, object, time, location, cause, purpose, manner}8个维度表示,赋予事件一定的结构特性。针对不同维度内容使用不同匹配器分别匹配,使用扩展证据理论模型综合维度匹配结果。针对大规模网页重复事件表象的发现,提出网页事件表象共现约束,减少网页间事件表象匹配次数。实验结果表明,该方法能够准确聚合大量共指同一事件的重复事件表象,并且减少事件表象间匹配次数,有效降低了网页重复事件表象发现的时间,提高了重复事件表象发现的效率。 (2)针对指向同一事件的Web事件表象形式多样,提出一种通过维度内容重组的事件表象统一方法,选取大量重复事件表象中较准确和详细的维度内容并组合到一条事件表象中,反映现实事件。本文提出一种通过维度内容重组的事件表象统一方法,提出使用Markov逻辑网结合多种一阶逻辑规则综合判断,选择事件表象中较完整、准确的维度内容。组合分散在多个事件表象中较准确详细的维度内容到一条事件表象中。实验结果表明,该方法能够有效选择较完整、准确的维度内容,事件表象统一有较高的准确率。 (3)针对不同事件可以拥有共同主题,提出一种基于主题特征聚类和扩展LDA模型的事件主题分析方法。分析事件的主题词和主题词热度,从主题层面识别有价值事件。本文提出一种扩展LDA模型DLDA,在LDA模型中集成事件的维度信息,避免在主题无关的事件维度上分配主题概率(如时间、地点等维度内容),选取主题特征维度。根据选取的主题特征维度内容聚类,准确识别事件主题。提出一种主题词合成规则,合成事件的主题词并分析主题词热度。实验结果表明,本文所提方法可以准确地提取事件主题词并分析主题词热度,从主题层面有效识别有价值事件。
[Abstract]:With the rapid development of Internet technology , the Web has become a huge source of information , and has the characteristics of openness , interactivity and convenience . It has become an important platform for people to get information . How to get the required information accurately and effectively from the Web , further analysis and mining of information is of particular importance to the analysis of information such as market intelligence analysis and business intelligence .

In contrast to structured data in traditional data integration , Web pages contain a large amount of unstructured data , in which event statements that occur at a particular time , place , and attended by a particular participant are called events . It is recognized that there are value events in the web page , that is , identify event information dispersed in a large number of web pages and associate the event ' s value data , providing data support for applications such as market intelligence analysis .

There are a lot of events in the news reports of Web pages . It provides users with timely and extensive information . However , it is difficult to identify whether or not to point to the same event .

Web - based event recognition has become one of the current hot - spot research problems . Because Web events have the characteristics of mass , no structure , description of arbitrary nature and rich contact , there are some problems to be solved in the study .
( 2 ) How to describe the events from different angles , how to make full use of the mutual authentication and supplementary information among the representations , unify the representations of the forms of common finger events into a form , and ensure that the combined event images have more accurate and abundant data , which is a problem that needs to be solved ;
( 3 ) Web different events can own common theme , how to find the topics of different events accurately , analyze the heat of the theme words , identify valuable events from the theme level , are the problems that need to be solved .

Based on the Web data set , this paper studies the above problems in Web valuable event recognition , and the contribution of this paper mainly includes the following three aspects :

( 1 ) A method for finding duplicate events based on dimension matching and co - occurrence constraint is proposed . The 8 - dimension representation of the event is used to reduce the number of occurrences of event representation by using the event representation of the web page , which can accurately and efficiently find the duplicate event representation in the web page .

This paper presents a method for finding a duplicate event based on dimension matching and co - occurrence constraint . The event uses { agent , activity , object , time , location , cause , purpose , manner } 8 dimensions to express and assign certain structural characteristics to the event . The results show that the method can accurately aggregate a large amount of repeated event representations of the same event and reduce the number of events between the web pages .

( 2 ) Aiming at the form and diversity of Web events pointing to the same event , a unified method of event representation through dimension content reorganization is proposed , and the more accurate and detailed dimension contents in large number of duplicate event images are selected and combined into an event table to reflect the real events .

This paper presents a unified method for event representation by dimension content recombination . It is proposed to use Markov logic network to combine multiple first - order logic rules to judge and select the more complete and accurate dimension content in the event representation . The experimental results show that the method can effectively select the more complete and accurate dimension content , and the event representation has a higher accuracy .

( 3 ) In view of the common theme of different events , an event theme analysis method based on thematic clustering and extended LDA model is proposed . The theme words and the theme word heat degree of the event are analyzed , and the value events are recognized from the subject level .

This paper proposes an extended LDA model DLDA , which integrates the dimension information of event in LDA model , avoids the distribution of theme probability ( such as time , place , etc . ) on the topic - independent event dimension , and selects the theme feature dimension . According to the selected topic feature dimension content clustering , the theme word of the event is identified and the hot degree of the subject word is analyzed . The experimental results show that the proposed method can accurately extract the event subject word and analyze the heat degree of the subject word , and the value event can be effectively identified from the theme level .

【学位授予单位】：山东大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.1

【参考文献】