基于社区分析的大众分类多义词发现方法研究

发布时间：2019-06-14 20:29

【摘要】：由社会化标注系统形成的大众分类在个性化推荐领域和信息检索领域已经得到了广泛的应用。社会化标注系统的成功主要缘于用户可以随意使用标签标注资源。然而,正是这种不规范的标注方式使得社会化标注系统及大众分类长期受到语义模糊问题的困扰,阻碍着社会化标注系统进一步发展。本文针对大众分类中的多义词这一语义模糊问题开展研究。在大多数已有研究中,研究者的关注点更多集中于使用标签、资源以及它们之间的关联信息,常常忽略表现用户特征的信息。然而,作为社会化标注系统的主体,用户对于标签的理解直接影响着标签所蕴含的语义。同时,对于标签语义的挖掘也不应局限于用户集合整体层面,也应当深入到个体层面。因此,本文根据用户的兴趣信息对大众分类进行分割,分析同一个标签在不同用户社区中的上下文差异,并通过对这些差异的比较来发现大众分类中的多义词标签。具体而言,本文进行了两方面的工作。一方面,本文构建了基于用户兴趣的关系网络,并在该网络上通过社区发现算法进行用户社区发现。另一方面,本文提出了语义聚集度和语义离散度两个度量指标,其中语义聚集度用来度量上下文中的标签之间的语义相似程度,语义离散度用来度量标签在不同社区中的上下文之间的差异程度。通过这两个指标,本文可以量化地比较不同用户社区之间标签上下文的差异,进而判断标签是否为多义词标签。本文使用了Delicious数据集和Movie Lens数据集进行了实验,并于基于重叠聚类的一词多义发现算法进行了对比。实验结果证明,本文所提出的多义词发现方法优于对比方法,尤其是在拥有大量具有不同兴趣用户的数据集上表现更为明显。
[Abstract]:Public classification formed by socialized tagging system has been widely used in the field of personalized recommendation and information retrieval. The success of socialized tagging system is mainly due to the fact that users can use label tagging resources at will. However, it is this irregular tagging method that makes the socialized tagging system and the public classification suffer from the semantic ambiguity problem for a long time, which hinders the further development of the socialized tagging system. In this paper, the semantic ambiguity of polysemy in popular classification is studied. In most of the existing studies, researchers focus more on the use of tags, resources and their association information, often neglecting the information that represents the characteristics of the user. However, as the main body of socialized tagging system, users' understanding of tags directly affects the semantics of tags. At the same time, the mining of tag semantics should not be limited to the overall level of user collection, but also should go deep into the individual level. Therefore, this paper divides the popular classification according to the interest information of the user, analyzes the context difference of the same label in different user communities, and finds the polysemous word label in the popular classification through the comparison of these differences. Specifically, this paper has carried on two aspects of work. On the one hand, this paper constructs a relational network based on user interest, and carries on the user community discovery through the community discovery algorithm on the network. On the other hand, this paper proposes two metrics: semantic aggregation and semantic dispersion, in which semantic aggregation is used to measure the semantic similarity between tags in context, and semantic dispersion is used to measure the degree of difference between the contexts of tags in different communities. Through these two indicators, this paper can quantitatively compare the differences of label context among different user communities, and then judge whether the label is polysemous or not. In this paper, Delicious dataset and Movie Lens dataset are used for experiments, and the polysemy discovery algorithm based on overlapping clustering is compared. The experimental results show that the polysemy discovery method proposed in this paper is superior to the contrast method, especially on the dataset with a large number of users with different interests.
【学位授予单位】：大连理工大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】