面向个性化主题的半监督文本聚类算法研究
发布时间:2018-07-25 10:25
【摘要】:随着互联网在全球范围的普及,上网人数不断增加,互联网中积累的数据也在成指数级别的增长。这些数据中有相当大的一部分数据为文本数据。怎样有效地分析这些文本数据,并从中挖掘有价值信息成为一个热点研究的问题。在数据挖掘中,作为文本分析的重要技术措施之一的半监督文本聚类方法,能够有效利用少量监督信息来提高聚类的性能。因此,这种方法被广泛关注。大部分现有的半监督文本聚类算法忽视或者不能很好的利用用户的个体意愿,从而没有办法很好地实现个性化的文本划分,或者因为监督信息的形式对用户来说难以实现而导致算法的应用范围十分有限。此外,在实际的操作过程中,相对于庞大的文本数据,用户能提供的监督信息相当稀少,使得这些少量的监督信息对聚类过程的影响也十分有限。基于对半监督文本聚类相关研究背景及现有的半监督聚类算法所存在问题的分析,本文研究内容和研究成果体现在:(1)本文提出了一种新的监督信息格式,即感兴趣和不感兴趣这种关键词的格式。这种新的监督信息格式不仅便于用户提供,而且在一定程度上解决了用户个性化的体现问题及监督信息的形式问题。(2)根据用户提供的有限的监督信息、文本和潜在主题中词的分布,对监督信息进行学习和扩充来解决监督信息匮乏的问题。LDA在解决聚类问题上具有良好的性能,并且能够挖掘出文本间潜在的主题。因此,本文将LDA引入到半监督文本聚类问题中,使用罐子模型来模拟结合新的监督信息形式的文本聚类过程。本文针对新提出的监督信息形式并利用词的分布对其扩展,提出了一种可扩展的基于用户偏好的半监督文本聚类算法(extended LDA,ex LDA)。为了验证算法的有效性,本文从新闻数据集20-newsgroups中的不同角度选取五组真实数据集进行实验,首先从监督信息形式角度分析监督信息的合理性和有效性,最后从监督信息的扩展上验证了扩展监督信息对聚类结果的影响。在真实数据集上的实验表明,同传统和最新的半监督文本聚类算法比较,在解决文本聚类的问题上,本文提出的ex LDA算法具有更高的准确度,同时能满足用户个性化的文本划分。
[Abstract]:With the popularity of the Internet in the world, the number of Internet users is increasing, and the data accumulated in the Internet is also growing exponentially. A considerable portion of this data is text data. How to effectively analyze these text data and mine valuable information has become a hot issue. In data mining, semi-supervised text clustering, as one of the important technical measures of text analysis, can effectively use a small amount of supervised information to improve the clustering performance. Therefore, this method is widely concerned. Most of the existing semi-supervised text clustering algorithms ignore or can not make good use of the user's individual wishes, so there is no good way to achieve personalized text partitioning. Or because the form of supervised information is difficult for users to implement, the application scope of the algorithm is very limited. In addition, in the actual operation process, compared with the huge text data, the supervision information provided by the user is quite rare, which makes the influence of the small amount of supervision information on the clustering process very limited. Based on the analysis of the research background of semi-supervised text clustering and the problems existing in the existing semi-supervised clustering algorithms, the research contents and research results are as follows: (1) this paper proposes a new supervised information format. That is, interested and not interested in this keyword format. This new monitoring information format not only facilitates users to provide, but also solves the problem of personalization of users and the form of supervision information to some extent. (2) according to the limited supervision information provided by users, The distribution of words in text and potential topics, learning and expanding supervisory information to solve the problem of lack of supervisory information. LDA has good performance in solving clustering problems, and can mine potential topics between texts. Therefore, LDA is introduced into the semi-supervised text clustering problem, and the jar model is used to simulate the text clustering process combined with the new supervised information. In this paper, we propose an extensible semi-supervised text clustering algorithm based on user preference (extended LDA ex LDA).) for the newly proposed supervised information form and extend it by word distribution. In order to verify the validity of the algorithm, this paper selects five groups of real data sets from different angles in news data set 20-newsgroups for experiments. Firstly, the rationality and validity of supervision information are analyzed from the perspective of supervisory information form. Finally, the effect of extended supervisory information on clustering results is verified from the extension of supervisory information. Experiments on real data sets show that the proposed ex LDA algorithm is more accurate than the traditional and the latest semi-supervised text clustering algorithms in solving the problem of text clustering. At the same time, it can satisfy the user's personalized text partition.
【学位授予单位】:贵州大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
,
本文编号:2143531
[Abstract]:With the popularity of the Internet in the world, the number of Internet users is increasing, and the data accumulated in the Internet is also growing exponentially. A considerable portion of this data is text data. How to effectively analyze these text data and mine valuable information has become a hot issue. In data mining, semi-supervised text clustering, as one of the important technical measures of text analysis, can effectively use a small amount of supervised information to improve the clustering performance. Therefore, this method is widely concerned. Most of the existing semi-supervised text clustering algorithms ignore or can not make good use of the user's individual wishes, so there is no good way to achieve personalized text partitioning. Or because the form of supervised information is difficult for users to implement, the application scope of the algorithm is very limited. In addition, in the actual operation process, compared with the huge text data, the supervision information provided by the user is quite rare, which makes the influence of the small amount of supervision information on the clustering process very limited. Based on the analysis of the research background of semi-supervised text clustering and the problems existing in the existing semi-supervised clustering algorithms, the research contents and research results are as follows: (1) this paper proposes a new supervised information format. That is, interested and not interested in this keyword format. This new monitoring information format not only facilitates users to provide, but also solves the problem of personalization of users and the form of supervision information to some extent. (2) according to the limited supervision information provided by users, The distribution of words in text and potential topics, learning and expanding supervisory information to solve the problem of lack of supervisory information. LDA has good performance in solving clustering problems, and can mine potential topics between texts. Therefore, LDA is introduced into the semi-supervised text clustering problem, and the jar model is used to simulate the text clustering process combined with the new supervised information. In this paper, we propose an extensible semi-supervised text clustering algorithm based on user preference (extended LDA ex LDA).) for the newly proposed supervised information form and extend it by word distribution. In order to verify the validity of the algorithm, this paper selects five groups of real data sets from different angles in news data set 20-newsgroups for experiments. Firstly, the rationality and validity of supervision information are analyzed from the perspective of supervisory information form. Finally, the effect of extended supervisory information on clustering results is verified from the extension of supervisory information. Experiments on real data sets show that the proposed ex LDA algorithm is more accurate than the traditional and the latest semi-supervised text clustering algorithms in solving the problem of text clustering. At the same time, it can satisfy the user's personalized text partition.
【学位授予单位】:贵州大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
,
本文编号:2143531
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2143531.html