当前位置:主页 > 管理论文 > 移动网络论文 >

面向在线社区的用户信息挖掘及应用研究

发布时间:2018-06-05 22:04

  本文选题:在线社区 + 用户信息 ; 参考:《哈尔滨工业大学》2014年博士论文


【摘要】:近些年,随着各种在线社区的发展,网络上积累了海量的用户信息,包括了用户账户信息(例如用户名)、用户人口信息(例如性别和年龄等)、用户社交关系(例如朋友关系和回复关系等)以及用户生成内容等。一方面,这些用户信息可以帮助企业更好的理解和定位客户,另外一方面可以为用户提供更好的个性化信息系统,同时可以帮助社会学家更好的理解人类行为。因此,挖掘在线社区中的用户信息是构建新的社会化应用以及理解人类行为的关键。 然而,在线社区中的用户信息挖掘存在着各种挑战,包括了非结构化的挑战、跨社区的挑战和非度量化的挑战。非结构化的挑战是指在线社区中的用户信息以非结构化的形式呈现在各种不同类型的网页中,这些网页的布局结构的多样性和动态性为用户信息的自动抽取带来了困难。跨社区的挑战是指一个用户的信息碎片化的分布在不同的社区中,这为全方面理解一个用户带来了很大的困难。非度量化的挑战是指各种用户属性信息(例如影响力、专业水平等)缺少显式的直接度量,这为用户属性信息的直接应用带来了困难。本文主要针对这三个挑战进行了研究,并对用户信息的应用研究进行了一定的探索。具体的,本文的主要研究内容可概括如下: (1)针对用户信息的非结构化挑战,本文研究了面向用户生成内容网页的用户名抽取问题。本文提出了一种基于弱指导学习的方法。该方法利用少量的、由统计意义上稀有的字符串构成的用户名,自动收集和标注大量训练数据,解决了目前有指导学习方法需要人工标注训练数据的问题。同时,本文方法仅依赖于从单页面中抽取出的特征,克服了已有方法对于多页面特征的依赖性。实验结果表明,本文方法显著性优于仅基于单页面特征的有指导学习方法,并且和基于多页面特征的有指导学习方法性能相当。 (2)针对用户信息跨社区的挑战,,本文研究了跨社区的用户链指问题。本文将用户链指问题分为两步:(a)同名消歧,即判断使用相同用户名的用户是否属于同一个自然人;(b)不同名消解,即收集一个自然人所使用的所有不同的用户名。本文关注解决同名消歧任务。首先,本文进行了用户问卷调查和基于About.me数据的分析,量化的说明了解决同名消歧任务的重要性。这是第一个量化的研究人们使用用户名行为习惯的工作。然后,本文提出根据用户名的语言模型概率自动获取训练数据的方法。同时,本文在Yahoo! Answers的数据集上实验验证了该方法所基于的假设的合理性。本文方法解决了目前有指导学习方法需要人工标注数据的困难。实验结果表明,本文方法在自动标注的训练集上学习到的分类器是有效的。 (3)针对用户信息非度量化的挑战,本文以用户专业水平估计为例研究了用户信息的度量。具体的,本文研究了问答社区中用户专业水平的估计问题。本文提出了基于竞赛模型的用户专业水平估计方法。该方法将用户专业水平的估计问题转换成了根据一系列二人竞赛的比赛结果估计选手的能力水平的问题。具体的,本文方法克服了基于链接分析的方法不能将问答关系和答案质量信息等异构信息进行统一建模的问题。同时,本文方法通过对每场比赛的难度进行建模,克服了基于答案质量的方法将每个问题相等对待的问题。实验结果表明,与基于链接分析的方法和基于答案质量的估计方法相比,本文提出的竞赛模型在估计活跃用户的专业水平时性能有显著性提高。 (4)本文从应用的角度出发,在结构化、度量化、跨社区链指的用户信息基础上,研究了基于用户信息的众包任务难度估计。具体的,本文以问答社区中的问题难度估计为例进行了研究。本文利用用户专业水平的度量信息,提出了基于用户竞赛的模型估计问题的难度。用户专业水平的度量为问题难度的估计提供了指导,解决了之前方法不能处理观察值为偏序关系的问题。实验结果验证了本文所提出的模型的有效性。最后,本文利用跨社区的用户链指信息,研究了跨社区的问题难度估计问题。 总之,本文一方面致力于解决用户信息挖掘中非结构化、跨社区和非度量化的挑战,另一方面从应用的角度出发,尝试了将结构化、度量化、跨社区链指的用户信息应用到众包任务难度估计的问题上来。本研究取得了一些初步的成果,期待这些成果能对本领域的其他研究者提供借鉴。随着用户信息挖掘技术的不断完善,相信用户信息挖掘技术会为各种社会化应用以及社会计算相关的研究带来更大的帮助。
[Abstract]:In recent years, with the development of various online communities, the network has accumulated a huge amount of user information, including user account information (such as username), user population information (such as gender and age, etc.), user social relationships (such as friends and reply relationships, etc.) to generate content and so on. On the one hand, these user information can help A better understanding and positioning of the customer, on the other hand, can provide a better personalized information system for the user and help sociologists to better understand human behavior. Therefore, mining the user information in the online community is the key to the construction of new social applications and understanding of human behavior.
However, there are various challenges in user information mining in the online community, including unstructured challenges, cross community challenges and non quantitative challenges. The unstructured challenge is that the user information in the online community is presented in a variety of different types of web pages in an unstructured form, and the diversity of the layout of these pages. The challenge of cross community is that the fragmentation of a user's information is distributed in different communities, which brings great difficulties to a user in all aspects. The challenge of non quantification refers to the lack of explicit user attribute information, such as influence, professional level, etc. The direct measurement of the user's attribute information is difficult. This paper focuses on the three challenges and explores the application of the user information.
(1) aiming at the unstructured challenge of user information, this paper studies user name extraction for user generated content web pages. In this paper, a method based on weak guidance learning is proposed. This method uses a small number of usernames made up of rare strings in statistical sense to automatically collect and label a large number of training data, and solve the problem. At the same time, the proposed method needs to manually annotate the training data. At the same time, this method relies only on the feature extracted from a single page and overcomes the dependence of the existing methods on multi page features. The experimental results show that the method is superior to the supervised learning method based on the single page feature only, and is based on more than one page feature. A page feature has the same performance as a guiding learning method.
(2) in view of the challenge of user information across the community, this paper studies the problem of cross community user chain reference. This paper divides the user chain finger into two steps: (a) the same name disambiguation, that is, to judge whether the user who uses the same username belongs to the same natural person; (b) the different name elimination, that is, to collect all the different usernames used by a natural person. This paper focuses on solving the same name disambiguation task. First, this paper makes a user questionnaire survey and analysis based on About.me data, which quantifies the importance of solving the same name disambiguation task. This is the first quantified study of people using user name behavior habits. Then, this paper proposes that the probability of the language model based on the username is automatically obtained. At the same time, this paper tests the rationality of the hypothesis based on the method in the data set of Yahoo! Answers. This method solves the difficulty of the manual annotation of the data in the present guiding learning method. The experimental results show that the classifier this method has learned on the automatic tagged training set is effective. Yes.
(3) aiming at the challenge of non degree of user information quantification, this paper takes the user professional level as an example to study the measurement of user information. In this paper, this paper studies the estimation of user professional level in the question and answer community. This paper proposes a user professional level estimation method based on competition model. This method will estimate the problem of user's professional level. The problem of estimating the player's ability level based on a series of competition results in a series of two person competitions is transformed. This method overcomes the problem that the method based on link analysis can not model the isomerism information such as question and answer relationship and the quality information of answer. The experimental results show that the performance of the competition model proposed in this paper is significantly higher in estimating the professional level of active users compared with the method based on link analysis and the method based on the quality of answer based on the answer quality based approach.
(4) from the perspective of application, this paper studies the task difficulty estimation based on user information on the basis of user information which is structured, quantified and cross community chain. In this paper, this paper studies the problem of difficulty estimation in the question and answer community. This paper uses the measurement information of the user's professional level, and proposes a user competition based on the measurement information of user's professional level. The model of the game is used to estimate the difficulty of the problem. The measurement of the user's professional level provides guidance for the estimation of the difficulty of the problem. It solves the problem that the previous method can not handle the observation value as the partial order relation. The experimental results verify the validity of the model proposed in this paper. Finally, this paper uses the information of the user chain in the cross community area to study the cross community. The problem of the problem of difficulty estimation.
In a word, on the one hand, this paper tries to solve the unstructured, cross community and non quantitative challenges in user information mining. On the other hand, from the perspective of the application, we try to apply the structured, quantified, and cross community chain user information to the task difficulty estimation of the public packet from the perspective of application. These results can provide reference for other researchers in this field. With the continuous improvement of user information mining technology, it is believed that the user information mining technology will bring more help to various socialized applications and social computing related research.
【学位授予单位】:哈尔滨工业大学
【学位级别】:博士
【学位授予年份】:2014
【分类号】:TP393.092

【参考文献】

相关期刊论文 前3条

1 王允;李弼程;林琛;;基于网页布局相似度的Web论坛数据抽取[J];中文信息学报;2010年02期

2 李栋;徐志明;李生;刘挺;王秀文;;在线社会网络中信息扩散[J];计算机学报;2014年01期

3 吴信东;李毅;李磊;;在线社交网络影响力分析[J];计算机学报;2014年04期

相关博士学位论文 前4条

1 曹云波;关于网络社区问答知识重用的研究[D];上海交通大学;2011年

2 王宝勋;面向网络社区问答对的语义挖掘研究[D];哈尔滨工业大学;2013年

3 宋鑫莹;网络信息自动化高效抽取技术研究[D];哈尔滨工业大学;2013年

4 孙韬;社会化媒体中提升用户参与度的关键因素研究[D];北京大学;2013年



本文编号:1983552

资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1983552.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户35261***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com