互联网环境下的中文热词与方言词汇的定量研究
发布时间:2018-07-09 09:59
本文选题:查询日志 + 拼音输入法 ; 参考:《清华大学》2014年博士论文
【摘要】:随着科学技术的进步,特别是信息技术的不断发展以及互联网的普及,中文语言发生了巨大的变化。其中词汇作为语言中最活跃的部分,变化最为显著。互联网环境下的词汇变化,突出表现在两个方面:一方面是热点词新词不断涌现;另一方面是方言词在网络中的大量使用。对词汇的变化进行研究,有助于我们改进中文信息处理的性能;对热点词汇以及方言词汇进行识别,有助于补充语言词典、辅助语言的量化研究。本文中我们从词汇变化的主要来源-搜索引擎查询日志以及中文拼音输入法的数据入手,对词汇的变化加以研究。论文的工作包括:(1)提出了基于搜索引擎查询词的热词新词识别方法。通过对热点查询词的时间动态模式进行分析,我们发现热点词具有特定的时间模式。针对热点词的主要突发期进行检测,设计了基于突发期内频度比的算法以自动发现热点词。(2)综合考虑语义相似度与时间序列相似度,对热点词汇进行了扩充,挖掘了热点词汇相关的低频查询词,解决了低频热词新词难以识别的困难。通过对查询词频度序列的时间模式进行分析,我们对热词中的可预测部分重点加以识别。(3)提出了利用中文拼音输入法用户记录自动识别方言词汇的方法。通过对输入法用户的地理信息提取输入法词条的地域化特征,同时分析了输入法用户调用输入法的程序类别,对输入词条提取了口语化相关特征。通过综合分析地域化特征与口语化特征,提出了基于特征组合排序的方法对方言词汇进行识别。实验结果表明口语化特征与地域化特征相结合的方法大大提高了方言词汇的识别性能。(4)通过对中文拼音输入法数据中的词汇及频度信息,设计不同的词表,考察不同词表在各地域的频度排序序列之间的相关关系以比较各地方言之间的关系,利用层次聚类的方法对方言分区进行了量化研究。同时对词条在方言区域及其相邻区域之间的区分度覆盖度等特征进行分析,整理给出了各地域的方言特征词,最后我们实现了方言词汇地理分布的可视化,以辅助方言间词汇关系的研究。
[Abstract]:With the progress of science and technology, especially the continuous development of information technology and the popularity of the Internet, the Chinese language has undergone tremendous changes. Vocabulary as the most active part of the language, the most significant change. The lexical changes in the Internet environment are highlighted in two aspects: one is the continuous emergence of hot words and the other is the extensive use of dialect words in the network. The research on the change of vocabulary is helpful to improve the performance of Chinese information processing, to recognize hot words and dialect words, to supplement the language dictionary and to assist the quantitative study of language. In this paper, we study the change of vocabulary from the main source of lexical change, search engine query log and the data of Chinese phonetic input method. The main work of this paper is as follows: (1) A new word recognition method based on search engine query is proposed. By analyzing the temporal dynamic pattern of hot query words, we find that hot words have a specific time pattern. In order to detect the main burst period of hot words, an algorithm based on frequency ratio in burst period is designed to find hot words automatically. (2) considering the semantic similarity and time series similarity, the hot words are expanded. The low frequency query words related to hot words are excavated, and the difficulty of identifying new low frequency hot words is solved. By analyzing the time pattern of the frequency sequence of query words, we recognize the predictable parts of hot words. (3) A method of automatic recognition of dialect words by Chinese phonetic input method is proposed. By extracting the geographical information of the input method user's geographical feature of the input method, the author analyzes the program category of the input method user's calling the input method, and extracts the relevant colloquial feature of the input term. Based on the comprehensive analysis of regional and colloquial features, a method based on feature combination and ranking is proposed to identify dialect vocabulary. The experimental results show that the combination of colloquial and regional features greatly improves the recognition performance of dialect words. (4) different lexical lists are designed through the information of vocabulary and frequency in Chinese phonetic input data. This paper investigates the correlation between frequency sequence of different lexical lists in different regions to compare the relationships between different dialects, and makes a quantitative study of dialect division by hierarchical clustering method. At the same time, the paper analyzes the features of the terms in the dialect area and its adjacent areas, and puts forward the dialect feature words in each region. Finally, we realize the visualization of the geographical distribution of the dialect vocabulary. To assist the study of lexical relationships among dialects.
【学位授予单位】:清华大学
【学位级别】:博士
【学位授予年份】:2014
【分类号】:TP391.1
【参考文献】
相关期刊论文 前1条
1 贾澎涛;何华灿;刘丽;孙涛;;时间序列数据挖掘综述[J];计算机应用研究;2007年11期
,本文编号:2108992
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2108992.html