基于网络信息的个性化用户词典更新方法
发布时间:2018-01-23 04:35
本文关键词: 网络信息提取 新词发现 新词分类 个性化加载 拼音输入法 出处:《哈尔滨工业大学》2013年硕士论文 论文类型:学位论文
【摘要】:汉字输入是中文信息处理中非常重要的问题之一,也是智能人机接口的一个重要组成部分。在汉字输入领域,拼音输入比较符合人们的使用习惯,目前已经进入第三代云输入法的发展阶段。目前主流输入法强调个性化,个性化主要体现为词频调整和词库自动扩充。词频调整是指根据用户输入的分词统计,随时对词库的词频做出合理的调整,给用户最合理的词条排序。而词库自动扩充是指通过搜索引擎或者互联网抓取前所未有的超大训练语料(TB级别),使得各种各样的词语都可以统统纳入词典而不受任何限制。本文正是主要从词库扩充来改进输入法。词库扩充最重要的方面是新词发现,这也是本文的核心内容,针对这个问题,本文主要进行了以下研究工作: (1)网络信息的提取和处理:用网络爬虫程序爬取新浪网页,抽取出其中的网页内容,由于其中的网页内容还有大量垃圾信息(比如广告,版权等信息),需要对抽取到的网页内容进行净化,提取其中有效信息,,标记其中重要信息。网页净化是指对原始网页库中的每一个网页进行解析和过滤,提取有效信息,标记重要信息,去掉意义不大的广告、版权等信息的过程。原始网页经过净化,可以转变为一个结构清晰,内容紧凑,信息明确的网页。 (2)设计实现了新词的提取:对净化的网页采用基于普通重复串统计方法提取新词,对中文按照标点和停用词表进行切分,然后对每个二字词、三字词、四字词进行出现次数的统计,次数超过预先设置好的阈值的字串作为候选新词,再基于重复串查找算法删除重复子串和构词规则删除垃圾串,最后将候选新词和输入法本身的词库进行比对,形成一个新词词库。 (3)新词分类和词库的个性化加载:在所得到的净化网页信息中,经研究原始网页发现,标题字段也含有正文的类别信息,用匹配的方法提取出类别。通过这种方法,把新词进行分类。根据用户的使用习惯,有选择的加载或删除新词词库其中的一类或者几类,体现用户的个性化特点。 最后,为了对系统取得真实、客观的评价,本文以准确率,召回率,F值来评测新词提取的性能,以字符准确率,行准确率为评价指标,对输入法加入新词词库前后的性能进行比较。经评测发现,新词提取的各项标准较好,而加入新词词库后输入法的性能得到了进一步的提高。
[Abstract]:Chinese character input is one of the most important problems in Chinese information processing, and it is also an important part of intelligent man-machine interface. At present, it has entered the development stage of the third generation cloud input method. At present, the mainstream input method emphasizes personalization, personalization is mainly reflected in word frequency adjustment and word bank automatic expansion. Word frequency adjustment refers to word segmentation statistics according to user input. At any time to make a reasonable readjustment of the vocabulary frequency to give users the most reasonable word ranking. And the automatic expansion of vocabulary refers to the search engine or the Internet to grab unprecedented huge training corpus terabytes). So that all kinds of words can be included in the dictionary without any restrictions. This paper is mainly from the lexicon expansion to improve the input method. The most important aspect of lexicon expansion is the discovery of new words. This is also the core content of this paper, in view of this problem, this paper mainly carried out the following research work: 1) extraction and processing of network information: crawling Sina web page with web crawler program, extracting the web page content, because of the web page content and a large number of spam information (such as advertising, copyright and other information). It is necessary to purify the extracted web page content, extract the effective information and mark the important information. Page purification refers to the analysis and filtering of every page in the original web page library to extract effective information. The process of marking important information and removing information such as advertising and copyright, etc. After purification, the original page can be transformed into a web page with clear structure, compact content and clear information. Design and implementation of the new word extraction: the purification of the web page based on the common repeated string statistics method to extract new words, Chinese according to punctuation and stop word table for segmentation, and then for each two words, three words. The number of occurrences of four words is counted, the number of times exceeding the pre-set threshold as a candidate new word, and then based on repeated string search algorithm to delete repeated substrings and word-formation rules to delete garbage string. Finally, the candidate neologisms are compared with the lexicon of the input method to form a neologism lexicon. 3) Classification of new words and personalized loading of thesaurus: in the purified web page information obtained, it is found that the title field also contains the category information of the text after studying the original web page. Use matching method to extract categories. By this method, new words are classified. According to the usage habits of users, one or more of the categories of neologisms are selectively loaded or deleted. Reflect the personalized characteristics of the user. Finally, in order to obtain a true and objective evaluation of the system, this paper uses accuracy, recall rate and F value to evaluate the performance of neologism extraction, and takes character accuracy and line accuracy as evaluation indicators. This paper compares the performance of the input method before and after adding the new word library. The evaluation shows that the new word extraction standards are better, and the performance of the input method has been further improved after the addition of the new word bank.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.14
【参考文献】
相关期刊论文 前9条
1 丁建立;慈祥;黄剑雄;;一种基于免疫遗传算法的网络新词识别方法[J];计算机科学;2011年01期
2 刘峰;王晔晗;汤步洲;王晓龙;王轩;;基于Android的智能中文输入法[J];计算机工程;2011年07期
3 杨晓东;晏立;尤慧丽;;CCRF与规则相结合的中文机构名识别[J];计算机工程;2011年08期
4 向晓雯,史晓东,曾华琳;一个统计与规则相结合的中文命名实体识别系统[J];计算机应用;2005年10期
5 刘非凡;赵军;吕碧波;徐波;于浩;夏迎炬;;面向商务信息抽取的产品命名实体识别研究[J];中文信息学报;2006年01期
6 赵军;;命名实体识别、排歧和跨语言关联[J];中文信息学报;2009年02期
7 刘挺,吴岩,王开铸;串频统计和词形匹配相结合的汉语自动分词系统[J];中文信息学报;1998年01期
8 郑家恒,李文花;基于构词法的网络新词自动识别初探[J];山西大学学报(自然科学版);2002年02期
9 俞鸿魁;张华平;刘群;吕学强;施水才;;基于层叠隐马尔可夫模型的中文命名实体识别[J];通信学报;2006年02期
本文编号:1456724
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1456724.html