基于自然语言理解和领域本体的邮件过滤器的研究与设计
发布时间:2018-05-11 21:32
本文选题:邮件过滤 + 中文邮件 ; 参考:《兰州理工大学》2007年硕士论文
【摘要】: 垃圾邮件又称“不请自来的商业邮件”,给我国的生产或商务活动带来了巨大的损失。虽然陆续推出几款邮件过滤软件,但在对比几种邮件过滤软件的原理后发现,目前的邮件过滤方法或多或少地存在着语义缺失的问题,当垃圾邮件发展到一定程度的时候,目前的邮件过滤算法将难以应付。 本文针对现有垃圾邮件过滤器在对邮件内容进行处理中的语义缺失问题,提出了将自然语言理解的相关方法引入邮件判断中来,使邮件过滤器能够从语义的高度对所收到的邮件进行过滤和分类,以达到减轻用户人工处理邮件的工作量的目的。另外,将概念分析理论引入到自然语言理解中来,利用概念分析理论不涉及具体语言这一特点,来解决汉语语言构成复杂,口语化严重的问题,并在此基础上设计了基于概念分析的邮件内容分析方法。 通过利用广告行业的领域专用术语的特点,构建了广告领域的领域本体,并作为概念分析的基础和知识库。采用的技术路线主要是:首先将对汉语语言的定义和语言的实例定义到本体库中,从而省去了数据库层,方便了系统的构建,且用可扩展标记语言(XML)来定义本体,为以后的扩展奠定了基础。用描述逻辑来支持基于概念分析的自然语言理解和推理,其次是利用描述逻辑支持分层设计的特点,设计出基于概念分析的、层次性的邮件领域本体。 最后,依据上述研究基础或设计想法,,设计了一款基于自然语言理解和领域本体的邮件过滤器,并提出了一种符合真实邮件过滤环境的句法分析和语义分析算法。以广告垃圾邮件作为测试用例进行测试,并给出相应的测试数据,对该算法进行了验证,取得了令人满意的结果。
[Abstract]:Spam, also called "unsolicited commercial mail", has brought huge losses to the production or business activities of our country. Although several kinds of mail filtering software have been introduced one after another, after comparing the principles of several kinds of mail filtering software, it is found that the current mail filtering methods have more or less the problem of semantic lack, when spam develops to a certain extent, Current mail filtering algorithms will be difficult to cope with. In order to solve the problem of semantic deficiency of existing spam filters in the processing of email content, this paper proposes a method of natural language understanding in mail judgment. The email filter can filter and classify the received mail from the semantic level, so as to reduce the workload of the user to handle the mail manually. In addition, the conceptual analysis theory is introduced into natural language understanding to solve the complex and colloquial problems of Chinese language structure by using the feature that conceptual analysis theory does not involve specific language. On this basis, a method of mail content analysis based on conceptual analysis is designed. By using the characteristics of domain terminology in advertising industry, the domain ontology of advertising field is constructed, which serves as the basis of conceptual analysis and knowledge base. The main technical routes are as follows: firstly, the definition of Chinese language and the examples of the language are defined into the ontology library, which saves the database layer, facilitates the construction of the system, and defines the ontology with extensible markup language (XML). For the future expansion laid the foundation. Description logic is used to support natural language understanding and reasoning based on conceptual analysis. Secondly, the hierarchical mail domain ontology based on conceptual analysis is designed by using description logic to support hierarchical design. Finally, according to the above research basis or design ideas, a mail filter based on natural language understanding and domain ontology is designed, and a syntactic and semantic analysis algorithm is proposed to fit the real mail filtering environment. The algorithm is tested with spam as a test case and the corresponding test data are given. The algorithm is verified and satisfactory results are obtained.
【学位授予单位】:兰州理工大学
【学位级别】:硕士
【学位授予年份】:2007
【分类号】:TP391.1;TP393.098
【参考文献】
相关期刊论文 前10条
1 邓志鸿,唐世渭,张铭,杨冬青,陈捷;Ontology研究综述[J];北京大学学报(自然科学版);2002年05期
2 彭树青,乔佩利,张甲寅;Internet垃圾邮件过滤技术研究[J];信息技术;2003年12期
3 郭艳华,周昌乐;自然语言理解研究综述[J];杭州电子工业学院学报;2000年01期
4 李善平,尹奇椺,胡玉杰,郭鸣,付相君;本体论研究综述[J];计算机研究与发展;2004年07期
5 周明,黄昌宁,张敏,白栓虎,吴升;统计与规则并举的汉语句法分析模型[J];计算机研究与发展;1994年02期
6 王鹏,戴新宇,陈家骏,王启祥;基于规则的汉语句法分析方法研究[J];计算机工程与应用;2003年29期
7 何静,刘海燕;基于向量空间模型的实时内容过滤[J];计算机工程;2004年15期
8 张天庆,唐常杰,左R
本文编号:1875713
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1875713.html