基于WordNet概念向量空间模型的电子邮件分类方法的研究与实现

发布时间：2019-02-22 09:39

【摘要】： 随着计算机技术、信息化程度的日益提高,尤其是互联网的日益普及,电子邮件因其快捷、经济等特点而逐渐成为人们普遍采用的一种通信手段。正因如此,电子邮件往往反映出社会当前的热点问题和公众的舆论焦点。然而电子邮件使用的越来越频繁,垃圾邮件、广告、群发消息等的泛滥,使得用户花费在处理邮件上的时间增多,也影响了人们对信息的整理和获取。倘若能将电子邮件进行分类,那么人们就可以准确、全面、迅速地获取到自己关心的内容,大大提高了工作效率,从而减少了人力、财力、物力等方面的损失。因此,电子邮件分类引起了许多学者的研究兴趣。现有的电子邮件分类技术可以分为基于统计、基于连结和基于规则的三种方法。常用的基于统计的方法有Naive Bayes、KNN、类中心向量、回归模型、支持向量机、最大熵模型等。常用的基于连结的方法是人工神经网络。常用的基于规则的方法有决策树、关联规则等。这些分类方法存在一个共同的问题:都不考虑邮件文本中词与词之间的语义关系,然而现实的邮件文本中的用词往往是有关联的,比如:同义词、同义词集合间的上下位关系等,不考虑邮件文本中词与词之间的语义关系往往会出现向量空间的高维性,其结果是高维性会造成分类性能和分类精度的降低。为解决上述问题,本文提出了一种特征提取方法,即以WordNet本体库为基础,以同义词集合来代替词条,同时考虑同义词集合间的上下位关系,建立邮件文本的概念空间向量模型作为邮件文本的特征向量,使得在训练过程中能够提取出能作为类别特征的高层次信息。本文还设计了一种确定阀值的方法(百分比阀值确定法),可以通过调整阀值来满足不同的查全率和查准率。最后本文将提出的方法付诸实现,并通过试验证明了基于WordNet概念向量空间模型的电子邮件分类方法的有效性。本文提出的基于WordNet概念向量空间模型的电子邮件分类方法对现有的电子邮件分类方法进行了改进,并在分类性能和效率上获得了提升。这些结果使能够快速准确的获取有用的信息,从而大大提高了人们的工作效率。
[Abstract]:With the development of computer technology and information technology, especially the popularity of the Internet, email has become a popular means of communication because of its quick and economical characteristics. Because of this, e-mail often reflects the current hot social issues and public opinion focus. However, the more and more frequent use of email, spam, advertising, mass messaging and other flooding, users spend more time on the processing of mail, but also affect the collation and access to information. If email can be classified, people can get the contents of their concern accurately, comprehensively and quickly, and greatly improve their work efficiency, thus reducing the loss of human, financial, material and other aspects. Therefore, email classification has attracted the interest of many scholars. The existing email classification techniques can be classified into three methods: statistical based, linked-based and rule-based. The commonly used statistical methods include Naive Bayes,KNN, class center vector, regression model, support vector machine, maximum entropy model and so on. The commonly used method based on link is artificial neural network. The commonly used rule-based methods are decision tree, association rules and so on. There is a common problem with these classification methods: they do not consider the semantic relationship between words and words in email texts, but the words used in real mail texts are often related, such as synonyms, etc. The relationship between the upper and lower synonyms and so on, without considering the semantic relationship between words and words in the email text, often leads to the high dimension of vector space, and the result is that the classification performance and classification accuracy will be reduced because of the high dimensionality. In order to solve the above problems, a feature extraction method is proposed in this paper, which is based on WordNet ontology library, using synonym set instead of entries, and considering the relationship between the upper and lower synonyms. The concept space vector model of mail text is established as the feature vector of mail text, which makes it possible to extract high-level information which can be used as category feature in the process of training. This paper also designs a method of determining the threshold (percentage threshold), which can satisfy different recall and precision by adjusting the threshold. Finally, the proposed method is implemented, and the validity of the email classification method based on WordNet concept vector space model is proved by experiments. In this paper, the email classification method based on WordNet concept vector space model is improved, and the classification performance and efficiency are improved. These results make it possible to obtain useful information quickly and accurately, thus greatly improving people's working efficiency.
【学位授予单位】：华东师范大学
【学位级别】：硕士
【学位授予年份】：2008
【分类号】：TP393.098

【参考文献】