用于比较RAMS标准的机器学习算法开发
发布时间:2022-09-30 16:18
语言是人类用来沟通的工具。尽管所有人都对它很熟悉,但我们的知识和文化直接影响着我们与他人交流的方式,因此不同的句子可能具有相同的含义。自然语言处理是专注于研究计算机和语言之间交互的领域。在过去的几十年中,随着重要性日益增加的信息工具,分析文本片段变得更容易和更快捷,这领域也受到了越来越多的关注。更确切地说,文本比较是许多应用中的关键任务,例如机器翻译,信息检索和问答等等。这项任务的主要困难是确保计算机程序能够有效地处理文本片段或大型语料库,以真正理解句子的含义。在这项研究工作中,我们专注于近义句子识别任务(判断一对句子是否近义)的应用,以比较RAMS标准文档。我们的方法研究了大量的词汇,句法和语义特征。我们研究这些特征对模型性能的影响,特别是将它们结合在一起以确保对句子全面的理解。之后,我们用这些属性训练两种不同类型的模型,一个多数胜算法和输种机器学习分类器(线性和非线性)。我们发现特征选择和组合是确保近义句子识别任务良好表现的关键步骤。另外,我们的结论是,虽然基于经验和传统方法的多数胜算法表现的不错,但几乎所有的机器学习分类器都超过了它。通过对支持向量分类器的算法进行调整,我们可以为...
【文章页数】:106 页
【学位级别】:硕士
【文章目录】:
摘要
Abstract
Chapter 1 Research Context
1.1 Background
1.1.1 The OBOR Project
1.1.2 RAMS Standards
1.1.3 China National Institute of Standards(CNIS)
1.2 Problem Definition
1.3 Purpose of Study
1.4 Proposed Solution
Chapter 2 Literature Review
2.1 Introduction
2.2 Text comparison using traditional methods
2.3 Text comparison using machine learning methods
2.3.1 Introduction to Machine Learning and Deep Learning
2.3.2 Use of Machine Learning Classifiers
2.3.3 Use of Deep Learning Neural Networks
Chapter 3 Theoretical Framework
3.1 Challenges to overcome
3.1.1 Major concepts for text comparisons
3.1.2 Major issues faced for text comparisons
3.2 Global Methodology
3.3 Determination of lexical features
3.3.1 Introduction to the role of lexical features
3.3.2 Bag of Words
3.3.3 String Matching
3.3.4 Longest Common Substring
3.3.5 Longest Common Subsequence
3.3.6 Word Error Rate
3.3.7 Position Independent Word Error Rate
3.4 Determination of syntactic and semantic features
3.4.1 Syntactic Features
3.4.2 Semantic Features
3.5 Different methodologies to pursue text comparison
3.6 Determination of performances
Chapter 4 Experiments& Results
4.1 Datasets
4.1.1 Twitter Paraphrase Corpus
4.1.2 PPDB: Paraphrase Database
4.1.3 Microsoft Research Paraphrase Corpus
4.2 Experiment 1:simple feature comparison
4.2.1 Bag of Words
4.2.2 String Matching
4.2.3 Longest Common Subsequence
4.2.4 Longest Common Substring
4.2.5 Word Error Rate
4.2.6 Position Independent Word Error Rate
4.2.7 Part of Speech Tagging
4.2.8 Wu Palmer Similarity
4.2.9 Conclusion on Simple Feature Comparison
4.3 Experiment 2: "majority wins" comparison
4.3.1 Correlation among features
4.3.2 Analysis of the influence of each lexical feature
4.3.3 Analysis of the influence of the syntactic and semantic features
4.3.4 Results for the "Majority Wins" algorithm
4.3.5 Conclusion
4.4 Experiment 3:machine learning classification comparison
4.4.1 First raw of experiments
4.4.2 Feature Selection
4.4.3 Algorithm Tuning
4.4.4 Final Results
4.5 Summary of Findings
Chapter 5 Discussion of Findings
5.1 Comparison with Baseline results
5.2 Discussion& Future Work
List of Nomenclatures
References
Acknowledgements
Appendix
Appendix A:main codes implemented
Resume and Academic Achievements
本文编号:3683874
【文章页数】:106 页
【学位级别】:硕士
【文章目录】:
摘要
Abstract
Chapter 1 Research Context
1.1 Background
1.1.1 The OBOR Project
1.1.2 RAMS Standards
1.1.3 China National Institute of Standards(CNIS)
1.2 Problem Definition
1.3 Purpose of Study
1.4 Proposed Solution
Chapter 2 Literature Review
2.1 Introduction
2.2 Text comparison using traditional methods
2.3 Text comparison using machine learning methods
2.3.1 Introduction to Machine Learning and Deep Learning
2.3.2 Use of Machine Learning Classifiers
2.3.3 Use of Deep Learning Neural Networks
Chapter 3 Theoretical Framework
3.1 Challenges to overcome
3.1.1 Major concepts for text comparisons
3.1.2 Major issues faced for text comparisons
3.2 Global Methodology
3.3 Determination of lexical features
3.3.1 Introduction to the role of lexical features
3.3.2 Bag of Words
3.3.3 String Matching
3.3.4 Longest Common Substring
3.3.5 Longest Common Subsequence
3.3.6 Word Error Rate
3.3.7 Position Independent Word Error Rate
3.4 Determination of syntactic and semantic features
3.4.1 Syntactic Features
3.4.2 Semantic Features
3.5 Different methodologies to pursue text comparison
3.6 Determination of performances
Chapter 4 Experiments& Results
4.1 Datasets
4.1.1 Twitter Paraphrase Corpus
4.1.2 PPDB: Paraphrase Database
4.1.3 Microsoft Research Paraphrase Corpus
4.2 Experiment 1:simple feature comparison
4.2.1 Bag of Words
4.2.2 String Matching
4.2.3 Longest Common Subsequence
4.2.4 Longest Common Substring
4.2.5 Word Error Rate
4.2.6 Position Independent Word Error Rate
4.2.7 Part of Speech Tagging
4.2.8 Wu Palmer Similarity
4.2.9 Conclusion on Simple Feature Comparison
4.3 Experiment 2: "majority wins" comparison
4.3.1 Correlation among features
4.3.2 Analysis of the influence of each lexical feature
4.3.3 Analysis of the influence of the syntactic and semantic features
4.3.4 Results for the "Majority Wins" algorithm
4.3.5 Conclusion
4.4 Experiment 3:machine learning classification comparison
4.4.1 First raw of experiments
4.4.2 Feature Selection
4.4.3 Algorithm Tuning
4.4.4 Final Results
4.5 Summary of Findings
Chapter 5 Discussion of Findings
5.1 Comparison with Baseline results
5.2 Discussion& Future Work
List of Nomenclatures
References
Acknowledgements
Appendix
Appendix A:main codes implemented
Resume and Academic Achievements
本文编号:3683874
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/3683874.html