Web数学公式提取方法的研究
发布时间:2019-01-10 19:05
【摘要】:随着信息技术的发展,Web技术对数学交流的支持目益成熟和完善,用户在Web上进行数学公式的获取和管理数学公式活动,需要数学公式搜索引擎的支持。数学公式搜索引擎是第三代智能化搜索引擎的研究课题之一,而基于数学公式的爬虫是数学公式搜索中极其重要的一部分,其质量的好坏直接影响着数学公式搜索引擎的功能和性能。 本文的工作重点是对基于数学公式爬虫的研究,主要涉及Web数学公式的识别提取和系统设计。目前,数学公式的识别研究已经取得相当大的进展,但无法应用到数学公式交流和搜索上。本文对用户可编程的数学公式的识别做了有针对性的研究工作,以Web文档中XML格式、LaTeX格式、Infix格式描述的公式以及微软办公软件和OpenOffice中公式为重点。总结分析这些描述形式的公式在Web中的存在形式及其外在的模式特征,利用模式匹配识别提取。在此研究基础上,以开源软件Nutch为系统基础设计实现了数学爬虫系统MathCrawler, MathCrawler有良好的系统架构,可以在互联网上抓取含有数学公式相关内容的文档并提取出数学公式,并用实验表明系统有良好的性能,可以较准确地提取了数学公式。
[Abstract]:With the development of information technology, the support of Web technology for mathematical communication becomes more and more mature and perfect. Users need the support of mathematical formula search engine to obtain and manage mathematical formula on Web. The mathematical formula search engine is one of the research topics of the third generation intelligent search engine, and the reptile based on the mathematical formula is an extremely important part of the mathematical formula search. Its quality directly affects the function and performance of mathematical formula search engine. This paper focuses on the research of crawler based on mathematical formula, mainly involved in the identification and extraction of Web mathematical formula and the design of the system. At present, the research of mathematical formula recognition has made great progress, but it can not be applied to the communication and search of mathematical formula. This paper focuses on the identification of user programmable mathematical formulas, focusing on XML format, LaTeX format, Infix format description formula in Web document, Microsoft office software and OpenOffice formula. This paper summarizes and analyzes the existing forms of these descriptive forms in Web and their external pattern features, and extracts them by pattern matching recognition. On the basis of this research, this paper designs and implements the mathematical crawler system MathCrawler, MathCrawler based on open source software Nutch. It has a good system structure, and can grab the documents containing mathematical formula and extract the mathematical formula on the Internet. Experiments show that the system has good performance and can extract the mathematical formula more accurately.
【学位授予单位】:兰州大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.09;TP391.3
本文编号:2406685
[Abstract]:With the development of information technology, the support of Web technology for mathematical communication becomes more and more mature and perfect. Users need the support of mathematical formula search engine to obtain and manage mathematical formula on Web. The mathematical formula search engine is one of the research topics of the third generation intelligent search engine, and the reptile based on the mathematical formula is an extremely important part of the mathematical formula search. Its quality directly affects the function and performance of mathematical formula search engine. This paper focuses on the research of crawler based on mathematical formula, mainly involved in the identification and extraction of Web mathematical formula and the design of the system. At present, the research of mathematical formula recognition has made great progress, but it can not be applied to the communication and search of mathematical formula. This paper focuses on the identification of user programmable mathematical formulas, focusing on XML format, LaTeX format, Infix format description formula in Web document, Microsoft office software and OpenOffice formula. This paper summarizes and analyzes the existing forms of these descriptive forms in Web and their external pattern features, and extracts them by pattern matching recognition. On the basis of this research, this paper designs and implements the mathematical crawler system MathCrawler, MathCrawler based on open source software Nutch. It has a good system structure, and can grab the documents containing mathematical formula and extract the mathematical formula on the Internet. Experiments show that the system has good performance and can extract the mathematical formula more accurately.
【学位授予单位】:兰州大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP393.09;TP391.3
【参考文献】
相关期刊论文 前3条
1 欧阳辰;数学公式与WEB[J];计算机工程与应用;2001年17期
2 靳简明;江红英;王庆人;;数学公式识别系统:MatheReader[J];计算机学报;2006年11期
3 卢托;于俊清;廖兆存;聂江;;基于Web的数学公式检索系统设计与实现[J];微处理机;2008年02期
相关硕士学位论文 前4条
1 刘志伟;数学搜索引擎研究[D];兰州大学;2011年
2 吴明;WEB上数学公式表达技术研究[D];南京师范大学;2005年
3 景珂;网络数学搜索中的数学查询语言与索引的研究[D];兰州大学;2009年
4 刘东阁;基于MathML的公式检索系统的设计与实现[D];东北大学;2009年
,本文编号:2406685
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2406685.html