支持正则表达式的文本匹配优化算法

发布时间：2018-03-17 07:00

本文选题：正则表达式　切入点：过滤　出处：《东北大学》2012年硕士论文　论文类型：学位论文

【摘要】：正则表达式本身具备描述复杂查询的能力,能够通过特定的语法描述一类文本的共同特征。正则表达式因其强大的表达能力和简洁的语法,使得其计算机语言以及相关领域中的应用十分广泛。因此,支持高效的正则表达式的文本匹配技术也就显得尤为重要。目前,支持正则表达式的文本匹配的搜索引擎种类很多。但是,基本上所有的正则表达式匹配算法,都采用了自动机理论。也就是说,正则表达式通常先转换成确定(或非确定)有限状态自动机,然后利用自动机在文本中进行搜索匹配。由于需要在线地处理正则表达式并构建自动机,因此,基本上支持正则表达式的文本匹配算法都是在线的。按照匹配类型不同,支持正则表达式的文本匹配可以分为正则表达式的全局匹配与正则表达式的局部匹配。全局匹配是判断字符串是否属于正则表达式表达语言。而局部匹配在判断字符串中的任何子串是否属于正则表达式所表达的语言的同时,并同时需要返回子串的位置信息。根据匹配类型的不同,本文设计了正则表达式的局部匹配和全局匹配的算法。这两个算法都支持离线的处理,并设计了相应的索引结构。对于正则表达式的全局匹配,定义了前-后缀的概念,提出了基于前-后缀的过滤策略,并同时选择了适用于这种过滤策略的索引结构trie树对字符串集合构建了索引。进而,提出了对正则表达表达式进行化简方法。首先对查询的字符串进行化简,通过计算得到其过滤因子：前-后缀集合。然后对待检索的字符串集合进行过滤,得到候选字符串集合。最后采用传统的有限自动机进行验证候选集合。而对于正则表达式的局部匹配,设计了基于BWT的索引结构,定义了正则表达式中不同操作符的运算规则,将正则表达式的文本匹配转换成位置信息列表的处理操作。首先,根据将正则表达式片段在索引中其在文本中出现的位置信息的列表；然后,根据定义的正则表达式中操作符的运算规则对位置信息列表进行相应的操作,得到的位置信息的列表也就是正则表达式在文本中进行局部匹配成功的字符串的位置。最后,进行了大量的实验测试,结果表明本文提出的两种支持正则表达式的文本匹配算法具备较高的查询效率。
[Abstract]:Regular expressions are capable of describing complex queries and can describe common features of a class of text through specific syntax. Because of its wide application in computer language and related fields, it is very important to support efficient text matching technology for regular expressions. At present, there are many kinds of search engines that support regular expression text matching. However, almost all regular expression matching algorithms adopt automata theory. Regular expressions are usually converted to deterministic (or non-deterministic) finite-state automata, then used to search for matching in text. Because regular expressions need to be processed online and automata are constructed, Basically, text matching algorithms that support regular expressions are online, depending on the type of match, Text matching that supports regular expressions can be divided into global matching of regular expressions and local matching of regular expressions. Global matching is to determine whether a string belongs to a regular expression expression language, and local matching is used to judge whether a string belongs to a regular expression language. While any substrings in a string belong to the language of a regular expression, It also needs to return the position information of the substring. According to the different matching types, the local matching and global matching algorithms of regular expressions are designed in this paper. Both of these algorithms support off-line processing, and the corresponding index structure is designed. In this paper, the concept of front suffix is defined, and a filtering strategy based on front suffix is proposed. At the same time, trie tree, an index structure suitable for this filtering strategy, is selected to index the collection of strings. In this paper, a method of simplifying regular expression expression is proposed. Firstly, the query string is simplified, and its filter factor is obtained by calculating the pre-suffix set. Then, the search string set is filtered. The candidate string set is obtained. Finally, the candidate set is verified by the traditional finite automata. For the local matching of regular expressions, the index structure based on BWT is designed, and the operation rules of different operators in regular expressions are defined. Converts the text match of a regular expression into a list of location information. First, based on the list of location information that appears in the text of the regular expression fragment in the index; then, According to the operation rules of the operators in the defined regular expression, the list of position information is the position of the string that the regular expression matches successfully in the text. Finally, A large number of experiments have been carried out and the results show that the two text matching algorithms which support regular expressions have high query efficiency.
【学位授予单位】：东北大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【参考文献】