带有通配符和长度约束的模式匹配问题求解及其应用研究

发布时间:2018-08-26 07:25
【摘要】:带有通配符模式匹配是模式识别领域中重要的研究方向之一,在计算生物学、信息检索、网络安全等研究领域中都得到了广泛关注。它是通过在模式识别问题中引入通配符这种特殊字符,来匹配字母表中的任意字符,这样带来更多灵活性的匹配结果,但也使得问题求解变得更为复杂。当频繁共现在一段文本区域内的多个模式之间表现为某种模式形式。例如在DNA序列中,启动子TATTA序列常常出现在CAATCT序列下游中间间隔30-50个通配符,其中也不是简单的重复。由这两个子序列共同组成的模式可提高序列特异性,可以标记“CAATCT[30-50]TATA...”。将其特点推广为子模式序列集间隔,其中两个相邻子模式的间隔在一定长度范围内,为表示这种灵活的位置间隔,将通配符从指代单个字符扩展为指代一定长度的子串,此通配符称作限长空位(Bounded length gaps)。同时通过引入one-off约束限制条件来保证匹配子串模式序列集的稳定性。研究带有通配符可变长度约束的模式匹配问题。本文围绕带通配符和长度约束的模式匹配问题中解结构的复杂性特征,从设计求解算法以及相似性度量模型应用等问题,展开一些研究工作,主要内容概括为以下三个方面:(1)结合带通配符和长度约束的精确模式匹配问题中求解的复杂性、精确性与完备性等特征,针对目前已有研究成果中还缺少针对性的模型建立求解策略。为此,借鉴约束可满足问题框架(CSPs),首先对带通配符和长度约束的精确模式匹配问题构建三元组求解模型。模型对该问题的约束条件和解空间等基本概念给出形式化的描述,并将该问题已知的8条特殊情况统一表述为问题的基本性质,其中包括在特殊条件下的完备性和相邻匹配解在文本中的位置关系。同时提出一种求解带通配符的模式串精确匹配问题的FIN算法。首先在定义了解空间的划分边界中,提出了采用分治策略的解空间划分算法,将带通配符的模式串精确匹配问题等价划分为若干独立的子问题,并从理论上说明了划分前后解结构等价性。实验结果表明FIN算法与解决相同问题的算法(PAIG)对比,该算法不仅可以得到匹配数目,而且可以得到完备的匹配解位置。(2)针对处理带通配符的近似模式匹配问题中,使用传统算法求解存在匹配子串结果质量不高、易丢解等问题,提出一种启发式的算法W-DPBI。该算法采取文本倒置搜索策略与流程的优化。与同类DP和SAIL-APPROX算法进行实验对比,结果表明该算法获取解的平均增长率可达21.9%,最高可达57%,匹配结果具有良好的优势,在特定情况下可以明显提高求解近似匹配结果的质量和能力,具有较好的运用灵活性和启发性。(3)结合模式匹配及其相关求解算法在计算生物学领域中应用以及相关实验方法研究的成果,针对药物基因和疾病基因序列中的相似度结构等特征,对所收集的数据信息源,采用近似匹配协同过滤算法并与相关算法组合求解搜索的策略,着重从已知疾病信息与基因信息角度来计算药物与疾病之间的相似度,应用于药物重定位与构建模型研究。实验结果表明,该方法能够明显提高潜在治疗关系的药物-疾病的富集程度。相比于已有分类模型和随机抽样结果,可以有效降低了预测的假阳性率,其模型参数可作为药物研发试验的参照。
[Abstract]:Pattern matching with wildcards is one of the most important research directions in pattern recognition. It has attracted wide attention in computational biology, information retrieval, network security and other fields. It matches any character in the alphabet by introducing a special character, wildcards, into pattern recognition, which brings more flexibility. For example, in DNA sequences, the promoter TATTA sequence often appears in the downstream middle of the CAATCT sequence with 30-50 wildcards, which are not simply duplicated. A pattern composed of subsequences can improve sequence specificity by marking "CAATCT [30-50] TATA...". A substring is called a bounded length gaps. At the same time, the stability of the sequence of matching substring patterns is guaranteed by introducing one-off constraints. The problem of pattern matching with variable length constraints of wildcards is studied. The main contents are summarized as follows: (1) Considering the complexity, accuracy and completeness of solving exact pattern matching problems with wildcards and length constraints, there is still a lack of needles in the existing research results. In this paper, a three-tuple solution model for exact pattern matching problem with wildcard and length constraints is constructed by using the Constraint Satisfaction Problem Framework (CSPs). The model formally describes the basic concepts of constraint conditions and solution space of the problem, and eight special cases of the problem are given. The basic properties of the problem are formulated in a unified way, including the completeness under special conditions and the location relationship between adjacent matching solutions in the text. At the same time, a FIN algorithm for exact matching of pattern strings with wildcards is proposed. The algorithm divides the exact matching problem of pattern strings with wildcards into several independent sub-problems and theoretically illustrates the structural equivalence of the solution before and after partitioning.The experimental results show that the FIN algorithm can not only obtain the number of matches, but also obtain the complete matching solution position. To solve the problem of approximate pattern matching with wildcards, a heuristic algorithm W-DPBI is proposed to solve the problem of low quality matching substrings and easy to be lost. The algorithm adopts the strategy of text inversion search and the optimization of process. Compared with similar DP and SAIL-APPROX algorithms, the results show that the algorithm is effective. The average growth rate of the solution obtained by the method is 21.9% and the maximum is 57%. The matching results have good advantages, which can obviously improve the quality and ability of solving approximate matching results under certain conditions, and have good flexibility and inspiration in application. (3) Combining pattern matching and related algorithms in computational biology applications and According to the similarity structure of drug gene and disease gene sequence, the strategy of approximate matching collaborative filtering algorithm combined with related algorithm is adopted to search the collected data information source. The emphasis is on calculating the relationship between drug and disease from the perspective of known disease information and gene information. Similarity is applied to drug relocation and modeling. Experimental results show that this method can significantly improve the drug-disease enrichment of potential therapeutic relationships. Compared with existing classification models and random sampling results, it can effectively reduce the predicted false positive rate, and its model parameters can be used as a reference for drug development trials.
【学位授予单位】:合肥工业大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP391.4

【参考文献】

相关期刊论文 前10条

1 黄海宁;张浩;汪海;;沙利度胺抗肿瘤机制及其作用靶点CRBN的研究进展[J];中国药理学通报;2015年06期

2 张浩;叶明全;;求解PMWOC问题的位并行算法[J];计算机应用研究;2015年10期

3 强继朋;谢飞;高隽;胡学钢;吴信东;;带任意长度通配符的模式匹配[J];自动化学报;2014年11期

4 项泰宁;郭丹;王海平;胡学钢;;带通配符的模式匹配问题及其解空间特征分析[J];计算机科学;2014年09期

5 吴信东;强继朋;谢飞;;Pattern Matching with Flexible Wildcards[J];Journal of Computer Science & Technology;2014年05期

6 王可鉴;石乐明;贺林;张永祥;杨仑;;中国药物研发的新机遇:基于医药大数据的系统性药物重定位[J];科学通报;2014年18期

7 沈璐;纪允;纪冬宝;李萍;;带可变长度通配符的模式匹配算法[J];计算机工程与应用;2015年15期

8 吴信东;谢飞;黄咏明;胡学钢;高隽;;带通配符和One-Off条件的序列模式挖掘[J];软件学报;2013年08期

9 王宝勋;刘秉权;孙承杰;王晓龙;孙林;;基于论坛话题段落划分的答案识别[J];自动化学报;2013年01期

10 张永祥;程肖蕊;周文霞;;药物重定位——网络药理学的重要应用领域[J];中国药理学与毒理学杂志;2012年06期

相关博士学位论文 前3条

1 刘应玲;带可变长度通配符的模式匹配算法研究[D];合肥工业大学;2014年

2 赵华;多模型下的近似字符串匹配算法研究[D];华中科技大学;2013年

3 孙德才;基于q-gram过滤的近似串匹配技术研究[D];湖南大学;2012年



本文编号:2204145

资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2204145.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户db628***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com