面向刑事案件的精细分类与串并案分析技术研究

发布时间：2018-09-08 17:00

【摘要】：随着信息技术的高速发展,公安领域的情报信息系统也面临着海量数据,主要是文本数据带来的巨大挑战,传统的手工处理方式已经难以满足业务上的需求,必须采用更加自动化、智能化的文本挖掘技术来提高办案效率。面向刑事案件文本,重点研究案件精细分类和串并案分析这两个刑侦人员普遍关注的问题。提出了基于朴素贝叶斯和关键词共现图谱的两级分类方法TLC-NBK,该方法根据案件文本长度短、词频低、类别分布具有层次性和不均衡性的特点,首先在文档频率DF方法的基础上引入了词性特征,提出双因子评估算法进行特征选择,然后利用面向不均衡类别的多变量贝努利模型进行朴素贝叶斯分类,实现了一级案件类别的快速、准确划分;在第一级分类器的基础上,针对其所属的二级案件类别分别构建以文档集为基本单位的关键词共现向量,以关键词间的共现关系代替词频计算权重,并提出了逆类别频率因子对共现权重进行修正,最后采用简单向量距离算法实现二级案件类别的精细分类。此外,还利用同义词网技术消除了领域同义词对分类结果的干扰。提出了基于案件特征的密度聚类方法,实现了系列案件的串并分析。该方法首先结合规则和字典从非结构化的案情描述信息中抽取出结构化的案件特征;接着定义了案件文本间的特征相似度计算公式,综合考虑了精细案件类别、案发时间和案发地点对案件特征相似度的影响,并采用层次分析法决策各维度的权重值;最后,借鉴经典密度聚类算法OPTICS的思想,提出了特征密度聚类算法OPTICS-FD,能够有效的分析出系列案件的密集簇,辅助刑侦人员破案。最后,通过实验对双因子评估算法、两级分类器、案件特征抽取和串并案聚类进行了测试。结果表明,在刑事案件文本挖掘领域,相比于传统方法,TLC-NBK方法的准确率和召回率分别提升了7.53%和12.99%;OPTICS-FD算法的缩减率与召回率分别达到了66.52%和91.25%,更好的支持了刑侦人员进行决策。
[Abstract]:With the rapid development of information technology, the information system in the field of public security is also faced with a huge amount of data, mainly text data, the traditional manual processing method has been difficult to meet the needs of the business. More automatic and intelligent text mining technology must be adopted to improve the efficiency of case handling. Focusing on the text of criminal cases, this paper focuses on the fine classification of cases and the analysis of serial cases, which are generally concerned by criminal investigators. A two-level classification method, TLC-NBK, based on naive Bayes and cooccurrence map of keywords is proposed. The method is based on the characteristics of short text length, low word frequency, hierarchical and unbalanced distribution of categories. Firstly, based on the DF method of document frequency, part of speech feature is introduced, and a two-factor evaluation algorithm is proposed for feature selection, and then naive Bayesian classification is carried out by using the multi-variable Bernoulli model oriented to unbalanced categories. On the basis of the first level classifier, the cooccurrence vector of keywords based on the document set is constructed for the second class case category to which it belongs. The cooccurrence relation between keywords is used instead of the word frequency to calculate the weight, and the inverse class frequency factor is proposed to modify the co-occurrence weight. Finally, the simple vector distance algorithm is used to realize the fine classification of the second-level case category. In addition, the interference of domain synonyms to classification results is eliminated by using synonym net technology. A density clustering method based on case features is proposed, and the serial case sequence analysis is realized. The method firstly extracts the structured case features from the unstructured case description information by combining rules and dictionaries, and then defines the formula for calculating the similarity of features between the case texts, and considers the fine case categories synthetically. The influence of time and location on the similarity of case features is analyzed, and the weight of each dimension is determined by AHP. Finally, the idea of OPTICS, a classical density clustering algorithm, is used for reference. The feature density clustering algorithm (OPTICS-FD,) is proposed to analyze the cluster of cases effectively and to assist the criminal investigators to solve the cases. Finally, the double factor evaluation algorithm, two-level classifier, case feature extraction and string-parallel case clustering are tested through experiments. The results show that in the field of criminal case text mining, the accuracy and recall rate of TLC-NBK method are increased by 7.53% and 12.99%, respectively, and the reduction rate and recall rate of OPTICS-FD algorithm are 66.52% and 91.25%, respectively.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1;D918.2

【参考文献】