基于字频分布的中文网页编码识别研究

发布时间：2018-10-29 14:23

【摘要】：随着计算机技术的高速发展,互联网已然与人们的生活紧紧结合在一起,成为人们分享信息的重要途径。然而,大量不良网页的出现使得网络安全的形势日趋严峻,成为人们关注的热点。网页内容过滤是网络安全中的重要研究领域,而编码识别是网页内容过滤的必要前提。由于历史和地域原因,中文编码标准甚多,多种中文编码共存给中文网页的内容过滤带来了不便。因此,如何快速准确识别网页的编码成为人们研究的热门课题。本文介绍了国标码、大五码、万国码等中文编码的特征,研究了贝叶斯分类、Unigram和CodeFinder等编码识别算法。上述算法无法排除网页中ASCII码的干扰,导致识别准确率和时间效率低下。针对这一不足,本文提出了一种基于字频分布的中文网页编码识别算法——FKI。FKI根据汉字的字频分布,选取使用频度较高的字符构成高频字符表,以高频字符编码作为关键字,在待识别网页中查找,跳过了噪声(如ASCII码等)的干扰。通过比较不同码制的编码在网页中的匹配数目,最终判定待识别网页的真实码制。FKI算法选取高频字符作为关键字,这些关键字在中文网页内具有超高的使用率,使得算法几乎适用于所有中文网页编码的识别。对AC算法进行改进,使之适合网页内中文高频字符编码的匹配。改进的AC算法构建反向状态自动机,以字节为单位进行关键字查找。当出现字节失配时,以“0”状态所对应的字节作为失配字节计算跳转距离,.增大了失配时的跳转距离,从而提高中文编码的匹配效率。最后,对FKI算法、Unigram算法和CodeFinder算法进行了对比测试。实验结果表明,与上述两种算法相比,FKI算法的编码识别准确率较高且具有优越的时间效率,适合对未知码制类型的中文网页进行快速准确的编码识别。
[Abstract]:With the rapid development of computer technology, the Internet has become an important way for people to share information. However, with the emergence of a large number of bad web pages, the situation of network security is becoming more and more serious, which has become the focus of attention. Web content filtering is an important research field in network security, and coding and recognition is a necessary prerequisite for web content filtering. Due to historical and regional reasons, there are many Chinese coding standards, and the coexistence of multiple Chinese codes brings inconvenience to the content filtering of Chinese web pages. Therefore, how to quickly and accurately identify the coding of web pages has become a hot topic. This paper introduces the features of Chinese coding such as GB code, large five code and Wanguo code, and studies the coding recognition algorithms such as Bayesian classification, Unigram and CodeFinder. The above algorithms can not eliminate the interference of ASCII codes in web pages, resulting in low recognition accuracy and time efficiency. In order to solve this problem, a Chinese page coding recognition algorithm based on word frequency distribution is proposed in this paper. According to the word frequency distribution of Chinese characters, FKI.FKI selects the characters with high frequency to form a high frequency character table. The high frequency character encoding is used as the key word to be searched in the web page to be identified, and the noise (such as ASCII code) is avoided. By comparing the matching number of different codes in the web page, the real code system of the web page to be identified is finally determined. The FKI algorithm selects high-frequency characters as keywords, and these keywords have a high utilization rate in Chinese web pages. The algorithm is suitable for almost all Chinese web page coding recognition. The AC algorithm is improved to fit the matching of Chinese high frequency character encoding in web pages. The improved AC algorithm constructs the reverse state automaton and searches keywords in bytes. When a byte mismatch occurs, the jump distance is calculated by using the byte corresponding to the "0" state as the mismatch byte. The jump distance of mismatch is increased, and the matching efficiency of Chinese coding is improved. Finally, the FKI algorithm, Unigram algorithm and CodeFinder algorithm are compared and tested. The experimental results show that compared with the above two algorithms, the FKI algorithm has higher accuracy and superior time efficiency, and is suitable for fast and accurate coding recognition of Chinese web pages with unknown coding system.
【学位授予单位】：合肥工业大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.1;TP393.092

【相似文献】