基于多特征融合的网页正文提取及双语网站探测

发布时间：2019-04-10 12:57

【摘要】：随着互联网的快速发展，互联网信息规模呈指数级增长，同时互联网海量信息的背后伴随着质量的参差不齐，，准确，快速，全面的获取信息变得越来越困难，强大的信息提取能力变得备受关注，信息海量堆积也对信息抽取技术提出了新的机遇与挑战。而随着自然语言处理技术的飞速发展，机器翻译技术在现实生活中的变得越来越实用，有道翻译，Google翻译，百度翻译等相关产品已经成为非专业人士进行外文学习工作的重要工具。双语语料是机器翻译的基础，是机器翻译中训练、测试、分析机器翻译模型的重要数据。双语语料的数量与质量直接关系到机器翻译参数的训练结果，同时很大程度上对后续的机器翻译产品性能产生影响。构建一个质量高、数量大的双语语料库对机器翻译、自然语言处理等问题有巨大的应用价值和学术意义。本文着力于架构并实现一个性能优异、效率高的双语文本抽取系统（此系统是互联网双语语料抓取系统的子系统，不包括爬虫和句子对齐）。本文的主要研究内容包含两个方面：网页正文提取和双语网页探测。本文使用多特征融合技术针对网页正文进行提取，不同于传统生成DOM树的网页处理方法，本文采用基于容器标签的线性化重构方法对网页进行处理，在数据结构上使得需要进行树操作的算法简化到基于线性表的处理，同时通过长度，分词结果，句子数，等多个特征综合判断正文脉络，而后通过基于信息增益的聚类获得网页正文。在双语网页探测方面本文采用基于局部句子锚点搜索的互译率计算对正文得到的双语文本进行互译判断。在此基础上本文计加入了基于命名实体重合度、代词比率等特征的辅助正文判断算法，基于同一网站的大量网页的模板自动生成算法，来提升算法的准确率。本文的网页正文提取和双语网页探测系统达到了目前同领域的顶级水平，本系统及后续处理系统生成中英三千万双语语料并经过了黑龙江省电子信息产品监督检验院软件评测中心的严格检测准确率在95%以上。实验结果也验证了本文提出的多特征融合方法在双语语料挖掘领域的有效性。
[Abstract]:With the rapid development of the Internet, the scale of Internet information is growing exponentially. At the same time, it is more and more difficult to obtain information in an all-round way with the uneven, accurate, rapid and all-round access to information behind the massive amount of information on the Internet. The powerful information extraction ability has been paid more and more attention, and the massive accumulation of information has brought new opportunities and challenges to the information extraction technology. With the rapid development of natural language processing technology, machine translation technology has become more and more practical in real life. Youdao Translation, Google translation, Baidu translation and other related products have become an important tool for non-professionals to study foreign languages. Bilingual corpus is the foundation of machine translation, and it is the important data of training, testing and analyzing machine translation model in machine translation. The quantity and quality of bilingual corpus are directly related to the training results of machine translation parameters and affect the performance of subsequent machine translation products to a great extent. The construction of a bilingual corpus with high quality and large quantity is of great practical and academic significance to machine translation, natural language processing and other problems. This paper focuses on the architecture and implementation of a bilingual text extraction system with excellent performance and high efficiency (this system is a subsystem of the bilingual data capture system on the Internet, excluding crawlers and sentence alignment). The main contents of this paper include two aspects: the extraction of web pages and the detection of bilingual web pages. In this paper, multi-feature fusion technology is used to extract the text of web page, which is different from the traditional method of generating DOM tree. In this paper, the linearization reconstruction method based on container tag is used to process the web page. In the data structure, the algorithm which needs tree operation is simplified to the linear table processing. At the same time, the text context is comprehensively judged by the length, the result of participle, the number of sentences, and so on. Then the text of the web page is obtained by clustering based on information gain. In the aspect of bilingual web page detection, this paper uses the mutual translation rate calculation based on local sentence anchor search to judge the mutual translation of the bilingual text obtained from the text. On this basis, this paper adds an auxiliary text judgment algorithm based on named entity coincidence degree, pronoun ratio and other features, and an automatic template generation algorithm based on a large number of web pages on the same website to improve the accuracy of the algorithm. The text extraction and bilingual web detection system of this paper has reached the top level in the same field at present. This system and its follow-up processing system generate Chinese-English 30 million bilingual corpus and pass through the software evaluation center of Heilongjiang Electronic Information products Supervision and Inspection Institute. The accuracy of strict detection is more than 95%. The experimental results also verify the effectiveness of the proposed multi-feature fusion method in bilingual corpus mining.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】