藏文网页除噪技术研究

发布时间：2018-12-07 17:24

【摘要】： 随着网络信息技术的飞速发展以及藏族地区计算机应用技术的不断提高,越来越多的藏文网页出现在互联网中,使我们更多地了解到广大藏族同胞的文化生活和民风民俗,增进了我们之间的交流,推动了藏族地区的发展。然而,在藏文网页的有用信息周围往往夹杂着很多噪声信息,例如弹出的广告、多余的图片以及一些无关的链接等。这些信息严重影响了藏文网页中有用信息的获取效率,如何有效地去除这些无用的噪声信息已经成为藏文信息处理领域一个亟待解决的问题。本文分析了大量当前存在的网页除噪技术以及藏文网页的内容类型,研究了DOM技术的特点和一些主要的操作规范,在此基础上提出了一种基于DOM和显示属性相结合的藏文网页除噪技术。本技术通过分析人们在阅读浏览网页内容时的潜在行为,得出了网页元素从显示属性上分块的特征,使用了一种显示属性分块模型,并通过示例页面展示了此模型的具体应用,通过把藏文网页解析成DOM树结构,结合显示属性和分块模型对页面内容进行分析,经过一系列的显示块划分、DOM节点的合并与删除、DOM树简化对藏文页面进行去噪处理。本文除噪技术的核心步骤是提取网页DOM树节点的显示属性,因此必须实现藏文网页的DOM解析。在深入研究了大量网页解析技术的基础上,本文使用Java程序设计语言在Eclipse平台上开发出了一个藏文网页DOM解析器,可以把一个藏文HTML页面解析成一棵DOM节点树,每个节点都完整地包含了HTML文档的标签属性,可以根据需要随机提取网页各信息块的显示属性。本解析器还具有简单的浏览器功能,可以直接通过输入网址来解析一个藏文网页,也可以通过把网页源码下载到本地计算机上进行解析,具有很强的标签识别和修复能力,适用于大多数藏文网页。同时,通过分析藏文网页信息的特征,本文提出了依据藏文信息音节点出现频率和网页超链率进行噪声信息块识别的方法,可以有效地识别出大部分藏文网页中包含的噪声信息块。最后,对保留的有用信息块进行DOM节点过滤可以完成对藏文网页的除噪。经过大量测试,本文的除噪技术可以有效地去除藏文网页中的大多数噪声信息,具有很好的实用价值和应用前景。
[Abstract]:With the rapid development of network information technology and the continuous improvement of computer application technology in Tibetan areas, more and more Tibetan web pages appear on the Internet, which makes us know more about the cultural life and folk customs of the Tibetan compatriots. This has enhanced exchanges between us and promoted the development of Tibetan areas. However, the useful information of Tibetan web pages is often surrounded by a lot of noise information, such as pop-up ads, redundant pictures and irrelevant links. This information seriously affects the efficiency of obtaining useful information in Tibetan web pages. How to effectively remove these useless noise information has become an urgent problem in the field of Tibetan information processing. This paper analyzes a large number of existing web page denoising techniques and the content types of Tibetan web pages, and studies the characteristics of DOM technology and some main operating specifications. On this basis, a Tibetan web page denoising technology based on DOM and display attributes is proposed. By analyzing the potential behavior of people when reading and browsing the web content, the technology obtains the feature that the elements of the web page are divided into blocks from the display attributes, and uses a model to divide the display attributes into blocks, and shows the concrete application of the model through an example page. Through parsing Tibetan web pages into DOM tree structure, combining display attribute and block model to analyze the content of the page, after a series of display blocks partition, DOM node merging and deleting, DOM tree simplifies the denoising processing of Tibetan pages. In this paper, the key step of the denoising technique is to extract the display attributes of the DOM tree node of the web page, so it is necessary to realize the DOM parsing of the Tibetan web page. Based on the deep study of a large number of web page parsing techniques, a Tibetan web page DOM parser is developed on the Eclipse platform by using Java programming language, which can parse a Tibetan HTML page into a DOM node tree. Each node contains the label attributes of HTML documents, and it can randomly extract the display attributes of each information block of the web page according to the need. The parser also has a simple browser function, which can directly parse a Tibetan web page by entering a URL, or can be parsed by downloading the source code of the web page to a local computer. It has a strong ability to identify and repair tags. Suitable for most Tibetan web pages. At the same time, by analyzing the characteristics of Tibetan web page information, this paper proposes a method to identify the noise information blocks based on the frequency of syllable points of Tibetan information and the hyperchain rate of web pages. It can effectively identify the noise information blocks contained in most Tibetan web pages. Finally, the DOM node filtering of reserved useful information blocks can eliminate the noise of Tibetan web pages. After a lot of tests, the denoising technology in this paper can effectively remove most of the noise information from Tibetan web pages, which has good practical value and application prospect.
【学位授予单位】：西北民族大学
【学位级别】：硕士
【学位授予年份】：2010
【分类号】：TP393.092

【参考文献】