藏文网页定题采集方法研究

发布时间：2018-04-04 06:17

本文选题：Web检索　切入点：藏文网页采集　出处：《长安大学》2012年硕士论文

【摘要】：与汉文相比，藏文信息处理技术发展较慢，加之缺乏支持藏文搜索引擎，互联网上的藏文信息常常处于“孤立状态”，给用户的查找和获取带来较大的困难。因此，探讨一种通过网络采集藏文信息的方法，对于藏文研究者显得尤为重要。在分析了网页采集流程、网络爬虫工作基本原理和主题网页采集的相关知识的基础上，对藏文网页的采集方法进行了深入研究： 1．对比分析藏文网页的字体、藏文音节点、藏文高频词等区别于其他网页的特征参数，设计出适合于判断藏文网页的相关算法。 2．探讨了藏文主题爬虫的关键技术，，如藏文分词、主题判断方法以及爬虫的爬行策略等内容，提出基于“导向词”的藏文主题判断方法。 3．研究Heritrix软件，并通过对其关键模块Extractor和Frontierscheduler的改进和扩展，实现“导向词”算法的藏文主题信息网站的抓取；另外，运用哈希算法，扩展Queue-assignment-policy模块，大大提升了爬虫的采集效率。 4．利用HTMLParse软件对采集的新闻信息进行提取，并将新闻的标题、发布时间、来源、正文信息存入数据库。 5．对采集的藏文网页文本进行编码“归一化”处理，转化成国际标准的Unicode编码。利用上述研究结果，以网页的查准率和查全率为参考指标，对“导向词”主题判断算法的几个阙值进行了测试，根据测试的结果对中国西藏网进行了网页抓取，抓取的准确率在62%左右。测试数据表明，研究结果对于藏文定题信息采集行之有效，具有较高的应用和理论参考价值。
[Abstract]:Compared with the Chinese language, Tibetan information processing technology develops slowly, coupled with the lack of support for Tibetan search engine, Tibetan information on the Internet is often in an "isolated state", which brings great difficulties to the users to find and obtain.Therefore, it is very important for Tibetan researchers to explore a method of collecting Tibetan information through network.On the basis of analyzing the process of web page collection, the basic principle of web crawler and the related knowledge of subject page collection, the collection method of Tibetan web page is deeply studied.1.By comparing and analyzing the characters of Tibetan web pages, such as font, syllable points, high-frequency words and so on, the relevant algorithms suitable for judging Tibetan web pages are designed.2.This paper discusses the key techniques of Tibetan theme crawler, such as the participle of Tibetan language, the judgment method of theme and the crawling strategy of crawler, and puts forward the judgment method of Tibetan subject based on "leading word".3.This paper studies the Heritrix software, improves and extends its key modules, Extractor and Frontierscheduler, realizes the acquisition of Tibetan subject information website of the "leading word" algorithm, and extends the Queue-assignment-policy module by using hash algorithm, which greatly improves the efficiency of crawler collection.4.The HTMLParse software is used to extract the news information collected, and the title, release time, source and text information of the news are stored in the database.5.The collected Tibetan web page text is coded "normalized" and transformed into international standard Unicode code.Taking the precision and recall rate of the web page as the reference index, this paper tests several threshold values of the theme judgment algorithm of "leading word", and grabs the web page of China Tibet net according to the results of the test.The capture accuracy is about 62%.The test data show that the research results are effective for the collection of Tibetan thematic information and have high application and theoretical reference value.
【学位授予单位】：长安大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.09;TP391.1

【参考文献】