移动互联网内容相似性研究
发布时间:2018-07-15 20:05
【摘要】:随着互联网的发展,网络信息呈爆炸式增长。由于众多镜像站点、转载网页、复制网页的存在,使网络中充斥着大量相似内容,这些内容降低搜索引擎结果的质量、浪费硬件存储资源、影响移动用户的使用体验。近年来移动互联网的发展,问题越加严重。 针对目前在移动互联网相似性方面研究的不足,本课题集中于网页正文抽取技术和网页相似性计算。在网页正文抽取技术方面,首先比较了基于统计的网页正文抽取技术、基于视觉分块的网页正文抽取技术及其他网页正文抽取技术,然后本论文提出一种基于主题相似分块的网页正文抽取技术。在网页相似性计算方面,首先比较了基于向量的相似性技术、基于特征的相似性技术、基于网页文本结构的相似性技术和基于语义的相似性技术,然后提出一种基于特征词的网页相似性算法。 基于主题相似分块的网页正文抽取技术以标题标签和分块内容的相似性为基础,通过构建网页树,抽取网页的正文内容。实验表明,该算法对复杂网页抽取准确率高。 基于特征词的网页相似性算法首先提取网页特征词,然后利用局部敏感哈希、分块查找等技术,计算网页的相似性。实验表明,该算法提高了短文本网页的查全率和查准率,,降低了复杂度,适合大规模数据应用。
[Abstract]:With the development of the Internet, network information is explosive growth. Because of the existence of many mirror sites, reprinting web pages and duplicating web pages, the network is filled with a lot of similar content, which reduces the quality of search engine results, wastes hardware storage resources, and affects the use experience of mobile users. In recent years, the development of mobile Internet, more and more serious problems. In view of the deficiency of the research on the similarity of mobile Internet, this paper focuses on the text extraction technology and the calculation of the similarity of the web pages. In the aspect of page text extraction, firstly, the paper compares the technology of page text extraction based on statistics, the technology of page text extraction based on visual block and other technology of web page text extraction. Then this paper proposes a text extraction technique based on topic similarity partitioning. In the aspect of web page similarity calculation, we first compare the similarity technology based on vector, feature based similarity, page text structure similarity and semantic similarity. Then a feature-based web page similarity algorithm is proposed. Based on the similarity of title label and block content, the text extraction technique based on topic similarity block is used to extract the text content of a web page by constructing a web page tree. Experiments show that the algorithm has high accuracy for complex web page extraction. The similarity algorithm of web pages based on feature words firstly extracts the feature words, and then calculates the similarity of web pages by using local sensitive hashing and block lookup techniques. Experiments show that the algorithm improves the recall and precision of short text pages, reduces the complexity and is suitable for large-scale data applications.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1;TP393.092
本文编号:2125230
[Abstract]:With the development of the Internet, network information is explosive growth. Because of the existence of many mirror sites, reprinting web pages and duplicating web pages, the network is filled with a lot of similar content, which reduces the quality of search engine results, wastes hardware storage resources, and affects the use experience of mobile users. In recent years, the development of mobile Internet, more and more serious problems. In view of the deficiency of the research on the similarity of mobile Internet, this paper focuses on the text extraction technology and the calculation of the similarity of the web pages. In the aspect of page text extraction, firstly, the paper compares the technology of page text extraction based on statistics, the technology of page text extraction based on visual block and other technology of web page text extraction. Then this paper proposes a text extraction technique based on topic similarity partitioning. In the aspect of web page similarity calculation, we first compare the similarity technology based on vector, feature based similarity, page text structure similarity and semantic similarity. Then a feature-based web page similarity algorithm is proposed. Based on the similarity of title label and block content, the text extraction technique based on topic similarity block is used to extract the text content of a web page by constructing a web page tree. Experiments show that the algorithm has high accuracy for complex web page extraction. The similarity algorithm of web pages based on feature words firstly extracts the feature words, and then calculates the similarity of web pages by using local sensitive hashing and block lookup techniques. Experiments show that the algorithm improves the recall and precision of short text pages, reduces the complexity and is suitable for large-scale data applications.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP391.1;TP393.092
【参考文献】
相关期刊论文 前8条
1 赵文;唐建雄;高庆锋;;基于统计的中文网页正文抽取的研究[J];电脑知识与技术;2008年01期
2 王琦,唐世渭,杨冬青,王腾蛟;基于DOM的网页主题信息自动提取[J];计算机研究与发展;2004年10期
3 于满泉,陈铁睿,许洪波;基于分块的网页信息解析器的研究与设计[J];计算机应用;2005年04期
4 魏丽霞;郑家恒;;基于网页文本结构的网页去重[J];计算机应用;2007年11期
5 张程;陈自郁;古平;杨瑞龙;;基于DOM树结构的Blog网页自动识别[J];计算机应用研究;2008年05期
6 孙承杰,关毅;基于统计的网页正文信息抽取方法的研究[J];中文信息学报;2004年05期
7 李纲;戴强斌;;WNBTE网页正文抽取方法研究[J];情报科学;2008年03期
8 丁振国;吴宝贵;辛友强;;基于Bloom Filter的大规模网页去重策略研究[J];现代图书情报技术;2008年03期
本文编号:2125230
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2125230.html