HTML页面中的文献记录分析算法
发布时间:2019-04-26 00:39
【摘要】:为了使出版机构能够及时从大量网页中发现所需文献,需要设计能够从超文本标记语言页面中自动提取文献信息的算法.为此,设计了基于条件随机场的文献记录分析算法:首先,设计了文档对象树的分割算法,通过分割标记将页面数据分成独立的部分,这些数据块由标签和文本序列构成;随后,将该序列作为条件随机场模型的特征向量,建立文献信息标记模型;最后,设计启发式算法,从标记模型中提取文献信息数据,并通过实验验证了其有效性.
[Abstract]:In order for publishers to find the required documents from a large number of web pages in time, it is necessary to design an algorithm that can automatically extract literature information from hypertext markup language pages. For this reason, a document record analysis algorithm based on conditional random field is designed. Firstly, the segmentation algorithm of document object tree is designed. The page data is divided into independent parts by segmenting tags, and these data blocks are composed of tags and text sequences. Then, using this sequence as the feature vector of conditional random field model, the document information marking model is established. Finally, the heuristic algorithm is designed to extract the literature information data from the marking model, and the validity of the model is verified by experiments.
【作者单位】: 北京印刷学院信息工程学院;清华大学计算机科学与技术博士后流动站;国家新闻出版广电总局广播电视卫星直播管理中心;
【基金】:北京市教委科技创新服务能力建设项目(PXM2016_014223_000025) 北京印刷学院校级重点项目(ea201507);北京印刷学院教师队伍建设—博士启动金项目(27170116005/062);北京印刷学院科研项目—出版物数据资产评估实验室建设项目(20190116005/006)
【分类号】:TP393.092
,
本文编号:2465603
[Abstract]:In order for publishers to find the required documents from a large number of web pages in time, it is necessary to design an algorithm that can automatically extract literature information from hypertext markup language pages. For this reason, a document record analysis algorithm based on conditional random field is designed. Firstly, the segmentation algorithm of document object tree is designed. The page data is divided into independent parts by segmenting tags, and these data blocks are composed of tags and text sequences. Then, using this sequence as the feature vector of conditional random field model, the document information marking model is established. Finally, the heuristic algorithm is designed to extract the literature information data from the marking model, and the validity of the model is verified by experiments.
【作者单位】: 北京印刷学院信息工程学院;清华大学计算机科学与技术博士后流动站;国家新闻出版广电总局广播电视卫星直播管理中心;
【基金】:北京市教委科技创新服务能力建设项目(PXM2016_014223_000025) 北京印刷学院校级重点项目(ea201507);北京印刷学院教师队伍建设—博士启动金项目(27170116005/062);北京印刷学院科研项目—出版物数据资产评估实验室建设项目(20190116005/006)
【分类号】:TP393.092
,
本文编号:2465603
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2465603.html