基于搜索引擎的邮箱地址自动提取系统开发

发布时间：2018-11-05 17:42

【摘要】：信息抽取技术已成为当前的研究热点之一，而对搜索引擎返回信息中存在的所谓的Rich Data Poor Information问题也是亟待解决的，若将两者相结合无疑是件很有趣又有实际价值的事情。本文就把为大家所熟知熟用的搜索引擎与信息提取技术相结合，开发出了一种基于搜索引擎的邮箱地址提取系统。有效的解决了常见邮箱搜索器中普遍存在的精确度不高、用户自主选择性低、前后两次结果会被重复提取等问题。本文的主要工作内容及创新点如下：首先，通过URL地址拼接技术，调用各大搜索引擎的返回数据获取源数据。用户提交关键字和需要处理的搜索引擎起始页面后，根据搜索引擎返回数据首页的url地址结构，拼接出首页的URL链接地址。对比于之前的研究，本文实现了自动翻页提取，即实现对“下一页”链接地址的获取。此外，为了增加Email系统中用户的自主选择性，用户可以根据需要，对要处理的网页页数范围进行限制。其次，HTMLParser包对html网页进行解析，利用正则表达式并对Email地址进行提取。为了获取更多更全面的信息，本文利用HTMLParser对网页内部的URL链接地址进行了深层提取。用户可以根据自己的需要，，选择需要处理的网页层数级别。再次，为了进一步提高用户的自主选择性，用户可以根据自身需要，选择对最后搜索结果中邮件服务器域名（如163.com、126.com、edu.cn等等）进行过滤。此外为了避免本次提取到的信息下次不会被重复提取，选择将结果保存在Access数据库中。抽取的结果也可以手动选择以文本文件的格式保存。最后，对系统进行了测试工作，针对出现的问题进行了改善，并对系统结果做了分析和评价，发现系统稳定性良好，可正常运行15小时（早8:00至23:00），足以满足实际需要。而且召回率和准确率都在94%以上，这比现存的邮箱地址搜索器实现的结果都要高。
[Abstract]:Information extraction technology has become one of the current research hotspots, and the so-called Rich Data Poor Information problem in the return information of search engines is urgently needed to be solved. It is undoubtedly very interesting and valuable to combine the two technologies. This paper combines the familiar search engine with information extraction technology and develops a search engine based mailbox address extraction system. It effectively solves the common problems such as low accuracy, low user autonomy and low selectivity in common mailbox searchers, and the results will be extracted repeatedly before and after two times. The main contents and innovations of this paper are as follows: firstly, through the URL address splicing technology, the return data of each major search engine is called to obtain the source data. After the user submits the keywords and the search engine starting page which needs to be processed, according to the url address structure of the data home page returned by the search engine, the URL link address of the front page is spliced out. Compared with the previous research, this paper realizes the automatic page-turning extraction, that is to achieve the "next page" link address acquisition. In addition, in order to increase the self-selection of users in Email system, users can limit the number of pages to be processed according to their needs. Secondly, the HTMLParser package parses the html pages and extracts the Email addresses by using regular expressions. In order to obtain more and more comprehensive information, this paper uses HTMLParser to extract the URL link address in the web page. According to their own needs, users can choose the level of web pages to be handled. Thirdly, in order to further improve the user's self-selectivity, users can choose to filter the domain name of mail server in the final search results (such as 163.com.com 126.comedu.cn) according to their own needs. In addition, in order to avoid the information extracted this time will not be repeated extraction next time, choose to save the results in the Access database. The extracted results can also be manually selected to be saved in a text file format. Finally, the system is tested, the problems are improved, and the system results are analyzed and evaluated. It is found that the system is stable and can run normally for 15 hours (from 8:00 to 23:00). Enough to meet actual needs. Moreover, the recall rate and accuracy rate are more than 94%, which is higher than that achieved by the existing mailbox address searcher.
【学位授予单位】：浙江理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【相似文献】