学生公寓房源数据采集平台的设计与实现

发布时间：2018-09-08 20:23

【摘要】：留彼工坊科技有限公司是一家专门面向英国当地留学生群体提供学生公寓租房信息服务的020互联网创业公司。在互联网模式下,公司需要为用户提供体验良好的服务并且快速而精准地获取所需的公寓信息。目前其房源数据通过Unite-Students等机构合作以及友商平台获得,通过邮件沟通,手工更新公寓情况以及租赁信息。然而邮件方式效率低下,管理成本高,在租房的热门季度中,余量以及租期信息变动频繁。在业务要求下,需要更为自动化的方式来处理平台之间房源信息的同步,以获取最新精准的公寓数据。网页数据抓取便是一种有效的手段。在不同的公寓平台之间,虽然公寓的信息结构大体一致,但是展示页面细节各不相同,面对定制化的网页采集需求,为减少爬虫编写的工作量,降低生产成本,如何设计整体的系统架构,控制爬虫编写的模块复杂度,解耦模块功能,进行数据清洗、结构化以及导入数据等都是本项目的关键的问题。本人于留彼工坊公司实习期间,参与了公寓后台数据中心的开发工作。参考公司原有的未开发完成的基于Pyspider的爬虫应用,重新开发了基于Scrapy的新的系统。区别于主站后台Livety,数据中心称为Sharingan。Livety负责选择确切的房源数据展示在前台页面,管理用户,而Sharingan主要作为房源数据库,存储和管理从不同平台中采集的结构化的房源数据,并且作为网络爬虫的调度和部署平台,进行一系列的数据处理工作。同时,两个后台中心以消息系统的方式进行通信,以实现系统间的低耦合。本人在项目开发中,具体进行的工作内容有:(1)参与了房源数据库关系模型的建模。深入了解业务需求以及各平台的学生公寓出租信息,制定了结构化的数据存储模型。通过这些工作,为该业务的房源数据结构化提取和导入、存储提供基础和规范;(2)参与了数据中心系统架构的设计,基于整体需求,结合之前遗留的爬虫系统得到的实践经验,面向网页数据采集提取建立通用的模式,确定了新系统的架构,框架、技术以及功能模块整合方案等。明确了开发需求和系统架构设计,内部模块的概要设计等;(3)负责具体模块的实现,子系统的开发及整合,包括Scrapy爬虫的Fragment模块、Processor模块、Validator模块、Spider调度、监控模块,数据库导入模块,数据中心的消息系统等。最后构建出了一个初步可用的完整系统。(4)负责编写相关测试,确保系统的正确运行。通过测试,找出并修改了系统和模块中的程序错误。系统初步上线后,运行情况良好,目前定时从各平台采集数据,用于为内部的展示系统提供公寓数据服务,其扩展性为以后成为通用性更高、面向更多数据的采集平台打下了基础。
[Abstract]:Technology Co., Ltd. is a local students in the United Kingdom to provide student housing information services 020 Internet startups. In Internet mode, companies need to provide users with experienced services and quick and accurate access to the required apartment information. At present, its source data is obtained through Unite-Students and other institutional cooperation and rival platforms, through email communication, manual update of apartment and rental information. However, the efficiency of mail is low, the management cost is high, and the margin and the information of the lease period fluctuate frequently in the hot quarter of renting. Under business requirements, more automated ways are needed to synchronize the source information between platforms to obtain up-to-date and accurate apartment data. Web data capture is an effective method. In different apartment platforms, although the information structure of the apartment is roughly the same, but the details of the display page are different, in the face of customized web page collection demand, to reduce the amount of work compiled by the reptiles, reduce production costs, How to design the whole system architecture, control the complexity of the crawler module, decouple the module function, clean the data, structure and import the data are the key problems of this project. I took part in the development of the back-end data center of the apartment during my internship. A new system based on Scrapy is developed by referring to the original undeveloped crawler application based on Pyspider. Different from the main station background Livety, data center, Sharingan.Livety is responsible for selecting the exact room source data to display on the front page and managing the user, while Sharingan is mainly used as the house source database to store and manage the structured house source data collected from different platforms. And as a network crawler scheduling and deployment platform, a series of data processing work. At the same time, the two backend centers communicate with the message system in order to realize the low coupling between the systems. In the development of the project, the contents are as follows: (1) taking part in the modeling of the relational model of the house source database. A structured data storage model is developed to understand business requirements and rental information of student apartments on various platforms. Through these works, it provides the basis and specification for the structured extraction and import, storage and storage of the house source data of the business. (2) participated in the design of the data center system architecture, based on the overall requirements, combined with the practical experience gained from the previous reptile system, A general pattern for data collection and extraction of web pages is established, and the new system architecture, framework, technology and integration scheme of function modules are determined. (3) responsible for the realization of specific modules, the development and integration of subsystems, including the Fragment module of Scrapy crawler, the Validator module, the module of Spider scheduling, the monitoring module, and the other modules, such as the design of system architecture, the outline design of internal modules, etc. (3) responsible for the implementation of specific modules, the development and integration of subsystems, including the Fragment module of Scrapy crawler, Database import module, data center message system and so on. Finally, a preliminary usable complete system is constructed. (4) responsible for writing relevant tests to ensure the correct operation of the system. Through the test, found and modified the system and module program errors. After the initial launch of the system, the system is running well. At present, it regularly collects data from various platforms, which is used to provide apartment data services for the internal display system. Its expansibility makes it more versatile in the future. More data for the acquisition platform laid the foundation.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.52;TP274.2

【参考文献】