deep web, Ajax, crawler, protocol
With the rapid development of the Internet, general-purpose web crawlers have increasingly become unable to meet people’s individual needs as they are no longer efficient enough to fetch deep web pages. The presence of several deep web pages in the websites and the widespread use of Ajax make it difficult for general-purpose web crawlers to fetch information quickly and efficiently. On the basis of the original Robots Exclusion Protocol (REP), a Robots Exclusion and Guidance Protocol (REGP) is proposed in this paper, by integrating the independent scattered expansions of the original Robots Protocol developed by major search engine companies. Our protocol expands the file format and command set of the REP as well as two labels of the Sitemap Protocol. Through our protocol, websites can express their aspects of requirements for restrictions and guidance to the visiting crawlers, and provide a general-purpose fast access of deep web pages and Ajax pages for the crawlers, and facilitates crawlers to easily obtain the open data on websites effectively with ease. Finally, this paper presents a specific application scenario, in which both a website and a crawler work with support from our protocol. A series of experiments are also conducted to demonstrate the efficiency of the proposed protocol.
Tsinghua University Press
Dajie Ge, Zhijun Ding. Robots Exclusion and Guidance Protocol. Tsinghua Science and Technology 2016, 21(6): 643-659.