•  
  •  
 
Big Data Mining and Analytics

Keywords

adaptive web scraping, deep learning, Long Short-Term Memory (LSTM), Web data extraction, You only look once (Yolo)

Abstract

Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.

Publisher

Tsinghua University Press

Share

COinS