Read this in other languages: 한국어
-
Product review crawler and preprocessing with OCR and etc.
-
Combine with honeycomb (Flask server + SQLAlchemy) and MySQL DB.
- Can use proxy IP if you have.
- Can changes the user-agent if you want.
- Failover process when crawler stops like IP banned Case.
- Error handling with retry pattern when fetch to the Webserver
- Define fetch url to Web server
- Use PaddleOCR to analyize product explanation by seller.
- Due to large size of the image, there is a image cutting logic without ignoring any characters.
- TODO: multiprocessing would be added in this process.
- Use GUI with tkinter to develop tagging tools.
- TWO TYPES
- OCR tagging
- OCR tagging tools.
- Review tagging
- Review tagging tools.
- OCR tagging
-
product_link_crawler.py
- Crawling at naver shopping price comparsion tabs.
- Crawl data and fetch to the server.
-
review_crawler.py
- Crawling at the product-unit price comparison page.
- Get OCR data with PaddleOCR/
- Get review data by json payloads.
- Get spec data provided by naver.
- Fetch to the server