Skip to content

Product crawler by json packet from network and preprocessing / 크롬드라이브 네트워크에서 log를 받는 셀레니움 크롤러

License

Notifications You must be signed in to change notification settings

ket0825/preview

Repository files navigation

Preview (Product review) project

Read this in other languages: 한국어

  • Product review crawler and preprocessing with OCR and etc.

  • Combine with honeycomb (Flask server + SQLAlchemy) and MySQL DB.


Composition

Driver

  • Can use proxy IP if you have.
  • Can changes the user-agent if you want.
  • Failover process when crawler stops like IP banned Case.

Route handler

  • Error handling with retry pattern when fetch to the Webserver
  • Define fetch url to Web server

Image processing

  • Use PaddleOCR to analyize product explanation by seller.
  • Due to large size of the image, there is a image cutting logic without ignoring any characters.
  • TODO: multiprocessing would be added in this process.

GUI

  • Use GUI with tkinter to develop tagging tools.
  • TWO TYPES
    1. OCR tagging
      • OCR tagging tools.
    2. Review tagging
      • Review tagging tools.

Crawler work at

  1. product_link_crawler.py

    • Crawling at naver shopping price comparsion tabs.
    • Crawl data and fetch to the server.
  2. review_crawler.py

    • Crawling at the product-unit price comparison page.
    • Get OCR data with PaddleOCR/
    • Get review data by json payloads.
    • Get spec data provided by naver.
    • Fetch to the server

Dockerfile (with only testing product_link_crawler.py)


Sequence Diagram

product_link_crawler.py

image

review_crawler.py

image

About

Product crawler by json packet from network and preprocessing / 크롬드라이브 네트워크에서 log를 받는 셀레니움 크롤러

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published