This Web-Crawler is an integral part of an on-going project and is not meant for use by anyone else. These files are only for reading purpose and should not be downloaded or forked or copied in anyway. Any such activity will not be welcomed. Basic purpose of this commit is to store our work online and
- asyncio_crawler.py
- crawler.py
- compressor.py
- parse.py
- server.py
- proxy_server.py
- dbconnections.py
- dboperations.py
Crawler -------------> Compressor ---------------> MongoDb --------------> Parse --------------> Server -----------> Crawler
Database 1 : Server : 2 Collections -----> https, http to store URLs of different schemas and to be used by server.py Database 2 : Compressor : 3 Collections ------> https3, Not Crawled, Not Text to be used by multiple scripts.
This project is done in collaboration with
(https://github.com/ExpressHermes) and our contributions are mentioned below: : Structure and Workflow Design, Writing asyncio_crawler.py, crawler.py, parse.py, proxy_Server.py parts of dboperations.py(5%) but major contribution is of Shivam Yadav. : Writing scripts - server.py, compressor.py, dbconnection.py, dboperations.py, managing MongoDb database.