Web-Crawler

Disclaimer

:

This Web-Crawler is an integral part of an on-going project and is not meant for use by anyone else. These files are only for reading purpose and should not be downloaded or forked or copied in anyway. Any such activity will not be welcomed. Basic purpose of this commit is to store our work online and

NOT TO BE COPIED

.

The Crawler is divided into 8 scripts:

asyncio_crawler.py
crawler.py
compressor.py
parse.py
server.py
proxy_server.py
dbconnections.py
dboperations.py

The crawler workflow is cyclic :

Crawler -------------> Compressor ---------------> MongoDb --------------> Parse --------------> Server -----------> Crawler

Role of each file is described below :

asyncio_crawler.py

: This is Central file used to maintain cyclic flow of entire process. Uses Asynchronous programming to improve efficiency of the cyclic stucture. to start the cyclic flow, the crawler is fed up with some uncrawled links stored in MongoDb database. It then starts the cyclic flow and maintains the cycle.

crawler.py

: This downloads web-pages and store them in local directory. For now, web-pages with MIME type with schema as text can be downloaded. If a page throws status error other than 200, it is stored separately in the database for feeding these web-pages to asyncio_crawler.py to initiate the cycle again.For pages of MIME types with schema other than text, are stored separately in a collection named 'Not Text'.

compressor.py

: Reads file store on the local directory by crawler and push them into MongoDb database in collection https3. Before pushing, pages are compressed using in-biult MongoDb functionality.

parse.py

: Read random files which are not yet parsed from the https3 collection. It then uses BeautifulSoup parser to extract all the 'a' tags and 'href' and pass them to server.py.

server.py

: This is written to check if a link to be crawled, is already crawled or not. If crawled, the link is not further crawled and else is crawled. If a link can be crawled, it is first entered into database named 'Server' and then passed to crawler.py.

dbconnections.py

: Is a centralised file for establishing connection with MongoDb cluster and then with its databases and collections.

dboperations.py

: Contains all possible MongoDb queries required to complete the task.

proxy_server.py

: Written to enable proxy rotation to prevent web sites form blocking the crawler. This is not very affective because of free-proxies. tried User-Agents rotation as well but could not succesfully use it.

MongDb Structure:

Database 1 : Server : 2 Collections -----> https, http to store URLs of different schemas and to be used by server.py Database 2 : Compressor : 3 Collections ------> https3, Not Crawled, Not Text to be used by multiple scripts.

This project is done in collaboration with

Shivam Yadav

(https://github.com/ExpressHermes) and our contributions are mentioned below:

Ansh Lehri Contribution (Repository Owner)

: Structure and Workflow Design, Writing asyncio_crawler.py, crawler.py, parse.py, proxy_Server.py parts of dboperations.py(5%) but major contribution is of Shivam Yadav.

Shivam Yadav Contribution

: Writing scripts - server.py, compressor.py, dbconnection.py, dboperations.py, managing MongoDb database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-Crawler

Disclaimer

NOT TO BE COPIED

The Crawler is divided into 8 scripts:

The crawler workflow is cyclic :

Crawler -------------> Compressor ---------------> MongoDb --------------> Parse --------------> Server -----------> Crawler

Role of each file is described below :

asyncio_crawler.py

crawler.py

compressor.py

parse.py

server.py

dbconnections.py

dboperations.py

proxy_server.py

MongDb Structure:

Shivam Yadav

Ansh Lehri Contribution (Repository Owner)

Shivam Yadav Contribution

The Crawler is up and running on AWS EC2 instance, managed by Ansh Lehri.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
asyncio_crawler.py		asyncio_crawler.py
compressor.py		compressor.py
crawler.py		crawler.py
dbconnections.py		dbconnections.py
dboperations.py		dboperations.py
parse.py		parse.py
proxy_server.py		proxy_server.py
server.py		server.py

ansh-lehri/Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Web-Crawler

Disclaimer

NOT TO BE COPIED

The Crawler is divided into 8 scripts:

The crawler workflow is cyclic :

Crawler -------------> Compressor ---------------> MongoDb --------------> Parse --------------> Server -----------> Crawler

Role of each file is described below :

asyncio_crawler.py

crawler.py

compressor.py

parse.py

server.py

dbconnections.py

dboperations.py

proxy_server.py

MongDb Structure:

Shivam Yadav

Ansh Lehri Contribution (Repository Owner)

Shivam Yadav Contribution

The Crawler is up and running on AWS EC2 instance, managed by Ansh Lehri.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages