Web Scraper API with DRF

This is a simple web scraper API that allows users to add a URL to a web page and scrape a list of all the links on the page. The API is built using Django Rest Framework (DRF).

Features

Users can see a list of all pages that have been scraped, along with the number of links found.
Users can see the details of all links on a particular page, including the URL and name.
Users can add a URL and the system will check for all the links and add them to the database.
Pagination is available for the list of pages and links.
Users can see which pages are currently being processed.

Implementation

The API is built using Django Rest Framework (DRF).
Database: SQLite
Test Suite: PyTest
Scraping Tool: Beautiful Soup
Concurrency: Threading library.

Installation

Clone this repository to your local machine.

$ git clone https://github.com/rodzun/linkscraper.git

Install the required dependencies using pip install -r requirements.txt using a virtual environment.
Run migrations using python manage.py migrate.
Start the server using python manage.py runserver.
Endpoints:

GET
- To see the list of scraped pages.
```
http://localhost:8000/api/scraped_pages/
```
- To see the list of links in the page number 1. Just change the number to see the list according to the page id.
```
http://localhost:8000/api/scraped_pages/1/links
```
POST
- To request scraping a url.
```
http://localhost:8000/api/add_page/
```

Usage

Once the project is running after following the steps in the previous point it can be used using different tools like Postman or other similars.
The GET endpoints are listed in the previous part. To see the response, different tools can be used. I used Google Chrome with the JSON Formatter Extension (https://chrome.google.com/webstore/detail/json-formatter/bcjindcccaagfpapjjmafapmmgkkhgoa?hl=en) in order to clean and proper formmating in the web browser.
The POST endpoint was tested using httpie library to use the following command to scrape https://www.google.com/ web page. Just change that parameter to test other urls.
```
$ http POST http://localhost:8000/api/add_page/ url=https://www.google.com/ 
```
To run the Test Suite:

Inside Tests folder run the following command:
```
$ pytest
```
or the following to see a more verbose output:
```
$ pytest --verbose
```

Main Dependencies

Django for web framework.
Django Rest Framework for API development.
Requests for HTTP requests.
BeautifulSoup4 for scraping web pages.
PyTest for unit testing.
Threading for concurrency.

For full list of dependencies see requirements.txt file.

Commentaries

The Test Suite doesn't cover all posible scenarios. It mainly cover most important points and the idea was to show my skills using different test tools and logic, like patch, mock, etc. This was because of the time constraint.
Git correct usage was not used as it was not the main goal in the project description. The repo mainly contains the whole ptoject as a Initial commit.
Some details may have scape the current project because of the constraints in the challenge. The project can be improved in many ways.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
link_scraper		link_scraper
scraper		scraper
tests		tests
README.md		README.md
db.sqlite3		db.sqlite3
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper API with DRF

Features

Implementation

Installation

GET

POST

Usage

Main Dependencies

Commentaries

About

Releases

Packages

Languages

rodzun/linkscraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraper API with DRF

Features

Implementation

Installation

GET

POST

Usage

Main Dependencies

Commentaries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages