The service for crawling websites (experimental)
- Docker
- Make
- Phoenix framework
- Redis for jobs processing
- Cassandra for the persistent storage
From the project root, inside shell, run:
make pull
to pull latest imagesmake init
to install fresh dependenciesmake up
to run app containers
Now you can visit localhost:4000
from your browser.
make down
- to extinguish running containersmake help
- for additional commands
- The user adds new source URL -> new async job started
- Inside the job:
- Normalize URL (validate schema, remove trailing slash, etc...)
- Store link in DB, if link already exists, than exit
- Parse HTML links and metadata
- Store it in different tables
- Normalize links, check wether it relational or not.
- Check links are external
- For each non-external link -> schedule new async job with some random interval
- Thats literally it
To see it in action, go to the localhost:4000/crawl and type any kind of URL.
To see some search results visit localhost:4000/search.
The default keyspace is storage
Tables:
site_statistics
contains source URLs and counting parsed linkssites
contains URL and HTML parsedsites_by_meta
contains URL and parsed metadata
For LIKE
-style search queries SASI index needs to be configured.
See schema.cql
and cassandra.yaml
for more detail.
- Visit localhost:4000/jobs to see crawling jobs in action
- Visit localhost:4000/dashboard to see core metrics of the system
MIT. Please see the license file for more information.