Authors: Jon Catanio, Tyler Dahl, Alex Greene, Lohit Vankineni, & Shubham Kahal
In order to use the searchengine
package the PYTHONPATH
environment variable
needs to be updated.
export PYTHONPATH=${PYTHONPATH}:<absolute_path/to/SearchEngine/>
$ export PYTHONPATH=${PYTHONPATH}:<absolute_path/to/SearchEngine/>
Please see the requirements.txt
file in the root directory. You should be able to run the command pip3 install -r requirements.txt
to install them pretty easily.
Please refer to the README located here to learn how to set up the MySQL database. A testing database that has pre-crawled information with PageRanks assigned can be found here.
- In the
searchengine
directory there is a Python3 file titledserver.py
simply runpython3 server.py
and a Flask server will be running on localhost port 5000. - Navigate to the
frontend
directory and open theindex.html
file in your web browser of choice (we know Chrome and Safari work if you have trouble with this). - Enter search queries and wait shortly for results. (There will be no results if the database is not populated).
Navigate to the crawler
directory before following the directions below.
python3 crawl.py
Our crawler starts out with a url stack to increase the chances of reaching as many pages not reachable from the Cal Poly homepage. The script that will update the stack is called stack.py
.
stack.py
uses a custom Google search engine in order to retrieve search results. In order to communicate with Google's API, an API key and a search engine id are required.
Run pip install google-api-python-client
Create a file called config.py
within the crawler
directory. Inside, paste:
key = '<api key goes here>'
cse_id = '<search engine id goes here>'
Replace the bracketed text with the credentials provided by @alexgreene. Do not leave the brackets, but do leave the quotation marks.
To add links to the stack, run python3 stack.py site:csc.calpoly.edu/~ 1
where site:csc.calpoly.edu/~
would be replaced with your search term and 1
would be the number of pages of Google search results to pull urls from.
Note: We are limited to 100 total pages per day, with each page returning 10 results.
To assign a PageRank to each link simply navigate to the searchengine/algorithms
directory and run python3 PageRank.py
, this will begin assigning ranks to each link.
Note: The database needs to be setup and populated before running the PageRank file
For other information please visit some of the other directories in the package, most have their own README files associated with them.
The website can be found here which includes information that was found in our final report.
The live demo can be found on our AWS EC2 instance. This will only be up for a short period of time as we do not want to pay for it to stay up. If you would like to see the live version please contact me at joncatanio [at] gmail [dot] com. The live version is using a precrawled database and will be out of date as new pages get added to the csc.calpoly.edu
domain since we are not constantly crawling on that instance.
If you are experiencing some slowness in getting results that is simply due to us using beautiful soup to fetch relevant link titles and descriptions after the database API returns links. We realize this was not the best design decision and could have been remedied by storing that data in the Links
table after each crawl. Other than that it operates as expected and the results should be very relevant! :)