Using the Semantic Scholar API, this script, SS_crawler.py
, searches scientific papers with customized queries.
- MacBook Pro 2019 / Mac OSX 12.0.1 (Monterey) / python 3.8.12
- RaspberryPi / Raspbian 10 (buster) / python 3.7.3
Run once by providing arguments. The usage is:
$ python SS_crawler.py -o -q your+favorite search+keywords -N 3
options:
-o
: One-shot option
-q
: Search query words concatenated with + sign. Multiple queries must be separated with space.
-N
: Number of posted papers per search query
Omit the -o
option to run periodically at the specified date and time.
The best practice is to run the script on a network-connected server such as RaspberryPi (see 4. Recommended Usage for details).
$ python SS_crawler.py -q your+favorite search+keywords -N 3
To modify the date and time of post, change the variable day_off
and posting_hour
at the header of the SS_crawler.py
. See 3. Details for more advanced options.
Semantic Scholar [1] is a machine-learning-assisted publication search engine. The advantages of using the Semantic Scholar include but are not limited to:
- Search across journal/conference papers in addition to preprint servers (e.g., arXiv, bioRxiv, and PsyArXiv).
- Each paper comes with a list of highly influenced articles (thus, highly related) by the paper.
- Recent-updated Semantic Scholar API [2] provides easy access to the search engine with a customized code.
Utilizing these features, we can automatize the daily literature survey to find papers that are highly relevant to your work. I hope your research will benefit from the SS_cralwer.py
.
To customize the script, modify the header part of the SS_crawler.py
as follows.
Find the variable query_list
in the header. Multiple queries can be specified. Words in a query must be concatenated with +
signs. For example:
query_list = ('face+presentation+attack+recognition', 'sequential+probability+ratio+test')
By default, the SS_crawler.py
outputs the search results to the console. To improve the readability and searchablility, the results can be posted on a personal Slack channel (or onto any url that you want) via a webhook whose address is specified with the variable slack_url
. To get the webhook url, see [3].
In addition to searching with the default queries defined in the query_list
, SS_crawler.py has an experimental function to find classic papers. Reading classic papers is educational: we can learn what papers have had what impact on future researches (and simply enjoyable). Semantic Scholar helps visualize the impact with the influential papers list, suitable for searching classics.
Set the ifClassic
variable to True
at the header to activate the classic paper searching option. Currently, the classic paper search is not supported on the one-shot execution.
Modify classic_query_list
to define queries for classic papers. Instead of searching every query, one of the queries from the classic_query_list
will be randomly chosen for a search. I usually set queries as more abstract than ones in query_list
. To modify the range and time window, change the variable range_classic
. The default is np.arange(1935, 2025, 10)
: one of the 10-year time windows is randomly chosen to search papers within the time window. Hereafter, the papers searched with the query_list
are referred to as "regular papers" to be distinguished from classic papers.
Modify the variables Npapers_to_display
and Nclassic_to_display
for the regular and classic papers, respectively.
SS_crawler saves the IDs of the papers that are already posted as .pkl
files. The IDs of the regular papers and classic papers are saved as published_ss.pkl
and published_ss_old.pkl
, respectively. To clear the history, simply delete these files. If a specific paper ID must be deleted, the ID needs to be deleted from the .pkl
file. Note that this function is adapted from the arXiv API crawler found in [4].
I usually connect a RaspberryPi to the Internet and run SS_crawler.py
under Linux GNU screen so that closing terminal will not terminate the script. After installing the screen (e.g., with apt
or yum
), simply initialize a new screen with:
$ screen -S the_name_of_your_screen
then run SS_crawler.py
. For more details, see [5].
python3 SS_crawler_ver1.py -o -q mixed+selectivity+nature -N 1
[1] Semantic Scholar
[2] API | Semantic Scholar
[3] Sending messages using Incoming Webhooks
[4] github.com/kushanon/oreno-curator
[5] Screen User's Manual