0. Summary

Using the Semantic Scholar API, this script, SS_crawler.py, searches scientific papers with customized queries.

Tested environments

MacBook Pro 2019 / Mac OSX 12.0.1 (Monterey) / python 3.8.12
RaspberryPi / Raspbian 10 (buster) / python 3.7.3

1. Quickstart

(i) One-shot execution

Run once by providing arguments. The usage is:

$ python SS_crawler.py -o -q your+favorite search+keywords -N 3

options:
-o: One-shot option
-q: Search query words concatenated with + sign. Multiple queries must be separated with space.
-N: Number of posted papers per search query

(ii) Periodic execution

Omit the -o option to run periodically at the specified date and time.
The best practice is to run the script on a network-connected server such as RaspberryPi (see 4. Recommended Usage for details).

$ python SS_crawler.py -q your+favorite search+keywords -N 3

To modify the date and time of post, change the variable day_off and posting_hour at the header of the SS_crawler.py. See 3. Details for more advanced options.

2. Background and Motivation

Semantic Scholar [1] is a machine-learning-assisted publication search engine. The advantages of using the Semantic Scholar include but are not limited to:

Search across journal/conference papers in addition to preprint servers (e.g., arXiv, bioRxiv, and PsyArXiv).
Each paper comes with a list of highly influenced articles (thus, highly related) by the paper.
Recent-updated Semantic Scholar API [2] provides easy access to the search engine with a customized code.

Utilizing these features, we can automatize the daily literature survey to find papers that are highly relevant to your work. I hope your research will benefit from the SS_cralwer.py.

3. Details

To customize the script, modify the header part of the SS_crawler.py as follows.

3-1. Modifying the default query list

Find the variable query_list in the header. Multiple queries can be specified. Words in a query must be concatenated with + signs. For example:

query_list = ('face+presentation+attack+recognition', 'sequential+probability+ratio+test')

3-2. Slack posting option

By default, the SS_crawler.py outputs the search results to the console. To improve the readability and searchablility, the results can be posted on a personal Slack channel (or onto any url that you want) via a webhook whose address is specified with the variable slack_url. To get the webhook url, see [3].

3-3. Classic paper searching option

In addition to searching with the default queries defined in the query_list, SS_crawler.py has an experimental function to find classic papers. Reading classic papers is educational: we can learn what papers have had what impact on future researches (and simply enjoyable). Semantic Scholar helps visualize the impact with the influential papers list, suitable for searching classics.

Set the ifClassic variable to True at the header to activate the classic paper searching option. Currently, the classic paper search is not supported on the one-shot execution.

Modify classic_query_list to define queries for classic papers. Instead of searching every query, one of the queries from the classic_query_list will be randomly chosen for a search. I usually set queries as more abstract than ones in query_list. To modify the range and time window, change the variable range_classic. The default is np.arange(1935, 2025, 10): one of the 10-year time windows is randomly chosen to search papers within the time window. Hereafter, the papers searched with the query_list are referred to as "regular papers" to be distinguished from classic papers.

3-4. Change the default number of papers to be displayed

Modify the variables Npapers_to_display and Nclassic_to_display for the regular and classic papers, respectively.

3-5. Clear the search log

SS_crawler saves the IDs of the papers that are already posted as .pkl files. The IDs of the regular papers and classic papers are saved as published_ss.pkl and published_ss_old.pkl, respectively. To clear the history, simply delete these files. If a specific paper ID must be deleted, the ID needs to be deleted from the .pkl file. Note that this function is adapted from the arXiv API crawler found in [4].

4. Recommended Usage

I usually connect a RaspberryPi to the Internet and run SS_crawler.py under Linux GNU screen so that closing terminal will not terminate the script. After installing the screen (e.g., with apt or yum), simply initialize a new screen with:

$ screen -S the_name_of_your_screen

then run SS_crawler.py. For more details, see [5].

5. Example Result

python3 SS_crawler_ver1.py -o -q mixed+selectivity+nature -N 1

6. References

[1] Semantic Scholar
[2] API | Semantic Scholar
[3] Sending messages using Incoming Webhooks
[4] github.com/kushanon/oreno-curator
[5] Screen User's Manual

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
LICENSE		LICENSE
README.md		README.md
SS_crawler.py		SS_crawler.py
Slack_example_screenshot.png		Slack_example_screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

0. Summary

Tested environments

1. Quickstart

(i) One-shot execution

(ii) Periodic execution

2. Background and Motivation

3. Details

3-1. Modifying the default query list

3-2. Slack posting option

3-3. Classic paper searching option

3-4. Change the default number of papers to be displayed

3-5. Clear the search log

4. Recommended Usage

5. Example Result

6. References

About

Releases

Packages

Languages

License

Akinori-F-Ebihara/Semantic_Scholar_API_crawler

Folders and files

Latest commit

History

Repository files navigation

0. Summary

Tested environments

1. Quickstart

(i) One-shot execution

(ii) Periodic execution

2. Background and Motivation

3. Details

3-1. Modifying the default query list

3-2. Slack posting option

3-3. Classic paper searching option

3-4. Change the default number of papers to be displayed

3-5. Clear the search log

4. Recommended Usage

5. Example Result

6. References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages