forked from oduwsdl/Scholar-Groups
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhtmlsave.py
executable file
·142 lines (127 loc) · 7.24 KB
/
htmlsave.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
#!/usr/bin/env python3
#htmlsave.py
import os
import sys
import argparse
import requests
from datetime import date
"""
The "createfilename()" function uses a specific ID separator of 'XXXXXXX' before and
after the author ID value. Originally, the file would start with the author ID and
have the date string appended to it. However, some author IDs in Google Scholar will
contain hyphens '-' and underscores '_' that create significant errors when trying
to process commands from the Command Line Interface. For example, one author in the
WSDL group has an ID in the form "-eRx..." that is read as an optional argument in
Linux. Prefacing it with \ can work with manually-entered commands, but this is not
as easily implemented when using automated scripts. Therefore, a regular character
separator was added. Currently, 7 sets of 'X' is used as that is unlikely to be seen
in an actual author ID field. The date field is added to identify when the content
has been downloaded from Google Scholar. This provides a record to reflect changes
to the website and content over time. The beginning and ending strings are used to
capture a range of articles due to pagination issues where GS only displays some of
the articles at any time. Originally, GS showed only a certain range of pages, then
the code was revised to display all articles, then it reverted back to showing only
a range of articles/page. These are now set to 4 digits to capture from 0001 - 9999
articles. The decision to use 4 digits instead of 3 digits was to prepare for the
possibility that any user had more than 999 articles listed.
The createURL() function provides the format of the URL to be captured by the program.
Originally, this was within the main code. However, ongoing changes in the HTML code
for the Google Scholar webpage made it necessary to revise the code. Having this as
a separate function facilitates future changes to the URL to be captured..
"""
# This function provides the format of the filename for saving the HTML content
# The file has the structure XXXXXXXauthorIDvalueXXXXXXX2021-08-14-0001-1000.html
def createfilename():
id_separator = 'XXXXXXX'
end_value = (begin_value + 99)
filename = (id_separator + authorID + id_separator + '-' + str(today) + '-' + \
str('%04d' % begin_value) + '-' + str('%04d' % end_value) + '.html')
return filename
# This function formats the URL that is used to capture the HTML content.
# The URL captures articles sorted according to most recent publication date.
def createURL():
captureURL = ('https://scholar.google.com/citations?hl=en&user=' + authorID + \
'&view_op=list_works&sortby=pubdate&cstart=' + str(begin_value) + \
'&pagesize=' + str(page_size))
return captureURL
"""
Originally, the program inquired of the user for an author ID, processed the ID to
capture a webpage, and repeated the loop until the user indicated no further IDs to
be processed. This was revised to run from the command line with author ID arguments
without inquiry to the user. The 'sys.argv' library function is used to capture the
author IDs from the command line. A loop is available to capture 100 articles at a
time as that is the maximum that can be currently displayed in a single GS webpage
within a single URL link. Capturing more would require additional automation.
"""
# Import author IDs from command line and download Google Scholar webpages
# The program loops through arguments to capture one or multiple author IDs.
arguments = len(sys.argv)
today = date.today()
if arguments == 3:
sys.stdout.write('No author IDs provided to process ...\n')
for a in range(3,arguments):
authorID = sys.argv[a]
sys.stdout.write('Processing Author ID ' + authorID + ' ...\n')
# Loop through the program to download author ID webpages. Currently, the
# Google Scholar website allows capturing of up to 100 articles at a time.
start_value = 0
begin_value = start_value
page_size = 100
page = requests.get(createURL())
"""
The status code of the requests.get function enables the program to verify
if a valid webpage has been received. If an invalid author ID is entered, a
status code '404' is received, and the HTML page is not downloaded. A simple
'302' redirect is allowed when the final page has the correct '200' status code.
The one limitation is that the user must know the current author ID in order
to download articles. In cases where an author's ID has been changed, the
page will redirect in a way that gives a '302' redirect for users who are
logged into the Google Scholar service but gives a '404' for unknown users.
This program will not be recognized as a known user, so it cannot capture
the redirected page even when entering the URL in a browser would redirect
successfully. However, successfully navigating a redirect for changed author
IDs would not be useful because the new page shows only the first 20 articles
irrespective of the range provided in the original URL link. Therefore, it is
beneficial for the user to receive an error and be responsible for identifying
the author's updated ID. Additionally, the program uses a while loop to capture
additional pages of articles until it finds a specific qualifier string. The
string ">There are no articles in this profile.<" is currently the qualifier.
"""
# Program requests further pages of articles until the qualifier string is found
qualifier = '>There are no articles in this profile.<'
article_test = True
# Program checks status code to verify a valid page was received. A status code
# of '200' is valid. A '302' redirect to a '200' is normally accepted as well.
statuscode = page.status_code
x = 1
# Program loops to capture articles as long as qualifier is not found and the
# status code of '200' is registered.
while statuscode == 200 and article_test == True:
parser = argparse.ArgumentParser(description='Specify Output Location')
parser.add_argument('--output', type=str, nargs ='?', required = True, help='Output Location')
args, unknownargs = parser.parse_known_args()
save_path = args.output
new_filename = createfilename()
if os.path.exists(new_filename):
sys.stdout.write('Overwriting existing file with same name ...\n')
else:
sys.stdout.write('Creating new file ...\n')
complete_Name = os.path.join(save_path, new_filename)
html_file = open(complete_Name, 'wb')
html_file.write(page.content)
html_file.close()
sys.stdout.write('File saved as "' + html_file.name + '"\n')
begin_value = begin_value + 100
page = requests.get(createURL())
new_test = page.text
if qualifier in new_test:
article_test = False
statuscode = page.status_code
x = x+1
# Program notifies user when no further articles are found within valid GS page
if statuscode == 200 and article_test == False:
sys.stdout.write('There are no more articles to capture ...\n')
# Program notifies user when an invalid page is returned
if statuscode != 200:
sys.stderr.write('Incorrect author ID or inaccessible webpage.\n')
continue