forked from oduwsdl/Scholar-Groups
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathukvsconvert.py
executable file
·303 lines (262 loc) · 13.5 KB
/
ukvsconvert.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
#!/usr/bin/env python3
#ukvsconvert.py
import sys
import argparse
import fileinput
import json
import re
"""
This program imports entries structured in a dictionary-type key/value format and converts
the entries to JSON, BIBTEX, or HTML. The program is run from the Command Line Interface and
can read from STDIN or accept a designated file as input. The arguments --json, --bibtex, or
--html identify the desired output, which can be displayed through STDOUT or saved as a file.
The argparse library is imported to recognize and interpret the specified arguments.
The createjson() function is designed to import the UKVS entries and convert them into a JSON
recognized format. Normal JSON conventions are used, such as brackets for an [array], and a
space after the colon in the "Key": "Value" pairings. The function is separated from the main
portion of code to facilitate any future need for revision. Currently, the function uses
simply string replacement to extract and format the information. Originally, the attempt was
made to parse the items with a dictionary library, but this created errors if quotation marks
were part of a title, which was the case in at least one instance. For example, an article
title of "Why are some files "lost" in the cloud?" would often be identified as two fields
instead of one. Because titles and source information listings often contain quotation marks
and colons, it seemed that a dictionary parser was not a good choice.
The createbibtex() function imports the UKVS entries and converts them to BIBTEX format. In
this case, all entries are specified as "@misc" types as there is no easy way to identify
the actual type of entry from the often-abbreviated notation used by Google Scholar. There
are two deviations from normal BIBTEX conventions: (1) Instead of only having one set of
curly braces around all authors, each author is also in braces. This was done so that any
BIBTEX interpreter would not convert initials to lowercase as if they were a full name.
Thus, the field entry is "author = {{ML Nelson} and {MC Weigle} and {SM Jones}}" here.
(2) A comma is provided at the end of every entry even though this is not required as it
provided better identification if added fields are appended.
The createhtml() function imports the UKVS entries and converts them to an HTML style list.
Because HTML files require lines of code before and after a list to identify document type,
style, title, head, and body, the function is more involved than the other functions. The
program does not currently use a specified style, but the line in the code remains, being
commented out, so that a user can easily include that functionality. Additionally, the
program allows the user to specify the title for the webpage if desired using the optional
"--title" argument with a subsequent title. All list entries are inclosed within <ol> tags
to identify them as an ordered list.
"""
# This function converts the entries in an UKVS file to the conventional JSON format. Each
# entry is identified with a hash of the title followed by the year of publication.
def createjson():
sys.stdout.write('{\n' + '"Article Results": [')
inp = fileinput.input(args.inputfile)
for idx,line in enumerate(inp):
if idx > 0:
sys.stdout.write(',\n')
item_hash,item_year,item_list = line.split(' ', 2)
item_list = item_list.replace('":"', '": "')
directURL, title, authors, source, citedby, citations, pageyear = item_list.split('", "')
directURL = directURL.replace('{ "', '"')
directURL = (directURL + '"')
fixed = re.sub(r'"([\s\w-]*)"([\s\w-]*)"',r'"\1\"\2\"', title)
fixed = re.sub(r'Van de Sompel Herbert,"', r'Van de Sompel Herbert,\"', fixed)
title = ('"' + fixed + '"')
authors = ('"' + authors + '"')
if ',' in authors:
authors = authors.replace(', ', '",\n "')
authors = authors.replace('": "', '": [\n "')
authors = (authors + '\n ]')
source = ('"' + source + '"')
citedby = ('"' + citedby + '"')
if citations == '": "':
citations = citations.replace('": "', '": ')
citations = ('"' + citations)
else:
citations = ('"' + citations + '"')
if pageyear == '": "':
pageyear = pageyear.replace('": "', '": ')
pageyear = pageyear.replace('"}', '')
pageyear = ('"' + pageyear)
else:
pageyear = ('"' + pageyear)
json_entry = (' \n{ \n ' + directURL + ',\n ' +
title + ',\n ' + authors + ',\n ' + source + ',\n' +
citedby + ',\n ' + citations + ',\n ' + pageyear )
sys.stdout.write(json_entry)
sys.stdout.write(']' + '}')
# This function converts the entries in an UKVS file to the conventional BIBTEX format. Each
# entry is identified as "@misc" type in the current configuration. Additionally, when the
# author field has multiple authors, each author is also enclosed in curly braces {}.
def createbibtex():
for line in fileinput.input(args.inputfile):
#print(line)
item_hash,item_year,item_list = line.split(' ', 2)
directURL, title, authors, source, citedby, citations, pageyear = item_list.split('", "')
directURL = directURL.replace('{ "DirectURL":"', 'url = {')
title = title.replace('Title":"', 'title = {')
authors = authors.replace('Authors":"', 'author = {')
if ',' in authors:
authors = (authors + '}')
authors = authors.replace('{', '{{')
authors = authors.replace(', ', '} and {')
source = source.replace('Source":"', 'howpublished = {')
pageyear = pageyear.replace('PageYear":"', 'date = {')
pageyear = pageyear.replace('"}\n', '')
bibtex_entry = ('@misc{' + item_hash + ':' + item_year + ',\n ' +
title + '},\n ' + authors + '},\n ' +
pageyear + '},\n ' + source + '},\n ' +
directURL + '},\n},\n') # A non-conventional comma ends entries
sys.stdout.write(bibtex_entry +"\n")
def createmd():
sys.stdout.write('# ' + args.title +'\n')
#sys.stdout.write('<p> </p>\n')
entries = []
start = float("inf")
end = -float("inf")
for line in fileinput.input(args.inputfile):
item_hash,item_year,item_list = line.split(' ', 2)
directURL, title, authors, source, citedby, citations, pageyear = item_list.split('", "')
directURL = directURL.replace('{ "DirectURL":"', '')
title = title.replace('Title":"', '')
authors = authors.replace('Authors":"', '')
source = source.replace('Source":"', '')
citedby = citedby.replace('CitedBy":"', '')
pageyear = pageyear.replace('PageYear":"', '')
pageyear = pageyear.replace('"}\n', '')
if source == 'Source":"':
source = pageyear.replace('PageYear":"','')
else:
source = source.replace('Source":"', '')
try:
entries.append((authors, directURL, title, source, int(pageyear)))
start = min(int(pageyear), start)
end = max(int(pageyear), end)
except: # YEAR NOT PROVIDED - WILL FIX
entries.append((authors, directURL, title, source, 0))
if args.startyear:
start = int(args.startyear)
if args.endyear:
end = int(args.endyear)
if args.list == 'all':
prevyear = None
for item in entries:
year = int(item[4])
if year < start or year > end:
continue
if year != prevyear:
sys.stdout.write('## ' + str(year) + '\n')
sys.stdout.write('1. ' + item[0] + ', <b><a href="' + item[1] + '">' + item[2] + '</a></b>, ' + \
item[3] + '.<p> </p>\n')
prevyear = year
elif args.list == '1':
prevyear = None
for item in entries:
year = int(item[4])
if year < start or year > end:
continue
if year != prevyear:
sys.stdout.write(' ## ' + str(year) + '\n')
sys.stdout.write('1. ' + item[0] + ', <b><a href="' + item[1] + '">' + item[2] + '</a></b>, ' + \
item[3] + '.<p> </p>\n')
prevyear = year
elif args.list =='none' or args.list is None:
for item in entries:
year = int(item[4])
if year < start or year > end:
continue
sys.stdout.write('1. ' + item[0] + ', <b><a href="' + item[1] + '">' + item[2] + '</a></b>, ' + \
item[3] + '.<p> </p>\n')
def createhtml():
sys.stdout.write('<html>\n')
sys.stdout.write('<head>\n')
sys.stdout.write('<title>' + args.title + '</title>\n')
sys.stdout.write('<body bgcolor="white">\n')
sys.stdout.write('<h2>' + args.title + '</h2>\n')
sys.stdout.write('<p> </p>\n')
entries = []
start = float("inf")
end = -float("inf")
for line in fileinput.input(args.inputfile):
item_hash,item_year,item_list = line.split(' ', 2)
directURL, title, authors, source, citedby, citations, pageyear = item_list.split('", "')
directURL = directURL.replace('{ "DirectURL":"', '')
title = title.replace('Title":"', '')
authors = authors.replace('Authors":"', '')
citedby = citedby.replace('CitedBy":"', '')
pageyear = pageyear.replace('PageYear":"', '')
pageyear = pageyear.replace('"}\n', '')
if source == 'Source":"':
source = pageyear.replace('PageYear":"','')
else:
source = source.replace('Source":"', '')
try:
entries.append((authors, directURL, title, source, int(pageyear)))
start = min(int(pageyear), start)
end = max(int(pageyear), end)
except: # YEAR NOT PROVIDED - WILL FIX
entries.append((authors, directURL, title, source, 0))
if args.startyear:
start = int(args.startyear)
if args.endyear:
end = int(args.endyear)
if args.list == 'all':
prevyear = None
for item in entries:
year = int(item[4])
if year < start or year > end:
continue
if year != prevyear:
if prevyear is not None:
sys.stdout.write("</ol>")
sys.stdout.write('<h2>' + str(year) + '</h2>\n<ol>\n')
sys.stdout.write('<li>' + item[0] + ', <b><a href="' + item[1] + '">' + item[2] + '</a></b>, ' + \
item[3] + '.<p> </p></li>\n')
prevyear = year
elif args.list == '1':
prevyear = None
sys.stdout.write("<ol>\n")
for item in entries:
year = int(item[4])
if year < start or year > end:
continue
if year != prevyear:
sys.stdout.write('<h2>' + str(year) + '</h2>\n')
sys.stdout.write('<li>' + item[0] + ', <b><a href="' + item[1] + '">' + item[2] + '</a></b>, ' + \
item[3] + '.<p> </p></li>\n')
prevyear = year
sys.stdout.write("</ol>")
elif args.list =='none' or args.list is None:
for item in entries:
year = int(item[4])
if year < start or year > end:
continue
sys.stdout.write('<li>' + item[0] + ', <b><a href="' + item[1] + '">' + item[2] + '</a></b>, ' + \
item[3] + '.<p> </p></li>\n')
sys.stdout.write('</body>\n')
sys.stdout.write('</html>\n')
"""
The program is designed to be run from the Command Line Interface. The Argparse library
is imported to define and recognize arguments. Currently, --json, --bibtex, --md, and --html
are the four standard options for exported formats. You may also change HTML ordered list
views using the --html2 argument. Although convention indicates that "--" on the front of
an argument makes it optional, these three formats are configured so that one argument is
required, and only one may be selected. An optional "--title" argument is included so that
the user may designate the title of the page; this is only useful when the --html option is
selected. An optional "--sort" argument is available to allow the user to sort articles by
a specify a range of years, "start_year - end_year".
"""
parser = argparse.ArgumentParser(description='Converts UKVS file to selected filetype')
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument('--json', action='store_true', help='Converts to JSON format')
group.add_argument('--bibtex', action='store_true', help='Converts to BIBTEX format')
group.add_argument('--md', action='store_true', help='Converts to Markdown format')
group.add_argument('--html', action='store_true', help='Converts to HTML format')
parser.add_argument('--title', type=str, help='Provides title for HTML page if desired')
parser.add_argument('--startyear', type=str, help='Sort by specified start year')
parser.add_argument('--endyear', type=str, help='Sort by specified end year')
parser.add_argument('--list', type=str, help='Specify Ordered List Format')
parser.add_argument('inputfile', type=str, nargs='?', help='enter the UKVS file name')
args = parser.parse_args()
# The user can select the format argument option to be called.
if args.json:
createjson()
elif args.bibtex:
createbibtex()
elif args.md:
createmd()
elif args.html:
createhtml()