"KeyError: '@@'" raises while processing large corpora #1

ManiaCello · 2019-02-26T22:26:48Z

Hello! I've been trying to select select data from a large corpus that includes 21M sentences using representative corpus with 100k sentences and met a "KeyError: '@@'" exception.
I ran the script with following parameters:

./cynical-selection.py --task ../ds/100k.google --unadapted ../ds/10M.os.en --no-lower --batch

Full text of the exception:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "./cynical-selection.py", line 695, in main_loop
unadapted_squish = squish_corpus(unadapted_data, replace)
File "./cynical-selection.py", line 297, in squish_corpus
squished.append(' '.join([replace[token] for token in line.split()]))
File "./cynical-selection.py", line 297, in
squished.append(' '.join([replace[token] for token in line.split()]))
KeyError: '@@'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./cynical-selection.py", line 819, in
main()
File "./cynical-selection.py", line 798, in main
selected = threading_wrapper(task_data, unadapted_data, args)
File "./cynical-selection.py", line 775, in threading_wrapper
zip(repeat(task_data), parts_list, repeat(args)))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 296, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 670, in get
raise self._value
KeyError: '@@'

I also tried to run the program with a 10M general corpus but it didn't resolve the issue. It executes perfectly well on corpus with 2M sentences or less.

10M.os.en-100k.google-20190226_2152.log.zip

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"KeyError: '@@'" raises while processing large corpora #1

"KeyError: '@@'" raises while processing large corpora #1

ManiaCello commented Feb 26, 2019

"KeyError: '@@'" raises while processing large corpora #1

"KeyError: '@@'" raises while processing large corpora #1

Comments

ManiaCello commented Feb 26, 2019