Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"KeyError: '@@'" raises while processing large corpora #1

Open
ManiaCello opened this issue Feb 26, 2019 · 0 comments
Open

"KeyError: '@@'" raises while processing large corpora #1

ManiaCello opened this issue Feb 26, 2019 · 0 comments

Comments

@ManiaCello
Copy link

Hello! I've been trying to select select data from a large corpus that includes 21M sentences using representative corpus with 100k sentences and met a "KeyError: '@@'" exception.
I ran the script with following parameters:

./cynical-selection.py --task ../ds/100k.google --unadapted ../ds/10M.os.en --no-lower --batch

Full text of the exception:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "./cynical-selection.py", line 695, in main_loop
unadapted_squish = squish_corpus(unadapted_data, replace)
File "./cynical-selection.py", line 297, in squish_corpus
squished.append(' '.join([replace[token] for token in line.split()]))
File "./cynical-selection.py", line 297, in
squished.append(' '.join([replace[token] for token in line.split()]))
KeyError: '@@'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./cynical-selection.py", line 819, in
main()
File "./cynical-selection.py", line 798, in main
selected = threading_wrapper(task_data, unadapted_data, args)
File "./cynical-selection.py", line 775, in threading_wrapper
zip(repeat(task_data), parts_list, repeat(args)))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 296, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 670, in get
raise self._value
KeyError: '@@'

I also tried to run the program with a 10M general corpus but it didn't resolve the issue. It executes perfectly well on corpus with 2M sentences or less.

10M.os.en-100k.google-20190226_2152.log.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant