Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protobuf DecodeError when updating a (weird) document #8

Open
f11r opened this issue Dec 20, 2017 · 4 comments
Open

protobuf DecodeError when updating a (weird) document #8

f11r opened this issue Dec 20, 2017 · 4 comments
Assignees
Labels

Comments

@f11r
Copy link

f11r commented Dec 20, 2017

Running the following code leads to a google.protobuf.message.DecodeError: Error parsing message.

This admittedly is a very weird document. A chinese(?) text snuck into my English pipeline and caused this error. I then iteratively removed parts of the document and this is the smallest document where my pipeline still produced the error. I then went ahead and serialized it to a string. I'm thus not certain if the final document could actually be reduced further while causing that error.

I'm using version 3.8.0 of corenlp, python-stanford-corenlp and corenlp-protobuf, protobuf 3.5.0.post1.

import atexit
import corenlp
from corenlp_protobuf import Document

client = corenlp.CoreNLPClient(
    annotators=['parse'],
    properties={'annotators': 'parse',
                'enforceRequirements': False,
                'inputFormat': 'serialized',
                'outputFormat': 'serialized',
                'serializer': ('edu.stanford.nlp.pipeline.' 'ProtobufAnnotationSerializer')
                },
)
atexit.register(client.stop)
client.start()

doc = Document.FromString(
    b"\n\xfc\x012\t\xe3\x80\x8d r-o-s  \xe5\x8b\x95\xe3\x81\x97\xe3\x81\x84\n\xe4\xbe\x9b\n/\n\xe3\x82\xb9\n\xe2\x96\xa0\t\xe5\x8c\x96\n\xe9\x9a\x9b\n\xe3\x81\xae\xe7\x99\xba\xe8\xb6\xb3\n\xe2\x96\xa0\t\xe3\x83\xb3\n\xe8\xa1\x93\n\xe3\x81\xae\xe7\x99\xba\xe4\xbf\xa1\n\xe2\x96\xa0\t\xe5\x8b\x95\n\xe3\x83\xbc\n\xe3\x83\xac\n\xe6\x9d\x90\xe6\x97\x85\xe8\xa1\x8c\n\xe2\x96\xa0\t\xe5\x87\xba\n\xe8\xab\x87\n\xe9\x80\xa3\xe6\x83\x85\xe5\xa0\xb1\n\xe2\x96\xa0\t\xe4\xbf\xae\n\xe4\xbf\xae\n\xe4\xb8\xbb\xe8\xa6\x81\xe6\xa5\xad\n\xe3\x80\x81U (\xe3\x82\xa1\xe3\x83\xb3\xe3\x83\x89) \xe3\x81\x8b\n0\xe3\x83\x88\xe3\x81\xab\xe6\x8b\xa0\n\xe3\x80\x81\xe3\x81\x8a\n\xe3\x81\x88\n\xe7\xad\x89\n\xe3\x82\x8b\n\xe3\x80\x81\n\xe3\x83\x9aR \xe5\xb9\xb4\n\xe6\xad\xb3)\n\xe3\x81\x84\n\xe7\x85\xa7\nw.de\n.\n\xe5\xae\xb9\nd  9  :3\n5\nW\n\xe3\x82\x92\n\x12\xba\x11\n\x0f\n\x012\x12\x02CD\x1a\x0122\x00R\x012\n\x0f\n\x01\t\x12\x02SP\x1a\x01\t2\x00R\x01\t\n\x17\n\x03\xe3\x80\x8d\x12\x03NFP\x1a\x03\xe3\x80\x8d2\x01 R\x03\xe3\x80\x8d\n\x0f\n\x01r\x12\x02NN\x1a\x01r2\x00R\x01r\n\x11\n\x01-\x12\x04HYPH\x1a\x01-2\x00R\x01-\n\x0f\n\x01o\x12\x02NN\x1a\x01o2\x00R\x01o\n\x11\n\x01-\x12\x04HYPH\x1a\x01-2\x00R\x01-\n\x11\n\x01s\x12\x03POS\x1a\x01s2\x01 R\x01s\n\x0f\n\x01 \x12\x02SP\x1a\x01 2\x00R\x01 \n'\n\t\xe5\x8b\x95\xe3\x81\x97\xe3\x81\x84\x12\x02NN\x1a\t\xe5\x8b\x95\xe3\x81\x97\xe3\x81\x842\x00R\t\xe5\x8b\x95\xe3\x81\x97\xe3\x81\x84\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe4\xbe\x9b\x12\x02XX\x1a\x03\xe4\xbe\x9b2\x00R\x03\xe4\xbe\x9b\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x10\n\x01/\x12\x03SYM\x1a\x01/2\x00R\x01/\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe3\x82\xb9\x12\x02NN\x1a\x03\xe3\x82\xb92\x00R\x03\xe3\x82\xb9\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe2\x96\xa0\x12\x02NN\x1a\x03\xe2\x96\xa02\x00R\x03\xe2\x96\xa0\n\x0f\n\x01\t\x12\x02SP\x1a\x01\t2\x00R\x01\t\n\x15\n\x03\xe5\x8c\x96\x12\x02NN\x1a\x03\xe5\x8c\x962\x00R\x03\xe5\x8c\x96\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe9\x9a\x9b\x12\x02NN\x1a\x03\xe9\x9a\x9b2\x00R\x03\xe9\x9a\x9b\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n'\n\t\xe3\x81\xae\xe7\x99\xba\xe8\xb6\xb3\x12\x02NN\x1a\t\xe3\x81\xae\xe7\x99\xba\xe8\xb6\xb32\x00R\t\xe3\x81\xae\xe7\x99\xba\xe8\xb6\xb3\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe2\x96\xa0\x12\x02NN\x1a\x03\xe2\x96\xa02\x00R\x03\xe2\x96\xa0\n\x0f\n\x01\t\x12\x02SP\x1a\x01\t2\x00R\x01\t\n\x15\n\x03\xe3\x83\xb3\x12\x02NN\x1a\x03\xe3\x83\xb32\x00R\x03\xe3\x83\xb3\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe8\xa1\x93\x12\x02NN\x1a\x03\xe8\xa1\x932\x00R\x03\xe8\xa1\x93\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n'\n\t\xe3\x81\xae\xe7\x99\xba\xe4\xbf\xa1\x12\x02NN\x1a\t\xe3\x81\xae\xe7\x99\xba\xe4\xbf\xa12\x00R\t\xe3\x81\xae\xe7\x99\xba\xe4\xbf\xa1\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe2\x96\xa0\x12\x02NN\x1a\x03\xe2\x96\xa02\x00R\x03\xe2\x96\xa0\n\x0f\n\x01\t\x12\x02SP\x1a\x01\t2\x00R\x01\t\n\x15\n\x03\xe5\x8b\x95\x12\x02NN\x1a\x03\xe5\x8b\x952\x00R\x03\xe5\x8b\x95\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe3\x83\xbc\x12\x02NN\x1a\x03\xe3\x83\xbc2\x00R\x03\xe3\x83\xbc\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe3\x83\xac\x12\x02NN\x1a\x03\xe3\x83\xac2\x00R\x03\xe3\x83\xac\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n'\n\t\xe6\x9d\x90\xe6\x97\x85\xe8\xa1\x8c\x12\x02NN\x1a\t\xe6\x9d\x90\xe6\x97\x85\xe8\xa1\x8c2\x00R\t\xe6\x9d\x90\xe6\x97\x85\xe8\xa1\x8c\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe2\x96\xa0\x12\x02NN\x1a\x03\xe2\x96\xa02\x00R\x03\xe2\x96\xa0\n\x0f\n\x01\t\x12\x02SP\x1a\x01\t2\x00R\x01\t\n\x15\n\x03\xe5\x87\xba\x12\x02NN\x1a\x03\xe5\x87\xba2\x00R\x03\xe5\x87\xba\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe8\xab\x87\x12\x02NN\x1a\x03\xe8\xab\x872\x00R\x03\xe8\xab\x87\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n'\n\t\xe9\x80\xa3\xe6\x83\x85\xe5\xa0\xb1\x12\x02NN\x1a\t\xe9\x80\xa3\xe6\x83\x85\xe5\xa0\xb12\x00R\t\xe9\x80\xa3\xe6\x83\x85\xe5\xa0\xb1\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe2\x96\xa0\x12\x02NN\x1a\x03\xe2\x96\xa02\x00R\x03\xe2\x96\xa0\n\x0f\n\x01\t\x12\x02SP\x1a\x01\t2\x00R\x01\t\n\x15\n\x03\xe4\xbf\xae\x12\x02NN\x1a\x03\xe4\xbf\xae2\x00R\x03\xe4\xbf\xae\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe4\xbf\xae\x12\x02NN\x1a\x03\xe4\xbf\xae2\x00R\x03\xe4\xbf\xae\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n'\n\t\xe4\xb8\xbb\xe8\xa6\x81\xe6\xa5\xad\x12\x02NN\x1a\t\xe4\xb8\xbb\xe8\xa6\x81\xe6\xa5\xad2\x00R\t\xe4\xb8\xbb\xe8\xa6\x81\xe6\xa5\xad\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x19\n\x04\xe3\x80\x81U\x12\x02NN\x1a\x04\xe3\x80\x81U2\x01 R\x04\xe3\x80\x81u\n\x12\n\x01(\x12\x05-LRB-\x1a\x01(2\x00R\x01(\n'\n\t\xe3\x82\xa1\xe3\x83\xb3\xe3\x83\x89\x12\x02NN\x1a\t\xe3\x82\xa1\xe3\x83\xb3\xe3\x83\x892\x00R\t\xe3\x82\xa1\xe3\x83\xb3\xe3\x83\x89\n\x13\n\x01)\x12\x05-RRB-\x1a\x01)2\x01 R\x01)\n\x15\n\x03\xe3\x81\x8b\x12\x02XX\x1a\x03\xe3\x81\x8b2\x00R\x03\xe3\x81\x8b\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n*\n\n0\xe3\x83\x88\xe3\x81\xab\xe6\x8b\xa0\x12\x02XX\x1a\n0\xe3\x83\x88\xe3\x81\xab\xe6\x8b\xa02\x00R\n0\xe3\x83\x88\xe3\x81\xab\xe6\x8b\xa0\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x1e\n\x06\xe3\x80\x81\xe3\x81\x8a\x12\x02XX\x1a\x06\xe3\x80\x81\xe3\x81\x8a2\x00R\x06\xe3\x80\x81\xe3\x81\x8a\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe3\x81\x88\x12\x02XX\x1a\x03\xe3\x81\x882\x00R\x03\xe3\x81\x88\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe7\xad\x89\x12\x02XX\x1a\x03\xe7\xad\x892\x00R\x03\xe7\xad\x89\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe3\x82\x8b\x12\x02XX\x1a\x03\xe3\x82\x8b2\x00R\x03\xe3\x82\x8b\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x14\n\x03\xe3\x80\x81\x12\x01.\x1a\x03\xe3\x80\x812\x00R\x03\xe3\x80\x81\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x1a\n\x04\xe3\x83\x9aR\x12\x03NNP\x1a\x04\xe3\x83\x9aR2\x01 R\x04\xe3\x83\x9ar\n\x16\n\x03\xe5\xb9\xb4\x12\x03NNP\x1a\x03\xe5\xb9\xb42\x00R\x03\xe5\xb9\xb4\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe6\xad\xb3\x12\x02LS\x1a\x03\xe6\xad\xb32\x00R\x03\xe6\xad\xb3\n\x12\n\x01)\x12\x05-RRB-\x1a\x01)2\x00R\x01)\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe3\x81\x84\x12\x02NN\x1a\x03\xe3\x81\x842\x00R\x03\xe3\x81\x84\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe7\x85\xa7\x12\x02XX\x1a\x03\xe7\x85\xa72\x00R\x03\xe7\x85\xa7\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x19\n\x04w.de\x12\x03ADD\x1a\x04w.de2\x00R\x04w.de\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x0e\n\x01.\x12\x01.\x1a\x01.2\x00R\x01.\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe5\xae\xb9\x12\x02NN\x1a\x03\xe5\xae\xb92\x00R\x03\xe5\xae\xb9\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x10\n\x01d\x12\x02XX\x1a\x01d2\x01 R\x01d\n\x0f\n\x01 \x12\x02SP\x1a\x01 2\x00R\x01 \n\x10\n\x019\x12\x02CD\x1a\x0192\x01 R\x019\n\x0f\n\x01 \x12\x02SP\x1a\x01 2\x00R\x01 \n\x12\n\x02:3\x12\x02CD\x1a\x02:32\x00R\x02:3\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x0f\n\x015\x12\x02CD\x1a\x0152\x00R\x015\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x10\n\x01W\x12\x03NNP\x1a\x01W2\x00R\x01w\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\n\x15\n\x03\xe3\x82\x92\x12\x02NN\x1a\x03\xe3\x82\x922\x00R\x03\xe3\x82\x92\n\x0f\n\x01\n\x12\x02SP\x1a\x01\n2\x00R\x01\n\x10\x00\x18i")
client.update(doc)
@f11r
Copy link
Author

f11r commented Jan 2, 2018

The same problem seems to appear when parsing lists like the following (from wikipedia), which makes this problem much worse since it isn't just about texts in the wrong language.

Post
*Hy Finn – Standard Oil of California
*Indian Maiden – Land-O-Lake Butter/Dairy Products
*Jack – Jack-In-The-Box Restaurants
*Jeff – Burger Chef Restaurants
*Jello-O Baby – Jell-O
*Joe Camel – Camel Brand Cigarettes
*John H. Goodwill – Goodwill
*Jolly Green Giant – Green Giant Vegetables 
*Josephine The Plumber – Comet Cleanser
*Juan Valdez – National Federation of Coffee Growers
*Kedso The Clown – Keds Brand Sneakers
*Keebler Elves (Ernie The Elf) – Keebler Foods (Crackers)
*Kenner Gooney Bird – Kenner’s Toy Company
*Klondike The Polar Bear – Klondike Ice Cream Bar
*Kool Aid Man – Kool-Aid Drink Mix
*Kool Penguin – Kool Cigarettes
*Little Caesar – Little Caesar Pizza
*Little Foster – Foster’s Freeze Restaurants
*Little Mikey (the Freckly-faced Kid) – Life Cereal
*Little Miss Sunbeam – Sunbeam Bread
*Lucky The Leprechaun – General Mill’s Lucky Charms Cereal
*Madge The Manicurist – Palmolive Dish Washing Detergent
*Marky Maypo – Maypo Cereal
*Marlboro Man –

@arunchaganty arunchaganty self-assigned this Jan 22, 2018
@arunchaganty
Copy link
Contributor

@f11r So, I think the problem is that you have enforceRequirements set to False which is probably causing the CoreNLPServer to fail. I've run the following after observing your error:

client = corenlp.CoreNLPClient(timeout=10000, annotators=['parse'])
client.update(doc)

The response has binarized trees.

This might also fix the error on the second document you mentioned.

@f11r
Copy link
Author

f11r commented Jan 22, 2018

Yes, this "fixes" it by segmenting/tokenizing the document again. However, I need to keep the tokenization the way it is provided to corenlp and only use the parser (I'm aware of the quality implications of using only part of the pipeline). If you think this is caused by the CoreNLPServer, should I post this to the corenlp repository issues?

@arunchaganty
Copy link
Contributor

Ah I see. The problem is indeed caused by CoreNLPServer -- I'll have a look at trying to fix the server next chance I get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants