Skip to content

Commit

Permalink
Fixing bug that assumed # codepoints was equal to # UTF-8 bytes.
Browse files Browse the repository at this point in the history
  • Loading branch information
Tobin Baker committed Aug 12, 2015
1 parent 5b3c414 commit 2d53469
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion src/edu/washington/escience/myria/column/Column.java
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import java.io.Serializable;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.BitSet;

import org.joda.time.DateTime;
Expand Down Expand Up @@ -288,7 +289,8 @@ protected static ColumnMessage defaultStringProto(final Column<?> column) {
StringBuilder sb = new StringBuilder();
int startP = 0, endP = 0;
for (int i = 0; i < column.size(); i++) {
endP = startP + column.getString(i).length();
int len = column.getString(i).getBytes(StandardCharsets.UTF_8).length;
endP = startP + len;
inner.addStartIndices(startP);
inner.addEndIndices(endP);
sb.append(column.getString(i));
Expand Down

2 comments on commit 2d53469

@jingjingwang
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Interestingly that it hasn't been discovered until now.

@senderista
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as all data was ASCII, the bug wouldn't have manifested. I didn't find it from buggy behavior, the code just caught my eye.

Please sign in to comment.