Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize in-memory representation of string and datetime columns #764

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

senderista
Copy link
Contributor

This will proceed in two stages:

  1. Deserialize protobuf messages containing STRING_TYPE or DATETIME_TYPE columns into new packed Column implementations
  2. Define new ColumnBuilder implementations that build these packed Column implementations, possibly adding a new append-only ColumnBuilder interface (since the packed string representation can't efficiently replace/swap by row index)

Fixes #762

@senderista
Copy link
Contributor Author

@jingjingwang: So I've noticed that although I can avoid copying from a protobuf msg by stealing the ByteString as a read-only ByteBuffer, I can't go the other way; the only way to initialize a ByteString from a ByteBuffer is to copy all the bytes. I'm wondering at this point if only saving the copy one way is worth it, or if I should just be copying everything to/from a byte[] both ways. Alternatively, I could possibly avoid copying altogether by using a ByteString instead of a byte[] or ByteBuffer in the StringColumn impl, but I'm loath to do that since it's a specialized protobuf datatype. What do you think?

@jingjingwang
Copy link
Contributor

I think saving one direction is definitely better than not saving :) Plus the code looks neat so I don't see any reason not to do it. I also have concern on using a protobuf specialized datatype in StringColumn impls, and we can't avoid copying for other types like LongColumn anyway.

@senderista
Copy link
Contributor Author

I now think I should generalize the issue/PR into creating zero-copy (buffer-stealing) versions of all column types, so a column of any type can be serialized/deserialized from a protobuf msg with no buffer copy (by keeping all data in a ByteString). Additionally, I want to define a new ColumnBuilder type that implements WritableColumn but not ReplaceableColumn (it appears that only MutableTupleBuffer uses the ReplaceableColumn interface). That would allow us to optimize StringColumnBuilder to directly copy UTF-8 bytes from a String into a buffer which we could serialize into a ByteString (or incrementally serialize using ByteString.concat()), and avoid keeping String objects around any longer than we have to.

@jingjingwang
Copy link
Contributor

Sounds like it's going to be a big change! I'm not completely sure how the code will look like after the zero-copy change, but if it's doable then looks great. One concern is how do we handle ReplaceableColumn in a nice way.

@senderista
Copy link
Contributor Author

@jingjingwang what do you think of this PR after a year? :) How high a priority do you think it should be to avoid storing objects in String and DateTime columns? Not sure what memory savings this would really yield, but it seems like memory usage is currently a big issue for Myria users.

@jingjingwang
Copy link
Contributor

According to the discussion above, we were planing to add zero-copy to this PR. Could you add this improvement (if you plan to) and also clean the PR up? I can review it after it's up-to-date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants