-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design of arc, node, and polygon drawing #272
Comments
Hmm; I'm not sure that blurs user-level concerns with implementation, because aren't our users concerned with adding data associated with each node, by index (to ensure each node gets that data only once and unambiguously)? I think the user does care about such an index, at least in many cases. |
I don't follow, to associate a value with a node they would edit a cell in a table so they would never even see the node id. |
What I mean is the fact that each node has a unique identity is of user-level importance. Sure, the actual implementation as a node id is an implementation detail, but such an implementation is tightly tied to the notion of identity, and perhaps definitional of it, and so I don't think it's inappropriate to expose that to the user. |
@jbednar I've added a proposal 3 to capture your suggestion of a graph-like representation. |
While I totally understand your concern about the scalability of proposals 1 and 2 I've just done some testing and what I've found is that the bottleneck is the rendering step, not the search. To put it another way for very large paths or large collection of paths things will slow down for other reasons long before the node matching becomes an issue. |
Also @jlstevens just pointed out that if we only have to check for matches on the feature vertices (i.e. start and end nodes) the matching won't be expensive at all. |
My suggestion would be to go with proposal 2 for the time being and if that proves to be problematic either due to scalability or for other reasons tackle proposal 3 as part of the next phase of work. |
But can't there be arbitrarily many items associated with each node, as in millions? Seems like once someone sets up a simulation using such a polygon, there can be some very long time series of data per node, with millions or billions of time steps for some long climate simulation potentially. Ensuring that such data stays consistent between multiple copies of the node just doesn't seem sensible to me. Using a different data structure before and after simulation is of course an option, but seems very ugly. Simply naturally supporting the idea that a node is a unique thing seems so much more straightforward. |
Why would anyone ever even consider sending this data to the frontend? Edit: We are only talking about the data structure for sending the data to the frontend for rendering, editing and annotating the nodes, which can be entirely different to the representation in python. That said even in Python I would strongly argue against shoving timeseries into the same datastructure used for storing the nodes, linking the two by id is going to be much more efficient. Therefore the data structure sent to the frontend should never contain timeseries, at most it will contain one scalar values per node per dimension. |
But doesn't such linking need an id for each node? It seems like any proposal should include the full life cycle of this data, including any drawing tool version, a regular plotting version with some (limited number of) hover columns, and how long series of data associated with nodes will be stored. Basically, how will this all work in practice? That's the only thing really worth optimizing for. |
That's accurate but that can be assigned at any point during or after the drawing.
I actually think the graph-like storage format is a good idea in the longer term but I also pretty strongly disagree with this. Trying to make one data-structure work for the full lifecycle of the data also forces a bunch of design decisions at every level. As I already mentioned it forces us to write a new bokeh renderer but beyond that it also introduces a new quite, non-standard data format to be supported in HoloViews forcing us to extend the interfaces, and would then also require an entirely new element and plotting class. That's a lot of custom, very specialized code that deals with a format that isn't in common usage and that only has large benefits in certain cases. Indeed the graph representation is actually most useful when plotting because it avoids the node matching issue, but in terms of memory footprint I expect the gains to be very minimal because very few nodes will ever be shared (only start and end nodes ever are). On the Python end the graph-based format therefore offers little advantage and would have to be converted to other formats such as GEOS (i.e. shapely) geometries for interoperability anyway.
This consideration seems to me entirely separate, the storage of additional information like this is always going be in a separate table/data-structure indexed by id, and I suspect will be strongly driven by whatever simulator is used to generate the data. Mixing the very different considerations for dealing with these two types of data seems overly prescriptive and therefore a very bad idea to me. Overall while I think the graph based representation would be the most optimal at the drawing stage I'm not at all convinced it should be replicated for the whole workflow. Conversion between formats will always be relatively pretty cheap, and is never going to be the bottleneck. So a little bit of glue code which converts the data appropriately is in my opinion much preferable to being overly prescriptive. Finally I'd also like to optimize for a solution we can deliver in the short term, while not necessarily precluding a better longer term solution. In the short term here are my pros/cons for switching from the current format which would require node matching to the graph based representation (feel free to add more):
|
Just to clarify, I am not saying that a single data structure should be used for all stages. I'm saying that all stages should be taken into consideration in the proposals, so that we don't optimize for some local case without anticipating the knock-on effects elsewhere. The simplest assumption is a single data structure to rule them all, but that default position can be supplanted by others when there is good reason. Also, be careful to separate performance considerations from consistency considerations -- I'm not so worried about the computational cost of synchronizing shared nodes, but about the fact that at the user level, the user can have any number of items associated with nodes, and there needs to be a reasonable way to deal with that without major programming complexity, computational complexity, or memory problems. In certain cases it's fine to duplicate that information, but in the general case it's not, and I want to ensure that the overall solution doesn't require any unrealistic steps like that. |
I still do not quite follow this point, this is only a major problem if two conditions are met a) there's a lot of information associated with each node and b) there are lots of duplicates. The current scheme would only ever associate a few scalar values with each node during the drawing and annotating steps. What happens after that is a totally different issue and any time-series or other data associated with each node should then be stored in a secondary data structure indexed by node ID. Secondly there will only ever be a few duplicates since we are only allowing feature vertices to be shared. So again I agree with you in principle, and if we ever want a better representation for arbitrary meshes then these considerations become a lot more important but given the actual use case we have this seems like an extremely expensive case of premature optimization to me.
This is probably what I disagree with you most strongly on, not only do I see this as being horrendously inefficient and just generally a bad idea, but I think it also massively increases complexity because now suddenly this format needs to support everything, scalar values, ragged arrays, efficient storage of timeseries etc and provide convenient ways of querying the data in a format that is useful. |
What I really want to know at this point is whether you're on board with making the current implementation work using proposal 1 or 2, which would require only the following steps (most of which I've already prototyped):
And revisit the graph-based format proposal if it proves necessary. |
I'm not saying that's the simplest approach -- it's the simplest starting point. It's just a baseline, not a final solution. There can be many reasons a more complex approach is better overall. |
In that case I suppose I really don't understand the point you're making. To me it seems like by far the most complex starting point as well, since a single datastructure that handles all this stuff, almost by definition will have the combined complexity of a number of simpler datastructures which each handle their specific task well. |
A single list of nodes with associated arbitrarily many columns of data, and a separate data structure tracking connections between those nodes, is not a complex starting point. We should have an live meeting to discuss, rather than going back and forth by Github issue. |
Earlier you were talking about time series as well, hence my confusion. Yes let's have a meeting. |
For future reference, I believe holoviz/holoviews#3258 is relevant to this discussion. |
This issue attempts to outline the design for the polygon and polyline drawing/editing tools and annotators. The motivation here is to figure out how they fit into a workflow for specifying geometries which are then used to define watersheds from which a mesh can be computed and how their current limitations could be addressed.
Before beginning we will define some terminology starting at the highest level and working down:
The important thing to note is that each of these geometries needs to afford the ability to be annotated.
In the next section I will outline the current design of the drawing and editing tools and describe the limitations of the current approach.
Current design
Drawing/Editing
In the current design there are two main tools used for drawing and editing polylines and nodes:
Annotating
Once the vertices and arcs are drawn they can be annotated by editing their values in a table. To edit the values of an arc it can be selected in the table and the values can be edited directly. To edit vertex values the arc must be selected, which will display the vertices in the plot and populate the vertex table.
Proposed workflow
Limitations
Once an arc has been split the nodes/feature vertices in each part are completely decoupled, e.g. overlapping nodes or feature vertices which are parts of different polylines cannot be dragged at the same time and their values can diverge when editing the vertex tables. This means this approach is only suited towards a linear workflow where all nodes should be annotated before splitting arcs to avoid their locations and annotated values to diverge.
Positives
Alternative proposals
The main issue with the approach as outlined above is that feature vertices that are part of two or more arcs cannot be edited together. It would therefore be desirable to find a solution that ensures that feature arcs are not decoupled and can be edited together. To make this possible there are three possible proposals I could imagine:
Proposal 0 - Accept the current limitations
Not a real proposal simply an acknowledgement of the current limitations.
Proposal 1 - Add unique ids to vertices
Completely redesign how the drawing and editing tools work by adding unique ids to each vertex. When you start drawing a new arc by snapping to an existing vertex the vertex is cloned, the same happens when an existing arc is split. Whenever an edit is made to a node it searches all other arcs to check if the same node occurs in that vertex and then applies the same edit to that node.
Pros/Cons
Proposal 2 - Match on coordinate
Instead of establishing identity using a unique id the same could be achieved by establishing identity by the exact coordinate of a vertex. This would afford all the same capabilities as an index but would also be quite expensive to compute.
Pros/Cons
Proposal 3 - Multi-node graph representation
The issues in proposals 1 and 2 mostly stem from the format used to represent the arcs since they are stored as a lists of coordinates which means that even though arcs might share a node they have to list them separately. @jbednar instead proposes that there could be a new format for drawing paths which basically consists of a graph-like representation, where the nodes are held in one data-structure and the connectivity between nodes is held in another data-structure, which describes which nodes to connect into an arc. In such a scheme there would be no need to try to match the identity of the nodes, which would be much more scalable.
The format would be something like this:
Node data source
Arc data source
Pros/Cons
The text was updated successfully, but these errors were encountered: