Skip to content
This repository has been archived by the owner on Mar 8, 2020. It is now read-only.

semantic: Discrepancy in parsing of header comments #56

Open
r0mainK opened this issue Sep 13, 2019 · 9 comments
Open

semantic: Discrepancy in parsing of header comments #56

r0mainK opened this issue Sep 13, 2019 · 9 comments
Labels

Comments

@r0mainK
Copy link

r0mainK commented Sep 13, 2019

Basically, if we have:

// comment 1
// comment 2
package foo
// comment 3
// comment 4
import "bar"

Then:

  • comment 1, comment 2, comment 3 and comment 4 go in the Comments attribute of the root node
  • comment 1 and comment 2 go in the Doc attribute of the root node
  • comment 3 and comment 4 go in the Doc attribute of the GenDecl node in the root node's Decl attribute

However, if we have:

// comment 1

// comment 2
package foo
// comment 3

// comment 4
import "bar"

Then:

  • comment 1, comment 2, comment 3 and comment 4 still go in the Comments attribute of the root node
  • only comment 2 goes in the Doc attribute of the root node
  • only comment 4 goes in the Doc attribute of the GenDecl node in the root node's Decl attribute

Basically, any comments placed before a newline before the package/import keywords get removed from the Doc attributes of the respective nodes.

@dennwc dennwc added the bug label Sep 13, 2019
@dennwc
Copy link
Member

dennwc commented Sep 13, 2019

This is how Go identifies which comments correspond to which node: if the comment is immediately before the line, it's attached to a node.

In terms of Babelfish, this may be considered a bug, but not for the Native mode, since it's by definition "whatever the native parser produces".

However, I think the same is true for Semantic, so we should place comments to corresponding places at lest in that mode.

@r0mainK
Copy link
Author

r0mainK commented Sep 13, 2019

Okay, updating the title to sematinc: ... then.

@r0mainK r0mainK changed the title native: Discrepancy in parsing of header comments semantic: Discrepancy in parsing of header comments Sep 13, 2019
@kuba--
Copy link
Member

kuba-- commented Sep 20, 2019

It's somehow similar to how CDT handles comments in cpp-driver. I think we've already discussed it, and we decided that it's hard to define which comment belongs to what statement moreover what does it mean that comments belongs to some nested node in tree.
As a consensus we agreed to put comments like flat list on the same, top level under root.
In other words it will be no nested comments assigned to some nodes.
WDYT?

@dennwc
Copy link
Member

dennwc commented Sep 20, 2019

@kuba-- Correct, but I was thinking about CDT issue in terms of Native mode. Then we can add a processing stage for Semantic to place those comments (as separate nodes) inside the tree based on positions.

Same can be done for Go here. But we may need to pull some comments out from existing nodes first.

@kuba--
Copy link
Member

kuba-- commented Sep 20, 2019

But I'm not sure if it's really needed. Comments are comments: text + position. I'm not sure if we really need to complicate it. So far (based on use cases), I don't see the reason.

Personally, I would keep them as a flat, separate branch from root node.
I would even argue, that it may simplify analysis (avoid some comments noise).
And if we really need some comments analysis, maybe we can write a tool/lib, which let match nodes to comments based on positions.

@dennwc
Copy link
Member

dennwc commented Sep 20, 2019

Yeah, I agree about simplicity of traversal. This is one of the reasons why I advocate for graphs: comments may point to nodes, not nodes to comments. This way they won't appear in AST during the traversal, but are still easily accessible.

But for now we have to work with trees. Having a flat list of comments with positions works for advanced users that can map them back to nodes, but not for a simple use case. For example, if you open play.bblf.sh and paste the code with comments, you will expect them to appear near relevant nodes (but not in Native, as the issue points out).

@kuba--
Copy link
Member

kuba-- commented Sep 22, 2019

I would ask opposite question - do people really care where they see comments in tree. I know what we have in bblfsh playground, so far. My question is what could be wrong if we change it?
And honestly, I have a feeling that people who are focused mainly on comments they will have an easier task to parse it.
Maybe we should ask first, what can be use case, instead of trying to guess. I really believe that having comments as independent branch (from source code) and let them match nodes (later based on positions) can be easier to implement and more readable comparing to what we have (where we gather info, first and then propagate it)

@r0mainK
Copy link
Author

r0mainK commented Sep 23, 2019

From a user's point of view I would say this. Currently in some drivers, this one included, we have both implemented:

  • a flat list at the root node of the UAST, here the Comments attribute
  • nodes containing comments near the relevant nodes

Now, I would tend to say you ought to keep only the first and actually generalize this, in semantic mode, to all drivers. That is because as you said it's easy to map the comments using positional information if need be, comments do not need to be augmented with structural information imo, and I tend to think that for a lot of use cases one would simply extract all comments. That being said, I use Babelfish often, so I might be biased, and I can see how having comments at their correct position might help.

What I think is that, unless you want to follow a data-sparse approach in which case picking one or the other should be done, then there is no problem with having both. But in that case, I think it would be good to say so in the doc, possibly port the flat representation to drivers where it does not exist, and finally fill the gaps where they exist, e.g. here where some comments appear only in the flat representation - because these kind of discrepancies will definitely confuse users.

@creachadair
Copy link
Contributor

creachadair commented Sep 23, 2019

Mapping comments to syntactic constructs is very tricky to get right, and relies on conventions that differ across languages. There is no single rule that gives the "right answer" across languages, because the way people write comments feeds back on how the tools interpret them—meaning you write comments in different layouts if you expect different interpretation.

If we want to pick a universal rule, it's going to have to be pretty simplistic: Having a flat list of comments for the whole file, for example, should always work—even if we have to synthesize it for languages that don't provide it natively. The corner cases about what counts as "adjacent" are very tricky, and are not treated the same way between tools like Doxygen, Javadoc, gofmt, clang-format, and so on.

I agree we should try to provide users with a reasonable experience, but I don't foresee being able to pick a simple rule that accomplishes that. I did an extensive survey of comment layout rules at my last job, and if I still have the notes from it I will try to post them somewhere so you can get a sense of how annoyingly inconsistent people are about their expectations. 😀

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants