Metadata | |
---|---|
cEP | 18 |
Version | 1.0 |
Title | Integration of ANTLR into coala core |
Authors | Viresh Gupta mailto:viresh16118@iiitd.ac.in |
Status | Proposed |
Type | Feature |
This document describes how an API based on ANTLR will be constructed and maintained.
ANTLR provides parsers for various language grammars and thus we can provide an interface from within coala and make it available to bear writers so that they can write advanced linting bears. This will aim at supporting the more flexible visitor based method of AST traversal (as opposed to listener based mechanisms).
The proposal is to introduce a parallel concept to the coala-bears library,
which will be called coala-antlr
hereon.
Here is the detailed implementation stepwise:
- There will be a separate repository named as
coala-antlr
which will be installable viapip install coala-antlr
. - A new package would be introduced in the coala-antlr repository,
which would maintain all the dependencies for
antlr-ast
based bears. This would be thecoantlib
package - The
coantlib
package will provide endpoints relevant to the visitor model of traversing the ast generated via ANTLR. - Another package named
coantparsers
will be created insidecoala-antlr
repository. This package will hold parsers and lexers generated for various languages generated (with python target) beforehand for the supported set of grammars. coantlib
will have anASTLoader
class that will be responsible for loading the AST of a given file depending on the user input, using file extension as a fallback. For e.g, a .c file will be loaded using the parser fromcoantlib
for c language if no language is specified.- Another
ASTWalker
class will be responsible for providing a single method to resolve the language supplied by the bear and create a walker instance of the appropriate type. - Also
coantlib.walkers
will have several walker classes derived fromASTWalker
class that will be language specific and grammar dependant. For e.g aPy3ASTWalker
. - The bears dependant on the
coantlib
will reside in the same repository as thecoantlib
. All of them will be in a different package, thecoantbears
package, all of them installable together via a single option in the setup.py along withcoantlib
. - The bears will be able to use a high level API by defining the
LANGUAGE
instance variable, which will return to them a walker with high level API's in case the language is supported. (for e.g ifLANGUAGE='Python'
, they will automatically get aBasePyASTWalker
instance in the walker class variable)
Managing a new repository is a heavy task, and this will be highly automated as follows:
- The parsers in
coala-antlr/coantparsers
would be generated and pushed via CI builds whenever a new commit is pushed. - The cib tool can be enhanced to deal with the installation of bears that
require only some specified parsers (for e.g a
PyUnusedVarBear
would only require parser for python). - There will be nightly cron jobs for keeping parsers up-to-date.
- For adding support for a new language, a new Walker class deriving
ASTWalker
would be added and the rest will be automatically taken care by the library.
Here is a prototype for the implementations within coantlib
:
import antlr4
import coantparsers
import inspector
from antlr4 import InputStream
from coalib.bearlib.languages.definitions import *
from coalib.bearlib.languages.Language import parse_lang_str, Languages
def resolve_with_package_info(lang):
"""
Use the inspector methods to search amongst all
classes that inherit ASTWalker within this module
and return that class whose supported language matches
"""
class ASTLoader(object):
mapping = {
C: [coantparsers.clexer, coantparsers.cparser],
Python: [coantparsers.pylexer, coantparsers.pyparser],
JavaScript: [coantparsers.jslexer, coantparsers.jsparser],
...
}
@staticmethod
def load_file(file, filename, lang='auto'):
"""
Loads file's AST into memory
"""
if(lang == 'auto'):
ext = get_extension(filename)
else:
ext = Languages(parse_lang_str(lang)[0])[0]
if ext is None:
raise FileExtensionNotSupported('Unknown File Type')
inputStream = InputStream(''.join(file))
lexer = mapping[ext][0](inputStream)
parser = mapping[ext][1](lexer)
return parser
class ASTWalker(object):
treeRoot = None
tree = None
@staticmethod
def make_walker():
"""
Resolves walker object for appropriate type
"""
if lang == 'auto':
return ASTWalker
else:
return resolve_with_package_info(lang)
@staticmethod
def get_walker(file, filename, lang):
"""
Returns a walker to the required language
"""
ret_class = make_walker(lang)
if not ret_class:
raise Exception('Un-supported Language')
else:
return ret_class(file, filename)
def __init__(file, filename):
self.tree = ASTLoader.load_file(file, filename)
self.treeRoot = self.tree
...
These kinds of Walkers will supply a high level walker API
"""Prototype of Python ASTWalkers."""
class BasePyASTWalker(ASTWalker):
"""Base class for Python Walkers. This will be an abstract class."""
LANGUAGES = {'Python'}
def get_imports():
"""Return list of strings containing import statements."""
def get_methods():
"""Return list of strings containing all methods from the file."""
...
class Py2ASTWalker(BasePyASTWalker):
"""Python 2 specific walker."""
LANGUAGES = {'Python 2'}
def get_print_statements():
"""Return list of print statements, since print is a keyword in py2."""
...
class Py3ASTWalker(BasePyASTWalker):
"""Python 3 specific walker."""
LANGUAGES = {'Python3'}
def get_integer_div_stmts():
"""
Return a list of integer divide statements.
since / and // are different in py3, this is py3 specific
"""
...
"""Prototype of ASTBear class."""
from coantlib import ASTWalker
from coalib.bears.LocalBear import Bear
class ASTBear(LocalBear):
"""
ASTBear class.
This class is similar to Local Bear except for the added capability of
initialising walker for use by the instance of this bear.
"""
walker = None
def initialise(file, filename):
"""Initialise self.walker according to the file and language."""
if LANGUAGES:
self.walker = ASTWalker.get_walker(file,
filename,
self.LANGUAGES[0])
else:
self.walker = ASTWalker.get_walker(file, filename, 'auto')
def run(self,
filename,
file,
tree,
*args,
dependency_results=None,
**kwargs
):
"""Actual run method - to be implemented by bear."""
raise NotImplementedError('Needs to be done by bear')
"""A sample Bear."""
from coantlib.ASTBear import ASTBear
from coalib.results.Result import Result
class TestBear(ASTBear):
"""A test bear that uses walker from coantlib for linting."""
def run(self,
filename,
file,
tree,
*args,
dependency_results=None,
**kwargs
):
"""Essence of logic for bear."""
self.initialise(file, filename)
violations = {}
method_list = self.walker.get_methods()
for method in method_list:
"""
logic for checking naming convention
and yielding an apt result
wherever required
"""
...
ANTLR has a HiddenTokenStream
which recieves all the tokens irrelevant to the
parser such as whitespaces. These are not passed on to the parser and as a
result are not returned when simply fetching text from a node in parse tree.
Consider the code:
def do_nothing():
pass
def run_coala(console_printer=None,
log_printer=None,
print_results=do_nothing,
print_section_beginning=do_nothing,
nothing_done=do_nothing,
autoapply=True,
force_show_patch=False,
arg_parser=None,
arg_list=None,
args=None,
debug=False):
"""
This function signature has been taken from coala-core for fairly complex
example.
"""
print('This function will do something')
if (console_printer and
log_printer and
not (args == None and debug == False)
):
print('These conditions were made to trick !')
print(' !(a & b) == !a or !b')
After converting this source back using:
#!/usr/bin/env python
from antlr4 import *
from Python3Lexer import Python3Lexer
from Python3Parser import Python3Parser
from Python3Visitor import Python3Visitor
import sys
class Visitor(Python3Visitor):
def visitFile_input(self, ctx):
print(ctx.getText())
def main(args):
file = FileStream(args[1])
lexer = Python3Lexer(file)
stream = CommonTokenStream(lexer)
parser = Python3Parser(stream)
tre = parser.file_input()
if parser.getNumberOfSyntaxErrors()!=0:
print("File contains {} "
"syntax errors".format(parser.getNumberOfSyntaxErrors()))
return
visitor = Visitor()
visitor.stream = stream
visitor.visit(tre)
if __name__ == '__main__':
main(sys.argv)
We get the following output:
defdo_nothing():
pass
defrun_coala(console_printer=None,log_printer=None,print_results=do_nothing,print_section_beginning=do_nothing,nothing_done=do_nothing,autoapply=True,force_show_patch=False,arg_parser=None,arg_list=None,args=None,debug=False):
"""
This function signature has been taken from coala-core for fairly complex
example.
"""
print('This function will do something')
if(console_printerandlog_printerandnot(args==Noneanddebug==False)):
print('These conditions were made to trick !')
print(' !(a & b) == !a or !b')
<EOF>
The most important points to note in the above are (Especially with respect
to a Python
grammar scenario):
- All spaces within two tokens has been removed unless present inside a string
- All newlines have been stripped unless the newline is specified by the
grammar to be present in the parse tree. With reference to the Python grammar
the newlines are present irregularly because they are preserved only in case
of the
compound_stmt
rule. - All information about the indentation in the code is lost, which makes it impossible for the fixer to suggest a complex fix without guessing the indentation.
Despite of all the above, ANTLR provides us a line number and a column number
for each Terminal Node
.
Thus, keeping all the above in mind, the bears written with this library can
provide some fixes for linting errors, and such a fix is provided in
implementation by the PyPluralBear
. Once a complete roundtrip mechanism is
possible, any kind of fix can be done and the library can also provide
functions for doing so, but this is not in the scope of this project, due to
the forementioned facts.
Manual space insertion can be done using:
#!/usr/bin/env python
from antlr4 import *
from Python3Lexer import Python3Lexer
from Python3Parser import Python3Parser
from Python3Visitor import Python3Visitor
import sys
class Visitor(Python3Visitor):
def visitTerminal(self, node):
line = node.getPayload().line
col = node.getPayload().column+1
self.nodeinfo.append((line,col,node.getText(), type(node.parentCtx)))
def print_source(node_list):
current_line=1
current_column=1
for element in sorted(node_list):
while current_line<element[0]:
current_column=1
current_line+=1
print("")
while current_column<=element[1]-1:
current_column+=1
print(" ", end="")
if element[2]!='\n' and element[2]!='\t' and element[2].strip()!='':
print(element[2], end="")
current_column = current_column + len(element[2])
if element[2].count('\n') and element[2]!='\n':
current_column = 1
current_line += element[2].count('\n')
print("")
def main(args):
file = FileStream(args[1])
lexer = Python3Lexer(file)
stream = CommonTokenStream(lexer)
parser = Python3Parser(stream)
tre = parser.file_input()
if parser.getNumberOfSyntaxErrors()!=0:
print("File contains {} "
"syntax errors".format(parser.getNumberOfSyntaxErrors()))
return
visitor = Visitor()
visitor.stream = stream
visitor.nodeinfo = []
visitor.visit(tre)
print_source(visitor.nodeinfo)
if __name__ == '__main__':
main(sys.argv)
But the above also poses complications, especially when the source grammar uses stray nodes. For e.g the Python grammar uses nodes that the Python grammar for Java doesn't use. This forces antlr programs written using Python to deal with stray character nodes as of this cEP. Since the library can interface with arbitrary, but correct ANTLR grammar, it currently cannot detect these stray nodes automatically.
An example of these stray nodes can be seen using the Python3 grammar with a source that is indented using spaces.
Sample output for the above input code using the improvised pretty printer:
def do_nothing():
hpass
def run_coala(console_printer=None,
log_printer=None,
print_results=do_nothing,
print_section_beginning=do_nothing,
nothing_done=do_nothing,
autoapply=True,
force_show_patch=False,
arg_parser=None,
arg_list=None,
args=None,
debug=False):
"""
This function signature has been taken from coala-core for fairly complex
example.
"""
F
eprint('This function will do something')
ift(console_printer and
log_printer and
not (args == None and debug == False)
):
=print('These conditions were made to trick !')
eprint(' !(a & b) == !a or !b')
<EOF>
Notice the stray h
in front of pass
, in the actual node, it co-incides with
the position of character p
.
Similarly there is an extra F
, e
, t
and =
occupying spaces seemingly
randomly. (But this is an effect of the context rules provided in the grammar).
Also notice that replacing tab characters with whitespace
drastically reduced the indent levels.
For e.g:
Let's try to provide a fix from a bear that reduces complexity of conditions
inside if
. For the above example, it would be about converting
if (console_printer and
log_printer and
not (args == None and debug == False)):
print('These conditions were made to trick !')
print(' !(a & b) == !a or !b')
into
if (console_printer and
log_printer and
args != None or debug != False):
print('These conditions were made to trick !')
print(' !(a & b) == !a or !b')
For doing the above transformation, ideally one would require to transform the
tree at those and
nodes that have a not
node as a parent, and then
reconstruct source code from this transformed node.
Since ANTLR doesn't provide mechanism for node transformation as of this cEP,
we will have to switch to manually doing this transformation at string level.
Thus in order to do the above transformation, we save the column and line
number from the not
node and the ending line and column number as well.
The bear will then process this information and construct a new string which
will be sent to coala for diff generation.
Thus fixes from bears using this library are possible, but limited to being done by the bear. Also as noted above ANTLR has poor support for retrieving original source tokens that were discarded by the grammar from the parse tree, the feature to fix code using library provided functions is out of scope for this project, and will be proposed as a future project.