Metadata | |
---|---|
cEP | 0033 |
Version | 0.1 |
Title | Handle Nested Programming Language |
Authors | Naveen Naidu mailto:naveennaidu479@gmail.com |
Status | Proposed |
Type | Feature |
This cEP describes the architecture that needs to be implemented in order to enable coala to lint nested programming languages present in a file.
coala in its present state is capable of providing efficient static analysis
to only those file that contains a single programming language. But Multiple
programming languages can coexist in a single source file.Eg: PHP and HTML
,
HTML and Jinja
, Python and Jinja
, codeblocks and RST
etc.
coala does not yet support these kind of files. This project would enable coala to deal with those situations and allow people to write code analysis similar to how they already do it while being applicable to the right locations at the right files.The users of coala would not have to concentrate on writing new bears/ analysis routines. This implementation would perfectly work with the existing bears.
There can be several ways to approach this. In this cEP, we will be implementing a abstract way which will support arbitrary combination of languages.
A source file containing multiple programming languages is loaded up by coala.
The original nested file is broken/split into n
temporary files, where n is
the number of languages present in the original file. Each temporary file only
contains the snippets belonging to one language.
These temporary files would then be passed to their respective language bears ( chosen by the user) where the static analysis routines would run. The temporary files after being linted would then be assembled.
The above process creates the illusion that the original file is being linted, but in reality we divide the files into different parts and lint them one by one.
The figure below, would help give a clear picture
The original file is segregated into small pieces/units where each piece/unit
purely contains only one language
. These quantum pieces of the the nested
original file is termed as nl_section
.
A nl_section
can further be defined as a group of lines in the source code that
purely belongs to one particular language.
All the nl_sections
that belong to one particular language are grouped to form
a valid file of that programming language.
The following figure would make the things clear
In the above figure, we have a file which contains both HTML and PHP code. This original file can be broken up into 4 different sections.
Each nl_section will contain the following information:
- The Programming language of the lines
- Index of the section
- Starting Line number in the original file
- Ending Line number in the original file
- Starting line number in the linted file
- Ending line number in the linted file
class NlSection(TextRange):
def __init__(self,
orig_start: NlSectionPosition,
orig_end: (NlSectionPosition, None) = None,
index=None,
language=None):
"""
Creates a new NlSection.
:param orig_start: A NlSectionPosition indicating the start of the
section in original file.
:param orig_end: A NlSectionPosition indicating the end of the
section in the original file.
If ``None`` is given, the start object will be used
here. end must be in the same file and be greater
than start as negative ranges are not allowed.
:param language: The programming language of the lines.
:param index: The index of the nl_section.
:raises TypeError: Raised when
- start is not of type NlSectionPosition.
- end is neither of type NlSectionPosition, nor
is it None.
:raises ValueError: Raised when file of start and end mismatch.
"""
TextRange.__init__(self, start, end)
self.index = index
self.language = language
"""
:linted_start: The start of the section in the linted file.Initially it
is same as that of the start of the original file. It
changes only when any patches are applied on that line.
:linted_end: The end of the section in the linted file.Initially it
is same as that of the end of the original file. It
changes only when any patches are applied on that line.
"""
self.linted_start = start
self.linted_end = end
if self.start.file != self.end.file:
raise ValueError('File of start and end position do not match.')
@classmethod
def from_values(cls,
file,
start_line=None,
start_column=None,
end_line=None,
end_column=None,
index=None,
language=None):
start = NlSectionPosition(file, start_line, start_column)
if end_line or (end_column and end_column > start_column):
end = NlSectionPosition(file, end_line if end_line else start_line,
end_column)
else:
end = None
return cls(start, end, index, language)
The implementation of the nested language architecture would be done inside the
a new folder called as nestedlib
. This nestedlib
would be present under
the coalib
directory.
The following figure depicts the architecture that is going to be implemented to enable nested languages support.
- NLCore
This is the heart of the project. It is responsible to manage the execution flow in the program
- NLInfoExtractor
NLInfoExtractor is responsible to extract the information from the user's input. The extracted information would include the languages present in the file, bears to run for each programming language and the settings of these bears.
- Parser
Parser splits up the original nested file into different nl_sections
.
- NLFileHandler
NL FileHandler is responsible for creating the temporary files (which contains all the sections belonging to a single language) that would be passed for the actual analysis.
This project will be implemented in four phases:
- Information Gathering
- Segregating the file
- Linting the file
- Assembling
The users can inform coala about the presence of nested languages in the files
using the --handle-nested
flag.
eg:
coala --handle-nested --files=html_php_file.html --language=html,php --bears=HTMLBear,PHPBear
--languages
: Informs coala about the languages present in the file--bears
: Informs coala about the bears to run on the file.
The aim of information gathering is to extract the following information from the arguments passed to coala:
- Languages present in the file
- Language and Bear association
- Argument list (would be used to create coala sections)
The languages present in the file can easily be gathered from the --language
tag. The language and Bear association information can be attained by using
the meta information of the bear and matching it with the --languages
.
NlInfoExtractor
is responsible for the above tasks.
The original nested language file is divided into various temporary in-memory files, where each file contains the snippets of one particular language. The linting needs to be done by the respective bears on each of the file.
In order to do so, we would have to create virtual coala sections
for
each file.
Each section, would contain the information about the files to be linted, the
bears to run on those files and the setting to initialize the bear with. In
order to create these sections we would need the arguments
to be passed to
the parse_cli()
method. The arguments used to initialize coala in nested
language mode cannot be used to create the coala sections.
The NlInfoExtractor converts the original arguments into different argument list, where each argument list resembles the argument that a user would have passed if he was linting a single temp_file with the appropriate bears.
For eg:
coala --handle-nested --file=html_php_file.html --language=html,php --bears=HTMLBear,PHPBear
gets converted to:
coala --file=temp_html_file --language=html --bears=HTMLBear
and
coala --file=temp_php_file --language=php --bears=PHPBear
The above arguments when passed to the parse_cli()
method, creates two
section:
[nl.html]
files = temp_html_file
bears = HTMLBear
setting = value
[nl.php]
files = temp_php_file
bears = PHPBear
setting = value
def extract_info(args):
"""
Return a dictionary called as `nl_info_dict` with all the extracted
information.
Let's assume that we have the following arguments
>>>args = coala --handle-nested --file=html_php_file.html --language=html,php --bears=HTMLBear,PHPBear
>>> nl_info_dict = extract_info(args)
>>> nl_info_dict
{
"file_name": html_php_file.html,
"absolute_file_path": /home/test/html_php_file.html,
"languages": [html, php]
"bears": [HTMLBear, PHPBear]
"language_bear_dict": language_bear_dict
"arg_dict": arg_dict
}
>>> nl_info_dict['language_bear_dict']
{
"html": [HTMLBear],
"php" : [PHPBear]
}
>>> nl_info_dict['arg_dict']
{
"nl.html": {"file_name": "temp_html_file",
"bears": 'HTMLBear',
"settings": [(setting,value)]
}
"nl.php": {"file_name": "temp_php_file",
"bears": 'PHPBear',
"settings": [(setting,value)]
}
}
"""
def get_lang_info(args):
"""Return a list of langauges present in the nested file"""
def get_bear_info(args):
"""Return the list of bears to be run on the file"""
def get_file_info(args):
"""Returns the list of files to be linted"""
def get_absolute_file_path(file_name):
"""Returns the absolute file path of eacg file"""
def get_language_bear_dict(languages, bears):
"""
Return a language_bear dict, where each language is mapped to a set of bears
"""
def get_argument_dict(file_list, language_bear_dict, settings):
"""
Returns a argument_dict, where each temp_file is mapped to a set of
arguments.
"""
different nl_sections
and creating temporary files for each nested languages,
which would contain the snippets of that language from the original nested file.
This phase uses the Parser
and the NlFileHandler
parts of the architecture.
-
Parser is responsible for splitting the original file into different
nl_sections
. -
The input and output to the parser is standardized, where the input to the parser are the contents of the original nested file and the output is the list of nl_sections encompassing the snippets of different programming language.
-
Standardizing the parser helps us in removing any restriction on how a parser should parse the contents. A parser can also use third party API's as long as the output from it is in accordance with the format of a nl_sections.
-
Since different combination of languages would need different parsers, we would create a new folder at
nestedlib/parsers
which would host the collection of all the parsers. -
A new super class called as
Parser
would be created. This would have all the common methods that all the parser might need. All the parser would be derived from this Superclass.
import coalib.nestedlib.NlSection
class Parser:
def parse(file_contents):
'''
Returns a list of nl_sections.
:param file_contents: The contents of the original nested file
>>> file_contents = \
"""
<!DOCTYPE html>
<head>
<title>Hello world as text</title>
<?php
echo "<p>Hello world</p>";
?>
</html>
"""
>>> p = Parser()
>>> nl_sections = p.parse(file_contents)
>>> nl_sections
( <NlSection object(index=1, language=html, orig_start=1, orig_end=3)>,
<NlSection object(index=2, language=php, orig_start=4, orig_end=6)>,
<NlSection object(index=3, language=html, orig_start=7, orig_end=7)>
)
'''
def detect_language(str):
"""
Return the language of the particulat string.
This will be overridden by the child class
"""
def create_nl_section(start, end, language, index):
"""
Creates a NlSection Object from the values.
Returns the NlSection object
"""
return NlSection.from_values(start=start, end=end, language=language,
index=index)
This sections deals with how the segregated section (nl_sections) from the parser are combined to form different temporary files on which linting will be done.
NlFileHandler
is responsible to create the temporary files. These temporary
files are stored in the memory.
-
The nl_sections are passed to NlFileHandler.
-
NlFileHander then creates a
nl_file_dict
, where the key is the name of the temporary file and the contents are it's value. It is similar to thefile_dict
that is created by coala ininstantiate_process()
during the execution of the section. -
It is important to keep in mind, that the temporary segregated file are not actually present in the folder. In the normal flow of coala,during the execution of the coala sections, coala will try to find the files by the filenames mentioned in the coala section. And then create a
file_dict
. That cannot happen in our case. So we will explicitly replace thefile_dict
withnl_file_dict
.
def get_nl_file_dict(nl_file_info, nl_sections):
"""
Returns a nl_file_dict, where the key is the name of the temporary file
and the value is the contents of that file.
:param nl_file_info: The information extracted from the arguments.
This is generated by the NlInfoExtractor.
:param nl_sections: The segregated sections of the original file.
>>> nl_file_info
{
"file_name": html_php_file.html,
"absolute_file_path": /home/test/html_php_file.html,
"languages": [html, php]
"bears": [HTMLBear, PHPBear]
"language_bear_dict": language_bear_dict
"arg_dict": arg_dict
}
>>> nl_sections
( <NlSection object(index=1, language=html, orig_start=1, orig_end=3)>,
<NlSection object(index=2, language=php, orig_start=4, orig_end=6)>,
<NlSection object(index=3, language=html, orig_start=7, orig_end=7)>
)
>>> file_contents = get_file(nl_file_info['file_name'])
>>> file_contents = \
'''
<!DOCTYPE html>
<head>
<title>Hello world as text</title>
<?php
echo "<p>Hello world</p>";
?>
</html>
'''
>>> nl_file_dict = get_nl_file_dict(nl_file_info, nl_sections)
>>> nl_file_dict
{
"temp_html_file":('''<!DOCTYPE html>\n
<head>\n
<title>Hello world as text</title>\n
\n
\n
\n
</html>'''),
"temp_php_file":('''\n
\n
\n
\t\t<?php\n
\t\t\techo "<p>Hello world</p>";\n
\t?>''')
}
"""
def get_file_contents(file_name):
""" Returns the contents of the file"""
def assemble(nl_diff_dict, nl_sections, nl_info_dict):
"""
Assembles the temporary files.
The sections are extracted by their increasing order of their index and
then written directly to the original file.
"""
file = open(nl_info_dict['absolute_file_path'], 'w')
for nl_section in sorted(nl_sections.index):
temp_file_name = nl_section.temp_file
start = nl_section.linted_start
end = nl_section.linted_end
linted_file_content = nl_diff_dict[temp_file_name]
section_content = get_file_content(linted_file_content, start, end)
file.write(section_content)
This phase deals with linting the temporary files created by the NlFileHandler using
the nl_sections
.
-
In coala the linting of the files are done when the section created by coala are executed.
-
Since we are explicitly making the sections for nested languages, care has been taken to keep the file names of the
nl.lang
section and the file names innl_file_dict
same. -
During the execution of the section, in the
instantiate_process()
we do not access the physical file, rather we replace thefile_dict
withnl_file_dict
and make the necessary changes. -
Once the
file_dict
is changed, coala will normally continue the process. To coala, it now looks as if thetemporary files
were actually present in the physical drive. And the linting starts
Whenever the bears suggests a patch and the user desires to apply the Patch, we
would also need to update the information of the linted_start
and
linted_end
. This needs to be done because, whenever a patch is applied the
position of the lines might change because of addition and deletion of lines.
Keeping track of the start and end of a particular nl_section
in the linted
files would help in easier extraction of the nl_section
from the temporary
created files.
In order to do so, we'll have to make changes to the apply()
method of
the ApplyAction
. We use the update_nl_sections()
functions to update
the values.
In Diff.py
we add a new function get_diff_info()
that would give us the
information of the diff
def get_diff_info():
"""
Returns tuple containing line numbers of deleted,changed and added lines.
"""
deleted_lines = []
added_lines = []
changed_lines = []
for line_nr in self._changes:
line_diff = self._changes[line_nr]
if line_diff.change:
changed_lines.append(line_nr)
elif line_diff.delete:
deleted_lines.append(line_nr)
if line_diff.add_after:
added_lines.append(line_nr)
return changed_lines, deleted_lines, added_lines
The following method belongs to the NlCore
from coalib.result_action.Diff import stats,get_diff_info
diff_stats = stats()
diff_info = get_diff_info()
def find_section_index(diff_stats, diff_info, nl_sections):
"""
Returns the section index to which the patch is about to be applied.
"""
return index
def update_nl_sections(diff_stats, diff_info, nl_sections):
"""
Updates the `linted_start` and the `linted_end` of the nl_sections
"""
index = find_section_index(diff_stats, diff_info, nl_sections)
for nl_sections in nl_sections:
if nl_section.index == index:
"""
Update the values of linted_start` and the `linted_end`
"""
This phase deals with assembling the linted temporary files back into the original file.
This phase has two parts:
- Extracting the sections from the linted temporary files.
- Assembling these sections.
Once all the coala sections have been executed, we have a nl_diff_dict
where
the key is the name of the temporary file and the value is the contents of the
linted file contents of the temporary file.
We have a assemble()
method inside the NlFileHandler, which uses
the information from the nl_sections
and extracts the sections from the
nl_diff_dict
and write it to the original file
def assemble(nl_diff_dict, nl_sections, nl_info_dict):
"""
Assembles the temporary files.
The sections are extracted by their increasing order of their index and
then written directly to the original file.
"""
file = open(nl_info_dict['absolute_file_path'], 'w')
for nl_section in sorted(nl_sections.index):
temp_file_name = nl_section.temp_file
start = nl_section.linted_start
end = nl_section.linted_end
linted_file_content = nl_diff_dict[temp_file_name]
section_content = get_file_content(linted_file_content, start, end)
file.write(section_content)
from coalib.nestedlib.NlInfoExtractor import extract_info
from coalib.result_action.Diff import stats,get_diff_info
from coalib.nestedlib.NlFileHandler import get_file_contents, get_nl_file_dict, assemble
import coalib.nestedlib.Parser
def get_arg_info(arg):
"""
Return argument list and nl_info_dict that will be used to make
the coala sections
>>> arg_list = get_arg_list(arg)
[ ( (file, temp_html_file), (bears, HTMLBear ), (settings, value) ),
( (file, temp_php_file), (bears, PHPBear ), (settings, value) )
]
"""
nl_info_dict = extract_info(arg)
arg_list = make_arg_list(nl_info_dict)
return nl_info_dict, arg_list
def get_file_metadata(nl_info_dict):
"""
Returns the nl_sections and nl_file_dict
"""
parser = detect_parser(nl_info_dict['languages'])
file_contents = get_file_contents(nl_info_dict['absolute_file_path'])
index = find_section_index(diff_stats, diff_info, nl_sections)
for nl_sections in nl_sections:
if nl_section.index == index:
"""
Update the values of linted_start` and the `linted_end`
"""
def assemble_files(nl_diff_dict, nl_sections, nl_info_dict):
"""
Assembles the file and returns back to coala_main
"""
return assemble(nl_diff_dict, nl_sections, nl_info_dict)
import coalib.nestedlib.Nlcore
def main(debug=False):
configure_logging()
handle_nested = True
args = None # to have args variable in except block
# when parse_args fails
try:
args = default_arg_parser().parse_args()
if args.handle_nested:
nl_info_dict, nl_arg_dict = get_arg_info(args)
nl_sections, nl_file_dict = get_file_metadata(nl_info_dict)
"""
Code Contents of of coala.py
"""
return mode_normal(console_printer, None, args, debug=debug,
handle_nested, nl_info_dict, nl_arg_dict,
nl_sections, nl_file_dict)