This project hosts a recursive descent parser for JSON in pure python. I started the project following a tutorial on the topic, Building Recursive Descent Parsers: The Definitive Guide. Once through the tutorial, I applied tests from the JSONTestSuite to the resultant code. I noticed quite a few tests cases failing, especially related to escaping, numbers and special characters. So, I set out to build a more compliant & also flexible JSON parser in pure python.
json-rd
is different from standard JSON in a few ways:
- Allows comments starting with
#
- Allows single quoted strings
- Is less strict about number formatting in some cases (ex:
1 000
,- 1
are considered valid & parsed successfully) - Ignores trailing garbage to a legal JSON structure
To learn more about recursive descent parsing, I highly recommend working through Building Recursive Descent Parsers: The Definitive Guide. The file parser.py is from this original tutorial. Thanks to the author for sharing such a great learning resource with the world.
Project requires python3.9+
. Once you have it, run:
poetry install
Run:
python -m src.test_jsonparser
In jsonparser, goto the quoted_string
method.
Find the line:
quote = self.char("\"'")
Replace with:
quote = self.char('"')
This is my VSCode debug configuration:
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Module",
"type": "python",
"request": "launch",
"module": "src.test_jsonparser",
"justMyCode": true
}
]
}
To debug particular case, one can add a conditional breakpoint such as:
"partial_filename" in m
Where m
is one of the loop variables iterating through various sample JSON
files (see the test_match
and test_no_match
functions in test_jsonparser.py).
How does a recursive parser work in a nutshell? What's a nice way to think about them, as a programmer/engineer?
First of all it is important to understand some of the primitives in parser.py. The most basic ones are the following:
char
: Expect next character to be X. We can express X as a character class (ex:0-9
). If expectation not met, throwParseError
keyword
: Expect a sequence of characters XYZ. On failure to find the sequence, throwParseError
The next most important primitive is:
match
: Take a list of rules (essentially python functions) and apply them in order. Return on first successful match; on failure throwParseError
. Each of thematch
rules tend to be implemented throughchar
,keyword
and basic python such as conditionals/loops.
These are the only primitives required to build a recursive descent parser!
For example, an EBNF a :== b | c | d
can be defined as 4 python functions a
,
b
, c
, d
; one can simply do match(b, c, d)
to apply the rules in the
requisite order.
Moreover, the library offers some nice convenience functions built on top of the primitives mentioned above:
maybe_char
: Try matching a char; on failure returnNone
maybe_keyword
: Try matching a keyword; on failure returnNone
maybe_match
: Try matching a given rule/function; on failure, returnNone
These help with avoiding try/catch
structure on failing to match some piece
of text.
Once you understand the parser primitives and convenience functions, you are ready to translate an EBNF or other grammar specifications into python functions built on top of the primitives.
- Building Recursive Descent Parsers: The Definitive Guide - Boolean World: Tutorial on recursive descent parsers
- JSON.org: Nice rail road diagram of the grammar
- JSON McKeeman Form: Simple grammar specification in EBNF-type format
- Railroad Diagram Generator: Generate railroad diagrams for visualizing grammar
It was great fun learning about recursive descent parsers, and then building a json
parser on that understanding. That's one reason.
Secondly, I am trying to enhance an internal product at Hexmos, which embeds JSON as a sub-language as part of its DSL. Also, it has custom flexibility requirements on top of standard JSON. Therefore, to fulfill such additional requirements, I wanted more fine-grained control over JSON parsing process.