Skip to content

Commit

Permalink
event-based parser policy pt6: docs
Browse files Browse the repository at this point in the history
  • Loading branch information
biojppm committed May 5, 2024
1 parent 9713381 commit b955880
Show file tree
Hide file tree
Showing 9 changed files with 191 additions and 44 deletions.
32 changes: 20 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -729,13 +729,22 @@ See also [the roadmap](./ROADMAP.md) for a list of future work.
ryml deliberately makes no effort to follow the standard in the
following situations:

* Containers are not accepted as mapping keys: keys must be scalars.
* ryml's tree does NOT accept containers are as mapping keys: keys
must be scalars. HOWEVER, this is a limitation only of the tree. The
event-based parser engine DOES parse container keys. The parser
engine is the result of a recent refactor and its usage is meant to
be used by other programming languages to create their native
data-structures. This engine is fully tested and fully conformant
(other than the general error permissiveness noted below). But
because it is recent, it is still undocumented, and it requires some
API cleanup before being ready for isolated use. Please get in touch
if you are interested in integrating the event-based parser engine
without the standalone `ryml::parse_*()`
* Tab characters after `:` and `-` are not accepted tokens, unless
ryml is compiled with the macro `RYML_WITH_TAB_TOKENS`. This
requirement exists because checking for tabs introduces branching
into the parser's hot code and in some cases costs as much as 10%
in parsing time.
* Anchor names must not end with a terminating colon: eg `&anchor: key: val`.
* Non-unique map keys are allowed. Enforcing key uniqueness in the
parser or in the tree would cause log-linear parsing complexity (for
root children on a mostly flat tree), and would increase code size
Expand All @@ -754,18 +763,17 @@ following situations:
reflects the usual practice of having at most 1 or 2 tag directives;
also, be aware that this feature is under consideration for removal
in YAML 1.3.

Also, ryml tends to be on the permissive side where the YAML standard
dictates there should be an error; in many of these cases, ryml will
tolerate the input. This may be good or bad, but in any case is being
improved on (meaning ryml will grow progressively less tolerant of
YAML errors in the coming releases). So we strongly suggest to stay
away from those dark corners of YAML which are generally a source of
problems, which is a good practice anyway.
* ryml tends to be on the permissive side in several cases where the
YAML standard dictates that there should be an error; in many of these
cases, ryml will tolerate the input. This may be good or bad, but in
any case is being improved on, meaning ryml will grow progressively
less tolerant of YAML errors in the coming releases. So we strongly
suggest to stay away from those dark corners of YAML which are
generally a source of problems; this is good practice anyway.

If you do run into trouble and would like to investigate conformance
of your YAML code, beware of existing online YAML linters, many of
which are not fully conformant; instead, try using
of your YAML code, **beware** of existing online YAML linters, many of
which are not fully conformant. Instead, try using
[https://play.yaml.io](https://play.yaml.io), an amazing tool which
lets you dynamically input your YAML and continuously see the results
from all the existing parsers (kudos to @ingydotnet and the people
Expand Down
108 changes: 108 additions & 0 deletions changelog/current.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
**All the changes described come from a single PR: [#PR414](https://github.com/biojppm/rapidyaml/pull/414).**


### Parser refactor

The parser was completely refactored ([#PR414](https://github.com/biojppm/rapidyaml/pull/414)). This was a large and hard job carried out over several months, and the result is:

- A new event-based parser engine is now in place, enabling the improvements described below. This engine uses a templated event handler, where each event is a function call, which spares branches on the event handler. The parsing code was fully rewritten, and is now much more simple (albeit longer), and much easier to work with and fix.
- YAML standard-conformance was improved significantly. Along with many smaller fixes and additions, (too many to list here), the main changes are the following:
- The parser engine can now successfully parse container keys, emitting all the events in the correct , **but** as before, the ryml tree cannot accomodate these (and this constraint is no longer enforced by the parser, but instead by `EventHandlerTree`). For an example of a handler which can accomodate key containers, see the one which is used for the test suite at `test/test_suite/test_suite_event_handler.hpp`
- Anchor keys can now be terminated with colon (eg, `&anchor: key: val`), as dictated by the standard.
- The parser engine can now be used to create native trees in other programming languages, or in cases where the user *must* have container keys.
- Parsing performance improved (benchmark results incoming) from reduced parser branching.
- Emitting performance improved (benchmark results incoming), as the emitting code no longer has to read the full scalars to decide on an appropriate emit style.


### Strict JSON parser

- A strict JSON parser was added. Use the `parse_json_...()` family of functions to parse json in stricter mode (and faster) than flow-style YAML.


### YAML style preserved while parsing

- The YAML style information is now fully preserved through parsing/emitting round trips. This was made possible because the event model of the new parsing engine now incorporates style varieties. So, for example:
- a scalar parsed from a plain/single-quoted/double-quoted/block-literal/block-folded scalar will be emitted always using its original style in the YAML source
- a container parsed in block-style will always be emitted in block-style
- a container parsed in flow-style will always be emitted in flow-style
Because of this, the style of YAML emitted by ryml changes from previous releases.
- Scalar filtering was improved and is now done directly in the source being parsed (which may be in place or in the arena), except in the cases where the scalar expands and does not fit its initial range, in which case the scalar is filtered out of place to the tree's arena.
- Filtering can now be disabled while parsing, to ensure a fully-readonly parse (but this feature is still experimental and somewhat untested, given the scope of the rewrite work).
- The parser now offers methods to filter scalars in place or out of place.
- Style flags were added to `NodeType_e`:
```
FLOW_SL ///< mark container with single-line flow style (seqs as '[val1,val2], maps as '{key: val,key2: val2}')
FLOW_ML ///< mark container with multi-line flow style (seqs as '[\n val1,\n val2\n], maps as '{\n key: val,\n key2: val2\n}')
BLOCK ///< mark container with block style (seqs as '- val\n', maps as 'key: val')
KEY_LITERAL ///< mark key scalar as multiline, block literal |
VAL_LITERAL ///< mark val scalar as multiline, block literal |
KEY_FOLDED ///< mark key scalar as multiline, block folded >
VAL_FOLDED ///< mark val scalar as multiline, block folded >
KEY_SQUO ///< mark key scalar as single quoted '
VAL_SQUO ///< mark val scalar as single quoted '
KEY_DQUO ///< mark key scalar as double quoted "
VAL_DQUO ///< mark val scalar as double quoted "
KEY_PLAIN ///< mark key scalar as plain scalar (unquoted, even when multiline)
VAL_PLAIN ///< mark val scalar as plain scalar (unquoted, even when multiline)
```
- Style predicates were added to `NodeType`, `Tree`, `ConstNodeRef` and `NodeRef`:
```
bool is_container_styled() const;
bool is_block() const
bool is_flow_sl() const;
bool is_flow_ml() const;
bool is_flow() const;
bool is_key_styled() const;
bool is_val_styled() const;
bool is_key_literal() const;
bool is_val_literal() const;
bool is_key_folded() const;
bool is_val_folded() const;
bool is_key_squo() const;
bool is_val_squo() const;
bool is_key_dquo() const;
bool is_val_dquo() const;
bool is_key_plain() const;
bool is_val_plain() const;
```
- Style modifiers were also added:
```
void set_container_style(NodeType_e style);
void set_key_style(NodeType_e style);
void set_val_style(NodeType_e style);
```
- Emit helper predicates were added, and are used when an emitted node was built programatically without style flags:
```
/** choose a YAML emitting style based on the scalar's contents */
NodeType_e scalar_style_choose(csubstr scalar) noexcept;
/** query whether a scalar can be encoded using single quotes.
* It may not be possible, notably when there is leading
* whitespace after a newline. */
bool scalar_style_query_squo(csubstr s) noexcept;
/** query whether a scalar can be encoded using plain style (no
* quotes, not a literal/folded block scalar). */
bool scalar_style_query_plain(csubstr s) noexcept;
```

### Breaking changes

As a result of the refactor, there are some limited changes with impact in client code. Even though this was a large refactor, effort was directed at keeping maximal backwards compatibility, and the changes are not wide. But they still exist:

- The existing `parse_...()` methods in the `Parser` class were all removed. Use the corresponding `parse_...(Parser*, ...)` function from the header [`c4/yml/parse.hpp`](https://github.com/biojppm/master/src/c4/yml/parse.hpp) (link valid after this branch is merged).
- When instantiated by the user, the parser now needs to receive a `EventHandlerTree` object, which is responsible for building the tree. Although fully functional and tested, the structure of this class is still somewhat experimental and is still likely to change. There is an alternative event handler implementation responsible for producing the events for the YAML test suite in `test/test_suite/test_suite_event_handler.hpp`.
- The declaration and definition of `NodeType` was moved to a separate header file `c4/yml/node_type.hpp` (previously it was in `c4/yml/tree.hpp`).
- Some of the node type flags were removed, and several flags (and combination flags) were added.
- Most of the existing flags are kept, as well as their meaning.
- `KEYQUO` and `VALQUO` are now masks of the several style flags for quoted scalars. In general, however, client code using these flags and `.is_val_quoted()` or `.is_key_quoted()` is not likely to require any changes.


### New type for node IDs

A type `id_type` was added to signify the integer type for the node id, defaulting to the backwards-compatible `size_t` which was previously used in the tree. In the future, this type is likely to change, *and probably to a signed type*, so client code is encouraged to always use `id_type` instead of the `size_t`, and specifically not to rely on the signedness of this type.


### Reference resolver is now exposed

The reference (ie, alias) resolver object is now exposed in
[`c4/yml/reference_resolver.hpp`](https://github.com/biojppm/master/src/c4/yml/reference_resolver.hpp) (link valid after this PR is merged). Previously this object was temporarily instantiated in `Tree::resolve()`. Exposing it now enables the user to reuse this object through different calls, saving a potential allocation on every call.
2 changes: 2 additions & 0 deletions doc/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -952,6 +952,8 @@ WARN_LOGFILE =
INPUT = \
./doxy_main.md \
../src \
../test/test_suite/test_suite_event_handler.hpp \
../test/test_suite/test_suite_event_handler.cpp \
../samples/quickstart.cpp \
../ext/c4core/src/c4/substr.hpp \
../ext/c4core/src/c4/charconv.hpp \
Expand Down
1 change: 1 addition & 0 deletions doc/doxy_main.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
* @ref doc_tree
* @ref doc_node_classes
* For serialization/deserialization, see @ref doc_serialization.
* @ref doc_ref_utils - how to resolve references to anchors
* @ref doc_tag_utils - how to resolve tags
* @ref doc_callbacks - how to set up error/allocation/deallocation
callbacks either globally for the library, or for specific objects
Expand Down
4 changes: 4 additions & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,20 +47,24 @@ ryml is written in C++11, and compiles cleanly with:
<https://github.com/biojppm/c4conf>`_.


----

Table of contents
=================

.. toctree::
:maxdepth: 3

Doxygen docs <doxygen/index.html#http://>
YAML playground <https://play.yaml.io/main/parser?input=IyBFZGl0IE1lIQoKJVlBTUwgMS4yCi0tLQpmb286IEhlbGxvLCBZQU1MIQpiYXI6IFsxMjMsIHRydWVdCmJhejoKLSBvbmUKLSB0d28KLSBudWxsCg==>
./sphinx_quicklinks
./sphinx_is_it_rapid
./sphinx_yaml_standard
./sphinx_using
./sphinx_other_languages
./sphinx_alternative_libraries

----

API teaser
==========
Expand Down
12 changes: 11 additions & 1 deletion doc/sphinx_other_languages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,21 @@ out the general approach, other languages are likely to follow (all of
this is possible because we’re using `SWIG <http://www.swig.org/>`__,
which makes it easy to do so).


JavaScript
----------

A JavaScript+WebAssembly port is available, compiled through
`emscripten <https://emscripten.org/>`__.
`emscripten <https://emscripten.org/>`__. Here's a quick example of
how to compile ryml with emscripten using ``emcmake``:

.. code:: bash
git clone --recursive https://github.com/biojppm/rapidyaml
cd rapidyaml
emcmake cmake -S . -B build \
-DCMAKE_CXX_FLAGS="-s DISABLE_EXCEPTION_CATCHING=0"
Here's a quick example on how to configure, compile and run the tests
using `emscripten`:
Expand Down
7 changes: 5 additions & 2 deletions doc/sphinx_quicklinks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@ Quick links

* `API documentation (Doxygen) <./doxygen/index.html>`_

* `Github repo <https://github.com/biojppm/rapidyaml>`_
* `YAML playground <https://play.yaml.io/main/parser?input=IyBFZGl0IE1lIQoKJVlBTUwgMS4yCi0tLQpmb286IEhlbGxvLCBZQU1MIQpiYXI6IFsxMjMsIHRydWVdCmJhejoKLSBvbmUKLSB0d28KLSBudWxsCg==>`_

* YAML Test Suite `online <https://matrix.yaml.info>`_ / `Github <https://github.com/yaml/yaml-test-suite>`_

* `rapidyaml Github repo <https://github.com/biojppm/rapidyaml>`_

* `Issues <https://github.com/biojppm/rapidyaml/issues>`_

Expand All @@ -17,7 +21,6 @@ Quick links

* `README [0.6.0] <https://github.com/biojppm/rapidyaml/blob/v0.6.0/README.md>`_


* Since latest release (master branch):

* `README [master] <https://github.com/biojppm/rapidyaml/blob/master/README.md>`_
Expand Down
3 changes: 1 addition & 2 deletions doc/sphinx_using.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,8 +210,7 @@ of ryml:
low-level multi-platform utilities for C++. When
``RYML_STANDALONE=ON``, c4core is incorporated into ryml as if it is
the same library. Defaults to ``ON``.
- ``RYML_INSTALL=ON/OFF``. enable/disable install target. Defaults to
``ON``.
- ``RYML_INSTALL=ON/OFF``. enable/disable install target. Defaults to ``ON``.

If you’re developing ryml or just debugging problems with ryml itself,
the following cmake variables can be helpful:
Expand Down
66 changes: 39 additions & 27 deletions doc/sphinx_yaml_standard.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,43 @@ appear cases which ryml fails to parse. Your `bug reports or pull
requests <https://github.com/biojppm/rapidyaml/issues>`__ are very
welcome.

See also `the roadmap <https://github.com/biojppm/rapidyaml/tree/master/ROADMAP.md>`__ for a list of future work.
.. note::

If you do run into trouble and would like to investigate
conformance of your YAML code, **do not use existing online YAML
linters**, many of which are not fully conformant; instead, try
using `https://play.yaml.io
<https://play.yaml.io/main/parser?input=IyBFZGl0IE1lIQoKJVlBTUwgMS4yCi0tLQpmb286IEhlbGxvLCBZQU1MIQpiYXI6IFsxMjMsIHRydWVdCmJhejoKLSBvbmUKLSB0d28KLSBudWxsCg==>`__,
an amazing tool which lets you dynamically input your YAML and
continuously see the results from all the existing parsers (kudos
to @ingydotnet and the people from the YAML test suite). And of
course, if you detect anything wrong with ryml, please `open an
issue <https://github.com/biojppm/rapidyaml/issues>`__ so that we
can improve.


Known limitations
-----------------
Deliberate deviations
---------------------

ryml deliberately makes no effort to follow the standard in the
following situations:

- Containers are not accepted as mapping keys: keys must be scalars.
- ryml's tree does NOT accept containers are as mapping keys: keys
must be scalars. HOWEVER, this is a limitation only of the tree. The
event-based parser engine DOES parse container keys. The parser
engine is the result of a recent refactor and its usage is meant to
be used by other programming languages to create their native
data-structures. This engine is fully tested and fully conformant
(other than the general error permissiveness noted below). But
because it is recent, it is still undocumented, and it requires some
API cleanup before being ready for isolated use. Please get in touch
if you are interested in integrating the event-based parser engine
without the standalone `ryml::parse_*()`
- Tab characters after ``:`` and ``-`` are not accepted tokens, unless
ryml is compiled with the macro ``RYML_WITH_TAB_TOKENS``. This
requirement exists because checking for tabs introduces branching
into the parser’s hot code and in some cases costs as much as 10% in
parsing time.
- Anchor names must not end with a terminating colon: eg
``&anchor: key: val``.
- Non-unique map keys are allowed. Enforcing key uniqueness in the
parser or in the tree would cause log-linear parsing complexity (for
root children on a mostly flat tree), and would increase code size
Expand All @@ -45,31 +66,22 @@ following situations:
tag directives; also, be aware that this feature is under
consideration for removal in YAML 1.3.

Also, ryml tends to be on the permissive side where the YAML standard
dictates there should be an error; in many of these cases, ryml will
tolerate the input. This may be good or bad, but in any case is being
improved on (meaning ryml will grow progressively less tolerant of YAML
errors in the coming releases). So we strongly suggest to stay away from
those dark corners of YAML which are generally a source of problems,
which is a good practice anyway.

.. note::
Known (unintended) deviations
-----------------------------

ryml tends to be on the permissive side in several cases where the
YAML standard dictates that there should be an error; in many of these
cases, ryml will tolerate the input without producing an error. This
is being improved on, meaning **ryml will grow progressively less
tolerant of YAML errors** in the coming releases. So we strongly
suggest to stay away from those dark corners of YAML which are
generally a source of problems; this is good practice anyway.

If you do run into trouble and would like to investigate
conformance of your YAML code, **do not use existing online YAML
linters**, many of which are not fully conformant; instead, try
using `https://play.yaml.io
<https://play.yaml.io/main/parser?input=IyBFZGl0IE1lIQoKJVlBTUwgMS4yCi0tLQpmb286IEhlbGxvLCBZQU1MIQpiYXI6IFsxMjMsIHRydWVdCmJhejoKLSBvbmUKLSB0d28KLSBudWxsCg==>`__,
an amazing tool which lets you dynamically input your YAML and
continuously see the results from all the existing parsers (kudos
to @ingydotnet and the people from the YAML test suite). And of
course, if you detect anything wrong with ryml, please `open an
issue <https://github.com/biojppm/rapidyaml/issues>`__ so that we
can improve.


YAML test suite
===============
---------------

As part of its CI testing, ryml uses the `YAML test
suite <https://github.com/yaml/yaml-test-suite>`__. This is an extensive
Expand Down Expand Up @@ -99,7 +111,7 @@ several hundred thousand individual tests to which ryml is subjected,
which are added to the unit tests in ryml, which also employ the same
extensive combinatorial approach.

Also, note that in `their own words <http://matrix.yaml.io/>`__, the
Also, note that in `their own words <http://matrix.yaml.info/>`__, the
tests from the YAML test suite *contain a lot of edge cases that don’t
play such an important role in real world examples*. And yet, despite
the extreme focus of the test suite, currently ryml only fails a minor
Expand Down

0 comments on commit b955880

Please sign in to comment.