Skip to content

Latest commit

 

History

History
111 lines (85 loc) · 6.95 KB

readme.md

File metadata and controls

111 lines (85 loc) · 6.95 KB

Results

programme model of word comments
wc class [^ \f\n\r\t\v]+ only ascii; super-successful; limited accuracy
swift class [\w]+ I don't know what \w is, documentation appears to say that it's equivalent to [\p{W}\p{Nd}]+, but I know this is a simplification
binary class [\p{L}\p{M}\p{N}\p{Pc}\u200b\u200c\u200d\u2060]+ this was the programme I came up with using upper-bound binary-search
re2c class [\p{L}\p{M}\p{N}\p{Pc}\u200b\u200c\u200d\u2060]+ this is an equivalent with re2c
re2c regex ((word_begin=(\p{L}\p{M}* | (\p{N}\[0-9])) (word_mid* word_end)?) | number=(binary|hex|integer|real|money))+ see src/re2c_next_regex.re.c

I used my system's wc. I ran them though time <programme> < input. For input, I wanted to test multiple languages and a large book that's freely available. From the list of literary works by number of translations the Bible is probably the best bet, which I downloaded from Bible Super Search. (I confusingly didn't see the version, but it's not really important for our purposes.)

Korean Bible word count.

The word counts agree except wc. From what limited knowledge I have of Korean, I think this is not surprising. The swift programme is a lot slower; I would like to think that this is an inherent problem with the Swift language, but I'm forced to admit I have never seen Swift code before and it's possible that I don't know what I'm doing.

65 Bibles in different languages concatenated.

I concatenated all 65 versions of the Bible in different languages. I would have been interested to see what word count the swift programme gave, but it spectacularly balked. The expanded regex only merges a few thousandth's of words together in this source. The re2c and wc have similar run-times.

The re2c strategy of breaking a code-point up into bytes causes a huge amount of state fragmentation. On the other hand, I think of binary as being O(log n), but with a fixed and small n, I was curious to see whether taking in whole code-points and doing a binary search would be that much slower. It's about twice as slow, in this case. I optimized this by skipping the nul-check on non-words and having two tables, but it didn't make much difference, so I took that complication out.

Executable file size.

It looks like swift and binary have the same algorithmic size. I would guess it uses a multi-stage hash-table. That's probably reading to much into it.

What is this?

The model of words being delimited by white-space—specifically the isspace set: { ' ', '\f', '\n', '\f', '\t', '\v' }—is an approximation that does reasonably well on English texts. Of this simple model's detractions are: it doesn't actually point out word boundaries; it doesn't always work; it does poorly with more structured input (are "+" and "{" really words?); and it was developed for ascii when (rightfully) most modern text files are utf-8.

Wikipedia says that text-segmentation is "non-trivial." I looked up how to do this and every example was in Swift, so I wrote a a programme in Swift to compare with C. I initially was going to use a Patrica trie, but I wanted an upper-bound fixed binary-search.

Then I realized that this would be the ideal use-case for re2c. Which I found conveniently updated to include Unicode category definitions in 2019; this made it trivial to have a richer regular-expression definition of a word. Very useful because, not only may graphemes be several code-points, but it allows the context of the code-point to be easily taken into account—the word context is limited to a regular language that can be recognized by a finite-state automaton, not a type-1 context-sensitive language.

Binary class

I did this manually, and for that, turns out easier to build a model of my encoding. I am no expert on utf-8, but here's what I came up with. For the code-point U+uvwxyz. These are consumed.

"11111---" Error (for now?)
"11110uvv" "10vvwwww" "10xxxxyy" "10yyzzzz" 4-byte,
"11110---" "10------" "10------"            otherwise error,
"11110---" "10------"                       otherwise error,
"11110---"                                  otherwise error.
"1110wwww" "10xxxxyy" "10yyzzzz" 3-byte,
"1110----" "10------"            otherwise error,
"1110----"                       otherwise error.
"110xxxyy" "10yyxxxx" 2-byte,
"110-----"            otherwise error.
"10------" Continuation-byte not in the context of the above is in error.
"0yyyzzzz" 1-byte, (with checks for nul-termination if applicable.)

Errors are groups of bytes that are well-defined. How to parse these errors is ultimately user-defined. A random continuation byte requires up to 3 back reads to tell if it's an error. Surrogates, over-long encodings, and unicode code-points that are not assigned (yet?) is further up to the application.

Utf-8 pays careful attention to preserving the order of unicode, therefore I have side-stepped the issue by working directly in utf-8-space—don't apply a narrowing conversion to unicode; it is unicode-agnostic? In a binary search method, the tables are alternating is in this class or not. Erroneous indices, therefore, pick up the surrounding properties. In this case, it likely considers them among the class of not-words—an error in a word would probably split it in two.

Chunking a file in blocks (>= 4 bytes) might create truncated code-points. So for seeking to the end-byte, we (might) be in the middle of a valid code-point. I keep an auxiliary 3-byte buffer to (possibly) complete the code-point next time around. Using the above scheme, store any of the following truncations in the back of the buffer for the next read.

                  110-----
                  1110----
                  11110---
         1110---- 10------
         11110--- 10------
11110--- 10------ 10------

All others are complete or errors. This is simplified by not caring about distinguishing 2-byte, 3-byte, etc, errors, which in practice give little information: they are just errors. If one did care, it would be longer.

Duplicating my results

Unicode data placed in the root of one's project. https://www.unicode.org/Public/UNIDATA/UnicodeData.txt. Specifically, I used https://www.unicode.org/Public/16.0.0/ucd/UnicodeData.txt. It is 2.2MB.

Every .c in test/ generates a new executable. I should probably update the Makefile instead of explaining.

The re2c-generated file is large, but the gnu extension --case-ranges shortens this a lot.

re2c [src/*.re.c test/*.re.c] -o build/[*.c] -8 --case-ranges --input-encoding utf8 -i