-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathTODO
109 lines (88 loc) · 4.44 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# Cosmetic
- Generate properly indented C code
- Debugging: map IR labels to source labels
(Could be as simple as emitting a comment?)
# Functionality/bugs
- Port IPC to compiler / RTS
- Improved correctness: use TNFA/TDFA simulator in Interpreter instead of
regex-posix?
- implement `#n` (1.18), currently disabled in bsd.sh
- Ranges with 0 address (GNU extension). Very similar to 1,<> but allows the
end address to match on the first line (or something).
* Also apply appropriately to change-range special case...
* And reconcile with the pre-first line feature
- Change and !:
'8,9!chello': replaces *each* line with hello except line 8 and 9
(are negated ranges even implemented?)
a positive range such as 8,9chello replaces both lines with a single hello
- D/P (PrintFirstLine and DeleteFirstLine) are parsed but unimplemented?
- tested by gnused's uniq.sed
- commented-out test cases in bsd.sh too
- now that pattern space is exposed to IR, we can use a regexp replace for it
# Testing
* Add some testing of regex matching, e.g. comparing regex-posix, TNFA and TDFA.
* Add tests for IPC functionality
* Add mode for bsd.sh that compares interpreted to compiled
- Or just record known-good outputs for every test...
- And move that into sedness
# Optimizations
## "Advanced"
* Track matched regexes until pattern space changes, check for equivalent
regexes as well as impossible matches.
* Combine multiple regexes into a "switch" thing whenever there are multiple
branches on matches.
## String operations
- Deduplicate substrings in substitutions
- Optimize out IsEmptyString used by case conversion
Perhaps only practical on literals?
Optimizing on regex submatches in general would be useful, but requires more
information about the parsed regex to figure out the length of a group.
- Optimize out concatenation of empty strings
- Precalculate length of append lists to avoid repeated reallocs
## Regex
* Strip out failed states
1. Identify failed states using minLength
2. Remove all transitions to failed states
3. Remove failed states themselves (unless that happens automatically when
they are not mentioned and have no transitions?)
* Handle literals in TNFA/TDFA to make it more efficient (e.g. madding.sed)
- Doesn't seem feasible to keep literals through TNFA construction as they
would need to be split
- Rather recognize in TDFA? that there's a path where only one character can
match, and then you can string together more of them. Or even do it as an
optimization on the IR?
- Emit memcmp to match a long-enough literal
* TNFA minimization - might make TDFA generation faster
* TDFA minimization - combine equivalent states
* More efficient searching:
- If regex starts with a single character, use memchr
- If regex starts with a literal, use memmem for initial guess/skip
- In a state where we match something like .*literal or .*char we can skip
with memchr/memmem
- Fancy: something with the start-of-match tag - "abandon" states where the
start of match tag is already set but would be moved forward. Prototype in
SimulateTNFA, then see if the logic can be ported to the TDFA construction?
- Simpler: in cases where we don't emit tags at all the location/length of the
match would not be relevant so we can exit with a match on the first final
state and can look for e.g. ".*re".
This seemed to be a bit slower though. I think because there's no support
for non-greedy .* and returning shortest instead of longest match.
* Do more with memmem - e.g. precalculate tables and constants, check for
string search algorithms with more expensive precalculation.
### Done
* Return bool from match functions, remove `match_t::result`
* Track used groups in matches, feed back to regex compiler
* Skip retry if the regex is anchored at the start
* Parts of minLength optimization:
- Split up state entry point to have one without range check
- When transitioning to a state with a lower minLength, skip check
- Precalculate minLength and store in TDFA
* Use knowledge of regex to eliminate useless next-match calls
- anchored at start: can't match again
- anchored at end: first match will contain the end of the string
- matches any suffix / ends with .*: first match will match all
- also when matching any prefix
### Abandoned
* Port search with states as in SimulateTNFA to TDFA construction
- Turns out this was wrong. Can it even work? Let's just do the other
optimizations :)