You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Identify positions of 4-byte encoded characters in a UTF-8 string.
64
+
This function scans through the input text and identifies the positions
65
+
of characters that are encoded with 4 bytes in UTF-8. These characters
66
+
are typically non-BMP (Basic Multilingual Plane) characters, such as
67
+
certain emoji and some rare Chinese, Japanese, and Korean characters.
68
+
69
+
:param text: The input string to be analyzed.
70
+
:type text: str
71
+
:return: A list of positions where 4-byte encoded characters are found.
72
+
:rtype: List[int]
73
+
"""
74
+
positions= []
75
+
char_index=0
76
+
forcharintext:
77
+
iflen(char.encode('utf-8')) ==4:
78
+
positions.append(char_index)
79
+
# Adding 1 to the index because 4 byte characters are
80
+
# 2 bytes in length in LanguageTool, instead of 1 byte in Python.
81
+
char_index+=1
82
+
char_index+=1
83
+
returnpositions
76
84
77
85
@total_ordering
78
86
classMatch:
@@ -92,8 +100,12 @@ class Match:
92
100
93
101
- 'message': The message describing the error.
94
102
:type attrib: Dict[str, Any]
103
+
:param text: The original text in which the error occurred (the whole text, not just the context).
104
+
:type text: str
95
105
96
106
Attributes:
107
+
PREVIOUS_MATCHES_TEXT (Optional[str]): The text of the previous match object.
108
+
FOUR_BYTES_POSITIONS (Optional[List[int]]): The positions of 4-byte encoded characters in the text, registered by the previous match object (kept for optimization purposes if the text is the same).
97
109
ruleId (str): The ID of the rule that was violated.
98
110
message (str): The message describing the error.
99
111
replacements (list): A list of suggested replacements for the error.
@@ -103,13 +115,40 @@ class Match:
103
115
errorLength (int): The length of the error.
104
116
category (str): The category of the rule that was violated.
105
117
ruleIssueType (str): The issue type of the rule that was violated.
118
+
119
+
Exemple of a match object received from the LanguageTool API :
Returns the line and column number of the error in the context.
238
+
239
+
:param original_text: The original text in which the error occurred. We need this to calculate the line and column number, because the context has no more newline characters.
240
+
:type original_text: str
241
+
:return: A tuple containing the line and column number of the error.
0 commit comments